linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list
@ 2020-04-03  5:40 js1304
  2020-04-03  5:40 ` [PATCH v5 01/10] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Hello,

This patchset implements workingset protection and detection on
the anonymous LRU list.

* Changes on v5
- change memcg charge timing for the swapped-in page (fault -> swap-in)
- avoid readahead if previous owner of the swapped-out page isn't me
- use another lruvec to update the reclaim_stat for a new anonymous page
- add two more cases to fix up the reclaim_stat

* Changes on v4
- In the patch "mm/swapcache: support to handle the exceptional
entries in swapcache":
-- replace the word "value" with "exceptional entries"
-- add to handle the shadow entry in add_to_swap_cache()
-- support the huge page
-- remove the registration code for shadow shrinker

- remove the patch "mm/workingset: use the node counter
if memcg is the root memcg" since workingset detection for
anonymous page doesn't use shadow shrinker now
- minor style fixes

* Changes on v3
- rework the patch, "mm/vmscan: protect the workingset on anonymous LRU"
(use almost same reference tracking algorithm to the one for the file
mapped page)

* Changes on v2
- fix a critical bug that uses out of index lru list in
workingset_refault()
- fix a bug that reuses the rotate value for previous page

* SUBJECT
workingset protection

* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

As we can see, hot pages are swapped-out and it would cause swap-in later.

* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list,
if enough reference happens, the page will be promoted. This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

hot pages remains in the active list. :)

* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.

I did another test to show the performance effect of this patchset.

- ebizzy (with modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in/out
will happen if allocated memory is larger than the system memory.

The random function that represents the zipf distribution is used to
make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
the parameter is high, hot memory is accessed much larger than cold one.
If the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.

My test setup is a virtual machine with 8 cpus and 1024MB memory.

Result format is as following.

Parameter 0.1 ... 1.3
Allocated memory size
Throughput for base (larger is better)
Throughput for patchset (larger is better)
Improvement (larger is better)


* single thread

     0.1      0.3      0.5      0.7      0.9      1.1      1.3
<512M>
  7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
  6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
   -0.01     -0.0     -0.0     0.01    -0.01      0.0     0.01
<768M>
   915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
   920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
    0.01     -0.0    -0.03      0.0     0.02     0.04    -0.01
<1024M>
   425.0    471.0    539.0    753.0   1183.0   2130.0   3839.0
   414.0    468.0    553.0    770.0   1242.0   2187.0   3932.0
   -0.03    -0.01     0.03     0.02     0.05     0.03     0.02
<1280M>
   320.0    346.0    410.0    556.0    871.0   1654.0   3298.0
   316.0    346.0    411.0    550.0    892.0   1652.0   3293.0
   -0.01      0.0      0.0    -0.01     0.02     -0.0     -0.0
<1536M>
   273.0    290.0    341.0    458.0    733.0   1381.0   2925.0
   271.0    293.0    344.0    462.0    740.0   1398.0   2969.0
   -0.01     0.01     0.01     0.01     0.01     0.01     0.02
<2048M>
    77.0     79.0     95.0    147.0    276.0    690.0   1816.0
    91.0     94.0    115.0    170.0    321.0    770.0   2018.0
    0.18     0.19     0.21     0.16     0.16     0.12     0.11


* multi thread (8)

     0.1      0.3      0.5      0.7      0.9      1.1      1.3
<512M>
 29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
 29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
    0.01      0.0     0.01    -0.01     -0.0    -0.01     -0.0
<768M>
  3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
  3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
    0.07     0.08    -0.05     0.05     0.05    -0.03     0.02
<1024M>
  1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
  1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
    0.04     0.01     0.05     0.08     0.01    -0.01     0.07
<1280M>
  1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  12443.0
  1560.0   1707.0   2016.0   2714.0   4262.0   7518.0  13910.0
    0.02     0.05     0.04     0.05     0.03     0.08     0.12
<1536M>
  1303.0   1399.0   1550.0   2137.0   3469.0   6712.0  12944.0
  1356.0   1465.0   1701.0   2237.0   3583.0   6830.0  13580.0
    0.04     0.05      0.1     0.05     0.03     0.02     0.05
<2048M>
   172.0    184.0    215.0    289.0    514.0   1318.0   4153.0
   175.0    190.0    225.0    329.0    606.0   1585.0   5170.0
    0.02     0.03     0.05     0.14     0.18      0.2     0.24

As we can see, as allocated memory grows, patched kernel get the better
result. Maximum improvement is 21% for the single thread test and
24% for the multi thread test.


* SUBJECT
workingset detection

* PROBLEM
Later part of the patchset implements the workingset detection for
the anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.

Following is an example of thrashing due to the workingset protection.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)

5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

6. repeat 4, 5

Without workingset detection, this kind of workload cannot be promoted
and thrashing happens forever.

* SOLUTION
Therefore, this patchset implements workingset detection.
All the infrastructure for workingset detecion is already implemented,
so there is not much work to do. First, extend workingset detection
code to deal with the anonymous LRU list. Then, make swap cache handles
the exceptional value for the shadow entry. Lastly, install/retrieve
the shadow value into/from the swap cache and check the refault distance.

* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.

My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk
to capture the trace.

Test scenario is like as below.

1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6

Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval
(96+128 MB) is more 190MB, so it cannot be promoted without workingset
detection and swap-in/out happens repeatedly. With this patchset,
workingset detection works and promotion happens. Therefore, swap-in/out
occurs less.

Here is the result. (average of 5 runs)

type swap-in swap-out
base 863240 989945
patch 681565 809273

As we can see, patched kernel do less swap-in/out.

Patchset is based on v5.6.
Patchset can also be available at

https://github.com/JoonsooKim/linux/tree/improve-anonymous-lru-management-v5.00-v5.6

Enjoy it.

Thanks.

Joonsoo Kim (10):
  mm/vmscan: make active/inactive ratio as 1:1 for anon lru
  mm/vmscan: protect the workingset on anonymous LRU
  mm/workingset: extend the workingset detection for anon LRU
  mm/swapcache: support to handle the exceptional entries in swapcache
  mm/swap: charge the page when adding to the swap cache
  mm/swap: implement workingset detection for anonymous LRU
  mm/workingset: support to remember the previous owner of the page
  mm/swap: do not readahead if the previous owner of the swap entry
    isn't me
  mm/vmscan: restore active/inactive ratio for anonymous LRU
  mm/swap: reinforce the reclaim_stat changed by anon LRU algorithm
    change

 include/linux/mmzone.h        |  14 ++++--
 include/linux/pagevec.h       |   2 +-
 include/linux/swap.h          |  24 +++++++---
 include/linux/vmstat.h        |   2 +-
 include/trace/events/vmscan.h |   3 +-
 kernel/events/uprobes.c       |   2 +-
 mm/huge_memory.c              |   6 +--
 mm/khugepaged.c               |   2 +-
 mm/memcontrol.c               |  12 +++--
 mm/memory.c                   |  23 +++++++--
 mm/migrate.c                  |   2 +-
 mm/mlock.c                    |   2 +-
 mm/shmem.c                    |  21 +++++----
 mm/swap.c                     | 105 +++++++++++++++++++++++++++++++++++-------
 mm/swap_state.c               |  95 ++++++++++++++++++++++++++++++++------
 mm/swapfile.c                 |   2 +-
 mm/userfaultfd.c              |   2 +-
 mm/vmscan.c                   |  38 +++++++++------
 mm/vmstat.c                   |   6 ++-
 mm/workingset.c               |  90 +++++++++++++++++++++++++++---------
 mm/zswap.c                    |   2 +-
 21 files changed, 343 insertions(+), 112 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 01/10] mm/vmscan: make active/inactive ratio as 1:1 for anon lru
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 02/10] mm/vmscan: protect the workingset on anonymous LRU js1304
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Current implementation of LRU management for anonymous page has some
problems. Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list. Although, this problem will be
fixed in the following patchset, the preparation is required and
this patch does it.

What following patchset does is to restore workingset protection. In this
case, newly created or swap-in pages are started their lifetime on the
inactive list. If inactive list is too small, there is not enough chance
to be referenced and the page cannot become the workingset.

In order to provide enough chance to the newly anonymous pages, this patch
makes active/inactive LRU ratio as 1:1.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8763705..df92119 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2203,7 +2203,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
 	active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
-	if (gb)
+	if (gb && is_file_lru(inactive_lru))
 		inactive_ratio = int_sqrt(10 * gb);
 	else
 		inactive_ratio = 1;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 02/10] mm/vmscan: protect the workingset on anonymous LRU
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
  2020-04-03  5:40 ` [PATCH v5 01/10] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 03/10] mm/workingset: extend the workingset detection for anon LRU js1304
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

In current implementation, newly created or swap-in anonymous page
is started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.
Like as file LRU, newly created or swap-in anonymous pages will be
inserted to the inactive list. They are promoted to active list if
enough reference happens. This simple modification changes the above
example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot
be promoted and will be swapped-out if re-access interval is greater than
the size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection that is applied to file LRU some day before.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h    |  2 +-
 kernel/events/uprobes.c |  2 +-
 mm/huge_memory.c        |  6 +++---
 mm/khugepaged.c         |  2 +-
 mm/memory.c             |  9 ++++-----
 mm/migrate.c            |  2 +-
 mm/swap.c               | 13 +++++++------
 mm/swapfile.c           |  2 +-
 mm/userfaultfd.c        |  2 +-
 mm/vmscan.c             |  4 +---
 10 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7a..954e13e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,7 +344,7 @@ extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
+extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ece7e13..14156fc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -190,7 +190,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		get_page(new_page);
 		page_add_new_anon_rmap(new_page, vma, addr, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
 		dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 24ad53b..7da49cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -641,7 +641,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false, true);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1285,7 +1285,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false, false);
-		lru_cache_add_active_or_unevictable(pages[i], vma);
+		lru_cache_add_inactive_or_unevictable(pages[i], vma);
 		vmf->pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*vmf->pte));
 		set_pte_at(vma->vm_mm, haddr, vmf->pte, entry);
@@ -1438,7 +1438,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908..246c155 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1092,7 +1092,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
-	lru_cache_add_active_or_unevictable(new_page, vma);
+	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index e8bfdf0..127379a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2568,7 +2568,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
 		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3093,11 +3093,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(page != swapcache && swapcache)) {
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
-		activate_page(page);
 	}
 
 	swap_free(entry);
@@ -3241,7 +3240,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(page, vma);
+	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3504,7 +3503,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index b109287..e2b504e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2808,7 +2808,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	page_add_new_anon_rmap(page, vma, addr, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	if (!is_zone_device_page(page))
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
 
 	if (flush) {
diff --git a/mm/swap.c b/mm/swap.c
index cf39d24..d14a2fd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -448,23 +448,24 @@ void lru_cache_add(struct page *page)
 }
 
 /**
- * lru_cache_add_active_or_unevictable
+ * lru_cache_add_inactive_or_unevictable
  * @page:  the page to be added to LRU
  * @vma:   vma in which page is mapped for determining reclaimability
  *
- * Place @page on the active or unevictable LRU list, depending on its
+ * Place @page on the inactive or unevictable LRU list, depending on its
  * evictability.  Note that if the page is not evictable, it goes
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
+void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
+	bool unevictable;
+
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
-		SetPageActive(page);
-	else if (!TestSetPageMlocked(page)) {
+	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
+	if (unevictable && !TestSetPageMlocked(page)) {
 		/*
 		 * We use the irq-unsafe __mod_zone_page_stat because this
 		 * counter is not modified from interrupt context, and the pte
diff --git a/mm/swapfile.c b/mm/swapfile.c
index be33e61..e00c1b9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1888,7 +1888,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	swap_free(entry);
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 1b0d7ab..875e329 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -120,7 +120,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(page, dst_vma);
+	lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df92119..9853035 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -996,8 +996,6 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_RECLAIM;
 
 	if (referenced_ptes) {
-		if (PageSwapBacked(page))
-			return PAGEREF_ACTIVATE;
 		/*
 		 * All mapped pages start out with page table
 		 * references from the instantiating fault, so we need
@@ -1020,7 +1018,7 @@ static enum page_references page_check_references(struct page *page,
 		/*
 		 * Activate file-backed executable pages after first usage.
 		 */
-		if (vm_flags & VM_EXEC)
+		if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
 
 		return PAGEREF_KEEP;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 03/10] mm/workingset: extend the workingset detection for anon LRU
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
  2020-04-03  5:40 ` [PATCH v5 01/10] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
  2020-04-03  5:40 ` [PATCH v5 02/10] mm/vmscan: protect the workingset on anonymous LRU js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 04/10] mm/swapcache: support to handle the exceptional entries in swapcache js1304
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

In the following patch, workingset detection will be applied to
anonymous LRU. To prepare it, this patch adds some code to
distinguish/handle the both LRUs.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h | 14 +++++++++-----
 mm/memcontrol.c        | 12 ++++++++----
 mm/vmscan.c            | 15 ++++++++++-----
 mm/vmstat.c            |  6 ++++--
 mm/workingset.c        | 33 ++++++++++++++++++++-------------
 5 files changed, 51 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f687..57fcb89 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,8 +220,12 @@ enum node_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_NODES,
-	WORKINGSET_REFAULT,
-	WORKINGSET_ACTIVATE,
+	WORKINGSET_REFAULT_BASE,
+	WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
+	WORKINGSET_REFAULT_FILE,
+	WORKINGSET_ACTIVATE_BASE,
+	WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE,
+	WORKINGSET_ACTIVATE_FILE,
 	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
@@ -304,10 +308,10 @@ enum lruvec_flags {
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat	reclaim_stat;
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t			inactive_age;
+	/* Evictions & activations on the inactive list, anon=0, file=1 */
+	atomic_long_t			inactive_age[2];
 	/* Refaults at the time of last reclaim cycle */
-	unsigned long			refaults;
+	unsigned long			refaults[2];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
 #ifdef CONFIG_MEMCG
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ddf91c..dda278a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1444,10 +1444,14 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
 		       memcg_events(memcg, PGMAJFAULT));
 
-	seq_buf_printf(&s, "workingset_refault %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_REFAULT));
-	seq_buf_printf(&s, "workingset_activate %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_ACTIVATE));
+	seq_buf_printf(&s, "workingset_refault_anon %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
+	seq_buf_printf(&s, "workingset_refault_file %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
+	seq_buf_printf(&s, "workingset_activate_anon %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
+	seq_buf_printf(&s, "workingset_activate_file %lu\n",
+		       memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
 	seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
 		       memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9853035..7196ccc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2705,7 +2705,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!sc->force_deactivate) {
 		unsigned long refaults;
 
-		if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+		refaults = lruvec_page_state(target_lruvec,
+				WORKINGSET_ACTIVATE_ANON);
+		if (refaults != target_lruvec->refaults[0] ||
+			inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
 			sc->may_deactivate |= DEACTIVATE_ANON;
 		else
 			sc->may_deactivate &= ~DEACTIVATE_ANON;
@@ -2716,8 +2719,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * rid of any stale active pages quickly.
 		 */
 		refaults = lruvec_page_state(target_lruvec,
-					     WORKINGSET_ACTIVATE);
-		if (refaults != target_lruvec->refaults ||
+				WORKINGSET_ACTIVATE_FILE);
+		if (refaults != target_lruvec->refaults[1] ||
 		    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
 			sc->may_deactivate |= DEACTIVATE_FILE;
 		else
@@ -2994,8 +2997,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	unsigned long refaults;
 
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
-	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
-	target_lruvec->refaults = refaults;
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
+	target_lruvec->refaults[0] = refaults;
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
+	target_lruvec->refaults[1] = refaults;
 }
 
 /*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d5337..3cdf8e9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1146,8 +1146,10 @@ const char * const vmstat_text[] = {
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"workingset_nodes",
-	"workingset_refault",
-	"workingset_activate",
+	"workingset_refault_anon",
+	"workingset_refault_file",
+	"workingset_activate_anon",
+	"workingset_activate_file",
 	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b..59415e0 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
 #include <linux/writeback.h>
 #include <linux/shmem_fs.h>
 #include <linux/pagemap.h>
@@ -156,7 +157,7 @@
  *
  *		Implementation
  *
- * For each node's file LRU lists, a counter for inactive evictions
+ * For each node's anon/file LRU lists, a counter for inactive evictions
  * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
@@ -213,7 +214,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat,
+				bool file)
 {
 	/*
 	 * Reclaiming a cgroup means reclaiming all its children in a
@@ -230,7 +232,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 		struct lruvec *lruvec;
 
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		atomic_long_inc(&lruvec->inactive_age);
+		atomic_long_inc(&lruvec->inactive_age[file]);
 	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
 }
 
@@ -245,6 +247,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 {
 	struct pglist_data *pgdat = page_pgdat(page);
+	bool file = page_is_file_cache(page);
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	int memcgid;
@@ -254,12 +257,12 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	advance_inactive_age(page_memcg(page), pgdat);
+	advance_inactive_age(page_memcg(page), pgdat, file);
 
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->inactive_age);
+	eviction = atomic_long_read(&lruvec->inactive_age[file]);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
@@ -274,15 +277,16 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
  */
 void workingset_refault(struct page *page, void *shadow)
 {
+	bool file = page_is_file_cache(page);
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
 	struct pglist_data *pgdat;
-	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
+	unsigned long active;
 	bool workingset;
 	int memcgid;
 
@@ -308,9 +312,11 @@ void workingset_refault(struct page *page, void *shadow)
 	eviction_memcg = mem_cgroup_from_id(memcgid);
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		goto out;
+
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->inactive_age);
-	active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+	refault = atomic_long_read(&eviction_lruvec->inactive_age[file]);
+	active = lruvec_page_state(eviction_lruvec,
+				page_lru_base_type(page) + LRU_ACTIVE);
 
 	/*
 	 * Calculate the refault distance
@@ -341,19 +347,19 @@ void workingset_refault(struct page *page, void *shadow)
 	memcg = page_memcg(page);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
+	inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
 
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't act on pages that couldn't stay resident even if all
 	 * the memory was available to the page cache.
 	 */
-	if (refault_distance > active_file)
+	if (refault_distance > active)
 		goto out;
 
 	SetPageActive(page);
-	advance_inactive_age(memcg, pgdat);
-	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+	advance_inactive_age(memcg, pgdat, file);
+	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
 
 	/* Page was active prior to eviction */
 	if (workingset) {
@@ -370,6 +376,7 @@ void workingset_refault(struct page *page, void *shadow)
  */
 void workingset_activation(struct page *page)
 {
+	bool file = page_is_file_cache(page);
 	struct mem_cgroup *memcg;
 
 	rcu_read_lock();
@@ -383,7 +390,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	advance_inactive_age(memcg, page_pgdat(page));
+	advance_inactive_age(memcg, page_pgdat(page), file);
 out:
 	rcu_read_unlock();
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 04/10] mm/swapcache: support to handle the exceptional entries in swapcache
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (2 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 03/10] mm/workingset: extend the workingset detection for anon LRU js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache js1304
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Swapcache doesn't handle the exceptional entries since there is no case
using it. In the following patch, workingset detection for anonymous
page will be implemented and it stores the shadow entries as exceptional
entries into the swapcache. So, we need to handle the exceptional entries
and this patch implements it.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h | 10 ++++++----
 mm/shmem.c           |  3 ++-
 mm/swap_state.c      | 26 ++++++++++++++++++++------
 mm/vmscan.c          |  2 +-
 4 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 954e13e..273de48 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,9 +408,11 @@ extern struct address_space *swapper_spaces[];
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+			gfp_t gfp, void **shadowp);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern void __delete_from_swap_cache(struct page *page,
+			swp_entry_t entry, void *shadow);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
@@ -565,13 +567,13 @@ static inline int add_to_swap(struct page *page)
 }
 
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
-							gfp_t gfp_mask)
+					gfp_t gfp_mask, void **shadowp)
 {
 	return -1;
 }
 
 static inline void __delete_from_swap_cache(struct page *page,
-							swp_entry_t entry)
+					swp_entry_t entry, void *shadow)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index aad3ba7..9e34b4e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1370,7 +1370,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		list_add(&info->swaplist, &shmem_swaplist);
 
 	if (add_to_swap_cache(page, swap,
-			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
+			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
+			NULL) == 0) {
 		spin_lock_irq(&info->lock);
 		shmem_recalc_inode(inode);
 		info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a..f06af84 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -111,12 +111,15 @@ void show_swap_cache_info(void)
  * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
  */
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+			gfp_t gfp, void **shadowp)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
 	XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
 	unsigned long i, nr = compound_nr(page);
+	unsigned long nrexceptional = 0;
+	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -132,10 +135,17 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 			goto unlock;
 		for (i = 0; i < nr; i++) {
 			VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+			old = xas_load(&xas);
+			if (xa_is_value(old)) {
+				nrexceptional++;
+				if (shadowp)
+					*shadowp = old;
+			}
 			set_page_private(page + i, entry.val + i);
 			xas_store(&xas, page);
 			xas_next(&xas);
 		}
+		address_space->nrexceptional -= nrexceptional;
 		address_space->nrpages += nr;
 		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
 		ADD_CACHE_INFO(add_total, nr);
@@ -155,7 +165,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
  * This must be called only on pages that have
  * been verified to be in the swap cache.
  */
-void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
+void __delete_from_swap_cache(struct page *page,
+			swp_entry_t entry, void *shadow)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	int i, nr = hpage_nr_pages(page);
@@ -167,12 +178,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(PageWriteback(page), page);
 
 	for (i = 0; i < nr; i++) {
-		void *entry = xas_store(&xas, NULL);
+		void *entry = xas_store(&xas, shadow);
 		VM_BUG_ON_PAGE(entry != page, entry);
 		set_page_private(page + i, 0);
 		xas_next(&xas);
 	}
 	ClearPageSwapCache(page);
+	if (shadow)
+		address_space->nrexceptional += nr;
 	address_space->nrpages -= nr;
 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
 	ADD_CACHE_INFO(del_total, nr);
@@ -209,7 +222,7 @@ int add_to_swap(struct page *page)
 	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
 	if (err)
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -247,7 +260,7 @@ void delete_from_swap_cache(struct page *page)
 	struct address_space *address_space = swap_address_space(entry);
 
 	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(page, entry);
+	__delete_from_swap_cache(page, entry, NULL);
 	xa_unlock_irq(&address_space->i_pages);
 
 	put_swap_page(page, entry);
@@ -418,7 +431,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if XArray node allocation failed. */
 		__SetPageLocked(new_page);
 		__SetPageSwapBacked(new_page);
-		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+		err = add_to_swap_cache(new_page, entry,
+				gfp_mask & GFP_KERNEL, NULL);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
 			SetPageWorkingset(new_page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7196ccc..d46e3e5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -895,7 +895,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		mem_cgroup_swapout(page, swap);
-		__delete_from_swap_cache(page, swap);
+		__delete_from_swap_cache(page, swap, NULL);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
 	} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (3 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 04/10] mm/swapcache: support to handle the exceptional entries in swapcache js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03 18:29   ` Yang Shi
  2020-04-16 16:11   ` Johannes Weiner
  2020-04-03  5:40 ` [PATCH v5 06/10] mm/swap: implement workingset detection for anonymous LRU js1304
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Currently, some swapped-in pages are not charged to the memcg until
actual access to the page happens. I checked the code and found that
it could cause a problem. In this implementation, even if the memcg
is enabled, one can consume a lot of memory in the system by exploiting
this hole. For example, one can make all the pages swapped out and
then call madvise_willneed() to load the all swapped-out pages without
pressing the memcg. Although actual access requires charging, it's really
big benefit to load the swapped-out pages to the memory without pressing
the memcg.

And, for workingset detection which is implemented on the following patch,
a memcg should be committed before the workingset detection is executed.
For this purpose, the best solution, I think, is charging the page when
adding to the swap cache. Charging there is not that hard. Caller of
adding the page to the swap cache has enough information about the charged
memcg. So, what we need to do is just passing this information to
the right place.

With this patch, specific memcg could be pressured more since readahead
pages are also charged to it now. This would result in performance
degradation to that user but it would be fair since that readahead is for
that user.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c           | 18 ++++++++++--------
 mm/swap_state.c      | 25 +++++++++++++++++++++----
 3 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 273de48..eea0700 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -409,7 +409,7 @@ extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
 extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			gfp_t gfp, void **shadowp);
+			struct vm_area_struct *vma, gfp_t gfp, void **shadowp);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *page,
 			swp_entry_t entry, void *shadow);
@@ -567,7 +567,7 @@ static inline int add_to_swap(struct page *page)
 }
 
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
+			struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
 {
 	return -1;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 9e34b4e..8e28c1f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1369,7 +1369,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (list_empty(&info->swaplist))
 		list_add(&info->swaplist, &shmem_swaplist);
 
-	if (add_to_swap_cache(page, swap,
+	if (add_to_swap_cache(page, swap, NULL,
 			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
 			NULL) == 0) {
 		spin_lock_irq(&info->lock);
@@ -1434,10 +1434,11 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 #endif
 
 static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
-		struct shmem_inode_info *info, pgoff_t index)
+			struct mm_struct *mm, struct shmem_inode_info *info,
+			pgoff_t index)
 {
 	/* Create a pseudo vma that just contains the policy */
-	vma_init(vma, NULL);
+	vma_init(vma, mm);
 	/* Bias interleave by inode number to distribute better across nodes */
 	vma->vm_pgoff = index + info->vfs_inode.i_ino;
 	vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
@@ -1450,13 +1451,14 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
 }
 
 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+			struct mm_struct *mm, struct shmem_inode_info *info,
+			pgoff_t index)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 	struct vm_fault vmf;
 
-	shmem_pseudo_vma_init(&pvma, info, index);
+	shmem_pseudo_vma_init(&pvma, mm, info, index);
 	vmf.vma = &pvma;
 	vmf.address = 0;
 	page = swap_cluster_readahead(swap, gfp, &vmf);
@@ -1481,7 +1483,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
 								XA_PRESENT))
 		return NULL;
 
-	shmem_pseudo_vma_init(&pvma, info, hindex);
+	shmem_pseudo_vma_init(&pvma, NULL, info, hindex);
 	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
 			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
 	shmem_pseudo_vma_destroy(&pvma);
@@ -1496,7 +1498,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
 	struct vm_area_struct pvma;
 	struct page *page;
 
-	shmem_pseudo_vma_init(&pvma, info, index);
+	shmem_pseudo_vma_init(&pvma, NULL, info, index);
 	page = alloc_page_vma(gfp, &pvma, 0);
 	shmem_pseudo_vma_destroy(&pvma);
 
@@ -1652,7 +1654,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
 			count_memcg_event_mm(charge_mm, PGMAJFAULT);
 		}
 		/* Here we actually start the io */
-		page = shmem_swapin(swap, gfp, info, index);
+		page = shmem_swapin(swap, gfp, charge_mm, info, index);
 		if (!page) {
 			error = -ENOMEM;
 			goto failed;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f06af84..1db73a2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -112,7 +112,7 @@ void show_swap_cache_info(void)
  * but sets SwapCache flag and private instead of mapping and index.
  */
 int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			gfp_t gfp, void **shadowp)
+			struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
@@ -120,14 +120,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
 	unsigned long i, nr = compound_nr(page);
 	unsigned long nrexceptional = 0;
 	void *old;
+	bool compound = !!compound_order(page);
+	int error;
+	struct mm_struct *mm = vma ? vma->vm_mm : current->mm;
+	struct mem_cgroup *memcg;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 	page_ref_add(page, nr);
+	/* PageSwapCache() prevent the page from being re-charged */
 	SetPageSwapCache(page);
 
+	error = mem_cgroup_try_charge(page, mm, gfp, &memcg, compound);
+	if (error) {
+		ClearPageSwapCache(page);
+		page_ref_sub(page, nr);
+		return error;
+	}
+
 	do {
 		xas_lock_irq(&xas);
 		xas_create_range(&xas);
@@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp));
 
-	if (!xas_error(&xas))
+	if (!xas_error(&xas)) {
+		mem_cgroup_commit_charge(page, memcg, false, compound);
 		return 0;
+	}
+
+	mem_cgroup_cancel_charge(page, memcg, compound);
 
 	ClearPageSwapCache(page);
 	page_ref_sub(page, nr);
+
 	return xas_error(&xas);
 }
 
@@ -221,7 +238,7 @@ int add_to_swap(struct page *page)
 	/*
 	 * Add it to the swap cache.
 	 */
-	err = add_to_swap_cache(page, entry,
+	err = add_to_swap_cache(page, entry, NULL,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
 	if (err)
 		/*
@@ -431,7 +448,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if XArray node allocation failed. */
 		__SetPageLocked(new_page);
 		__SetPageSwapBacked(new_page);
-		err = add_to_swap_cache(new_page, entry,
+		err = add_to_swap_cache(new_page, entry, vma,
 				gfp_mask & GFP_KERNEL, NULL);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 06/10] mm/swap: implement workingset detection for anonymous LRU
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (4 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 07/10] mm/workingset: support to remember the previous owner of the page js1304
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

This patch implements workingset detection for anonymous LRU.
All the infrastructure is implemented by the previous patches so this patch
just activates the workingset detection by installing/retrieving
the shadow entry.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h |  6 ++++++
 mm/memory.c          |  8 ++++++++
 mm/swap.c            |  3 +--
 mm/swap_state.c      | 22 ++++++++++++++++++++--
 mm/vmscan.c          |  7 +++++--
 5 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index eea0700..97e8a2e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,6 +408,7 @@ extern struct address_space *swapper_spaces[];
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
+extern void *get_shadow_from_swap_cache(swp_entry_t entry);
 extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
 			struct vm_area_struct *vma, gfp_t gfp, void **shadowp);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
@@ -566,6 +567,11 @@ static inline int add_to_swap(struct page *page)
 	return 0;
 }
 
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 			struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 127379a..9effb23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2945,6 +2945,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	int locked;
 	int exclusive = 0;
 	vm_fault_t ret = 0;
+	void *shadow = NULL;
 
 	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
 		goto out;
@@ -2983,6 +2984,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				__SetPageLocked(page);
 				__SetPageSwapBacked(page);
 				set_page_private(page, entry.val);
+				shadow = get_shadow_from_swap_cache(entry);
 				lru_cache_add_anon(page);
 				swap_readpage(page, true);
 			}
@@ -3099,6 +3101,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		mem_cgroup_commit_charge(page, memcg, true, false);
 	}
 
+	if (shadow) {
+		workingset_refault(page, shadow);
+		if (PageActive(page) && PageLRU(page))
+			activate_page(page);
+	}
+
 	swap_free(entry);
 	if (mem_cgroup_swap_full(page) ||
 	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
diff --git a/mm/swap.c b/mm/swap.c
index d14a2fd..d9f2005 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -394,8 +394,7 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
-		if (page_is_file_cache(page))
-			workingset_activation(page);
+		workingset_activation(page);
 	}
 	if (page_is_idle(page))
 		clear_page_idle(page);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1db73a2..de994f9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -107,6 +107,20 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
 }
 
+void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+	struct address_space *address_space = swap_address_space(entry);
+	pgoff_t idx = swp_offset(entry);
+	struct page *page;
+
+	page = find_get_entry(address_space, idx);
+	if (xa_is_value(page))
+		return page;
+	if (page)
+		put_page(page);
+	return NULL;
+}
+
 /*
  * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
@@ -393,6 +407,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	struct page *found_page = NULL, *new_page = NULL;
 	struct swap_info_struct *si;
 	int err;
+	void *shadow;
 	*new_page_allocated = false;
 
 	do {
@@ -448,12 +463,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if XArray node allocation failed. */
 		__SetPageLocked(new_page);
 		__SetPageSwapBacked(new_page);
+		shadow = NULL;
 		err = add_to_swap_cache(new_page, entry, vma,
-				gfp_mask & GFP_KERNEL, NULL);
+				gfp_mask & GFP_KERNEL, &shadow);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
 			SetPageWorkingset(new_page);
-			lru_cache_add_anon(new_page);
+			if (shadow)
+				workingset_refault(new_page, shadow);
+			lru_cache_add(new_page);
 			*new_page_allocated = true;
 			return new_page;
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d46e3e5..d8aa42a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -853,6 +853,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 {
 	unsigned long flags;
 	int refcount;
+	void *shadow = NULL;
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -895,12 +896,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		mem_cgroup_swapout(page, swap);
-		__delete_from_swap_cache(page, swap, NULL);
+		if (reclaimed && !mapping_exiting(mapping))
+			shadow = workingset_eviction(page, target_memcg);
+		__delete_from_swap_cache(page, swap, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 		put_swap_page(page, swap);
 	} else {
 		void (*freepage)(struct page *);
-		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -1462,6 +1464,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			SetPageActive(page);
 			stat->nr_activate[type] += nr_pages;
 			count_memcg_page_event(page, PGACTIVATE);
+			workingset_activation(page);
 		}
 keep_locked:
 		unlock_page(page);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 07/10] mm/workingset: support to remember the previous owner of the page
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (5 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 06/10] mm/swap: implement workingset detection for anonymous LRU js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 08/10] mm/swap: do not readahead if the previous owner of the swap entry isn't me js1304
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

This patch supports to remember the previous owner of the page.
Since there is not enough spare bits in shadow entry for this information,
only a few bits of the memcg id of the page is stored. Although this
information is insufficient, it would provide enough level of the ability
to check the previous owner.

This patch is for the preparation of the following patch, "mm/swap:
do not readahead if the previous owner of the swap entry isn't me".

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/workingset.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 59415e0..871b867 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -168,8 +168,20 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+/*
+ * memcg_tag will be used to check the previous owner of the page.
+ * Since we don't have enough spare bits in shadow entry, just a few
+ * bits are used to approximate the previous owner.
+ */
+#if BITS_PER_LONG == 32
+#define PAGE_MEMCG_TAG_SHIFT 4
+#else
+#define PAGE_MEMCG_TAG_SHIFT 8
+#endif
+
 #define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
-			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT + \
+			 PAGE_MEMCG_TAG_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -182,11 +194,12 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
-			 bool workingset)
+static void *pack_shadow(int memcgid, int page_memcg_tag, pg_data_t *pgdat,
+			unsigned long eviction, bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction &= EVICTION_MASK;
+	eviction = (eviction << PAGE_MEMCG_TAG_SHIFT) | page_memcg_tag;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << 1) | workingset;
@@ -194,11 +207,12 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 	return xa_mk_value(eviction);
 }
 
-static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp, bool *workingsetp)
+static void unpack_shadow(void *shadow, int *memcgidp, int *page_memcg_tagp,
+			pg_data_t **pgdat, unsigned long *evictionp,
+			bool *workingsetp)
 {
 	unsigned long entry = xa_to_value(shadow);
-	int memcgid, nid;
+	int memcgid, page_memcg_tag, nid;
 	bool workingset;
 
 	workingset = entry & 1;
@@ -207,8 +221,11 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
 	entry >>= MEM_CGROUP_ID_SHIFT;
+	page_memcg_tag = entry & ((1UL << PAGE_MEMCG_TAG_SHIFT) - 1);
+	entry >>= PAGE_MEMCG_TAG_SHIFT;
 
 	*memcgidp = memcgid;
+	*page_memcg_tagp = page_memcg_tag;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
 	*workingsetp = workingset;
@@ -248,9 +265,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 {
 	struct pglist_data *pgdat = page_pgdat(page);
 	bool file = page_is_file_cache(page);
+	int memcgid, page_memcg_tag;
 	unsigned long eviction;
 	struct lruvec *lruvec;
-	int memcgid;
 
 	/* Page is fully exclusive and pins page->mem_cgroup */
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -262,8 +279,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
+	page_memcg_tag = mem_cgroup_id(page_memcg(page));
+	page_memcg_tag &= (1UL << PAGE_MEMCG_TAG_SHIFT) - 1;
 	eviction = atomic_long_read(&lruvec->inactive_age[file]);
-	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
+	return pack_shadow(memcgid, page_memcg_tag, pgdat, eviction,
+			PageWorkingset(page));
 }
 
 /**
@@ -281,6 +301,7 @@ void workingset_refault(struct page *page, void *shadow)
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
 	unsigned long refault_distance;
+	int memcgid, page_memcg_tag;
 	struct pglist_data *pgdat;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
@@ -288,9 +309,9 @@ void workingset_refault(struct page *page, void *shadow)
 	unsigned long refault;
 	unsigned long active;
 	bool workingset;
-	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+	unpack_shadow(shadow, &memcgid, &page_memcg_tag,
+			&pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 08/10] mm/swap: do not readahead if the previous owner of the swap entry isn't me
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (6 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 07/10] mm/workingset: support to remember the previous owner of the page js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:40 ` [PATCH v5 09/10] mm/vmscan: restore active/inactive ratio for anonymous LRU js1304
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

To implement workingset detection for anonymous page, the page that is
swapped-in by readahead but not yet touched in the swap cache is also
charged to the memcg. However, if these readahead pages are not ours,
these pages are wrongly charged to us and our memcg could be wrongly
pressured.

To solve the problem, this patch implements to avoid readahead if the
previous owner of the swap entry isn't me. With this patch, readahead
only happens for mine and then there is no wrongly memcg pressure issue.

The purpose of the readahead is to improve the performance by using
locality of the swap space. If the entry isn't mine, there is no locality.
Therefore, it's safe to avoid readahead.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/swap.h |  9 ++++++---
 mm/shmem.c           |  2 +-
 mm/swap_state.c      | 34 ++++++++++++++++++++++++++--------
 mm/workingset.c      | 16 ++++++++++++++++
 mm/zswap.c           |  2 +-
 5 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 97e8a2e..d204cc7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,6 +307,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
+bool shadow_from_memcg(void *shadow, struct mem_cgroup *memcg);
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
@@ -410,7 +411,8 @@ extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
 extern void *get_shadow_from_swap_cache(swp_entry_t entry);
 extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			struct vm_area_struct *vma, gfp_t gfp, void **shadowp);
+			struct vm_area_struct *vma, gfp_t gfp,
+			void **shadowp, bool readahead);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *page,
 			swp_entry_t entry, void *shadow);
@@ -425,7 +427,7 @@ extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 			bool do_poll);
 extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr,
-			bool *new_page_allocated);
+			bool *new_page_allocated, bool readahead);
 extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 				struct vm_fault *vmf);
 extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
@@ -573,7 +575,8 @@ static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
 }
 
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
+				struct vm_area_struct *vma, gfp_t gfp,
+				void **shadowp, bool readahead)
 {
 	return -1;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 8e28c1f..483c32d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1371,7 +1371,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 	if (add_to_swap_cache(page, swap, NULL,
 			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
-			NULL) == 0) {
+			NULL, false) == 0) {
 		spin_lock_irq(&info->lock);
 		shmem_recalc_inode(inode);
 		info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index de994f9..9bf0251f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -126,7 +126,8 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
  * but sets SwapCache flag and private instead of mapping and index.
  */
 int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
+			struct vm_area_struct *vma, gfp_t gfp,
+			void **shadowp, bool readahead)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
@@ -143,6 +144,23 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
+	if (readahead) {
+		void *shadow = get_shadow_from_swap_cache(entry);
+
+		/*
+		 * In readahead case, check the memcgid of the shadow entry
+		 * in order to stop to readahead other's page
+		 */
+		if (shadow) {
+			memcg = get_mem_cgroup_from_mm(mm);
+			if (memcg && !shadow_from_memcg(shadow, memcg)) {
+				mem_cgroup_put(memcg);
+				return -EINVAL;
+			}
+			mem_cgroup_put(memcg);
+		}
+	}
+
 	page_ref_add(page, nr);
 	/* PageSwapCache() prevent the page from being re-charged */
 	SetPageSwapCache(page);
@@ -253,7 +271,7 @@ int add_to_swap(struct page *page)
 	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry, NULL,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
+			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL, false);
 	if (err)
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -402,7 +420,7 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr,
-			bool *new_page_allocated)
+			bool *new_page_allocated, bool readahead)
 {
 	struct page *found_page = NULL, *new_page = NULL;
 	struct swap_info_struct *si;
@@ -465,7 +483,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		__SetPageSwapBacked(new_page);
 		shadow = NULL;
 		err = add_to_swap_cache(new_page, entry, vma,
-				gfp_mask & GFP_KERNEL, &shadow);
+				gfp_mask & GFP_KERNEL, &shadow, readahead);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
 			SetPageWorkingset(new_page);
@@ -481,7 +499,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * clear SWAP_HAS_CACHE flag.
 		 */
 		put_swap_page(new_page, entry);
-	} while (err != -ENOMEM);
+	} while (err != -ENOMEM && err != -EINVAL);
 
 	if (new_page)
 		put_page(new_page);
@@ -499,7 +517,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 {
 	bool page_was_allocated;
 	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
-			vma, addr, &page_was_allocated);
+			vma, addr, &page_was_allocated, false);
 
 	if (page_was_allocated)
 		swap_readpage(retpage, do_poll);
@@ -624,7 +642,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		/* Ok, do the async read-ahead now */
 		page = __read_swap_cache_async(
 			swp_entry(swp_type(entry), offset),
-			gfp_mask, vma, addr, &page_allocated);
+			gfp_mask, vma, addr, &page_allocated, true);
 		if (!page)
 			continue;
 		if (page_allocated) {
@@ -796,7 +814,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		if (unlikely(non_swap_entry(entry)))
 			continue;
 		page = __read_swap_cache_async(entry, gfp_mask, vma,
-					       vmf->address, &page_allocated);
+					vmf->address, &page_allocated, true);
 		if (!page)
 			continue;
 		if (page_allocated) {
diff --git a/mm/workingset.c b/mm/workingset.c
index 871b867..1fa46cf 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -231,6 +231,22 @@ static void unpack_shadow(void *shadow, int *memcgidp, int *page_memcg_tagp,
 	*workingsetp = workingset;
 }
 
+bool shadow_from_memcg(void *shadow, struct mem_cgroup *memcg)
+{
+	int memcgid, page_memcg_tag;
+	struct pglist_data *pgdat;
+	unsigned long eviction;
+	bool workingset;
+	int memcg_tag;
+
+	unpack_shadow(shadow, &memcgid, &page_memcg_tag,
+			&pgdat, &eviction, &workingset);
+	memcg_tag = mem_cgroup_id(memcg);
+	memcg_tag &= (1UL << PAGE_MEMCG_TAG_SHIFT) - 1;
+
+	return memcg_tag == page_memcg_tag;
+}
+
 static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat,
 				bool file)
 {
diff --git a/mm/zswap.c b/mm/zswap.c
index 55094e6..92f1463 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -848,7 +848,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 	bool page_was_allocated;
 
 	*retpage = __read_swap_cache_async(entry, GFP_KERNEL,
-			NULL, 0, &page_was_allocated);
+			NULL, 0, &page_was_allocated, false);
 	if (page_was_allocated)
 		return ZSWAP_SWAPCACHE_NEW;
 	if (!*retpage)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 09/10] mm/vmscan: restore active/inactive ratio for anonymous LRU
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (7 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 08/10] mm/swap: do not readahead if the previous owner of the swap entry isn't me js1304
@ 2020-04-03  5:40 ` js1304
  2020-04-03  5:45 ` [PATCH v5 10/10] mm/swap: reinforce the reclaim_stat changed by anon LRU algorithm change js1304
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now, workingset detection is implemented for anonymous LRU.
We don't have to worry about the misfound for workingset due to
the ratio of active/inactive. Let's restore the ratio.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d8aa42a..9c5d20c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2204,7 +2204,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
 	active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
-	if (gb && is_file_lru(inactive_lru))
+	if (gb)
 		inactive_ratio = int_sqrt(10 * gb);
 	else
 		inactive_ratio = 1;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 10/10] mm/swap: reinforce the reclaim_stat changed by anon LRU algorithm change
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (8 preceding siblings ...)
  2020-04-03  5:40 ` [PATCH v5 09/10] mm/vmscan: restore active/inactive ratio for anonymous LRU js1304
@ 2020-04-03  5:45 ` js1304
       [not found] ` <20200406091814.17256-1-hdanton@sina.com>
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: js1304 @ 2020-04-03  5:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

reclaim_stat's scanned/rotated is used for controlling the ratio of
scanning page between file and anonymous LRU. On the patch, protecting
anonymous pages on active LRU, LRU management algorithm for anonymous
page is changed. Due to this change, the reclaim_stat is affected and
it cause a regression on kernel build workload.

There are three cases that affect to the reclaim_stat.

First, before the patch, all new anonymous pages are counted for
rotated since they are attached to the active LRU. However, after the
patch, they are not counted since they are attached to the inactive LRU.

Second, before the patch, faulted-in once swapped page is activated and
then counted for rotated, but, it isn't now.

Lastly, reclaim checks the reference of the page and activates the page
when first reference found. Also, it is counted for rotated. However,
now, activation and counting needs two references found.

These affects reclaim_stat's scanned/rotated and more scanning for
anonymous LRU is observed due to it on kernel build workload. Then,
it leads to performance regression. Since there would be many system
that is optimized for previous file/anon scan ratio, it is better
to maintain previous scan ratio even if LRU management algorithm is
changed.

For this purpose, this patch counts reclaim_stat's rotated at the
previous counting places. Although it is non-logical to count the
reclaim_stat's rotate here in current algorithm, reducing
the regression would be more important.

I found this regression on kernel-build test and it is roughly 2~5%
performance degradation. With this workaround, performance is completely
restored.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/pagevec.h       |  2 +-
 include/linux/swap.h          |  1 +
 include/linux/vmstat.h        |  2 +-
 include/trace/events/vmscan.h |  3 +-
 mm/memory.c                   |  6 +++
 mm/mlock.c                    |  2 +-
 mm/swap.c                     | 89 ++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c                   | 12 ++++--
 8 files changed, 100 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 081d934..7090c38 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -24,7 +24,7 @@ struct pagevec {
 };
 
 void __pagevec_release(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
+void __pagevec_lru_add(struct pagevec *pvec, bool new_anon);
 unsigned pagevec_lookup_entries(struct pagevec *pvec,
 				struct address_space *mapping,
 				pgoff_t start, unsigned nr_entries,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d204cc7..1023d4a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -334,6 +334,7 @@ extern void lru_cache_add_anon(struct page *page);
 extern void lru_cache_add_file(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
+extern void update_anon_page_reclaim_stat(struct page *page);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 292485f..104ea36 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -27,7 +27,7 @@ struct reclaim_stat {
 	unsigned nr_writeback;
 	unsigned nr_immediate;
 	unsigned nr_activate[2];
-	unsigned nr_ref_keep;
+	unsigned nr_ref_keep[2];
 	unsigned nr_unmap_fail;
 };
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a5ab297..b42825f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -366,7 +366,8 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 		__entry->nr_immediate = stat->nr_immediate;
 		__entry->nr_activate0 = stat->nr_activate[0];
 		__entry->nr_activate1 = stat->nr_activate[1];
-		__entry->nr_ref_keep = stat->nr_ref_keep;
+		__entry->nr_ref_keep = stat->nr_ref_keep[0] +
+					stat->nr_ref_keep[1];
 		__entry->nr_unmap_fail = stat->nr_unmap_fail;
 		__entry->priority = priority;
 		__entry->reclaim_flags = trace_reclaim_flags(file);
diff --git a/mm/memory.c b/mm/memory.c
index 9effb23..fe30b02 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3099,6 +3099,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
+
+		/*
+		 * To reduce difference of the reclaim ratio caused by LRU
+		 * algorithm change, update reclaim_stat manually.
+		 */
+		update_anon_page_reclaim_stat(page);
 	}
 
 	if (shadow) {
diff --git a/mm/mlock.c b/mm/mlock.c
index a72c1ee..c70d0734 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -273,7 +273,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
 	 *__pagevec_lru_add() calls release_pages() so we don't call
 	 * put_page() explicitly
 	 */
-	__pagevec_lru_add(pvec);
+	__pagevec_lru_add(pvec, false);
 	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index d9f2005..9dd1cb5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@
 int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
+static DEFINE_PER_CPU(struct pagevec, lru_add_new_anon_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
@@ -272,6 +273,23 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
 		reclaim_stat->recent_rotated[file]++;
 }
 
+void update_anon_page_reclaim_stat(struct page *page)
+{
+	pg_data_t *pgdat;
+	struct lruvec *lruvec;
+	struct zone_reclaim_stat *reclaim_stat;
+
+	page = compound_head(page);
+	pgdat = page_pgdat(page);
+	spin_lock_irq(&pgdat->lru_lock);
+	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		reclaim_stat = &lruvec->reclaim_stat;
+		update_page_reclaim_stat(lruvec, 0, 1);
+	}
+	spin_unlock_irq(&pgdat->lru_lock);
+}
+
 static void __activate_page(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
@@ -333,9 +351,8 @@ void activate_page(struct page *page)
 }
 #endif
 
-static void __lru_cache_activate_page(struct page *page)
+static bool lookup_activate_page(struct pagevec *pvec, struct page *page)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 	int i;
 
 	/*
@@ -352,12 +369,35 @@ static void __lru_cache_activate_page(struct page *page)
 		struct page *pagevec_page = pvec->pages[i];
 
 		if (pagevec_page == page) {
-			SetPageActive(page);
-			break;
+			return true;
 		}
 	}
+	return false;
+}
+
+static void __lru_cache_activate_page(struct page *page)
+{
+	struct pagevec *pvec;
+	bool ret;
 
+	pvec = &get_cpu_var(lru_add_pvec);
+	ret = lookup_activate_page(pvec, page);
 	put_cpu_var(lru_add_pvec);
+
+	if (ret) {
+		SetPageActive(page);
+		return;
+	}
+
+	if (!PageSwapBacked(page))
+		return;
+
+	pvec = &get_cpu_var(lru_add_new_anon_pvec);
+	ret = lookup_activate_page(pvec, page);
+	put_cpu_var(lru_add_new_anon_pvec);
+
+	if (ret)
+		SetPageActive(page);
 }
 
 /*
@@ -407,7 +447,17 @@ static void __lru_cache_add(struct page *page)
 
 	get_page(page);
 	if (!pagevec_add(pvec, page) || PageCompound(page))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add(pvec, false);
+	put_cpu_var(lru_add_pvec);
+}
+
+static void __lru_cache_add_new_anon(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_new_anon_pvec);
+
+	get_page(page);
+	if (!pagevec_add(pvec, page) || PageCompound(page))
+		__pagevec_lru_add(pvec, true);
 	put_cpu_var(lru_add_pvec);
 }
 
@@ -474,6 +524,11 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
 				    hpage_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
+
+	if (PageSwapBacked(page) && !unevictable) {
+		__lru_cache_add_new_anon(page);
+		return;
+	}
 	lru_cache_add(page);
 }
 
@@ -596,7 +651,11 @@ void lru_add_drain_cpu(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
 
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add(pvec, false);
+
+	pvec = &per_cpu(lru_add_new_anon_pvec, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add(pvec, true);
 
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
@@ -744,6 +803,7 @@ void lru_add_drain_all(void)
 		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
 
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
+		    pagevec_count(&per_cpu(lru_add_new_anon_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
@@ -928,6 +988,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
+	bool new_anon = (bool)arg;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
@@ -962,8 +1023,18 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 
 	if (page_evictable(page)) {
 		lru = page_lru(page);
+		/*
+		 * Count new_anon page for rotation due to performance reason.
+		 * Previous LRU management algorithm for anonymous page attaches
+		 * a new anonymous page to the active list and it is counted as
+		 * rotation. However, current one attaches a new anonymous page
+		 * to the inactive list so rotation isn't counted. This count
+		 * difference results in difference of file/anon reclaim ratio
+		 * and then performance regression. To workaround it, new_anon
+		 * is counted for rotation here.
+		 */
 		update_page_reclaim_stat(lruvec, page_is_file_cache(page),
-					 PageActive(page));
+					 PageActive(page) | new_anon);
 		if (was_unevictable)
 			count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -982,9 +1053,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
-void __pagevec_lru_add(struct pagevec *pvec)
+void __pagevec_lru_add(struct pagevec *pvec, bool new_anon)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, (void *)new_anon);
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c5d20c..45eede0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1088,11 +1088,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		enum page_references references = PAGEREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
+		int file;
 
 		cond_resched();
 
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
+		file = page_is_file_cache(page);
 
 		if (!trylock_page(page))
 			goto keep;
@@ -1223,7 +1225,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
 		case PAGEREF_KEEP:
-			stat->nr_ref_keep += nr_pages;
+			stat->nr_ref_keep[file] += nr_pages;
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
 		case PAGEREF_RECLAIM_CLEAN:
@@ -1316,7 +1318,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * the rest of the LRU for clean pages and see
 			 * the same dirty pages again (PageReclaim).
 			 */
-			if (page_is_file_cache(page) &&
+			if (file &&
 			    (!current_is_kswapd() || !PageReclaim(page) ||
 			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
@@ -1460,9 +1462,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			try_to_free_swap(page);
 		VM_BUG_ON_PAGE(PageActive(page), page);
 		if (!PageMlocked(page)) {
-			int type = page_is_file_cache(page);
 			SetPageActive(page);
-			stat->nr_activate[type] += nr_pages;
+			stat->nr_activate[file] += nr_pages;
 			count_memcg_page_event(page, PGACTIVATE);
 			workingset_activation(page);
 		}
@@ -1955,6 +1956,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
 	reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
 
+	/* to mitigate impact on scan ratio due to LRU algorithm change */
+	reclaim_stat->recent_rotated[0] += stat.nr_ref_keep[0];
+
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-03  5:40 ` [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache js1304
@ 2020-04-03 18:29   ` Yang Shi
  2020-04-06  1:03     ` Joonsoo Kim
  2020-04-16 16:11   ` Johannes Weiner
  1 sibling, 1 reply; 26+ messages in thread
From: Yang Shi @ 2020-04-03 18:29 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman, kernel-team, Joonsoo Kim

On Thu, Apr 2, 2020 at 10:41 PM <js1304@gmail.com> wrote:
>
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> Currently, some swapped-in pages are not charged to the memcg until
> actual access to the page happens. I checked the code and found that
> it could cause a problem. In this implementation, even if the memcg
> is enabled, one can consume a lot of memory in the system by exploiting
> this hole. For example, one can make all the pages swapped out and
> then call madvise_willneed() to load the all swapped-out pages without
> pressing the memcg. Although actual access requires charging, it's really
> big benefit to load the swapped-out pages to the memory without pressing
> the memcg.
>
> And, for workingset detection which is implemented on the following patch,
> a memcg should be committed before the workingset detection is executed.
> For this purpose, the best solution, I think, is charging the page when
> adding to the swap cache. Charging there is not that hard. Caller of
> adding the page to the swap cache has enough information about the charged
> memcg. So, what we need to do is just passing this information to
> the right place.
>
> With this patch, specific memcg could be pressured more since readahead
> pages are also charged to it now. This would result in performance
> degradation to that user but it would be fair since that readahead is for
> that user.

If I read the code correctly, the readahead pages may be *not* charged
to it at all but other memcgs since mem_cgroup_try_charge() would
retrieve the target memcg id from the swap entry then charge to it
(generally it is the memcg from who the page is swapped out). So, it
may open a backdoor to let one memcg stress other memcgs?

>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/swap.h |  4 ++--
>  mm/shmem.c           | 18 ++++++++++--------
>  mm/swap_state.c      | 25 +++++++++++++++++++++----
>  3 files changed, 33 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 273de48..eea0700 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -409,7 +409,7 @@ extern unsigned long total_swapcache_pages(void);
>  extern void show_swap_cache_info(void);
>  extern int add_to_swap(struct page *page);
>  extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
> -                       gfp_t gfp, void **shadowp);
> +                       struct vm_area_struct *vma, gfp_t gfp, void **shadowp);
>  extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
>  extern void __delete_from_swap_cache(struct page *page,
>                         swp_entry_t entry, void *shadow);
> @@ -567,7 +567,7 @@ static inline int add_to_swap(struct page *page)
>  }
>
>  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
> -                                       gfp_t gfp_mask, void **shadowp)
> +                       struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
>  {
>         return -1;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9e34b4e..8e28c1f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1369,7 +1369,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>         if (list_empty(&info->swaplist))
>                 list_add(&info->swaplist, &shmem_swaplist);
>
> -       if (add_to_swap_cache(page, swap,
> +       if (add_to_swap_cache(page, swap, NULL,
>                         __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
>                         NULL) == 0) {
>                 spin_lock_irq(&info->lock);
> @@ -1434,10 +1434,11 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
>  #endif
>
>  static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
> -               struct shmem_inode_info *info, pgoff_t index)
> +                       struct mm_struct *mm, struct shmem_inode_info *info,
> +                       pgoff_t index)
>  {
>         /* Create a pseudo vma that just contains the policy */
> -       vma_init(vma, NULL);
> +       vma_init(vma, mm);
>         /* Bias interleave by inode number to distribute better across nodes */
>         vma->vm_pgoff = index + info->vfs_inode.i_ino;
>         vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
> @@ -1450,13 +1451,14 @@ static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
>  }
>
>  static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
> -                       struct shmem_inode_info *info, pgoff_t index)
> +                       struct mm_struct *mm, struct shmem_inode_info *info,
> +                       pgoff_t index)
>  {
>         struct vm_area_struct pvma;
>         struct page *page;
>         struct vm_fault vmf;
>
> -       shmem_pseudo_vma_init(&pvma, info, index);
> +       shmem_pseudo_vma_init(&pvma, mm, info, index);
>         vmf.vma = &pvma;
>         vmf.address = 0;
>         page = swap_cluster_readahead(swap, gfp, &vmf);
> @@ -1481,7 +1483,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>                                                                 XA_PRESENT))
>                 return NULL;
>
> -       shmem_pseudo_vma_init(&pvma, info, hindex);
> +       shmem_pseudo_vma_init(&pvma, NULL, info, hindex);
>         page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
>                         HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
>         shmem_pseudo_vma_destroy(&pvma);
> @@ -1496,7 +1498,7 @@ static struct page *shmem_alloc_page(gfp_t gfp,
>         struct vm_area_struct pvma;
>         struct page *page;
>
> -       shmem_pseudo_vma_init(&pvma, info, index);
> +       shmem_pseudo_vma_init(&pvma, NULL, info, index);
>         page = alloc_page_vma(gfp, &pvma, 0);
>         shmem_pseudo_vma_destroy(&pvma);
>
> @@ -1652,7 +1654,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
>                         count_memcg_event_mm(charge_mm, PGMAJFAULT);
>                 }
>                 /* Here we actually start the io */
> -               page = shmem_swapin(swap, gfp, info, index);
> +               page = shmem_swapin(swap, gfp, charge_mm, info, index);
>                 if (!page) {
>                         error = -ENOMEM;
>                         goto failed;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index f06af84..1db73a2 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -112,7 +112,7 @@ void show_swap_cache_info(void)
>   * but sets SwapCache flag and private instead of mapping and index.
>   */
>  int add_to_swap_cache(struct page *page, swp_entry_t entry,
> -                       gfp_t gfp, void **shadowp)
> +                       struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
>  {
>         struct address_space *address_space = swap_address_space(entry);
>         pgoff_t idx = swp_offset(entry);
> @@ -120,14 +120,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
>         unsigned long i, nr = compound_nr(page);
>         unsigned long nrexceptional = 0;
>         void *old;
> +       bool compound = !!compound_order(page);
> +       int error;
> +       struct mm_struct *mm = vma ? vma->vm_mm : current->mm;
> +       struct mem_cgroup *memcg;
>
>         VM_BUG_ON_PAGE(!PageLocked(page), page);
>         VM_BUG_ON_PAGE(PageSwapCache(page), page);
>         VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>
>         page_ref_add(page, nr);
> +       /* PageSwapCache() prevent the page from being re-charged */
>         SetPageSwapCache(page);
>
> +       error = mem_cgroup_try_charge(page, mm, gfp, &memcg, compound);
> +       if (error) {
> +               ClearPageSwapCache(page);
> +               page_ref_sub(page, nr);
> +               return error;
> +       }
> +
>         do {
>                 xas_lock_irq(&xas);
>                 xas_create_range(&xas);
> @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
>                 xas_unlock_irq(&xas);
>         } while (xas_nomem(&xas, gfp));
>
> -       if (!xas_error(&xas))
> +       if (!xas_error(&xas)) {
> +               mem_cgroup_commit_charge(page, memcg, false, compound);
>                 return 0;
> +       }
> +
> +       mem_cgroup_cancel_charge(page, memcg, compound);
>
>         ClearPageSwapCache(page);
>         page_ref_sub(page, nr);
> +
>         return xas_error(&xas);
>  }
>
> @@ -221,7 +238,7 @@ int add_to_swap(struct page *page)
>         /*
>          * Add it to the swap cache.
>          */
> -       err = add_to_swap_cache(page, entry,
> +       err = add_to_swap_cache(page, entry, NULL,
>                         __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
>         if (err)
>                 /*
> @@ -431,7 +448,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 /* May fail (-ENOMEM) if XArray node allocation failed. */
>                 __SetPageLocked(new_page);
>                 __SetPageSwapBacked(new_page);
> -               err = add_to_swap_cache(new_page, entry,
> +               err = add_to_swap_cache(new_page, entry, vma,
>                                 gfp_mask & GFP_KERNEL, NULL);
>                 if (likely(!err)) {
>                         /* Initiate read into locked page */
> --
> 2.7.4
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-03 18:29   ` Yang Shi
@ 2020-04-06  1:03     ` Joonsoo Kim
  2020-04-07  0:22       ` Yang Shi
  0 siblings, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-06  1:03 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman, kernel-team, Joonsoo Kim

2020년 4월 4일 (토) 오전 3:29, Yang Shi <shy828301@gmail.com>님이 작성:
>
> On Thu, Apr 2, 2020 at 10:41 PM <js1304@gmail.com> wrote:
> >
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > Currently, some swapped-in pages are not charged to the memcg until
> > actual access to the page happens. I checked the code and found that
> > it could cause a problem. In this implementation, even if the memcg
> > is enabled, one can consume a lot of memory in the system by exploiting
> > this hole. For example, one can make all the pages swapped out and
> > then call madvise_willneed() to load the all swapped-out pages without
> > pressing the memcg. Although actual access requires charging, it's really
> > big benefit to load the swapped-out pages to the memory without pressing
> > the memcg.
> >
> > And, for workingset detection which is implemented on the following patch,
> > a memcg should be committed before the workingset detection is executed.
> > For this purpose, the best solution, I think, is charging the page when
> > adding to the swap cache. Charging there is not that hard. Caller of
> > adding the page to the swap cache has enough information about the charged
> > memcg. So, what we need to do is just passing this information to
> > the right place.
> >
> > With this patch, specific memcg could be pressured more since readahead
> > pages are also charged to it now. This would result in performance
> > degradation to that user but it would be fair since that readahead is for
> > that user.
>
> If I read the code correctly, the readahead pages may be *not* charged
> to it at all but other memcgs since mem_cgroup_try_charge() would
> retrieve the target memcg id from the swap entry then charge to it
> (generally it is the memcg from who the page is swapped out). So, it
> may open a backdoor to let one memcg stress other memcgs?

It looks like you talk about the call path on CONFIG_MEMCG_SWAP.

The owner (task) for a anonymous page cannot be changed. It means that
the previous owner written on the swap entry will be the next user. So,
I think that using the target memcg id from the swap entry for readahead pages
is valid way.

As you concerned, if someone can control swap-readahead to readahead
other's swap entry, one memcg could stress other memcg by using the fact above.
However, as far as I know, there is no explicit way to readahead other's swap
entry so no problem.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-06  1:03     ` Joonsoo Kim
@ 2020-04-07  0:22       ` Yang Shi
  2020-04-07  1:27         ` Joonsoo Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Yang Shi @ 2020-04-07  0:22 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman, kernel-team, Joonsoo Kim

On Sun, Apr 5, 2020 at 6:03 PM Joonsoo Kim <js1304@gmail.com> wrote:
>
> 2020년 4월 4일 (토) 오전 3:29, Yang Shi <shy828301@gmail.com>님이 작성:
> >
> > On Thu, Apr 2, 2020 at 10:41 PM <js1304@gmail.com> wrote:
> > >
> > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > >
> > > Currently, some swapped-in pages are not charged to the memcg until
> > > actual access to the page happens. I checked the code and found that
> > > it could cause a problem. In this implementation, even if the memcg
> > > is enabled, one can consume a lot of memory in the system by exploiting
> > > this hole. For example, one can make all the pages swapped out and
> > > then call madvise_willneed() to load the all swapped-out pages without
> > > pressing the memcg. Although actual access requires charging, it's really
> > > big benefit to load the swapped-out pages to the memory without pressing
> > > the memcg.
> > >
> > > And, for workingset detection which is implemented on the following patch,
> > > a memcg should be committed before the workingset detection is executed.
> > > For this purpose, the best solution, I think, is charging the page when
> > > adding to the swap cache. Charging there is not that hard. Caller of
> > > adding the page to the swap cache has enough information about the charged
> > > memcg. So, what we need to do is just passing this information to
> > > the right place.
> > >
> > > With this patch, specific memcg could be pressured more since readahead
> > > pages are also charged to it now. This would result in performance
> > > degradation to that user but it would be fair since that readahead is for
> > > that user.
> >
> > If I read the code correctly, the readahead pages may be *not* charged
> > to it at all but other memcgs since mem_cgroup_try_charge() would
> > retrieve the target memcg id from the swap entry then charge to it
> > (generally it is the memcg from who the page is swapped out). So, it
> > may open a backdoor to let one memcg stress other memcgs?
>
> It looks like you talk about the call path on CONFIG_MEMCG_SWAP.
>
> The owner (task) for a anonymous page cannot be changed. It means that
> the previous owner written on the swap entry will be the next user. So,
> I think that using the target memcg id from the swap entry for readahead pages
> is valid way.
>
> As you concerned, if someone can control swap-readahead to readahead
> other's swap entry, one memcg could stress other memcg by using the fact above.
> However, as far as I know, there is no explicit way to readahead other's swap
> entry so no problem.

Swap cluster readahead would readahead in pages on consecutive swap
entries which may belong to different memcgs, however I just figured
out patch #8 ("mm/swap: do not readahead if the previous owner of the
swap entry isn't me") would prevent from reading ahead pages belonging
to other memcgs. This would kill the potential problem.

> Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 02/10] mm/vmscan: protect the workingset on anonymous LRU
       [not found] ` <20200406091814.17256-1-hdanton@sina.com>
@ 2020-04-07  0:40   ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-07  0:40 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Linux Memory Management List, LKML,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman, kernel-team, Joonsoo Kim

2020년 4월 6일 (월) 오후 6:18, Hillf Danton <hdanton@sina.com>님이 작성:
>
>
> On Fri,  3 Apr 2020 14:40:40 +0900 Joonsoo Kim wrote:
> >
> > @@ -3093,11 +3093,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (unlikely(page != swapcache && swapcache)) {
> >               page_add_new_anon_rmap(page, vma, vmf->address, false);
> >               mem_cgroup_commit_charge(page, memcg, false, false);
> > -             lru_cache_add_active_or_unevictable(page, vma);
> > +             lru_cache_add_inactive_or_unevictable(page, vma);
> >       } else {
> >               do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> >               mem_cgroup_commit_charge(page, memcg, true, false);
> > -             activate_page(page);
> >       }
> >
> >       swap_free(entry);
> ...
> > @@ -996,8 +996,6 @@ static enum page_references page_check_references(struct page *page,
> >               return PAGEREF_RECLAIM;
> >
> >       if (referenced_ptes) {
> > -             if (PageSwapBacked(page))
> > -                     return PAGEREF_ACTIVATE;
> >               /*
> >                * All mapped pages start out with page table
> >                * references from the instantiating fault, so we need
> > @@ -1020,7 +1018,7 @@ static enum page_references page_check_references(struct page *page,
> >               /*
> >                * Activate file-backed executable pages after first usage.
> >                */
> > -             if (vm_flags & VM_EXEC)
> > +             if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
> >                       return PAGEREF_ACTIVATE;
> >
> >               return PAGEREF_KEEP;
> > --
> > 2.7.4
>
> Both changes other than
> s/lru_cache_add_active_or_unevictable/lru_cache_add_inactive_or_unevictable/
> are likely worth their own seperate commits with a concise log.

IMO, all of the changes in this patch provides just one logical change
for LRU management
on anonymous page so it's better to be together.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
       [not found] ` <20200406115804.4440-1-hdanton@sina.com>
@ 2020-04-07  0:42   ` Joonsoo Kim
       [not found]   ` <20200407022144.11164-1-hdanton@sina.com>
  1 sibling, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-07  0:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Linux Memory Management List, LKML,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman

2020년 4월 6일 (월) 오후 8:58, Hillf Danton <hdanton@sina.com>님이 작성:
>
>
> On Fri,  3 Apr 2020 14:40:43 +0900 Joonsoo Kim wrote:
> >
> > @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> >               xas_unlock_irq(&xas);
> >       } while (xas_nomem(&xas, gfp));
> >
> > -     if (!xas_error(&xas))
> > +     if (!xas_error(&xas)) {
> > +             mem_cgroup_commit_charge(page, memcg, false, compound);
>
> Add a tp
>                 trace_mm_add_to_swap_cache(page);

Please let me know the reason of this comment. IMO, adding a tracepoint isn't
good idea until it is really necessary since it is considered as
kernel ABI and it
cannot be changed easily.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-07  0:22       ` Yang Shi
@ 2020-04-07  1:27         ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-07  1:27 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman, kernel-team, Joonsoo Kim

2020년 4월 7일 (화) 오전 9:22, Yang Shi <shy828301@gmail.com>님이 작성:
>
> On Sun, Apr 5, 2020 at 6:03 PM Joonsoo Kim <js1304@gmail.com> wrote:
> >
> > 2020년 4월 4일 (토) 오전 3:29, Yang Shi <shy828301@gmail.com>님이 작성:
> > >
> > > On Thu, Apr 2, 2020 at 10:41 PM <js1304@gmail.com> wrote:
> > > >
> > > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > >
> > > > Currently, some swapped-in pages are not charged to the memcg until
> > > > actual access to the page happens. I checked the code and found that
> > > > it could cause a problem. In this implementation, even if the memcg
> > > > is enabled, one can consume a lot of memory in the system by exploiting
> > > > this hole. For example, one can make all the pages swapped out and
> > > > then call madvise_willneed() to load the all swapped-out pages without
> > > > pressing the memcg. Although actual access requires charging, it's really
> > > > big benefit to load the swapped-out pages to the memory without pressing
> > > > the memcg.
> > > >
> > > > And, for workingset detection which is implemented on the following patch,
> > > > a memcg should be committed before the workingset detection is executed.
> > > > For this purpose, the best solution, I think, is charging the page when
> > > > adding to the swap cache. Charging there is not that hard. Caller of
> > > > adding the page to the swap cache has enough information about the charged
> > > > memcg. So, what we need to do is just passing this information to
> > > > the right place.
> > > >
> > > > With this patch, specific memcg could be pressured more since readahead
> > > > pages are also charged to it now. This would result in performance
> > > > degradation to that user but it would be fair since that readahead is for
> > > > that user.
> > >
> > > If I read the code correctly, the readahead pages may be *not* charged
> > > to it at all but other memcgs since mem_cgroup_try_charge() would
> > > retrieve the target memcg id from the swap entry then charge to it
> > > (generally it is the memcg from who the page is swapped out). So, it
> > > may open a backdoor to let one memcg stress other memcgs?
> >
> > It looks like you talk about the call path on CONFIG_MEMCG_SWAP.
> >
> > The owner (task) for a anonymous page cannot be changed. It means that
> > the previous owner written on the swap entry will be the next user. So,
> > I think that using the target memcg id from the swap entry for readahead pages
> > is valid way.
> >
> > As you concerned, if someone can control swap-readahead to readahead
> > other's swap entry, one memcg could stress other memcg by using the fact above.
> > However, as far as I know, there is no explicit way to readahead other's swap
> > entry so no problem.
>
> Swap cluster readahead would readahead in pages on consecutive swap
> entries which may belong to different memcgs, however I just figured
> out patch #8 ("mm/swap: do not readahead if the previous owner of the
> swap entry isn't me") would prevent from reading ahead pages belonging
> to other memcgs. This would kill the potential problem.

Yes, that patch kill the potential problem. However, I think that swap cluster
readahead would not open the backdoor even without the patch #8 in
CONFIG_MEMCG_SWAP case, because:

1. consecutive swap space is usually filled by the same task.
2. swap cluster readahead needs a large I/O price to the offender and effect
isn't serious to the target.
3. those pages would be charged to their previous owner and it is valid.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list
  2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
                   ` (11 preceding siblings ...)
       [not found] ` <20200406115804.4440-1-hdanton@sina.com>
@ 2020-04-08 16:55 ` Vlastimil Babka
  2020-04-09  0:50   ` Joonsoo Kim
  12 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2020-04-08 16:55 UTC (permalink / raw)
  To: js1304, Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Hocko,
	Hugh Dickins, Minchan Kim, Mel Gorman, kernel-team, Joonsoo Kim

On 4/3/20 7:40 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Hello,
> 
> This patchset implements workingset protection and detection on
> the anonymous LRU list.

Hi!

> I did another test to show the performance effect of this patchset.
> 
> - ebizzy (with modified random function)
> ebizzy is the test program that main thread allocates lots of memory and
> child threads access them randomly during the given times. Swap-in/out
> will happen if allocated memory is larger than the system memory.
> 
> The random function that represents the zipf distribution is used to
> make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
> the parameter is high, hot memory is accessed much larger than cold one.
> If the parameter is low, the number of access on each memory would be
> similar. I uses various parameters in order to show the effect of
> patchset on various hot/cold ratio workload.
> 
> My test setup is a virtual machine with 8 cpus and 1024MB memory.
> 
> Result format is as following.
> 
> Parameter 0.1 ... 1.3
> Allocated memory size
> Throughput for base (larger is better)
> Throughput for patchset (larger is better)
> Improvement (larger is better)
> 
> 
> * single thread
> 
>      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> <512M>
>   7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
>   6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
>    -0.01     -0.0     -0.0     0.01    -0.01      0.0     0.01
> <768M>
>    915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
>    920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
>     0.01     -0.0    -0.03      0.0     0.02     0.04    -0.01
> <1024M>
>    425.0    471.0    539.0    753.0   1183.0   2130.0   3839.0
>    414.0    468.0    553.0    770.0   1242.0   2187.0   3932.0
>    -0.03    -0.01     0.03     0.02     0.05     0.03     0.02
> <1280M>
>    320.0    346.0    410.0    556.0    871.0   1654.0   3298.0
>    316.0    346.0    411.0    550.0    892.0   1652.0   3293.0
>    -0.01      0.0      0.0    -0.01     0.02     -0.0     -0.0
> <1536M>
>    273.0    290.0    341.0    458.0    733.0   1381.0   2925.0
>    271.0    293.0    344.0    462.0    740.0   1398.0   2969.0
>    -0.01     0.01     0.01     0.01     0.01     0.01     0.02
> <2048M>
>     77.0     79.0     95.0    147.0    276.0    690.0   1816.0
>     91.0     94.0    115.0    170.0    321.0    770.0   2018.0
>     0.18     0.19     0.21     0.16     0.16     0.12     0.11
> 
> 
> * multi thread (8)
> 
>      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> <512M>
>  29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
>  29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
>     0.01      0.0     0.01    -0.01     -0.0    -0.01     -0.0
> <768M>
>   3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
>   3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
>     0.07     0.08    -0.05     0.05     0.05    -0.03     0.02
> <1024M>
>   1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
>   1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
>     0.04     0.01     0.05     0.08     0.01    -0.01     0.07
> <1280M>
>   1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  12443.0
>   1560.0   1707.0   2016.0   2714.0   4262.0   7518.0  13910.0
>     0.02     0.05     0.04     0.05     0.03     0.08     0.12
> <1536M>
>   1303.0   1399.0   1550.0   2137.0   3469.0   6712.0  12944.0
>   1356.0   1465.0   1701.0   2237.0   3583.0   6830.0  13580.0
>     0.04     0.05      0.1     0.05     0.03     0.02     0.05
> <2048M>
>    172.0    184.0    215.0    289.0    514.0   1318.0   4153.0
>    175.0    190.0    225.0    329.0    606.0   1585.0   5170.0
>     0.02     0.03     0.05     0.14     0.18      0.2     0.24
> 
> As we can see, as allocated memory grows, patched kernel get the better
> result. Maximum improvement is 21% for the single thread test and
> 24% for the multi thread test.

So, these results seem to be identical since v1. After the various changes up to
v5, should  the benchmark be redone? And was that with a full patchset or
patches 1+2?

> * EXPERIMENT
> I made a test program to imitates above scenario and confirmed that
> problem exists. Then, I checked that this patchset fixes it.
> 
> My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
> the amount of the memory that the test program can use is about 280 MB.
> This is because the system uses large ram-backed swap and large ramdisk
> to capture the trace.
> 
> Test scenario is like as below.
> 
> 1. allocate cold memory (512MB)
> 2. allocate hot-1 memory (96MB)
> 3. activate hot-1 memory (96MB)
> 4. allocate another hot-2 memory (96MB)
> 5. access cold memory (128MB)
> 6. access hot-2 memory (96MB)
> 7. repeat 5, 6
> 
> Since hot-1 memory (96MB) is on the active list, the inactive list can
> contains roughly 190MB pages. hot-2 memory's re-access interval
> (96+128 MB) is more 190MB, so it cannot be promoted without workingset
> detection and swap-in/out happens repeatedly. With this patchset,
> workingset detection works and promotion happens. Therefore, swap-in/out
> occurs less.
> 
> Here is the result. (average of 5 runs)
> 
> type swap-in swap-out
> base 863240 989945
> patch 681565 809273
> 
> As we can see, patched kernel do less swap-in/out.

Same comment, also v1 has this note:

> Note that, this result is gotten from v5.1. Although workingset detection
> works on v5.1, it doesn't work well on v5.5. It looks like that recent
> code change on workingset.c is the reason of this problem. I will
> track it soon.

What was the problem then, assuming it's fixed? Maybe I just missed it
mentioned. Can results now be gathered on 5.6?

In general, this patchset seems to be doing the right thing. I haven't reviewed
the code yet, but hope to do so soon. But inevitably, with any changes in this
area there will be workloads that will suffer instead of benefit. That can be
because we are actually doing a wrong thing, or there's a bug in the code, or
the workloads happen to benefit from the current behavior even if it's not the
generally optimal one. And I'm afraid only testing on a variety of workloads can
show that. You mentioned somewhere that your production workloads benefit? Can
it be quantified more? Could e.g. Johannes test this a bit at Facebook, or
somebody at Google?

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list
  2020-04-08 16:55 ` [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list Vlastimil Babka
@ 2020-04-09  0:50   ` Joonsoo Kim
  2020-06-03  3:57     ` Suren Baghdasaryan
  0 siblings, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-09  0:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linux Memory Management List, LKML,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Mel Gorman, kernel-team, Joonsoo Kim

2020년 4월 9일 (목) 오전 1:55, Vlastimil Babka <vbabka@suse.cz>님이 작성:
>
> On 4/3/20 7:40 AM, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > Hello,
> >
> > This patchset implements workingset protection and detection on
> > the anonymous LRU list.
>
> Hi!

Hi!

> > I did another test to show the performance effect of this patchset.
> >
> > - ebizzy (with modified random function)
> > ebizzy is the test program that main thread allocates lots of memory and
> > child threads access them randomly during the given times. Swap-in/out
> > will happen if allocated memory is larger than the system memory.
> >
> > The random function that represents the zipf distribution is used to
> > make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
> > the parameter is high, hot memory is accessed much larger than cold one.
> > If the parameter is low, the number of access on each memory would be
> > similar. I uses various parameters in order to show the effect of
> > patchset on various hot/cold ratio workload.
> >
> > My test setup is a virtual machine with 8 cpus and 1024MB memory.
> >
> > Result format is as following.
> >
> > Parameter 0.1 ... 1.3
> > Allocated memory size
> > Throughput for base (larger is better)
> > Throughput for patchset (larger is better)
> > Improvement (larger is better)
> >
> >
> > * single thread
> >
> >      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> > <512M>
> >   7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
> >   6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
> >    -0.01     -0.0     -0.0     0.01    -0.01      0.0     0.01
> > <768M>
> >    915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
> >    920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
> >     0.01     -0.0    -0.03      0.0     0.02     0.04    -0.01
> > <1024M>
> >    425.0    471.0    539.0    753.0   1183.0   2130.0   3839.0
> >    414.0    468.0    553.0    770.0   1242.0   2187.0   3932.0
> >    -0.03    -0.01     0.03     0.02     0.05     0.03     0.02
> > <1280M>
> >    320.0    346.0    410.0    556.0    871.0   1654.0   3298.0
> >    316.0    346.0    411.0    550.0    892.0   1652.0   3293.0
> >    -0.01      0.0      0.0    -0.01     0.02     -0.0     -0.0
> > <1536M>
> >    273.0    290.0    341.0    458.0    733.0   1381.0   2925.0
> >    271.0    293.0    344.0    462.0    740.0   1398.0   2969.0
> >    -0.01     0.01     0.01     0.01     0.01     0.01     0.02
> > <2048M>
> >     77.0     79.0     95.0    147.0    276.0    690.0   1816.0
> >     91.0     94.0    115.0    170.0    321.0    770.0   2018.0
> >     0.18     0.19     0.21     0.16     0.16     0.12     0.11
> >
> >
> > * multi thread (8)
> >
> >      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> > <512M>
> >  29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
> >  29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
> >     0.01      0.0     0.01    -0.01     -0.0    -0.01     -0.0
> > <768M>
> >   3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
> >   3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
> >     0.07     0.08    -0.05     0.05     0.05    -0.03     0.02
> > <1024M>
> >   1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
> >   1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
> >     0.04     0.01     0.05     0.08     0.01    -0.01     0.07
> > <1280M>
> >   1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  12443.0
> >   1560.0   1707.0   2016.0   2714.0   4262.0   7518.0  13910.0
> >     0.02     0.05     0.04     0.05     0.03     0.08     0.12
> > <1536M>
> >   1303.0   1399.0   1550.0   2137.0   3469.0   6712.0  12944.0
> >   1356.0   1465.0   1701.0   2237.0   3583.0   6830.0  13580.0
> >     0.04     0.05      0.1     0.05     0.03     0.02     0.05
> > <2048M>
> >    172.0    184.0    215.0    289.0    514.0   1318.0   4153.0
> >    175.0    190.0    225.0    329.0    606.0   1585.0   5170.0
> >     0.02     0.03     0.05     0.14     0.18      0.2     0.24
> >
> > As we can see, as allocated memory grows, patched kernel get the better
> > result. Maximum improvement is 21% for the single thread test and
> > 24% for the multi thread test.
>
> So, these results seem to be identical since v1. After the various changes up to
> v5, should  the benchmark be redone? And was that with a full patchset or
> patches 1+2?

It was done with a full patchset. I think that these results would not
be changed
even on v5 since it is improvement from the concept of this patchset and
implementation detail doesn't much matter. However, I will redo.

> > * EXPERIMENT
> > I made a test program to imitates above scenario and confirmed that
> > problem exists. Then, I checked that this patchset fixes it.
> >
> > My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
> > the amount of the memory that the test program can use is about 280 MB.
> > This is because the system uses large ram-backed swap and large ramdisk
> > to capture the trace.
> >
> > Test scenario is like as below.
> >
> > 1. allocate cold memory (512MB)
> > 2. allocate hot-1 memory (96MB)
> > 3. activate hot-1 memory (96MB)
> > 4. allocate another hot-2 memory (96MB)
> > 5. access cold memory (128MB)
> > 6. access hot-2 memory (96MB)
> > 7. repeat 5, 6
> >
> > Since hot-1 memory (96MB) is on the active list, the inactive list can
> > contains roughly 190MB pages. hot-2 memory's re-access interval
> > (96+128 MB) is more 190MB, so it cannot be promoted without workingset
> > detection and swap-in/out happens repeatedly. With this patchset,
> > workingset detection works and promotion happens. Therefore, swap-in/out
> > occurs less.
> >
> > Here is the result. (average of 5 runs)
> >
> > type swap-in swap-out
> > base 863240 989945
> > patch 681565 809273
> >
> > As we can see, patched kernel do less swap-in/out.
>
> Same comment, also v1 has this note:

I had tested this scenario on every version of the patchset and found the same
trend.

> > Note that, this result is gotten from v5.1. Although workingset detection
> > works on v5.1, it doesn't work well on v5.5. It looks like that recent
> > code change on workingset.c is the reason of this problem. I will
> > track it soon.
> What was the problem then, assuming it's fixed? Maybe I just missed it
> mentioned. Can results now be gathered on 5.6?

It was fixed on v2. Change log on v2 "fix a critical bug that uses out of index
lru list in workingset_refault()" is for this problem. I should note
that clearly.

> In general, this patchset seems to be doing the right thing. I haven't reviewed
> the code yet, but hope to do so soon. But inevitably, with any changes in this
> area there will be workloads that will suffer instead of benefit. That can be
> because we are actually doing a wrong thing, or there's a bug in the code, or
> the workloads happen to benefit from the current behavior even if it's not the
> generally optimal one. And I'm afraid only testing on a variety of workloads can
> show that. You mentioned somewhere that your production workloads benefit? Can
> it be quantified more? Could e.g. Johannes test this a bit at Facebook, or

I cannot share the detail of the test for my production (smart TV)
workload. Roughly,
it is repeat of various action and app (channel change, volume change,
youtube, etc.)
on smart TV and it is memory stress test. Result after the workload is:

base
pswpin 328211
pswpout 304015

patched
pswpin 261884
pswpout 276062

So, improvement on pswpin and pswpout is roughly 20% and 9%, respectively.

> it be quantified more? Could e.g. Johannes test this a bit at Facebook, or
> somebody at Google?

It's really helpful if someone else could test this on their workload.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
       [not found]   ` <20200407022144.11164-1-hdanton@sina.com>
@ 2020-04-09  0:53     ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-09  0:53 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Linux Memory Management List, LKML,
	Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Vlastimil Babka, Mel Gorman

2020년 4월 7일 (화) 오전 11:21, Hillf Danton <hdanton@sina.com>님이 작성:
>
>
> On Tue, 7 Apr 2020 09:42:54 +0900 Joonsoo Kim wrote:
> >
> > 2020=EB=85=84 4=EC=9B=94 6=EC=9D=BC (=EC=9B=94) =EC=98=A4=ED=9B=84 8:58, Hi=
> > llf Danton <hdanton@sina.com>=EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1:
> > >
> > >
> > > On Fri,  3 Apr 2020 14:40:43 +0900 Joonsoo Kim wrote:
> > > >
> > > > @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entr=
> > y_t entry,
> > > >               xas_unlock_irq(&xas);
> > > >       } while (xas_nomem(&xas, gfp));
> > > >
> > > > -     if (!xas_error(&xas))
> > > > +     if (!xas_error(&xas)) {
> > > > +             mem_cgroup_commit_charge(page, memcg, false, compound);
> > >
> > > Add a tp
> > >                 trace_mm_add_to_swap_cache(page);
> >
> > Please let me know the reason of this comment.
>
> It is derived from
>
>         trace_mm_filemap_add_to_page_cache(page);
>
> in __add_to_page_cache_locked().

Maybe, it has it's own purpose and it would not be applied here. If
someone want to add
a tp here, it can be done with a justification, after this patchset.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-03  5:40 ` [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache js1304
  2020-04-03 18:29   ` Yang Shi
@ 2020-04-16 16:11   ` Johannes Weiner
  2020-04-17  1:38     ` Joonsoo Kim
  1 sibling, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2020-04-16 16:11 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

Hello Joonsoo,

On Fri, Apr 03, 2020 at 02:40:43PM +0900, js1304@gmail.com wrote:
> @@ -112,7 +112,7 @@ void show_swap_cache_info(void)
>   * but sets SwapCache flag and private instead of mapping and index.
>   */
>  int add_to_swap_cache(struct page *page, swp_entry_t entry,
> -			gfp_t gfp, void **shadowp)
> +			struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
>  {
>  	struct address_space *address_space = swap_address_space(entry);
>  	pgoff_t idx = swp_offset(entry);
> @@ -120,14 +120,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  	unsigned long i, nr = compound_nr(page);
>  	unsigned long nrexceptional = 0;
>  	void *old;
> +	bool compound = !!compound_order(page);
> +	int error;
> +	struct mm_struct *mm = vma ? vma->vm_mm : current->mm;
> +	struct mem_cgroup *memcg;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(PageSwapCache(page), page);
>  	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  	page_ref_add(page, nr);
> +	/* PageSwapCache() prevent the page from being re-charged */
>  	SetPageSwapCache(page);
>  
> +	error = mem_cgroup_try_charge(page, mm, gfp, &memcg, compound);
> +	if (error) {
> +		ClearPageSwapCache(page);
> +		page_ref_sub(page, nr);
> +		return error;
> +	}
> +
>  	do {
>  		xas_lock_irq(&xas);
>  		xas_create_range(&xas);
> @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  		xas_unlock_irq(&xas);
>  	} while (xas_nomem(&xas, gfp));
>  
> -	if (!xas_error(&xas))
> +	if (!xas_error(&xas)) {
> +		mem_cgroup_commit_charge(page, memcg, false, compound);

Unfortunately you cannot commit here yet because the rmap isn't set up
and that will cause memcg_charge_statistics() to account the page
incorrectly as file. And rmap is only set up during a page fault.

This needs a bit of a rework of the memcg charging code that appears
out of scope for your patches. I'm preparing a series right now to do
just that. It'll also fix the swap ownership tracking problem when the
swap controller is disabled, so we don't have to choose between
charging the wrong cgroup or hampering swap readahead efficiency.

The patches also unblock Alex Shi's "per lruvec lru_lock for memcg"
series, which is all the more indication that memcg needs fixing in
that area.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-16 16:11   ` Johannes Weiner
@ 2020-04-17  1:38     ` Joonsoo Kim
  2020-04-17  3:31       ` Johannes Weiner
  0 siblings, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-17  1:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

2020년 4월 17일 (금) 오전 1:11, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> Hello Joonsoo,
>
> On Fri, Apr 03, 2020 at 02:40:43PM +0900, js1304@gmail.com wrote:
> > @@ -112,7 +112,7 @@ void show_swap_cache_info(void)
> >   * but sets SwapCache flag and private instead of mapping and index.
> >   */
> >  int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > -                     gfp_t gfp, void **shadowp)
> > +                     struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
> >  {
> >       struct address_space *address_space = swap_address_space(entry);
> >       pgoff_t idx = swp_offset(entry);
> > @@ -120,14 +120,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> >       unsigned long i, nr = compound_nr(page);
> >       unsigned long nrexceptional = 0;
> >       void *old;
> > +     bool compound = !!compound_order(page);
> > +     int error;
> > +     struct mm_struct *mm = vma ? vma->vm_mm : current->mm;
> > +     struct mem_cgroup *memcg;
> >
> >       VM_BUG_ON_PAGE(!PageLocked(page), page);
> >       VM_BUG_ON_PAGE(PageSwapCache(page), page);
> >       VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> >
> >       page_ref_add(page, nr);
> > +     /* PageSwapCache() prevent the page from being re-charged */
> >       SetPageSwapCache(page);
> >
> > +     error = mem_cgroup_try_charge(page, mm, gfp, &memcg, compound);
> > +     if (error) {
> > +             ClearPageSwapCache(page);
> > +             page_ref_sub(page, nr);
> > +             return error;
> > +     }
> > +
> >       do {
> >               xas_lock_irq(&xas);
> >               xas_create_range(&xas);
> > @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> >               xas_unlock_irq(&xas);
> >       } while (xas_nomem(&xas, gfp));
> >
> > -     if (!xas_error(&xas))
> > +     if (!xas_error(&xas)) {
> > +             mem_cgroup_commit_charge(page, memcg, false, compound);
>
> Unfortunately you cannot commit here yet because the rmap isn't set up
> and that will cause memcg_charge_statistics() to account the page
> incorrectly as file. And rmap is only set up during a page fault.

I also found this problem a few days ago. In my investigation, what we need for
anonymous page to make accounting correct is a way to find the type of the page,
file or anon, since there is no special code to use the rmap. And, I
think that it
could be done by checking NULL mapping or something else. Is there anything
I missed? And, I cannot find the function, memcg_charge_statistics(). Please
let me know the file name of this function.

This is just my curiosity and I agree what you commented below.

> This needs a bit of a rework of the memcg charging code that appears
> out of scope for your patches. I'm preparing a series right now to do
> just that. It'll also fix the swap ownership tracking problem when the
> swap controller is disabled, so we don't have to choose between
> charging the wrong cgroup or hampering swap readahead efficiency.

Sound good! I also think that these patches are out of scope of my series.
I will wait your patches. Could you let me know when your series is submitted?
I'd like to plan my work schedule based on your patch schedule.

> The patches also unblock Alex Shi's "per lruvec lru_lock for memcg"
> series, which is all the more indication that memcg needs fixing in
> that area.

Okay.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-17  1:38     ` Joonsoo Kim
@ 2020-04-17  3:31       ` Johannes Weiner
  2020-04-17  3:57         ` Joonsoo Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2020-04-17  3:31 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

On Fri, Apr 17, 2020 at 10:38:53AM +0900, Joonsoo Kim wrote:
> 2020년 4월 17일 (금) 오전 1:11, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> >
> > Hello Joonsoo,
> >
> > On Fri, Apr 03, 2020 at 02:40:43PM +0900, js1304@gmail.com wrote:
> > > @@ -112,7 +112,7 @@ void show_swap_cache_info(void)
> > >   * but sets SwapCache flag and private instead of mapping and index.
> > >   */
> > >  int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > > -                     gfp_t gfp, void **shadowp)
> > > +                     struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
> > >  {
> > >       struct address_space *address_space = swap_address_space(entry);
> > >       pgoff_t idx = swp_offset(entry);
> > > @@ -120,14 +120,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > >       unsigned long i, nr = compound_nr(page);
> > >       unsigned long nrexceptional = 0;
> > >       void *old;
> > > +     bool compound = !!compound_order(page);
> > > +     int error;
> > > +     struct mm_struct *mm = vma ? vma->vm_mm : current->mm;
> > > +     struct mem_cgroup *memcg;
> > >
> > >       VM_BUG_ON_PAGE(!PageLocked(page), page);
> > >       VM_BUG_ON_PAGE(PageSwapCache(page), page);
> > >       VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> > >
> > >       page_ref_add(page, nr);
> > > +     /* PageSwapCache() prevent the page from being re-charged */
> > >       SetPageSwapCache(page);
> > >
> > > +     error = mem_cgroup_try_charge(page, mm, gfp, &memcg, compound);
> > > +     if (error) {
> > > +             ClearPageSwapCache(page);
> > > +             page_ref_sub(page, nr);
> > > +             return error;
> > > +     }
> > > +
> > >       do {
> > >               xas_lock_irq(&xas);
> > >               xas_create_range(&xas);
> > > @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > >               xas_unlock_irq(&xas);
> > >       } while (xas_nomem(&xas, gfp));
> > >
> > > -     if (!xas_error(&xas))
> > > +     if (!xas_error(&xas)) {
> > > +             mem_cgroup_commit_charge(page, memcg, false, compound);
> >
> > Unfortunately you cannot commit here yet because the rmap isn't set up
> > and that will cause memcg_charge_statistics() to account the page
> > incorrectly as file. And rmap is only set up during a page fault.
> 
> I also found this problem a few days ago. In my investigation, what we need for
> anonymous page to make accounting correct is a way to find the type of the page,
> file or anon, since there is no special code to use the rmap. And, I
> think that it
> could be done by checking NULL mapping or something else.

page->mapping is NULL for truncated file pages, file pages before they
are inserted into the page cache, and anon pages before the rmap. It's
not straight-forward to robustly tell those apart inside memcg.

But fundamentally, it's a silly problem to fix. We only need to tell
page types apart to maintain the MEMCG_CACHE and MEMCG_RSS
counters. But these are unnecessary duplicates of the NR_FILE_PAGES
and NR_ANON_MAPPED vmstat counters - counters for which we already
have accounting sites in generic code, and that already exist in the
per-cgroup vmstat arrays. We just need to link them.

So that's what I'm fixing instead: I'm adjusting the charging sequence
slightly so that page->mem_cgroup is stable when the core VM code
accounts pages by type. And then I'm hooking into these places with
mod_lruvec_page_state and friends, and ditching MEMCG_CACHE/MEMCG_RSS.

After that, memcg doesn't have to know about the types of pages at
all. It can focus on maintaining page_counters and page->mem_cgroup,
and leave the vmstat breakdown to generic VM code.

Then we can charge pages right after allocation, regardless of type.

[ Eventually, the memcg accounting interface shouldn't be much more
  than GFP_ACCOUNT (with memalloc_use_memcg() for faults, swap etc.),
  and the vmstat hooks. My patches don't quite get there, but that's
  the direction they're pushing. ]

> Is there anything I missed? And, I cannot find the function,
> memcg_charge_statistics(). Please let me know the file name of this
> function.

The correct name is mem_cgroup_charge_statistics(), my apologies.

> > This needs a bit of a rework of the memcg charging code that appears
> > out of scope for your patches. I'm preparing a series right now to do
> > just that. It'll also fix the swap ownership tracking problem when the
> > swap controller is disabled, so we don't have to choose between
> > charging the wrong cgroup or hampering swap readahead efficiency.
> 
> Sound good! I also think that these patches are out of scope of my series.
> I will wait your patches. Could you let me know when your series is submitted?
> I'd like to plan my work schedule based on your patch schedule.

I just finished the first draft of the series a few minutes ago. I'll
polish it a bit, make sure it compiles etc. ;-), and send it out for
review, testing and rebasing, hopefully tomorrow or early next week.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache
  2020-04-17  3:31       ` Johannes Weiner
@ 2020-04-17  3:57         ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2020-04-17  3:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux Memory Management List, LKML, Michal Hocko,
	Hugh Dickins, Minchan Kim, Vlastimil Babka, Mel Gorman,
	kernel-team, Joonsoo Kim

2020년 4월 17일 (금) 오후 12:31, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> On Fri, Apr 17, 2020 at 10:38:53AM +0900, Joonsoo Kim wrote:
> > 2020년 4월 17일 (금) 오전 1:11, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
> > >
> > > Hello Joonsoo,
> > >
> > > On Fri, Apr 03, 2020 at 02:40:43PM +0900, js1304@gmail.com wrote:
> > > > @@ -112,7 +112,7 @@ void show_swap_cache_info(void)
> > > >   * but sets SwapCache flag and private instead of mapping and index.
> > > >   */
> > > >  int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > > > -                     gfp_t gfp, void **shadowp)
> > > > +                     struct vm_area_struct *vma, gfp_t gfp, void **shadowp)
> > > >  {
> > > >       struct address_space *address_space = swap_address_space(entry);
> > > >       pgoff_t idx = swp_offset(entry);
> > > > @@ -120,14 +120,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > > >       unsigned long i, nr = compound_nr(page);
> > > >       unsigned long nrexceptional = 0;
> > > >       void *old;
> > > > +     bool compound = !!compound_order(page);
> > > > +     int error;
> > > > +     struct mm_struct *mm = vma ? vma->vm_mm : current->mm;
> > > > +     struct mem_cgroup *memcg;
> > > >
> > > >       VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > >       VM_BUG_ON_PAGE(PageSwapCache(page), page);
> > > >       VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> > > >
> > > >       page_ref_add(page, nr);
> > > > +     /* PageSwapCache() prevent the page from being re-charged */
> > > >       SetPageSwapCache(page);
> > > >
> > > > +     error = mem_cgroup_try_charge(page, mm, gfp, &memcg, compound);
> > > > +     if (error) {
> > > > +             ClearPageSwapCache(page);
> > > > +             page_ref_sub(page, nr);
> > > > +             return error;
> > > > +     }
> > > > +
> > > >       do {
> > > >               xas_lock_irq(&xas);
> > > >               xas_create_range(&xas);
> > > > @@ -153,11 +165,16 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
> > > >               xas_unlock_irq(&xas);
> > > >       } while (xas_nomem(&xas, gfp));
> > > >
> > > > -     if (!xas_error(&xas))
> > > > +     if (!xas_error(&xas)) {
> > > > +             mem_cgroup_commit_charge(page, memcg, false, compound);
> > >
> > > Unfortunately you cannot commit here yet because the rmap isn't set up
> > > and that will cause memcg_charge_statistics() to account the page
> > > incorrectly as file. And rmap is only set up during a page fault.
> >
> > I also found this problem a few days ago. In my investigation, what we need for
> > anonymous page to make accounting correct is a way to find the type of the page,
> > file or anon, since there is no special code to use the rmap. And, I
> > think that it
> > could be done by checking NULL mapping or something else.
>
> page->mapping is NULL for truncated file pages, file pages before they
> are inserted into the page cache, and anon pages before the rmap. It's
> not straight-forward to robustly tell those apart inside memcg.

Okay.

> But fundamentally, it's a silly problem to fix. We only need to tell
> page types apart to maintain the MEMCG_CACHE and MEMCG_RSS
> counters. But these are unnecessary duplicates of the NR_FILE_PAGES
> and NR_ANON_MAPPED vmstat counters - counters for which we already
> have accounting sites in generic code, and that already exist in the
> per-cgroup vmstat arrays. We just need to link them.
>
> So that's what I'm fixing instead: I'm adjusting the charging sequence
> slightly so that page->mem_cgroup is stable when the core VM code
> accounts pages by type. And then I'm hooking into these places with
> mod_lruvec_page_state and friends, and ditching MEMCG_CACHE/MEMCG_RSS.
>
> After that, memcg doesn't have to know about the types of pages at
> all. It can focus on maintaining page_counters and page->mem_cgroup,
> and leave the vmstat breakdown to generic VM code.
>
> Then we can charge pages right after allocation, regardless of type.
>
> [ Eventually, the memcg accounting interface shouldn't be much more
>   than GFP_ACCOUNT (with memalloc_use_memcg() for faults, swap etc.),
>   and the vmstat hooks. My patches don't quite get there, but that's
>   the direction they're pushing. ]

Thanks for explanation! It will help me understand your future patches.

> > Is there anything I missed? And, I cannot find the function,
> > memcg_charge_statistics(). Please let me know the file name of this
> > function.
>
> The correct name is mem_cgroup_charge_statistics(), my apologies.

No problem.

> > > This needs a bit of a rework of the memcg charging code that appears
> > > out of scope for your patches. I'm preparing a series right now to do
> > > just that. It'll also fix the swap ownership tracking problem when the
> > > swap controller is disabled, so we don't have to choose between
> > > charging the wrong cgroup or hampering swap readahead efficiency.
> >
> > Sound good! I also think that these patches are out of scope of my series.
> > I will wait your patches. Could you let me know when your series is submitted?
> > I'd like to plan my work schedule based on your patch schedule.
>
> I just finished the first draft of the series a few minutes ago. I'll
> polish it a bit, make sure it compiles etc. ;-), and send it out for
> review, testing and rebasing, hopefully tomorrow or early next week.

Sound good! See you again next week. :)

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list
  2020-04-09  0:50   ` Joonsoo Kim
@ 2020-06-03  3:57     ` Suren Baghdasaryan
  2020-06-03  5:46       ` Joonsoo Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Suren Baghdasaryan @ 2020-06-03  3:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Andrew Morton, Linux Memory Management List,
	LKML, Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Mel Gorman, kernel-team, Joonsoo Kim

On Wed, Apr 8, 2020 at 5:50 PM Joonsoo Kim <js1304@gmail.com> wrote:
>
> 2020년 4월 9일 (목) 오전 1:55, Vlastimil Babka <vbabka@suse.cz>님이 작성:
> >
> > On 4/3/20 7:40 AM, js1304@gmail.com wrote:
> > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > >
> > > Hello,
> > >
> > > This patchset implements workingset protection and detection on
> > > the anonymous LRU list.
> >
> > Hi!
>
> Hi!
>
> > > I did another test to show the performance effect of this patchset.
> > >
> > > - ebizzy (with modified random function)
> > > ebizzy is the test program that main thread allocates lots of memory and
> > > child threads access them randomly during the given times. Swap-in/out
> > > will happen if allocated memory is larger than the system memory.
> > >
> > > The random function that represents the zipf distribution is used to
> > > make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
> > > the parameter is high, hot memory is accessed much larger than cold one.
> > > If the parameter is low, the number of access on each memory would be
> > > similar. I uses various parameters in order to show the effect of
> > > patchset on various hot/cold ratio workload.
> > >
> > > My test setup is a virtual machine with 8 cpus and 1024MB memory.
> > >
> > > Result format is as following.
> > >
> > > Parameter 0.1 ... 1.3
> > > Allocated memory size
> > > Throughput for base (larger is better)
> > > Throughput for patchset (larger is better)
> > > Improvement (larger is better)
> > >
> > >
> > > * single thread
> > >
> > >      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> > > <512M>
> > >   7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
> > >   6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
> > >    -0.01     -0.0     -0.0     0.01    -0.01      0.0     0.01
> > > <768M>
> > >    915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
> > >    920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
> > >     0.01     -0.0    -0.03      0.0     0.02     0.04    -0.01
> > > <1024M>
> > >    425.0    471.0    539.0    753.0   1183.0   2130.0   3839.0
> > >    414.0    468.0    553.0    770.0   1242.0   2187.0   3932.0
> > >    -0.03    -0.01     0.03     0.02     0.05     0.03     0.02
> > > <1280M>
> > >    320.0    346.0    410.0    556.0    871.0   1654.0   3298.0
> > >    316.0    346.0    411.0    550.0    892.0   1652.0   3293.0
> > >    -0.01      0.0      0.0    -0.01     0.02     -0.0     -0.0
> > > <1536M>
> > >    273.0    290.0    341.0    458.0    733.0   1381.0   2925.0
> > >    271.0    293.0    344.0    462.0    740.0   1398.0   2969.0
> > >    -0.01     0.01     0.01     0.01     0.01     0.01     0.02
> > > <2048M>
> > >     77.0     79.0     95.0    147.0    276.0    690.0   1816.0
> > >     91.0     94.0    115.0    170.0    321.0    770.0   2018.0
> > >     0.18     0.19     0.21     0.16     0.16     0.12     0.11
> > >
> > >
> > > * multi thread (8)
> > >
> > >      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> > > <512M>
> > >  29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
> > >  29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
> > >     0.01      0.0     0.01    -0.01     -0.0    -0.01     -0.0
> > > <768M>
> > >   3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
> > >   3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
> > >     0.07     0.08    -0.05     0.05     0.05    -0.03     0.02
> > > <1024M>
> > >   1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
> > >   1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
> > >     0.04     0.01     0.05     0.08     0.01    -0.01     0.07
> > > <1280M>
> > >   1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  12443.0
> > >   1560.0   1707.0   2016.0   2714.0   4262.0   7518.0  13910.0
> > >     0.02     0.05     0.04     0.05     0.03     0.08     0.12
> > > <1536M>
> > >   1303.0   1399.0   1550.0   2137.0   3469.0   6712.0  12944.0
> > >   1356.0   1465.0   1701.0   2237.0   3583.0   6830.0  13580.0
> > >     0.04     0.05      0.1     0.05     0.03     0.02     0.05
> > > <2048M>
> > >    172.0    184.0    215.0    289.0    514.0   1318.0   4153.0
> > >    175.0    190.0    225.0    329.0    606.0   1585.0   5170.0
> > >     0.02     0.03     0.05     0.14     0.18      0.2     0.24
> > >
> > > As we can see, as allocated memory grows, patched kernel get the better
> > > result. Maximum improvement is 21% for the single thread test and
> > > 24% for the multi thread test.
> >
> > So, these results seem to be identical since v1. After the various changes up to
> > v5, should  the benchmark be redone? And was that with a full patchset or
> > patches 1+2?
>
> It was done with a full patchset. I think that these results would not
> be changed
> even on v5 since it is improvement from the concept of this patchset and
> implementation detail doesn't much matter. However, I will redo.
>
> > > * EXPERIMENT
> > > I made a test program to imitates above scenario and confirmed that
> > > problem exists. Then, I checked that this patchset fixes it.
> > >
> > > My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
> > > the amount of the memory that the test program can use is about 280 MB.
> > > This is because the system uses large ram-backed swap and large ramdisk
> > > to capture the trace.
> > >
> > > Test scenario is like as below.
> > >
> > > 1. allocate cold memory (512MB)
> > > 2. allocate hot-1 memory (96MB)
> > > 3. activate hot-1 memory (96MB)
> > > 4. allocate another hot-2 memory (96MB)
> > > 5. access cold memory (128MB)
> > > 6. access hot-2 memory (96MB)
> > > 7. repeat 5, 6
> > >
> > > Since hot-1 memory (96MB) is on the active list, the inactive list can
> > > contains roughly 190MB pages. hot-2 memory's re-access interval
> > > (96+128 MB) is more 190MB, so it cannot be promoted without workingset
> > > detection and swap-in/out happens repeatedly. With this patchset,
> > > workingset detection works and promotion happens. Therefore, swap-in/out
> > > occurs less.
> > >
> > > Here is the result. (average of 5 runs)
> > >
> > > type swap-in swap-out
> > > base 863240 989945
> > > patch 681565 809273
> > >
> > > As we can see, patched kernel do less swap-in/out.
> >
> > Same comment, also v1 has this note:
>
> I had tested this scenario on every version of the patchset and found the same
> trend.
>
> > > Note that, this result is gotten from v5.1. Although workingset detection
> > > works on v5.1, it doesn't work well on v5.5. It looks like that recent
> > > code change on workingset.c is the reason of this problem. I will
> > > track it soon.
> > What was the problem then, assuming it's fixed? Maybe I just missed it
> > mentioned. Can results now be gathered on 5.6?
>
> It was fixed on v2. Change log on v2 "fix a critical bug that uses out of index
> lru list in workingset_refault()" is for this problem. I should note
> that clearly.
>
> > In general, this patchset seems to be doing the right thing. I haven't reviewed
> > the code yet, but hope to do so soon. But inevitably, with any changes in this
> > area there will be workloads that will suffer instead of benefit. That can be
> > because we are actually doing a wrong thing, or there's a bug in the code, or
> > the workloads happen to benefit from the current behavior even if it's not the
> > generally optimal one. And I'm afraid only testing on a variety of workloads can
> > show that. You mentioned somewhere that your production workloads benefit? Can
> > it be quantified more? Could e.g. Johannes test this a bit at Facebook, or
>
> I cannot share the detail of the test for my production (smart TV)
> workload. Roughly,
> it is repeat of various action and app (channel change, volume change,
> youtube, etc.)
> on smart TV and it is memory stress test. Result after the workload is:
>
> base
> pswpin 328211
> pswpout 304015
>
> patched
> pswpin 261884
> pswpout 276062
>
> So, improvement on pswpin and pswpout is roughly 20% and 9%, respectively.
>
> > it be quantified more? Could e.g. Johannes test this a bit at Facebook, or
> > somebody at Google?
>
> It's really helpful if someone else could test this on their workload.

Please let me know when the new version (after Johannes' memcg
charging changes) is available for testing. I'll try them on Android
workload.
Thanks.

>
> Thanks.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list
  2020-06-03  3:57     ` Suren Baghdasaryan
@ 2020-06-03  5:46       ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2020-06-03  5:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Andrew Morton, Linux Memory Management List,
	LKML, Johannes Weiner, Michal Hocko, Hugh Dickins, Minchan Kim,
	Mel Gorman, kernel-team, Joonsoo Kim

2020년 6월 3일 (수) 오후 12:57, Suren Baghdasaryan <surenb@google.com>님이 작성:
>
> On Wed, Apr 8, 2020 at 5:50 PM Joonsoo Kim <js1304@gmail.com> wrote:
> >
> > 2020년 4월 9일 (목) 오전 1:55, Vlastimil Babka <vbabka@suse.cz>님이 작성:
> > >
> > > On 4/3/20 7:40 AM, js1304@gmail.com wrote:
> > > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > >
> > > > Hello,
> > > >
> > > > This patchset implements workingset protection and detection on
> > > > the anonymous LRU list.
> > >
> > > Hi!
> >
> > Hi!
> >
> > > > I did another test to show the performance effect of this patchset.
> > > >
> > > > - ebizzy (with modified random function)
> > > > ebizzy is the test program that main thread allocates lots of memory and
> > > > child threads access them randomly during the given times. Swap-in/out
> > > > will happen if allocated memory is larger than the system memory.
> > > >
> > > > The random function that represents the zipf distribution is used to
> > > > make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
> > > > the parameter is high, hot memory is accessed much larger than cold one.
> > > > If the parameter is low, the number of access on each memory would be
> > > > similar. I uses various parameters in order to show the effect of
> > > > patchset on various hot/cold ratio workload.
> > > >
> > > > My test setup is a virtual machine with 8 cpus and 1024MB memory.
> > > >
> > > > Result format is as following.
> > > >
> > > > Parameter 0.1 ... 1.3
> > > > Allocated memory size
> > > > Throughput for base (larger is better)
> > > > Throughput for patchset (larger is better)
> > > > Improvement (larger is better)
> > > >
> > > >
> > > > * single thread
> > > >
> > > >      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> > > > <512M>
> > > >   7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
> > > >   6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
> > > >    -0.01     -0.0     -0.0     0.01    -0.01      0.0     0.01
> > > > <768M>
> > > >    915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
> > > >    920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
> > > >     0.01     -0.0    -0.03      0.0     0.02     0.04    -0.01
> > > > <1024M>
> > > >    425.0    471.0    539.0    753.0   1183.0   2130.0   3839.0
> > > >    414.0    468.0    553.0    770.0   1242.0   2187.0   3932.0
> > > >    -0.03    -0.01     0.03     0.02     0.05     0.03     0.02
> > > > <1280M>
> > > >    320.0    346.0    410.0    556.0    871.0   1654.0   3298.0
> > > >    316.0    346.0    411.0    550.0    892.0   1652.0   3293.0
> > > >    -0.01      0.0      0.0    -0.01     0.02     -0.0     -0.0
> > > > <1536M>
> > > >    273.0    290.0    341.0    458.0    733.0   1381.0   2925.0
> > > >    271.0    293.0    344.0    462.0    740.0   1398.0   2969.0
> > > >    -0.01     0.01     0.01     0.01     0.01     0.01     0.02
> > > > <2048M>
> > > >     77.0     79.0     95.0    147.0    276.0    690.0   1816.0
> > > >     91.0     94.0    115.0    170.0    321.0    770.0   2018.0
> > > >     0.18     0.19     0.21     0.16     0.16     0.12     0.11
> > > >
> > > >
> > > > * multi thread (8)
> > > >
> > > >      0.1      0.3      0.5      0.7      0.9      1.1      1.3
> > > > <512M>
> > > >  29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
> > > >  29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
> > > >     0.01      0.0     0.01    -0.01     -0.0    -0.01     -0.0
> > > > <768M>
> > > >   3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
> > > >   3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
> > > >     0.07     0.08    -0.05     0.05     0.05    -0.03     0.02
> > > > <1024M>
> > > >   1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
> > > >   1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
> > > >     0.04     0.01     0.05     0.08     0.01    -0.01     0.07
> > > > <1280M>
> > > >   1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  12443.0
> > > >   1560.0   1707.0   2016.0   2714.0   4262.0   7518.0  13910.0
> > > >     0.02     0.05     0.04     0.05     0.03     0.08     0.12
> > > > <1536M>
> > > >   1303.0   1399.0   1550.0   2137.0   3469.0   6712.0  12944.0
> > > >   1356.0   1465.0   1701.0   2237.0   3583.0   6830.0  13580.0
> > > >     0.04     0.05      0.1     0.05     0.03     0.02     0.05
> > > > <2048M>
> > > >    172.0    184.0    215.0    289.0    514.0   1318.0   4153.0
> > > >    175.0    190.0    225.0    329.0    606.0   1585.0   5170.0
> > > >     0.02     0.03     0.05     0.14     0.18      0.2     0.24
> > > >
> > > > As we can see, as allocated memory grows, patched kernel get the better
> > > > result. Maximum improvement is 21% for the single thread test and
> > > > 24% for the multi thread test.
> > >
> > > So, these results seem to be identical since v1. After the various changes up to
> > > v5, should  the benchmark be redone? And was that with a full patchset or
> > > patches 1+2?
> >
> > It was done with a full patchset. I think that these results would not
> > be changed
> > even on v5 since it is improvement from the concept of this patchset and
> > implementation detail doesn't much matter. However, I will redo.
> >
> > > > * EXPERIMENT
> > > > I made a test program to imitates above scenario and confirmed that
> > > > problem exists. Then, I checked that this patchset fixes it.
> > > >
> > > > My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
> > > > the amount of the memory that the test program can use is about 280 MB.
> > > > This is because the system uses large ram-backed swap and large ramdisk
> > > > to capture the trace.
> > > >
> > > > Test scenario is like as below.
> > > >
> > > > 1. allocate cold memory (512MB)
> > > > 2. allocate hot-1 memory (96MB)
> > > > 3. activate hot-1 memory (96MB)
> > > > 4. allocate another hot-2 memory (96MB)
> > > > 5. access cold memory (128MB)
> > > > 6. access hot-2 memory (96MB)
> > > > 7. repeat 5, 6
> > > >
> > > > Since hot-1 memory (96MB) is on the active list, the inactive list can
> > > > contains roughly 190MB pages. hot-2 memory's re-access interval
> > > > (96+128 MB) is more 190MB, so it cannot be promoted without workingset
> > > > detection and swap-in/out happens repeatedly. With this patchset,
> > > > workingset detection works and promotion happens. Therefore, swap-in/out
> > > > occurs less.
> > > >
> > > > Here is the result. (average of 5 runs)
> > > >
> > > > type swap-in swap-out
> > > > base 863240 989945
> > > > patch 681565 809273
> > > >
> > > > As we can see, patched kernel do less swap-in/out.
> > >
> > > Same comment, also v1 has this note:
> >
> > I had tested this scenario on every version of the patchset and found the same
> > trend.
> >
> > > > Note that, this result is gotten from v5.1. Although workingset detection
> > > > works on v5.1, it doesn't work well on v5.5. It looks like that recent
> > > > code change on workingset.c is the reason of this problem. I will
> > > > track it soon.
> > > What was the problem then, assuming it's fixed? Maybe I just missed it
> > > mentioned. Can results now be gathered on 5.6?
> >
> > It was fixed on v2. Change log on v2 "fix a critical bug that uses out of index
> > lru list in workingset_refault()" is for this problem. I should note
> > that clearly.
> >
> > > In general, this patchset seems to be doing the right thing. I haven't reviewed
> > > the code yet, but hope to do so soon. But inevitably, with any changes in this
> > > area there will be workloads that will suffer instead of benefit. That can be
> > > because we are actually doing a wrong thing, or there's a bug in the code, or
> > > the workloads happen to benefit from the current behavior even if it's not the
> > > generally optimal one. And I'm afraid only testing on a variety of workloads can
> > > show that. You mentioned somewhere that your production workloads benefit? Can
> > > it be quantified more? Could e.g. Johannes test this a bit at Facebook, or
> >
> > I cannot share the detail of the test for my production (smart TV)
> > workload. Roughly,
> > it is repeat of various action and app (channel change, volume change,
> > youtube, etc.)
> > on smart TV and it is memory stress test. Result after the workload is:
> >
> > base
> > pswpin 328211
> > pswpout 304015
> >
> > patched
> > pswpin 261884
> > pswpout 276062
> >
> > So, improvement on pswpin and pswpout is roughly 20% and 9%, respectively.
> >
> > > it be quantified more? Could e.g. Johannes test this a bit at Facebook, or
> > > somebody at Google?
> >
> > It's really helpful if someone else could test this on their workload.
>
> Please let me know when the new version (after Johannes' memcg
> charging changes) is available for testing. I'll try them on Android
> workload.
> Thanks.

I have a new version after Johannes' memcg charging changes but now
MM tree has another changes on this area ("mm: balance LRU lists based on
relative thrashing v2") so I need to rebase on it. I guess that at
least two weaks
are required.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-06-03  5:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03  5:40 [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list js1304
2020-04-03  5:40 ` [PATCH v5 01/10] mm/vmscan: make active/inactive ratio as 1:1 for anon lru js1304
2020-04-03  5:40 ` [PATCH v5 02/10] mm/vmscan: protect the workingset on anonymous LRU js1304
2020-04-03  5:40 ` [PATCH v5 03/10] mm/workingset: extend the workingset detection for anon LRU js1304
2020-04-03  5:40 ` [PATCH v5 04/10] mm/swapcache: support to handle the exceptional entries in swapcache js1304
2020-04-03  5:40 ` [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache js1304
2020-04-03 18:29   ` Yang Shi
2020-04-06  1:03     ` Joonsoo Kim
2020-04-07  0:22       ` Yang Shi
2020-04-07  1:27         ` Joonsoo Kim
2020-04-16 16:11   ` Johannes Weiner
2020-04-17  1:38     ` Joonsoo Kim
2020-04-17  3:31       ` Johannes Weiner
2020-04-17  3:57         ` Joonsoo Kim
2020-04-03  5:40 ` [PATCH v5 06/10] mm/swap: implement workingset detection for anonymous LRU js1304
2020-04-03  5:40 ` [PATCH v5 07/10] mm/workingset: support to remember the previous owner of the page js1304
2020-04-03  5:40 ` [PATCH v5 08/10] mm/swap: do not readahead if the previous owner of the swap entry isn't me js1304
2020-04-03  5:40 ` [PATCH v5 09/10] mm/vmscan: restore active/inactive ratio for anonymous LRU js1304
2020-04-03  5:45 ` [PATCH v5 10/10] mm/swap: reinforce the reclaim_stat changed by anon LRU algorithm change js1304
     [not found] ` <20200406091814.17256-1-hdanton@sina.com>
2020-04-07  0:40   ` [PATCH v5 02/10] mm/vmscan: protect the workingset on anonymous LRU Joonsoo Kim
     [not found] ` <20200406115804.4440-1-hdanton@sina.com>
2020-04-07  0:42   ` [PATCH v5 05/10] mm/swap: charge the page when adding to the swap cache Joonsoo Kim
     [not found]   ` <20200407022144.11164-1-hdanton@sina.com>
2020-04-09  0:53     ` Joonsoo Kim
2020-04-08 16:55 ` [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list Vlastimil Babka
2020-04-09  0:50   ` Joonsoo Kim
2020-06-03  3:57     ` Suren Baghdasaryan
2020-06-03  5:46       ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).