linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/24] Swapin path refactor for optimization and bugfix
@ 2023-11-19 19:47 Kairui Song
  2023-11-19 19:47 ` [PATCH 01/24] mm/swap: fix a potential undefined behavior issue Kairui Song
                   ` (24 more replies)
  0 siblings, 25 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

This series tries to unify and clean up the swapin path, fixing a few
issues with optimizations:

1. Memcg leak issue: when a process that previously swapped out some
   migrated to another cgroup, and the origianl cgroup is dead. If we
   do a swapoff, swapped in pages will be accounted into the process
   doing swapoff instead of the new cgroup. This will allow the process
   to use more memory than expect easily.

   This can be easily reproduced by:
   - Setup a swap.
   - Create memory cgroup A, B and C.
   - Spawn process P1 in cgroup A and make it swap out some pages.
   - Move process P1 to memory cgroup B.
   - Destroy cgroup A.
   - Do a swapoff in cgroup C
   - Swapped in pages is accounted into cgroup C.

   This patch will fix it make the swapped in pages accounted in cgroup B.

2. When there are multiple swap deviced configured, if one of these
   devices is not SSD, VMA readahead will be globally disabled.

   This series will make the readahead policy check per swap entry.

3. This series also include many refactor and optimzations:
   - Swap readahead policy check is unified for page-fault/swapoff/shmem,
     so swapin from ramdisk (eg. ZRAM) will always bypass swapcache.
     Previously shmem and swapoff have different behavior on this.
   - Some mircro optimization (eg. skip duplicated xarray lookup)
     for swapin path while doing the refactor.

Some benchmark:

1. fio test for shmem (whin 2G memcg limit and using lzo-rle ZRAM swap):
fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=256m --ioengine=mmap \
  --iodepth=128 --rw=randrw --random_distribution=<RANDOM> --time_based\
  --ramp_time=1m --runtime=1m --group_reporting

RANDOM=zipf:1.2 ZRAM
Before (R/W, bw): 7339MiB/s / 7341MiB/s
After  (R/W, bw): 7305MiB/s / 7308MiB/s (-0.5%)

RANDOM=zipf:0.5 ZRAM
Before (R/W, bw): 770MiB/s / 770MiB/s
After  (R/W, bw): 775MiB/s / 774MiB/s (+0.6%)

RANDOM=random ZRAM
Before (R/W, bw): 537MiB/s / 537MiB/s
After  (R/W, bw): 552MiB/s / 552MiB/s (+2.7%)

We can see readahead barely helps, and for random RW there is a observable performance gain.

2. Micro benchmark which use madvise to swap out 10G zero-filled data to
   ZRAM then read them in, shows a performance gain for swapin path:

Before: 12480532 us
After:  12013318 us (+3.8%)

4. The final vmlinux is also a little bit smaller (gcc 8.5.0):
./scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 8/7 grow/shrink: 5/6 up/down: 5737/-5789 (-52)
Function                                     old     new   delta
unuse_vma                                      -    3204   +3204
swapin_page_fault                              -    1804   +1804
swapin_page_non_fault                          -     437    +437
swapin_no_readahead                            -     165    +165
swap_cache_get_folio                         291     326     +35
__pfx_unuse_vma                                -      16     +16
__pfx_swapin_page_non_fault                    -      16     +16
__pfx_swapin_page_fault                        -      16     +16
__pfx_swapin_no_readahead                      -      16     +16
read_swap_cache_async                        179     191     +12
swap_cluster_readahead                       912     921      +9
__read_swap_cache_async                      669     673      +4
zswap_writeback_entry                       1463    1466      +3
__do_sys_swapon                             4923    4920      -3
nr_rotate_swap                                 4       -      -4
__pfx_unuse_pte_range                         16       -     -16
__pfx_swapin_readahead                        16       -     -16
__pfx___swap_count                            16       -     -16
__x64_sys_swapoff                           1347    1312     -35
__ia32_sys_swapoff                          1346    1311     -35
__swap_count                                  72       -     -72
shmem_swapin_folio                          1697    1535    -162
do_swap_page                                2404    1942    -462
try_to_unuse                                1867     880    -987
swapin_readahead                            1377       -   -1377
unuse_pte_range                             2604       -   -2604
Total: Before=30085393, After=30085341, chg -0.00%

Kairui Song (24):
  mm/swap: fix a potential undefined behavior issue
  mm/swapfile.c: add back some comment
  mm/swap: move no readahead swapin code to a stand alone helper
  mm/swap: avoid setting page lock bit and doing extra unlock check
  mm/swap: move readahead policy checking into swapin_readahead
  swap: rework swapin_no_readahead arguments
  mm/swap: move swap_count to header to be shared
  mm/swap: check readahead policy per entry
  mm/swap: inline __swap_count
  mm/swap: remove nr_rotate_swap and related code
  mm/swap: also handle swapcache lookup in swapin_readahead
  mm/swap: simplify arguments for swap_cache_get_folio
  swap: simplify swap_cache_get_folio
  mm/swap: do shadow lookup as well when doing swap cache lookup
  mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO
    device
  mm/swap: reduce scope of get_swap_device in swapin path
  mm/swap: fix false error when swapoff race with swapin
  mm/swap: introduce a helper non fault swapin
  shmem, swap: refactor error check on OOM or race
  swap: simplify and make swap_find_cache static
  swap: make swapin_readahead result checking argument mandatory
  swap: make swap_cluster_readahead static
  swap: fix multiple swap leak when after cgroup migrate
  mm/swap: change swapin_readahead to swapin_page_fault

 include/linux/swap.h |   7 --
 mm/memory.c          | 109 +++++++--------------
 mm/shmem.c           |  55 ++++-------
 mm/swap.h            |  34 ++++---
 mm/swap_state.c      | 222 ++++++++++++++++++++++++++++++++-----------
 mm/swapfile.c        |  70 ++++++--------
 mm/zswap.c           |   2 +-
 7 files changed, 269 insertions(+), 230 deletions(-)

-- 
2.42.0


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH 01/24] mm/swap: fix a potential undefined behavior issue
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-19 20:55   ` Matthew Wilcox
  2023-11-19 19:47 ` [PATCH 02/24] mm/swapfile.c: add back some comment Kairui Song
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When folio is NULL, taking the address of its struct member is an
undefined behavior, the UB is caused by applying -> operator
to a pointer not pointing to any object. Although in practice this
won't lead to a real issue, still better to fix it, also makes the
code less error-prone, when folio is NULL, page is also NULL,
instead of a meanless offset value.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e27e2e5beb3f..70ffa867b1be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3861,7 +3861,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			/* skip swapcache */
 			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 						vma, vmf->address, false);
-			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
@@ -3879,6 +3878,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 					workingset_refault(folio, shadow);
 
 				folio_add_lru(folio);
+				page = &folio->page;
 
 				/* To provide entry to swap_readpage() */
 				folio->swap = entry;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 02/24] mm/swapfile.c: add back some comment
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
  2023-11-19 19:47 ` [PATCH 01/24] mm/swap: fix a potential undefined behavior issue Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-19 19:47 ` [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper Kairui Song
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Some useful comments were dropped in commit b56a2d8af914 ('mm: rid
swapoff of quadratic complexity'), add them back.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4bc70f459164..756104ebd585 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1879,6 +1879,17 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				folio = page_folio(page);
 		}
 		if (!folio) {
+			/*
+			 * The entry could have been freed, and will not
+			 * be reused since swapoff() already disabled
+			 * allocation from here, or alloc_page() failed.
+			 *
+			 * We don't hold lock here, so the swap entry could be
+			 * SWAP_MAP_BAD (when the cluster is discarding).
+			 * Instead of fail out, We can just skip the swap
+			 * entry because swapoff will wait for discarding
+			 * finish anyway.
+			 */
 			swp_count = READ_ONCE(si->swap_map[offset]);
 			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
 				continue;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
  2023-11-19 19:47 ` [PATCH 01/24] mm/swap: fix a potential undefined behavior issue Kairui Song
  2023-11-19 19:47 ` [PATCH 02/24] mm/swapfile.c: add back some comment Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-19 21:00   ` Matthew Wilcox
                     ` (2 more replies)
  2023-11-19 19:47 ` [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check Kairui Song
                   ` (21 subsequent siblings)
  24 siblings, 3 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, simply move the routine to a standalone function to
be used later. The error path handling is copied from the "out_page"
label, to make the code change minimized for easier reviewing.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 33 +++++----------------------------
 mm/swap.h       |  2 ++
 mm/swap_state.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 70ffa867b1be..fba4a5229163 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3794,7 +3794,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -3858,33 +3857,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				if (mem_cgroup_swapin_charge_folio(folio,
-							vma->vm_mm, GFP_KERNEL,
-							entry)) {
-					ret = VM_FAULT_OOM;
-					goto out_page;
-				}
-				mem_cgroup_swapin_uncharge_swap(entry);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-				page = &folio->page;
-
-				/* To provide entry to swap_readpage() */
-				folio->swap = entry;
-				swap_readpage(page, true, NULL);
-				folio->private = NULL;
-			}
+			/* skip swapcache and readahead */
+			page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
+						vmf);
+			if (page)
+				folio = page_folio(page);
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
diff --git a/mm/swap.h b/mm/swap.h
index 73c332ee4d91..ea4be4791394 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,6 +56,8 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 				    struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf);
+struct page *swapin_no_readahead(swp_entry_t entry, gfp_t flag,
+				 struct vm_fault *vmf);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 85d9e5806a6a..ac4fa404eaa7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -853,6 +853,54 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	return page;
 }
 
+/**
+ * swapin_no_readahead - swap in pages skipping swap cache and readahead
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vmf: fault information
+ *
+ * Returns the struct page for entry and addr after the swap entry is read
+ * in.
+ */
+struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
+				 struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL;
+	struct folio *folio;
+	void *shadow = NULL;
+
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
+				vma, vmf->address, false);
+	if (folio) {
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
+		if (mem_cgroup_swapin_charge_folio(folio,
+					vma->vm_mm, GFP_KERNEL,
+					entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return NULL;
+		}
+		mem_cgroup_swapin_uncharge_swap(entry);
+
+		shadow = get_shadow_from_swap_cache(entry);
+		if (shadow)
+			workingset_refault(folio, shadow);
+
+		folio_add_lru(folio);
+
+		/* To provide entry to swap_readpage() */
+		folio->swap = entry;
+		page = &folio->page;
+		swap_readpage(page, true, NULL);
+		folio->private = NULL;
+	}
+
+	return page;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (2 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  4:17   ` Chris Li
  2023-11-19 19:47 ` [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead Kairui Song
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When swapping in a page, mem_cgroup_swapin_charge_folio is called for new
allocated folio, nothing else is referencing the folio so no need to set
the lock bit. This avoided doing unlock check on error path.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index ac4fa404eaa7..45dd8b7c195d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -458,6 +458,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 						mpol, ilx, numa_node_id());
 		if (!folio)
                         goto fail_put_swap;
+		if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
+			goto fail_put_folio;
 
 		/*
 		 * Swap entry may have been freed since our caller observed it.
@@ -483,13 +485,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	/*
 	 * The swap entry is ours to swap in. Prepare the new page.
 	 */
-
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
-	if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
-		goto fail_unlock;
-
 	/* May fail (-ENOMEM) if XArray node allocation failed. */
 	if (add_to_swap_cache(folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
 		goto fail_unlock;
@@ -510,6 +508,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 fail_unlock:
 	put_swap_folio(folio, entry);
 	folio_unlock(folio);
+fail_put_folio:
 	folio_put(folio);
 fail_put_swap:
 	put_swap_device(si);
@@ -873,16 +872,15 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 				vma, vmf->address, false);
 	if (folio) {
-		__folio_set_locked(folio);
-		__folio_set_swapbacked(folio);
-
-		if (mem_cgroup_swapin_charge_folio(folio,
-					vma->vm_mm, GFP_KERNEL,
-					entry)) {
-			folio_unlock(folio);
+		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+						   GFP_KERNEL, entry)) {
 			folio_put(folio);
 			return NULL;
 		}
+
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
 		mem_cgroup_swapin_uncharge_swap(entry);
 
 		shadow = get_shadow_from_swap_cache(entry);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (3 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21  6:15   ` Chris Li
  2023-11-19 19:47 ` [PATCH 06/24] swap: rework swapin_no_readahead arguments Kairui Song
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

This makes swapin_readahead a main entry for swapin pages,
prepare for optimizations in later commits.

This also makes swapoff able to make use of readahead checking
based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster:

Before:
time swapoff /dev/zram0
real    0m12.337s
user    0m0.001s
sys     0m12.329s

After:
time swapoff /dev/zram0
real    0m9.728s
user    0m0.001s
sys     0m9.719s

And what's more, because now swapoff will also make use of no-readahead
swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM):
when a process that swapped out some memory previously was moved to a new
cgroup, and the original cgroup is dead, swapoff the swap device will
make the swapped in pages accounted into the process doing the swapoff
instead of the new cgroup the process was moved to.

This can be easily reproduced by:
- Setup a ramdisk (eg. ZRAM) swap.
- Create memory cgroup A, B and C.
- Spawn process P1 in cgroup A and make it swap out some pages.
- Move process P1 to memory cgroup B.
- Destroy cgroup A.
- Do a swapoff in cgroup C.
- Swapped in pages is accounted into cgroup C.

This patch will fix it make the swapped in pages accounted in cgroup B.

The same bug exists for readahead path too, we'll fix it in later
commits.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 22 +++++++---------------
 mm/swap.h       |  6 ++----
 mm/swap_state.c | 33 ++++++++++++++++++++++++++-------
 mm/swapfile.c   |  2 +-
 4 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fba4a5229163..f4237a2e3b93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3792,6 +3792,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	rmap_t rmap_flags = RMAP_NONE;
 	bool exclusive = false;
 	swp_entry_t entry;
+	bool swapcached;
 	pte_t pte;
 	vm_fault_t ret = 0;
 
@@ -3855,22 +3856,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			/* skip swapcache and readahead */
-			page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			if (page)
-				folio = page_folio(page);
+		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+					vmf, &swapcached);
+		if (page) {
+			folio = page_folio(page);
+			if (swapcached)
+				swapcache = folio;
 		} else {
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			if (page)
-				folio = page_folio(page);
-			swapcache = folio;
-		}
-
-		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
diff --git a/mm/swap.h b/mm/swap.h
index ea4be4791394..f82d43d7b52a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -55,9 +55,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 				    struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf);
-struct page *swapin_no_readahead(swp_entry_t entry, gfp_t flag,
-				 struct vm_fault *vmf);
+			      struct vm_fault *vmf, bool *swapcached);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -89,7 +87,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
 }
 
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
+			struct vm_fault *vmf, bool *swapcached)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 45dd8b7c195d..fd0047ae324e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	release_pages(pages, nr);
 }
 
+static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
+{
+	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
+}
+
 static inline bool swap_use_vma_readahead(void)
 {
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
@@ -861,8 +866,8 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * Returns the struct page for entry and addr after the swap entry is read
  * in.
  */
-struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				 struct vm_fault *vmf)
+static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
+					struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL;
@@ -904,6 +909,8 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
+ * @swapcached: pointer to a bool used as indicator if the
+ *              page is swapped in through swapcache.
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
@@ -912,17 +919,29 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+			      struct vm_fault *vmf, bool *swapcached)
 {
 	struct mempolicy *mpol;
-	pgoff_t ilx;
 	struct page *page;
+	pgoff_t ilx;
+	bool cached;
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	page = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
+		page = swapin_no_readahead(entry, gfp_mask, vmf);
+		cached = false;
+	} else if (swap_use_vma_readahead()) {
+		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+		cached = true;
+	} else {
+		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		cached = true;
+	}
 	mpol_cond_put(mpol);
+
+	if (swapcached)
+		*swapcached = cached;
+
 	return page;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 756104ebd585..0142bfc71b81 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1874,7 +1874,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			};
 
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
+						&vmf, NULL);
 			if (page)
 				folio = page_folio(page);
 		}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 06/24] swap: rework swapin_no_readahead arguments
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (4 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  0:20   ` kernel test robot
  2023-11-21  6:44   ` Chris Li
  2023-11-19 19:47 ` [PATCH 07/24] mm/swap: move swap_count to header to be shared Kairui Song
                   ` (18 subsequent siblings)
  24 siblings, 2 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Make it use alloc_pages_mpol instead of vma_alloc_folio, and accept
mm_struct directly as an argument instead of taking a vmf as argument.
Make its arguments similar to swap_{cluster,vma}_readahead, to
make the code more aligned.

Also prepare for following commits which will skip vmf for certain
swapin paths.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index fd0047ae324e..ff6756f2e8e4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -867,17 +867,17 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * in.
  */
 static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
-					struct vm_fault *vmf)
+					struct mempolicy *mpol, pgoff_t ilx,
+					struct mm_struct *mm)
 {
-	struct vm_area_struct *vma = vmf->vma;
-	struct page *page = NULL;
 	struct folio *folio;
+	struct page *page;
 	void *shadow = NULL;
 
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-				vma, vmf->address, false);
+	page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+	folio = (struct folio *)page;
 	if (folio) {
-		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+		if (mem_cgroup_swapin_charge_folio(folio, mm,
 						   GFP_KERNEL, entry)) {
 			folio_put(folio);
 			return NULL;
@@ -896,7 +896,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 		/* To provide entry to swap_readpage() */
 		folio->swap = entry;
-		page = &folio->page;
 		swap_readpage(page, true, NULL);
 		folio->private = NULL;
 	}
@@ -928,7 +927,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
-		page = swapin_no_readahead(entry, gfp_mask, vmf);
+		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
 		cached = false;
 	} else if (swap_use_vma_readahead()) {
 		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 07/24] mm/swap: move swap_count to header to be shared
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (5 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 06/24] swap: rework swapin_no_readahead arguments Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21  6:51   ` Chris Li
  2023-11-19 19:47 ` [PATCH 08/24] mm/swap: check readahead policy per entry Kairui Song
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, prepare for later commits.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h     | 5 +++++
 mm/swapfile.c | 5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index f82d43d7b52a..a9a654af791e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -61,6 +61,11 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	return page_swap_info(&folio->page)->flags;
 }
+
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
+}
 #else /* CONFIG_SWAP */
 struct swap_iocb;
 static inline void swap_readpage(struct page *page, bool do_poll,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0142bfc71b81..a8ae472ed2b6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -114,11 +114,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
-static inline unsigned char swap_count(unsigned char ent)
-{
-	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
-}
-
 /* Reclaim the swap entry anyway if possible */
 #define TTRS_ANYWAY		0x1
 /*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (6 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 07/24] mm/swap: move swap_count to header to be shared Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  6:04   ` Huang, Ying
  2023-11-21  7:54   ` Chris Li
  2023-11-19 19:47 ` [PATCH 09/24] mm/swap: inline __swap_count Kairui Song
                   ` (16 subsequent siblings)
  24 siblings, 2 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently VMA readahead is globally disabled when any rotate disk is
used as swap backend. So multiple swap devices are enabled, if a slower
hard disk is set as a low priority fallback, and a high performance SSD
is used and high priority swap device, vma readahead is disabled globally.
The SSD swap device performance will drop by a lot.

Check readahead policy per entry to avoid such problem.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index ff6756f2e8e4..fb78f7f18ed7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
 	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
 }
 
-static inline bool swap_use_vma_readahead(void)
+static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
 {
-	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
+	return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
 }
 
 /*
@@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
 
 	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
 	if (!IS_ERR(folio)) {
-		bool vma_ra = swap_use_vma_readahead();
+		bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
 		bool readahead;
 
 		/*
@@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			      struct vm_fault *vmf, bool *swapcached)
 {
+	struct swap_info_struct *si;
 	struct mempolicy *mpol;
 	struct page *page;
 	pgoff_t ilx;
 	bool cached;
 
+	si = swp_swap_info(entry);
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
+	if (swap_use_no_readahead(si, entry)) {
 		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
 		cached = false;
-	} else if (swap_use_vma_readahead()) {
+	} else if (swap_use_vma_readahead(si)) {
 		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
 		cached = true;
 	} else {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 09/24] mm/swap: inline __swap_count
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (7 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 08/24] mm/swap: check readahead policy per entry Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  7:41   ` Huang, Ying
  2023-11-19 19:47 ` [PATCH 10/24] mm/swap: remove nr_rotate_swap and related code Kairui Song
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There is only one caller in swap subsystem now, where it can be inline
smoothly, avoid the memory access and function call overheads.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 6 ------
 mm/swap_state.c      | 6 +++---
 mm/swapfile.c        | 8 --------
 3 files changed, 3 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2401990d954d..64a37819a9b3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -485,7 +485,6 @@ int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t swapdev_block(int, pgoff_t);
-extern int __swap_count(swp_entry_t entry);
 extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -559,11 +558,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
 {
 }
 
-static inline int __swap_count(swp_entry_t entry)
-{
-	return 0;
-}
-
 static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index fb78f7f18ed7..d87c20f9f7ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -316,9 +316,9 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	release_pages(pages, nr);
 }
 
-static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
+static inline bool swap_use_no_readahead(struct swap_info_struct *si, pgoff_t offset)
 {
-	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
+	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && swap_count(si->swap_map[offset]) == 1;
 }
 
 static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
@@ -928,7 +928,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	si = swp_swap_info(entry);
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	if (swap_use_no_readahead(si, entry)) {
+	if (swap_use_no_readahead(si, swp_offset(entry))) {
 		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
 		cached = false;
 	} else if (swap_use_vma_readahead(si)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a8ae472ed2b6..e15a6c464a38 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1431,14 +1431,6 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 		spin_unlock(&p->lock);
 }
 
-int __swap_count(swp_entry_t entry)
-{
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
-
-	return swap_count(si->swap_map[offset]);
-}
-
 /*
  * How many references to @entry are currently swapped out?
  * This does not give an exact answer when swap count is continued,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 10/24] mm/swap: remove nr_rotate_swap and related code
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (8 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 09/24] mm/swap: inline __swap_count Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21 15:45   ` Chris Li
  2023-11-19 19:47 ` [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead Kairui Song
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No longer needed after we switched to per entry swap readhead policy.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 11 -----------
 2 files changed, 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 64a37819a9b3..cc83fb884757 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -454,7 +454,6 @@ extern void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
-extern atomic_t nr_rotate_swap;
 extern bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e15a6c464a38..01c3f53b6521 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -104,8 +104,6 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
 /* Activity counter to indicate that a swapon or swapoff has occurred */
 static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
-atomic_t nr_rotate_swap = ATOMIC_INIT(0);
-
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
@@ -2486,9 +2484,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (p->flags & SWP_CONTINUED)
 		free_swap_count_continuations(p);
 
-	if (!p->bdev || !bdev_nonrot(p->bdev))
-		atomic_dec(&nr_rotate_swap);
-
 	mutex_lock(&swapon_mutex);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
@@ -2990,7 +2985,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	struct swap_cluster_info *cluster_info = NULL;
 	struct page *page = NULL;
 	struct inode *inode = NULL;
-	bool inced_nr_rotate_swap = false;
 
 	if (swap_flags & ~SWAP_FLAGS_VALID)
 		return -EINVAL;
@@ -3112,9 +3106,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
 			cluster_set_null(&cluster->index);
 		}
-	} else {
-		atomic_inc(&nr_rotate_swap);
-		inced_nr_rotate_swap = true;
 	}
 
 	error = swap_cgroup_swapon(p->type, maxpages);
@@ -3218,8 +3209,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
 	kvfree(cluster_info);
-	if (inced_nr_rotate_swap)
-		atomic_dec(&nr_rotate_swap);
 	if (swap_file)
 		filp_close(swap_file, NULL);
 out:
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (9 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 10/24] mm/swap: remove nr_rotate_swap and related code Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  0:47   ` kernel test robot
  2023-11-21 16:06   ` Chris Li
  2023-11-19 19:47 ` [PATCH 12/24] mm/swap: simplify arguments for swap_cache_get_folio Kairui Song
                   ` (13 subsequent siblings)
  24 siblings, 2 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, just prepare for later commits.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 61 +++++++++++++++++++++++--------------------------
 mm/swap.h       | 10 ++++++--
 mm/swap_state.c | 26 +++++++++++++--------
 mm/swapfile.c   | 30 +++++++++++-------------
 4 files changed, 66 insertions(+), 61 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f4237a2e3b93..22af9f3e8c75 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct folio *swapcache, *folio = NULL;
+	struct folio *swapcache = NULL, *folio = NULL;
+	enum swap_cache_result cache_result;
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool exclusive = false;
 	swp_entry_t entry;
-	bool swapcached;
 	pte_t pte;
 	vm_fault_t ret = 0;
 
@@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
-	if (folio)
-		page = folio_file_page(folio, swp_offset(entry));
-	swapcache = folio;
-
-	if (!folio) {
-		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-					vmf, &swapcached);
-		if (page) {
-			folio = page_folio(page);
-			if (swapcached)
-				swapcache = folio;
-		} else {
+	page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+				vmf, &cache_result);
+	if (page) {
+		folio = page_folio(page);
+		if (cache_result != SWAP_CACHE_HIT) {
+			/* Had to read the page from swap area: Major fault */
+			ret = VM_FAULT_MAJOR;
+			count_vm_event(PGMAJFAULT);
+			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+		}
+		if (cache_result != SWAP_CACHE_BYPASS)
+			swapcache = folio;
+		if (PageHWPoison(page)) {
 			/*
-			 * Back out if somebody else faulted in this pte
-			 * while we released the pte lock.
+			 * hwpoisoned dirty swapcache pages are kept for killing
+			 * owner processes (which may be unknown at hwpoison time)
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
-			if (likely(vmf->pte &&
-				   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
-				ret = VM_FAULT_OOM;
-			goto unlock;
+			ret = VM_FAULT_HWPOISON;
+			goto out_release;
 		}
-
-		/* Had to read the page from swap area: Major fault */
-		ret = VM_FAULT_MAJOR;
-		count_vm_event(PGMAJFAULT);
-		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
-	} else if (PageHWPoison(page)) {
+	} else {
 		/*
-		 * hwpoisoned dirty swapcache pages are kept for killing
-		 * owner processes (which may be unknown at hwpoison time)
+		 * Back out if somebody else faulted in this pte
+		 * while we released the pte lock.
 		 */
-		ret = VM_FAULT_HWPOISON;
-		goto out_release;
+		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+				vmf->address, &vmf->ptl);
+		if (likely(vmf->pte &&
+			   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+			ret = VM_FAULT_OOM;
+		goto unlock;
 	}
 
 	ret |= folio_lock_or_retry(folio, vmf);
diff --git a/mm/swap.h b/mm/swap.h
index a9a654af791e..ac9136eee690 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
+enum swap_cache_result {
+	SWAP_CACHE_HIT,
+	SWAP_CACHE_MISS,
+	SWAP_CACHE_BYPASS,
+};
+
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
@@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 				    struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf, bool *swapcached);
+			      struct vm_fault *vmf, enum swap_cache_result *result);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
 }
 
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf, bool *swapcached)
+			struct vm_fault *vmf, enum swap_cache_result *result)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d87c20f9f7ec..e96d63bf8a22 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
- * @swapcached: pointer to a bool used as indicator if the
- *              page is swapped in through swapcache.
+ * @result: a return value to indicate swap cache usage.
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
@@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-			      struct vm_fault *vmf, bool *swapcached)
+			      struct vm_fault *vmf, enum swap_cache_result *result)
 {
+	enum swap_cache_result cache_result;
 	struct swap_info_struct *si;
 	struct mempolicy *mpol;
+	struct folio *folio;
 	struct page *page;
 	pgoff_t ilx;
-	bool cached;
+
+	folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
+	if (folio) {
+		page = folio_file_page(folio, swp_offset(entry));
+		cache_result = SWAP_CACHE_HIT;
+		goto done;
+	}
 
 	si = swp_swap_info(entry);
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	if (swap_use_no_readahead(si, swp_offset(entry))) {
 		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
-		cached = false;
+		cache_result = SWAP_CACHE_BYPASS;
 	} else if (swap_use_vma_readahead(si)) {
 		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
-		cached = true;
+		cache_result = SWAP_CACHE_MISS;
 	} else {
 		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
-		cached = true;
+		cache_result = SWAP_CACHE_MISS;
 	}
 	mpol_cond_put(mpol);
 
-	if (swapcached)
-		*swapcached = cached;
+done:
+	if (result)
+		*result = cache_result;
 
 	return page;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 01c3f53b6521..b6d57fff5e21 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 	si = swap_info[type];
 	do {
-		struct folio *folio;
+		struct page *page;
 		unsigned long offset;
 		unsigned char swp_count;
+		struct folio *folio = NULL;
 		swp_entry_t entry;
 		int ret;
 		pte_t ptent;
 
+		struct vm_fault vmf = {
+			.vma = vma,
+			.address = addr,
+			.real_address = addr,
+			.pmd = pmd,
+		};
+
 		if (!pte++) {
 			pte = pte_offset_map(pmd, addr);
 			if (!pte)
@@ -1847,22 +1855,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		offset = swp_offset(entry);
 		pte_unmap(pte);
 		pte = NULL;
-
-		folio = swap_cache_get_folio(entry, vma, addr);
-		if (!folio) {
-			struct page *page;
-			struct vm_fault vmf = {
-				.vma = vma,
-				.address = addr,
-				.real_address = addr,
-				.pmd = pmd,
-			};
-
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf, NULL);
-			if (page)
-				folio = page_folio(page);
-		}
+		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+					&vmf, NULL);
+		if (page)
+			folio = page_folio(page);
 		if (!folio) {
 			/*
 			 * The entry could have been freed, and will not
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 12/24] mm/swap: simplify arguments for swap_cache_get_folio
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (10 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21 16:36   ` Chris Li
  2023-11-19 19:47 ` [PATCH 13/24] swap: simplify swap_cache_get_folio Kairui Song
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There are only two caller now, simplify the arguments.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c      |  2 +-
 mm/swap.h       |  2 +-
 mm/swap_state.c | 15 +++++++--------
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0d1ce70bce38..72239061c655 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_get_folio(swap, NULL);
 	if (!folio) {
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
diff --git a/mm/swap.h b/mm/swap.h
index ac9136eee690..e43e965f123f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr);
+				   struct vm_fault *vmf);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e96d63bf8a22..91461e26a8cc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,8 +334,7 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
  *
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
 {
 	struct folio *folio;
 
@@ -352,22 +351,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
 			return folio;
 
 		readahead = folio_test_clear_readahead(folio);
-		if (vma && vma_ra) {
+		if (vmf && vma_ra) {
 			unsigned long ra_val;
 			int win, hits;
 
-			ra_val = GET_SWAP_RA_VAL(vma);
+			ra_val = GET_SWAP_RA_VAL(vmf->vma);
 			win = SWAP_RA_WIN(ra_val);
 			hits = SWAP_RA_HITS(ra_val);
 			if (readahead)
 				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vma->swap_readahead_info,
-					SWAP_RA_VAL(addr, win, hits));
+			atomic_long_set(&vmf->vma->swap_readahead_info,
+					SWAP_RA_VAL(vmf->address, win, hits));
 		}
 
 		if (readahead) {
 			count_vm_event(SWAP_RA_HIT);
-			if (!vma || !vma_ra)
+			if (!vmf || !vma_ra)
 				atomic_inc(&swapin_readahead_hits);
 		}
 	} else {
@@ -926,7 +925,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct page *page;
 	pgoff_t ilx;
 
-	folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
+	folio = swap_cache_get_folio(entry, vmf);
 	if (folio) {
 		page = folio_file_page(folio, swp_offset(entry));
 		cache_result = SWAP_CACHE_HIT;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 13/24] swap: simplify swap_cache_get_folio
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (11 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 12/24] mm/swap: simplify arguments for swap_cache_get_folio Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21 16:50   ` Chris Li
  2023-11-19 19:47 ` [PATCH 14/24] mm/swap: do shadow lookup as well when doing swap cache lookup Kairui Song
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Rearrange the if statement, reduce the code indent, no feature change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 58 ++++++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 91461e26a8cc..3b5a34f47192 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -336,41 +336,39 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
 {
+	bool vma_ra, readahead;
 	struct folio *folio;
 
 	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
-	if (!IS_ERR(folio)) {
-		bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
-		bool readahead;
+	if (IS_ERR(folio))
+		return NULL;
 
-		/*
-		 * At the moment, we don't support PG_readahead for anon THP
-		 * so let's bail out rather than confusing the readahead stat.
-		 */
-		if (unlikely(folio_test_large(folio)))
-			return folio;
-
-		readahead = folio_test_clear_readahead(folio);
-		if (vmf && vma_ra) {
-			unsigned long ra_val;
-			int win, hits;
-
-			ra_val = GET_SWAP_RA_VAL(vmf->vma);
-			win = SWAP_RA_WIN(ra_val);
-			hits = SWAP_RA_HITS(ra_val);
-			if (readahead)
-				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vmf->vma->swap_readahead_info,
-					SWAP_RA_VAL(vmf->address, win, hits));
-		}
+	/*
+	 * At the moment, we don't support PG_readahead for anon THP
+	 * so let's bail out rather than confusing the readahead stat.
+	 */
+	if (unlikely(folio_test_large(folio)))
+		return folio;
 
-		if (readahead) {
-			count_vm_event(SWAP_RA_HIT);
-			if (!vmf || !vma_ra)
-				atomic_inc(&swapin_readahead_hits);
-		}
-	} else {
-		folio = NULL;
+	vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
+	readahead = folio_test_clear_readahead(folio);
+	if (vmf && vma_ra) {
+		unsigned long ra_val;
+		int win, hits;
+
+		ra_val = GET_SWAP_RA_VAL(vmf->vma);
+		win = SWAP_RA_WIN(ra_val);
+		hits = SWAP_RA_HITS(ra_val);
+		if (readahead)
+			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+		atomic_long_set(&vmf->vma->swap_readahead_info,
+				SWAP_RA_VAL(vmf->address, win, hits));
+	}
+
+	if (readahead) {
+		count_vm_event(SWAP_RA_HIT);
+		if (!vmf || !vma_ra)
+			atomic_inc(&swapin_readahead_hits);
 	}
 
 	return folio;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 14/24] mm/swap: do shadow lookup as well when doing swap cache lookup
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (12 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 13/24] swap: simplify swap_cache_get_folio Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21 16:55   ` Chris Li
  2023-11-19 19:47 ` [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device Kairui Song
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Make swap_cache_get_folio capable of returning the shadow value when the
xarray contains a shadow instead of a valid folio. Just extend the
arguments to be used later.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c      |  2 +-
 mm/swap.h       |  2 +-
 mm/swap_state.c | 11 +++++++----
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 72239061c655..f9ce4067c742 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL);
+	folio = swap_cache_get_folio(swap, NULL, NULL);
 	if (!folio) {
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
diff --git a/mm/swap.h b/mm/swap.h
index e43e965f123f..da9deb5ba37d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
-				   struct vm_fault *vmf);
+				   struct vm_fault *vmf, void **shadowp);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3b5a34f47192..e057c79fb06f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,14 +334,17 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
  *
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
+struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
 {
 	bool vma_ra, readahead;
 	struct folio *folio;
 
-	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
-	if (IS_ERR(folio))
+	folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
+	if (xa_is_value(folio)) {
+		if (shadowp)
+			*shadowp = folio;
 		return NULL;
+	}
 
 	/*
 	 * At the moment, we don't support PG_readahead for anon THP
@@ -923,7 +926,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct page *page;
 	pgoff_t ilx;
 
-	folio = swap_cache_get_folio(entry, vmf);
+	folio = swap_cache_get_folio(entry, vmf, NULL);
 	if (folio) {
 		page = folio_file_page(folio, swp_offset(entry));
 		cache_result = SWAP_CACHE_HIT;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (13 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 14/24] mm/swap: do shadow lookup as well when doing swap cache lookup Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-21 17:15   ` Chris Li
  2023-11-19 19:47 ` [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path Kairui Song
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When a xa_value is returned by the cache lookup, keep it to be used
later for workingset refault check instead of doing the looking up again
in swapin_no_readahead.

This does have a side effect of making swapoff also triggers workingset
check, but should be fine since swapoff does effect the workload in many
ways already.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index e057c79fb06f..51de2a0412df 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -872,7 +872,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 {
 	struct folio *folio;
 	struct page *page;
-	void *shadow = NULL;
 
 	page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
 	folio = (struct folio *)page;
@@ -888,10 +887,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 		mem_cgroup_swapin_uncharge_swap(entry);
 
-		shadow = get_shadow_from_swap_cache(entry);
-		if (shadow)
-			workingset_refault(folio, shadow);
-
 		folio_add_lru(folio);
 
 		/* To provide entry to swap_readpage() */
@@ -922,11 +917,12 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	enum swap_cache_result cache_result;
 	struct swap_info_struct *si;
 	struct mempolicy *mpol;
+	void *shadow = NULL;
 	struct folio *folio;
 	struct page *page;
 	pgoff_t ilx;
 
-	folio = swap_cache_get_folio(entry, vmf, NULL);
+	folio = swap_cache_get_folio(entry, vmf, &shadow);
 	if (folio) {
 		page = folio_file_page(folio, swp_offset(entry));
 		cache_result = SWAP_CACHE_HIT;
@@ -938,6 +934,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (swap_use_no_readahead(si, swp_offset(entry))) {
 		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
 		cache_result = SWAP_CACHE_BYPASS;
+		if (shadow)
+			workingset_refault(page_folio(page), shadow);
 	} else if (swap_use_vma_readahead(si)) {
 		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
 		cache_result = SWAP_CACHE_MISS;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (14 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-19 21:12   ` Matthew Wilcox
                     ` (2 more replies)
  2023-11-19 19:47 ` [PATCH 17/24] mm/swap: fix false error when swapoff race with swapin Kairui Song
                   ` (8 subsequent siblings)
  24 siblings, 3 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Move get_swap_device into swapin_readahead, simplify the code
and prepare for follow up commits.

For the later part in do_swap_page, using swp_swap_info directly is fine
since in that context, the swap device is pinned by swapcache reference.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 16 ++++------------
 mm/swap_state.c |  8 ++++++--
 mm/swapfile.c   |  4 +++-
 3 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 22af9f3e8c75..e399b37ef395 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3789,7 +3789,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct folio *swapcache = NULL, *folio = NULL;
 	enum swap_cache_result cache_result;
 	struct page *page;
-	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool exclusive = false;
 	swp_entry_t entry;
@@ -3845,14 +3844,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 	}
 
-	/* Prevent swapoff from happening to us. */
-	si = get_swap_device(entry);
-	if (unlikely(!si))
-		goto out;
-
 	page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 				vmf, &cache_result);
-	if (page) {
+	if (PTR_ERR(page) == -EBUSY) {
+		goto out;
+	} else if (page) {
 		folio = page_folio(page);
 		if (cache_result != SWAP_CACHE_HIT) {
 			/* Had to read the page from swap area: Major fault */
@@ -3964,7 +3960,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 */
 			exclusive = true;
 		} else if (exclusive && folio_test_writeback(folio) &&
-			  data_race(si->flags & SWP_STABLE_WRITES)) {
+			   (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
 			/*
 			 * This is tricky: not all swap backends support
 			 * concurrent page modifications while under writeback.
@@ -4068,8 +4064,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
-	if (si)
-		put_swap_device(si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4082,8 +4076,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (si)
-		put_swap_device(si);
 	return ret;
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 51de2a0412df..ff8a166603d0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -922,6 +922,11 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct page *page;
 	pgoff_t ilx;
 
+	/* Prevent swapoff from happening to us */
+	si = get_swap_device(entry);
+	if (unlikely(!si))
+		return ERR_PTR(-EBUSY);
+
 	folio = swap_cache_get_folio(entry, vmf, &shadow);
 	if (folio) {
 		page = folio_file_page(folio, swp_offset(entry));
@@ -929,7 +934,6 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		goto done;
 	}
 
-	si = swp_swap_info(entry);
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	if (swap_use_no_readahead(si, swp_offset(entry))) {
 		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
@@ -944,8 +948,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		cache_result = SWAP_CACHE_MISS;
 	}
 	mpol_cond_put(mpol);
-
 done:
+	put_swap_device(si);
 	if (result)
 		*result = cache_result;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b6d57fff5e21..925ad92486a4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1857,7 +1857,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte = NULL;
 		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 					&vmf, NULL);
-		if (page)
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+		else if (page)
 			folio = page_folio(page);
 		if (!folio) {
 			/*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 17/24] mm/swap: fix false error when swapoff race with swapin
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (15 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-19 19:47 ` [PATCH 18/24] mm/swap: introduce a helper non fault swapin Kairui Song
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When swapoff race with swapin, get_swap_device may fail and cause
swapin_readahead to return -EBUSY. In such case check if the page is
already swapped in by swapoff path.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e399b37ef395..620fa87557fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3846,9 +3846,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 				vmf, &cache_result);
-	if (PTR_ERR(page) == -EBUSY) {
-		goto out;
-	} else if (page) {
+	if (IS_ERR_OR_NULL(page)) {
+		/*
+		 * Back out if somebody else faulted in this pte
+		 * while we released the pte lock.
+		 */
+		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+				vmf->address, &vmf->ptl);
+		if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+			if (!page)
+				ret = VM_FAULT_OOM;
+			else
+				ret = VM_FAULT_RETRY;
+		}
+		goto unlock;
+	} else {
 		folio = page_folio(page);
 		if (cache_result != SWAP_CACHE_HIT) {
 			/* Had to read the page from swap area: Major fault */
@@ -3866,17 +3878,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = VM_FAULT_HWPOISON;
 			goto out_release;
 		}
-	} else {
-		/*
-		 * Back out if somebody else faulted in this pte
-		 * while we released the pte lock.
-		 */
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				vmf->address, &vmf->ptl);
-		if (likely(vmf->pte &&
-			   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
-			ret = VM_FAULT_OOM;
-		goto unlock;
 	}
 
 	ret |= folio_lock_or_retry(folio, vmf);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 18/24] mm/swap: introduce a helper non fault swapin
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (16 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 17/24] mm/swap: fix false error when swapoff race with swapin Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  1:07   ` kernel test robot
  2023-11-22  4:40   ` Chris Li
  2023-11-19 19:47 ` [PATCH 19/24] shmem, swap: refactor error check on OOM or race Kairui Song
                   ` (6 subsequent siblings)
  24 siblings, 2 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There are two places where swapin is not direct caused by page fault:
shmem swapin is invoked through shmem mapping, swapoff cause swapin by
walking the page table. They used to construct a pseudo vmfault struct
for swapin function.

Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
path is still using a pseudo vmfault.

Introduce a helper for them both, this help save stack usage for swapoff
path, and help apply a unified swapin cache and readahead policy check.

Also prepare for follow up commits.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c      | 51 ++++++++++++++++---------------------------------
 mm/swap.h       | 11 +++++++++++
 mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++
 mm/swapfile.c   | 23 +++++++++++-----------
 4 files changed, 76 insertions(+), 47 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f9ce4067c742..81d129aa66d1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
 			pgoff_t index, unsigned int order, pgoff_t *ilx);
 
-static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	struct mempolicy *mpol;
-	pgoff_t ilx;
-	struct page *page;
-
-	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	page = swap_cluster_readahead(swap, gfp, mpol, ilx);
-	mpol_cond_put(mpol);
-
-	if (!page)
-		return NULL;
-	return page_folio(page);
-}
-
 /*
  * Make sure huge_gfp is always more limited than limit_gfp.
  * Some of the flags set permissions, while others set limitations.
@@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct swap_info_struct *si;
+	enum swap_cache_result result;
 	struct folio *folio = NULL;
+	struct mempolicy *mpol;
+	struct page *page;
 	swp_entry_t swap;
+	pgoff_t ilx;
 	int error;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (is_poisoned_swp_entry(swap))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	if (!si) {
+	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+	page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
+	mpol_cond_put(mpol);
+
+	if (PTR_ERR(page) == -EBUSY) {
 		if (!shmem_confirm_swap(mapping, index, swap))
 			return -EEXIST;
 		else
 			return -EINVAL;
-	}
-
-	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, NULL);
-	if (!folio) {
-		/* Or update major stats only when swapin succeeds?? */
-		if (fault_type) {
+	} else if (!page) {
+		error = -ENOMEM;
+		goto failed;
+	} else {
+		folio = page_folio(page);
+		if (fault_type && result != SWAP_CACHE_HIT) {
 			*fault_type |= VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-		/* Here we actually start the io */
-		folio = shmem_swapin_cluster(swap, gfp, info, index);
-		if (!folio) {
-			error = -ENOMEM;
-			goto failed;
-		}
 	}
 
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
-	if (!folio_test_swapcache(folio) ||
+	if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
 	    folio->swap.val != swap.val ||
 	    !shmem_confirm_swap(mapping, index, swap)) {
 		error = -EEXIST;
@@ -1930,7 +1913,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
 	swap_free(swap);
-	put_swap_device(si);
 
 	*foliop = folio;
 	return 0;
@@ -1944,7 +1926,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_unlock(folio);
 		folio_put(folio);
 	}
-	put_swap_device(si);
 
 	return error;
 }
diff --git a/mm/swap.h b/mm/swap.h
index da9deb5ba37d..b073c29c9790 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -62,6 +62,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 				    struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf, enum swap_cache_result *result);
+struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
+				   struct mempolicy *mpol, pgoff_t ilx,
+				   struct mm_struct *mm,
+				   enum swap_cache_result *result);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -103,6 +107,13 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
+		struct mempolicy *mpol, pgoff_t ilx, struct mm_struct *mm,
+		enum swap_cache_result *result)
+{
+	return NULL;
+}
+
 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ff8a166603d0..eef66757c615 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -956,6 +956,44 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return page;
 }
 
+struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
+				   struct mempolicy *mpol, pgoff_t ilx,
+				   struct mm_struct *mm, enum swap_cache_result *result)
+{
+	enum swap_cache_result cache_result;
+	struct swap_info_struct *si;
+	void *shadow = NULL;
+	struct folio *folio;
+	struct page *page;
+
+	/* Prevent swapoff from happening to us */
+	si = get_swap_device(entry);
+	if (unlikely(!si))
+		return ERR_PTR(-EBUSY);
+
+	folio = swap_cache_get_folio(entry, NULL, &shadow);
+	if (folio) {
+		page = folio_file_page(folio, swp_offset(entry));
+		cache_result = SWAP_CACHE_HIT;
+		goto done;
+	}
+
+	if (swap_use_no_readahead(si, swp_offset(entry))) {
+		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, mm);
+		if (shadow)
+			workingset_refault(page_folio(page), shadow);
+		cache_result = SWAP_CACHE_BYPASS;
+	} else {
+		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		cache_result = SWAP_CACHE_MISS;
+	}
+done:
+	put_swap_device(si);
+	if (result)
+		*result = cache_result;
+	return page;
+}
+
 #ifdef CONFIG_SYSFS
 static ssize_t vma_ra_enabled_show(struct kobject *kobj,
 				     struct kobj_attribute *attr, char *buf)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 925ad92486a4..f8c5096fe0f0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1822,20 +1822,15 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 	si = swap_info[type];
 	do {
+		int ret;
+		pte_t ptent;
+		pgoff_t ilx;
+		swp_entry_t entry;
 		struct page *page;
 		unsigned long offset;
+		struct mempolicy *mpol;
 		unsigned char swp_count;
 		struct folio *folio = NULL;
-		swp_entry_t entry;
-		int ret;
-		pte_t ptent;
-
-		struct vm_fault vmf = {
-			.vma = vma,
-			.address = addr,
-			.real_address = addr,
-			.pmd = pmd,
-		};
 
 		if (!pte++) {
 			pte = pte_offset_map(pmd, addr);
@@ -1855,8 +1850,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		offset = swp_offset(entry);
 		pte_unmap(pte);
 		pte = NULL;
-		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-					&vmf, NULL);
+
+		mpol = get_vma_policy(vma, addr, 0, &ilx);
+		page = swapin_page_non_fault(entry, GFP_HIGHUSER_MOVABLE,
+					     mpol, ilx, vma->vm_mm, NULL);
+		mpol_cond_put(mpol);
+
 		if (IS_ERR(page))
 			return PTR_ERR(page);
 		else if (page)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 19/24] shmem, swap: refactor error check on OOM or race
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (17 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 18/24] mm/swap: introduce a helper non fault swapin Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  7:04   ` Chris Li
  2023-11-19 19:47 ` [PATCH 20/24] swap: simplify and make swap_find_cache static Kairui Song
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

It should always check if a swap entry is already swaped in on error,
fix potential false error issue.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 81d129aa66d1..6154b5b68b8f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1857,13 +1857,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
 	mpol_cond_put(mpol);
 
-	if (PTR_ERR(page) == -EBUSY) {
-		if (!shmem_confirm_swap(mapping, index, swap))
-			return -EEXIST;
+	if (IS_ERR_OR_NULL(page)) {
+		if (!page)
+			error = -ENOMEM;
 		else
-			return -EINVAL;
-	} else if (!page) {
-		error = -ENOMEM;
+			error = -EINVAL;
 		goto failed;
 	} else {
 		folio = page_folio(page);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 20/24] swap: simplify and make swap_find_cache static
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (18 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 19/24] shmem, swap: refactor error check on OOM or race Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-22  5:01   ` Chris Li
  2023-11-19 19:47 ` [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory Kairui Song
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There are only two callers within the same file now, make it static and
drop the redundant if check on the shadow variable.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       | 2 --
 mm/swap_state.c | 5 ++---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index b073c29c9790..4402970547e7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -46,8 +46,6 @@ void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-				   struct vm_fault *vmf, void **shadowp);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index eef66757c615..6f39aa8394f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,15 +334,14 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
  *
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
+static struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
 {
 	bool vma_ra, readahead;
 	struct folio *folio;
 
 	folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
 	if (xa_is_value(folio)) {
-		if (shadowp)
-			*shadowp = folio;
+		*shadowp = folio;
 		return NULL;
 	}
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (19 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 20/24] swap: simplify and make swap_find_cache static Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-22  5:15   ` Chris Li
  2023-11-19 19:47 ` [PATCH 22/24] swap: make swap_cluster_readahead static Kairui Song
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

This is only one caller now in page fault path, make the result return
argument mandatory.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6f39aa8394f1..0433a2586c6d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -913,7 +913,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			      struct vm_fault *vmf, enum swap_cache_result *result)
 {
-	enum swap_cache_result cache_result;
 	struct swap_info_struct *si;
 	struct mempolicy *mpol;
 	void *shadow = NULL;
@@ -928,29 +927,27 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	folio = swap_cache_get_folio(entry, vmf, &shadow);
 	if (folio) {
+		*result = SWAP_CACHE_HIT;
 		page = folio_file_page(folio, swp_offset(entry));
-		cache_result = SWAP_CACHE_HIT;
 		goto done;
 	}
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	if (swap_use_no_readahead(si, swp_offset(entry))) {
+		*result = SWAP_CACHE_BYPASS;
 		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
-		cache_result = SWAP_CACHE_BYPASS;
 		if (shadow)
 			workingset_refault(page_folio(page), shadow);
-	} else if (swap_use_vma_readahead(si)) {
-		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
-		cache_result = SWAP_CACHE_MISS;
 	} else {
-		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
-		cache_result = SWAP_CACHE_MISS;
+		*result = SWAP_CACHE_MISS;
+		if (swap_use_vma_readahead(si))
+			page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+		else
+			page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
 	}
 	mpol_cond_put(mpol);
 done:
 	put_swap_device(si);
-	if (result)
-		*result = cache_result;
 
 	return page;
 }
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 22/24] swap: make swap_cluster_readahead static
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (20 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-22  5:20   ` Chris Li
  2023-11-19 19:47 ` [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate Kairui Song
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now there is no caller outside the same file, make it static.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       | 8 --------
 mm/swap_state.c | 4 ++--
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 4402970547e7..795a25df87da 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,8 +56,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
 				     bool *new_page_allocated);
-struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-				    struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf, enum swap_cache_result *result);
 struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
@@ -93,12 +91,6 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-static inline struct page *swap_cluster_readahead(swp_entry_t entry,
-			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
-{
-	return NULL;
-}
-
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf, enum swap_cache_result *result)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0433a2586c6d..b377e55cb850 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -627,8 +627,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are used for every page of the readahead: neighbouring pages on swap
  * are fairly likely to have been swapped out from the same node.
  */
-struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
+					   struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct page *page;
 	unsigned long entry_offset = swp_offset(entry);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (21 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 22/24] swap: make swap_cluster_readahead static Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20  7:35   ` Huang, Ying
  2023-11-19 19:47 ` [PATCH 24/24] mm/swap: change swapin_readahead to swapin_page_fault Kairui Song
  2023-11-20 19:09 ` [PATCH 00/24] Swapin path refactor for optimization and bugfix Yosry Ahmed
  24 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When a process which previously swapped some memory was moved to
another cgroup, and the cgroup it previous in is dead, then swapped in
pages will be leaked into rootcg. Previous commits fixed the bug for
no readahead path, this commit fix the same issue for readahead path.

This can be easily reproduced by:
- Setup a SSD or HDD swap.
- Create memory cgroup A, B and C.
- Spawn process P1 in cgroup A and make it swap out some pages.
- Move process P1 to memory cgroup B.
- Destroy cgroup A.
- Do a swapoff in cgroup C
- Swapped in pages is accounted into cgroup C.

This patch will fix it make the swapped in pages accounted in cgroup B.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  2 +-
 mm/swap_state.c | 19 ++++++++++---------
 mm/zswap.c      |  2 +-
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 795a25df87da..4374bf11ca41 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -55,7 +55,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				   struct swap_iocb **plug);
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *new_page_allocated);
+				     struct mm_struct *mm, bool *new_page_allocated);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf, enum swap_cache_result *result);
 struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b377e55cb850..362a6f674b36 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -416,7 +416,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *new_page_allocated)
+				     struct mm_struct *mm, bool *new_page_allocated)
 {
 	struct swap_info_struct *si;
 	struct folio *folio;
@@ -462,7 +462,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 						mpol, ilx, numa_node_id());
 		if (!folio)
                         goto fail_put_swap;
-		if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
+		if (mem_cgroup_swapin_charge_folio(folio, mm, gfp_mask, entry))
 			goto fail_put_folio;
 
 		/*
@@ -540,7 +540,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
-					&page_allocated);
+				       vma->vm_mm, &page_allocated);
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
@@ -628,7 +628,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-					   struct mempolicy *mpol, pgoff_t ilx)
+					   struct mempolicy *mpol, pgoff_t ilx,
+					   struct mm_struct *mm)
 {
 	struct page *page;
 	unsigned long entry_offset = swp_offset(entry);
@@ -657,7 +658,7 @@ static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		/* Ok, do the async read-ahead now */
 		page = __read_swap_cache_async(
 				swp_entry(swp_type(entry), offset),
-				gfp_mask, mpol, ilx, &page_allocated);
+				gfp_mask, mpol, ilx, mm, &page_allocated);
 		if (!page)
 			continue;
 		if (page_allocated) {
@@ -675,7 +676,7 @@ static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
-					&page_allocated);
+				       mm, &page_allocated);
 	if (unlikely(page_allocated))
 		swap_readpage(page, false, NULL);
 	return page;
@@ -830,7 +831,7 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		pte_unmap(pte);
 		pte = NULL;
 		page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
-						&page_allocated);
+					       vmf->vma->vm_mm, &page_allocated);
 		if (!page)
 			continue;
 		if (page_allocated) {
@@ -850,7 +851,7 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	page = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
-					&page_allocated);
+				       vmf->vma->vm_mm, &page_allocated);
 	if (unlikely(page_allocated))
 		swap_readpage(page, false, NULL);
 	return page;
@@ -980,7 +981,7 @@ struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
 			workingset_refault(page_folio(page), shadow);
 		cache_result = SWAP_CACHE_BYPASS;
 	} else {
-		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx, mm);
 		cache_result = SWAP_CACHE_MISS;
 	}
 done:
diff --git a/mm/zswap.c b/mm/zswap.c
index 030cc137138f..e2712ff169b1 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1081,7 +1081,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	/* try to allocate swap cache page */
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
-				NO_INTERLEAVE_INDEX, &page_was_allocated);
+				NO_INTERLEAVE_INDEX, NULL, &page_was_allocated);
 	if (!page) {
 		ret = -ENOMEM;
 		goto fail;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH 24/24] mm/swap: change swapin_readahead to swapin_page_fault
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (22 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate Kairui Song
@ 2023-11-19 19:47 ` Kairui Song
  2023-11-20 19:09 ` [PATCH 00/24] Swapin path refactor for optimization and bugfix Yosry Ahmed
  24 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-19 19:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Huang, Ying, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swapin_readahead is only called from direct page fault path, so
rename it and drop the gfp argument, since there is only one caller
always using the same flag for userspace page fault.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     |  4 ++--
 mm/swap.h       |  6 +++---
 mm/swap_state.c | 15 +++++++++------
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 620fa87557fd..4907a5b1b75b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3844,8 +3844,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 	}
 
-	page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-				vmf, &cache_result);
+	page = swapin_page_fault(entry, GFP_HIGHUSER_MOVABLE,
+				 vmf, &cache_result);
 	if (IS_ERR_OR_NULL(page)) {
 		/*
 		 * Back out if somebody else faulted in this pte
diff --git a/mm/swap.h b/mm/swap.h
index 4374bf11ca41..2f8f8befff89 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,8 +56,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
 				     struct mm_struct *mm, bool *new_page_allocated);
-struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf, enum swap_cache_result *result);
+struct page *swapin_page_fault(swp_entry_t entry, gfp_t flag,
+			       struct vm_fault *vmf, enum swap_cache_result *result);
 struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
 				   struct mempolicy *mpol, pgoff_t ilx,
 				   struct mm_struct *mm,
@@ -91,7 +91,7 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+static inline struct page *swapin_page_fault(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf, enum swap_cache_result *result)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 362a6f674b36..2f51d2e64e59 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -899,7 +899,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
 }
 
 /**
- * swapin_readahead - swap in pages in hope we need them soon
+ * swapin_page_fault - swap in a page from page fault context
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
@@ -911,8 +911,8 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
  * it will read ahead blocks by cluster-based(ie, physical disk based)
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-			      struct vm_fault *vmf, enum swap_cache_result *result)
+struct page *swapin_page_fault(swp_entry_t entry, gfp_t gfp_mask,
+			       struct vm_fault *vmf, enum swap_cache_result *result)
 {
 	struct swap_info_struct *si;
 	struct mempolicy *mpol;
@@ -936,15 +936,18 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	if (swap_use_no_readahead(si, swp_offset(entry))) {
 		*result = SWAP_CACHE_BYPASS;
-		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
+		page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
+					   mpol, ilx, vmf->vma->vm_mm);
 		if (shadow)
 			workingset_refault(page_folio(page), shadow);
 	} else {
 		*result = SWAP_CACHE_MISS;
 		if (swap_use_vma_readahead(si))
-			page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+			page = swap_vma_readahead(entry, GFP_HIGHUSER_MOVABLE,
+						  mpol, ilx, vmf);
 		else
-			page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+			page = swap_cluster_readahead(entry, GFP_HIGHUSER_MOVABLE,
+						      mpol, ilx, vmf->vma->vm_mm);
 	}
 	mpol_cond_put(mpol);
 done:
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH 01/24] mm/swap: fix a potential undefined behavior issue
  2023-11-19 19:47 ` [PATCH 01/24] mm/swap: fix a potential undefined behavior issue Kairui Song
@ 2023-11-19 20:55   ` Matthew Wilcox
  2023-11-20  3:35     ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2023-11-19 20:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Michal Hocko, linux-kernel

On Mon, Nov 20, 2023 at 03:47:17AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> When folio is NULL, taking the address of its struct member is an
> undefined behavior, the UB is caused by applying -> operator
> to a pointer not pointing to any object. Although in practice this
> won't lead to a real issue, still better to fix it, also makes the
> code less error-prone, when folio is NULL, page is also NULL,
> instead of a meanless offset value.

Um, &folio->page is NULL if folio is NULL.  The offset of 'page' within
'folio' is 0.  By definition; and this will never change.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
  2023-11-19 19:47 ` [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper Kairui Song
@ 2023-11-19 21:00   ` Matthew Wilcox
  2023-11-20 11:14     ` Kairui Song
  2023-11-20 14:55   ` Dan Carpenter
  2023-11-21  5:34   ` Chris Li
  2 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2023-11-19 21:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Michal Hocko, linux-kernel

On Mon, Nov 20, 2023 at 03:47:19AM +0800, Kairui Song wrote:
> +			/* skip swapcache and readahead */
> +			page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +						vmf);
> +			if (page)
> +				folio = page_folio(page);

I think this should rather be:

			folio = swapin_no_readahead(entry,
					GFP_HIGHUSER_MOVABLE, vma);
			page = &folio->page;

and have swapin_no_readahead() work entirely in terms of folios.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-19 19:47 ` [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path Kairui Song
@ 2023-11-19 21:12   ` Matthew Wilcox
  2023-11-20 11:14     ` Kairui Song
  2023-11-21 17:25   ` Chris Li
  2023-11-22  0:36   ` Huang, Ying
  2 siblings, 1 reply; 93+ messages in thread
From: Matthew Wilcox @ 2023-11-19 21:12 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Michal Hocko, linux-kernel

On Mon, Nov 20, 2023 at 03:47:32AM +0800, Kairui Song wrote:
>  	page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>  				vmf, &cache_result);
> -	if (page) {
> +	if (PTR_ERR(page) == -EBUSY) {

	if (page == ERR_PTR(-EBUSY)) {


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 06/24] swap: rework swapin_no_readahead arguments
  2023-11-19 19:47 ` [PATCH 06/24] swap: rework swapin_no_readahead arguments Kairui Song
@ 2023-11-20  0:20   ` kernel test robot
  2023-11-21  6:44   ` Chris Li
  1 sibling, 0 replies; 93+ messages in thread
From: kernel test robot @ 2023-11-20  0:20 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Huang, Ying, David Hildenbrand, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.7-rc2 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231119194740.94101-7-ryncsn%40gmail.com
patch subject: [PATCH 06/24] swap: rework swapin_no_readahead arguments
config: i386-buildonly-randconfig-003-20231120 (https://download.01.org/0day-ci/archive/20231120/202311200826.8Nl5w3h8-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231120/202311200826.8Nl5w3h8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311200826.8Nl5w3h8-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/swap_state.c:872: warning: Function parameter or member 'mpol' not described in 'swapin_no_readahead'
>> mm/swap_state.c:872: warning: Function parameter or member 'ilx' not described in 'swapin_no_readahead'
>> mm/swap_state.c:872: warning: Function parameter or member 'mm' not described in 'swapin_no_readahead'
>> mm/swap_state.c:872: warning: Excess function parameter 'vmf' description in 'swapin_no_readahead'


vim +872 mm/swap_state.c

d9bfcfdc41e8e5 Huang Ying  2017-09-06  859  
19f582d2684e47 Kairui Song 2023-11-20  860  /**
19f582d2684e47 Kairui Song 2023-11-20  861   * swapin_no_readahead - swap in pages skipping swap cache and readahead
19f582d2684e47 Kairui Song 2023-11-20  862   * @entry: swap entry of this memory
19f582d2684e47 Kairui Song 2023-11-20  863   * @gfp_mask: memory allocation flags
19f582d2684e47 Kairui Song 2023-11-20  864   * @vmf: fault information
19f582d2684e47 Kairui Song 2023-11-20  865   *
19f582d2684e47 Kairui Song 2023-11-20  866   * Returns the struct page for entry and addr after the swap entry is read
19f582d2684e47 Kairui Song 2023-11-20  867   * in.
19f582d2684e47 Kairui Song 2023-11-20  868   */
598f2616cde014 Kairui Song 2023-11-20  869  static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
2538a5e96fe62f Kairui Song 2023-11-20  870  					struct mempolicy *mpol, pgoff_t ilx,
2538a5e96fe62f Kairui Song 2023-11-20  871  					struct mm_struct *mm)
19f582d2684e47 Kairui Song 2023-11-20 @872  {
19f582d2684e47 Kairui Song 2023-11-20  873  	struct folio *folio;
2538a5e96fe62f Kairui Song 2023-11-20  874  	struct page *page;
19f582d2684e47 Kairui Song 2023-11-20  875  	void *shadow = NULL;
19f582d2684e47 Kairui Song 2023-11-20  876  
2538a5e96fe62f Kairui Song 2023-11-20  877  	page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
2538a5e96fe62f Kairui Song 2023-11-20  878  	folio = (struct folio *)page;
19f582d2684e47 Kairui Song 2023-11-20  879  	if (folio) {
2538a5e96fe62f Kairui Song 2023-11-20  880  		if (mem_cgroup_swapin_charge_folio(folio, mm,
c2ac0dcbf9ab6a Kairui Song 2023-11-20  881  						   GFP_KERNEL, entry)) {
19f582d2684e47 Kairui Song 2023-11-20  882  			folio_put(folio);
19f582d2684e47 Kairui Song 2023-11-20  883  			return NULL;
19f582d2684e47 Kairui Song 2023-11-20  884  		}
c2ac0dcbf9ab6a Kairui Song 2023-11-20  885  
c2ac0dcbf9ab6a Kairui Song 2023-11-20  886  		__folio_set_locked(folio);
c2ac0dcbf9ab6a Kairui Song 2023-11-20  887  		__folio_set_swapbacked(folio);
c2ac0dcbf9ab6a Kairui Song 2023-11-20  888  
19f582d2684e47 Kairui Song 2023-11-20  889  		mem_cgroup_swapin_uncharge_swap(entry);
19f582d2684e47 Kairui Song 2023-11-20  890  
19f582d2684e47 Kairui Song 2023-11-20  891  		shadow = get_shadow_from_swap_cache(entry);
19f582d2684e47 Kairui Song 2023-11-20  892  		if (shadow)
19f582d2684e47 Kairui Song 2023-11-20  893  			workingset_refault(folio, shadow);
19f582d2684e47 Kairui Song 2023-11-20  894  
19f582d2684e47 Kairui Song 2023-11-20  895  		folio_add_lru(folio);
19f582d2684e47 Kairui Song 2023-11-20  896  
19f582d2684e47 Kairui Song 2023-11-20  897  		/* To provide entry to swap_readpage() */
19f582d2684e47 Kairui Song 2023-11-20  898  		folio->swap = entry;
19f582d2684e47 Kairui Song 2023-11-20  899  		swap_readpage(page, true, NULL);
19f582d2684e47 Kairui Song 2023-11-20  900  		folio->private = NULL;
19f582d2684e47 Kairui Song 2023-11-20  901  	}
19f582d2684e47 Kairui Song 2023-11-20  902  
19f582d2684e47 Kairui Song 2023-11-20  903  	return page;
19f582d2684e47 Kairui Song 2023-11-20  904  }
19f582d2684e47 Kairui Song 2023-11-20  905  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
  2023-11-19 19:47 ` [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead Kairui Song
@ 2023-11-20  0:47   ` kernel test robot
  2023-11-21 16:06   ` Chris Li
  1 sibling, 0 replies; 93+ messages in thread
From: kernel test robot @ 2023-11-20  0:47 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Huang, Ying, David Hildenbrand, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc2 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231119194740.94101-12-ryncsn%40gmail.com
patch subject: [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20231120/202311200813.x056cCRJ-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231120/202311200813.x056cCRJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311200813.x056cCRJ-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from mm/filemap.c:62:
>> mm/swap.h:101:52: warning: 'enum swap_cache_result' declared inside parameter list will not be visible outside of this definition or declaration
     101 |                         struct vm_fault *vmf, enum swap_cache_result *result)
         |                                                    ^~~~~~~~~~~~~~~~~
--
   In file included from mm/memory.c:93:
>> mm/swap.h:101:52: warning: 'enum swap_cache_result' declared inside parameter list will not be visible outside of this definition or declaration
     101 |                         struct vm_fault *vmf, enum swap_cache_result *result)
         |                                                    ^~~~~~~~~~~~~~~~~
   mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:3790:32: error: storage size of 'cache_result' isn't known
    3790 |         enum swap_cache_result cache_result;
         |                                ^~~~~~~~~~~~
>> mm/memory.c:3857:37: error: 'SWAP_CACHE_HIT' undeclared (first use in this function); did you mean 'DQST_CACHE_HITS'?
    3857 |                 if (cache_result != SWAP_CACHE_HIT) {
         |                                     ^~~~~~~~~~~~~~
         |                                     DQST_CACHE_HITS
   mm/memory.c:3857:37: note: each undeclared identifier is reported only once for each function it appears in
>> mm/memory.c:3863:37: error: 'SWAP_CACHE_BYPASS' undeclared (first use in this function); did you mean 'SMP_CACHE_BYTES'?
    3863 |                 if (cache_result != SWAP_CACHE_BYPASS)
         |                                     ^~~~~~~~~~~~~~~~~
         |                                     SMP_CACHE_BYTES
>> mm/memory.c:3790:32: warning: unused variable 'cache_result' [-Wunused-variable]
    3790 |         enum swap_cache_result cache_result;
         |                                ^~~~~~~~~~~~


vim +3790 mm/memory.c

  3777	
  3778	/*
  3779	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  3780	 * but allow concurrent faults), and pte mapped but not yet locked.
  3781	 * We return with pte unmapped and unlocked.
  3782	 *
  3783	 * We return with the mmap_lock locked or unlocked in the same cases
  3784	 * as does filemap_fault().
  3785	 */
  3786	vm_fault_t do_swap_page(struct vm_fault *vmf)
  3787	{
  3788		struct vm_area_struct *vma = vmf->vma;
  3789		struct folio *swapcache = NULL, *folio = NULL;
> 3790		enum swap_cache_result cache_result;
  3791		struct page *page;
  3792		struct swap_info_struct *si = NULL;
  3793		rmap_t rmap_flags = RMAP_NONE;
  3794		bool exclusive = false;
  3795		swp_entry_t entry;
  3796		pte_t pte;
  3797		vm_fault_t ret = 0;
  3798	
  3799		if (!pte_unmap_same(vmf))
  3800			goto out;
  3801	
  3802		entry = pte_to_swp_entry(vmf->orig_pte);
  3803		if (unlikely(non_swap_entry(entry))) {
  3804			if (is_migration_entry(entry)) {
  3805				migration_entry_wait(vma->vm_mm, vmf->pmd,
  3806						     vmf->address);
  3807			} else if (is_device_exclusive_entry(entry)) {
  3808				vmf->page = pfn_swap_entry_to_page(entry);
  3809				ret = remove_device_exclusive_entry(vmf);
  3810			} else if (is_device_private_entry(entry)) {
  3811				if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
  3812					/*
  3813					 * migrate_to_ram is not yet ready to operate
  3814					 * under VMA lock.
  3815					 */
  3816					vma_end_read(vma);
  3817					ret = VM_FAULT_RETRY;
  3818					goto out;
  3819				}
  3820	
  3821				vmf->page = pfn_swap_entry_to_page(entry);
  3822				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  3823						vmf->address, &vmf->ptl);
  3824				if (unlikely(!vmf->pte ||
  3825					     !pte_same(ptep_get(vmf->pte),
  3826								vmf->orig_pte)))
  3827					goto unlock;
  3828	
  3829				/*
  3830				 * Get a page reference while we know the page can't be
  3831				 * freed.
  3832				 */
  3833				get_page(vmf->page);
  3834				pte_unmap_unlock(vmf->pte, vmf->ptl);
  3835				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
  3836				put_page(vmf->page);
  3837			} else if (is_hwpoison_entry(entry)) {
  3838				ret = VM_FAULT_HWPOISON;
  3839			} else if (is_pte_marker_entry(entry)) {
  3840				ret = handle_pte_marker(vmf);
  3841			} else {
  3842				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  3843				ret = VM_FAULT_SIGBUS;
  3844			}
  3845			goto out;
  3846		}
  3847	
  3848		/* Prevent swapoff from happening to us. */
  3849		si = get_swap_device(entry);
  3850		if (unlikely(!si))
  3851			goto out;
  3852	
  3853		page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
  3854					vmf, &cache_result);
  3855		if (page) {
  3856			folio = page_folio(page);
> 3857			if (cache_result != SWAP_CACHE_HIT) {
  3858				/* Had to read the page from swap area: Major fault */
  3859				ret = VM_FAULT_MAJOR;
  3860				count_vm_event(PGMAJFAULT);
  3861				count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  3862			}
> 3863			if (cache_result != SWAP_CACHE_BYPASS)
  3864				swapcache = folio;
  3865			if (PageHWPoison(page)) {
  3866				/*
  3867				 * hwpoisoned dirty swapcache pages are kept for killing
  3868				 * owner processes (which may be unknown at hwpoison time)
  3869				 */
  3870				ret = VM_FAULT_HWPOISON;
  3871				goto out_release;
  3872			}
  3873		} else {
  3874			/*
  3875			 * Back out if somebody else faulted in this pte
  3876			 * while we released the pte lock.
  3877			 */
  3878			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  3879					vmf->address, &vmf->ptl);
  3880			if (likely(vmf->pte &&
  3881				   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  3882				ret = VM_FAULT_OOM;
  3883			goto unlock;
  3884		}
  3885	
  3886		ret |= folio_lock_or_retry(folio, vmf);
  3887		if (ret & VM_FAULT_RETRY)
  3888			goto out_release;
  3889	
  3890		if (swapcache) {
  3891			/*
  3892			 * Make sure folio_free_swap() or swapoff did not release the
  3893			 * swapcache from under us.  The page pin, and pte_same test
  3894			 * below, are not enough to exclude that.  Even if it is still
  3895			 * swapcache, we need to check that the page's swap has not
  3896			 * changed.
  3897			 */
  3898			if (unlikely(!folio_test_swapcache(folio) ||
  3899				     page_swap_entry(page).val != entry.val))
  3900				goto out_page;
  3901	
  3902			/*
  3903			 * KSM sometimes has to copy on read faults, for example, if
  3904			 * page->index of !PageKSM() pages would be nonlinear inside the
  3905			 * anon VMA -- PageKSM() is lost on actual swapout.
  3906			 */
  3907			page = ksm_might_need_to_copy(page, vma, vmf->address);
  3908			if (unlikely(!page)) {
  3909				ret = VM_FAULT_OOM;
  3910				goto out_page;
  3911			} else if (unlikely(PTR_ERR(page) == -EHWPOISON)) {
  3912				ret = VM_FAULT_HWPOISON;
  3913				goto out_page;
  3914			}
  3915			folio = page_folio(page);
  3916	
  3917			/*
  3918			 * If we want to map a page that's in the swapcache writable, we
  3919			 * have to detect via the refcount if we're really the exclusive
  3920			 * owner. Try removing the extra reference from the local LRU
  3921			 * caches if required.
  3922			 */
  3923			if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
  3924			    !folio_test_ksm(folio) && !folio_test_lru(folio))
  3925				lru_add_drain();
  3926		}
  3927	
  3928		folio_throttle_swaprate(folio, GFP_KERNEL);
  3929	
  3930		/*
  3931		 * Back out if somebody else already faulted in this pte.
  3932		 */
  3933		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  3934				&vmf->ptl);
  3935		if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  3936			goto out_nomap;
  3937	
  3938		if (unlikely(!folio_test_uptodate(folio))) {
  3939			ret = VM_FAULT_SIGBUS;
  3940			goto out_nomap;
  3941		}
  3942	
  3943		/*
  3944		 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
  3945		 * must never point at an anonymous page in the swapcache that is
  3946		 * PG_anon_exclusive. Sanity check that this holds and especially, that
  3947		 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
  3948		 * check after taking the PT lock and making sure that nobody
  3949		 * concurrently faulted in this page and set PG_anon_exclusive.
  3950		 */
  3951		BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
  3952		BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
  3953	
  3954		/*
  3955		 * Check under PT lock (to protect against concurrent fork() sharing
  3956		 * the swap entry concurrently) for certainly exclusive pages.
  3957		 */
  3958		if (!folio_test_ksm(folio)) {
  3959			exclusive = pte_swp_exclusive(vmf->orig_pte);
  3960			if (folio != swapcache) {
  3961				/*
  3962				 * We have a fresh page that is not exposed to the
  3963				 * swapcache -> certainly exclusive.
  3964				 */
  3965				exclusive = true;
  3966			} else if (exclusive && folio_test_writeback(folio) &&
  3967				  data_race(si->flags & SWP_STABLE_WRITES)) {
  3968				/*
  3969				 * This is tricky: not all swap backends support
  3970				 * concurrent page modifications while under writeback.
  3971				 *
  3972				 * So if we stumble over such a page in the swapcache
  3973				 * we must not set the page exclusive, otherwise we can
  3974				 * map it writable without further checks and modify it
  3975				 * while still under writeback.
  3976				 *
  3977				 * For these problematic swap backends, simply drop the
  3978				 * exclusive marker: this is perfectly fine as we start
  3979				 * writeback only if we fully unmapped the page and
  3980				 * there are no unexpected references on the page after
  3981				 * unmapping succeeded. After fully unmapped, no
  3982				 * further GUP references (FOLL_GET and FOLL_PIN) can
  3983				 * appear, so dropping the exclusive marker and mapping
  3984				 * it only R/O is fine.
  3985				 */
  3986				exclusive = false;
  3987			}
  3988		}
  3989	
  3990		/*
  3991		 * Some architectures may have to restore extra metadata to the page
  3992		 * when reading from swap. This metadata may be indexed by swap entry
  3993		 * so this must be called before swap_free().
  3994		 */
  3995		arch_swap_restore(entry, folio);
  3996	
  3997		/*
  3998		 * Remove the swap entry and conditionally try to free up the swapcache.
  3999		 * We're already holding a reference on the page but haven't mapped it
  4000		 * yet.
  4001		 */
  4002		swap_free(entry);
  4003		if (should_try_to_free_swap(folio, vma, vmf->flags))
  4004			folio_free_swap(folio);
  4005	
  4006		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
  4007		dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
  4008		pte = mk_pte(page, vma->vm_page_prot);
  4009	
  4010		/*
  4011		 * Same logic as in do_wp_page(); however, optimize for pages that are
  4012		 * certainly not shared either because we just allocated them without
  4013		 * exposing them to the swapcache or because the swap entry indicates
  4014		 * exclusivity.
  4015		 */
  4016		if (!folio_test_ksm(folio) &&
  4017		    (exclusive || folio_ref_count(folio) == 1)) {
  4018			if (vmf->flags & FAULT_FLAG_WRITE) {
  4019				pte = maybe_mkwrite(pte_mkdirty(pte), vma);
  4020				vmf->flags &= ~FAULT_FLAG_WRITE;
  4021			}
  4022			rmap_flags |= RMAP_EXCLUSIVE;
  4023		}
  4024		flush_icache_page(vma, page);
  4025		if (pte_swp_soft_dirty(vmf->orig_pte))
  4026			pte = pte_mksoft_dirty(pte);
  4027		if (pte_swp_uffd_wp(vmf->orig_pte))
  4028			pte = pte_mkuffd_wp(pte);
  4029		vmf->orig_pte = pte;
  4030	
  4031		/* ksm created a completely new copy */
  4032		if (unlikely(folio != swapcache && swapcache)) {
  4033			page_add_new_anon_rmap(page, vma, vmf->address);
  4034			folio_add_lru_vma(folio, vma);
  4035		} else {
  4036			page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
  4037		}
  4038	
  4039		VM_BUG_ON(!folio_test_anon(folio) ||
  4040				(pte_write(pte) && !PageAnonExclusive(page)));
  4041		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
  4042		arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
  4043	
  4044		folio_unlock(folio);
  4045		if (folio != swapcache && swapcache) {
  4046			/*
  4047			 * Hold the lock to avoid the swap entry to be reused
  4048			 * until we take the PT lock for the pte_same() check
  4049			 * (to avoid false positives from pte_same). For
  4050			 * further safety release the lock after the swap_free
  4051			 * so that the swap count won't change under a
  4052			 * parallel locked swapcache.
  4053			 */
  4054			folio_unlock(swapcache);
  4055			folio_put(swapcache);
  4056		}
  4057	
  4058		if (vmf->flags & FAULT_FLAG_WRITE) {
  4059			ret |= do_wp_page(vmf);
  4060			if (ret & VM_FAULT_ERROR)
  4061				ret &= VM_FAULT_ERROR;
  4062			goto out;
  4063		}
  4064	
  4065		/* No need to invalidate - it was non-present before */
  4066		update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
  4067	unlock:
  4068		if (vmf->pte)
  4069			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4070	out:
  4071		if (si)
  4072			put_swap_device(si);
  4073		return ret;
  4074	out_nomap:
  4075		if (vmf->pte)
  4076			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4077	out_page:
  4078		folio_unlock(folio);
  4079	out_release:
  4080		folio_put(folio);
  4081		if (folio != swapcache && swapcache) {
  4082			folio_unlock(swapcache);
  4083			folio_put(swapcache);
  4084		}
  4085		if (si)
  4086			put_swap_device(si);
  4087		return ret;
  4088	}
  4089	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 18/24] mm/swap: introduce a helper non fault swapin
  2023-11-19 19:47 ` [PATCH 18/24] mm/swap: introduce a helper non fault swapin Kairui Song
@ 2023-11-20  1:07   ` kernel test robot
  2023-11-22  4:40   ` Chris Li
  1 sibling, 0 replies; 93+ messages in thread
From: kernel test robot @ 2023-11-20  1:07 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Huang, Ying, David Hildenbrand, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc2 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231119194740.94101-19-ryncsn%40gmail.com
patch subject: [PATCH 18/24] mm/swap: introduce a helper non fault swapin
config: i386-buildonly-randconfig-002-20231120 (https://download.01.org/0day-ci/archive/20231120/202311200850.FrQj7bMD-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231120/202311200850.FrQj7bMD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311200850.FrQj7bMD-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/shmem.c:43:
   mm/swap.h:105:31: warning: declaration of 'enum swap_cache_result' will not be visible outside of this function [-Wvisibility]
                           struct vm_fault *vmf, enum swap_cache_result *result)
                                                      ^
   mm/swap.h:112:8: warning: declaration of 'enum swap_cache_result' will not be visible outside of this function [-Wvisibility]
                   enum swap_cache_result *result)
                        ^
>> mm/shmem.c:1841:25: error: variable has incomplete type 'enum swap_cache_result'
           enum swap_cache_result result;
                                  ^
   mm/shmem.c:1841:7: note: forward declaration of 'enum swap_cache_result'
           enum swap_cache_result result;
                ^
>> mm/shmem.c:1870:31: error: use of undeclared identifier 'SWAP_CACHE_HIT'
                   if (fault_type && result != SWAP_CACHE_HIT) {
                                               ^
>> mm/shmem.c:1879:17: error: use of undeclared identifier 'SWAP_CACHE_BYPASS'
           if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
                          ^
   2 warnings and 3 errors generated.


vim +1841 mm/shmem.c

  1827	
  1828	/*
  1829	 * Swap in the folio pointed to by *foliop.
  1830	 * Caller has to make sure that *foliop contains a valid swapped folio.
  1831	 * Returns 0 and the folio in foliop if success. On failure, returns the
  1832	 * error code and NULL in *foliop.
  1833	 */
  1834	static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
  1835				     struct folio **foliop, enum sgp_type sgp,
  1836				     gfp_t gfp, struct mm_struct *fault_mm,
  1837				     vm_fault_t *fault_type)
  1838	{
  1839		struct address_space *mapping = inode->i_mapping;
  1840		struct shmem_inode_info *info = SHMEM_I(inode);
> 1841		enum swap_cache_result result;
  1842		struct folio *folio = NULL;
  1843		struct mempolicy *mpol;
  1844		struct page *page;
  1845		swp_entry_t swap;
  1846		pgoff_t ilx;
  1847		int error;
  1848	
  1849		VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
  1850		swap = radix_to_swp_entry(*foliop);
  1851		*foliop = NULL;
  1852	
  1853		if (is_poisoned_swp_entry(swap))
  1854			return -EIO;
  1855	
  1856		mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
  1857		page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
  1858		mpol_cond_put(mpol);
  1859	
  1860		if (PTR_ERR(page) == -EBUSY) {
  1861			if (!shmem_confirm_swap(mapping, index, swap))
  1862				return -EEXIST;
  1863			else
  1864				return -EINVAL;
  1865		} else if (!page) {
  1866			error = -ENOMEM;
  1867			goto failed;
  1868		} else {
  1869			folio = page_folio(page);
> 1870			if (fault_type && result != SWAP_CACHE_HIT) {
  1871				*fault_type |= VM_FAULT_MAJOR;
  1872				count_vm_event(PGMAJFAULT);
  1873				count_memcg_event_mm(fault_mm, PGMAJFAULT);
  1874			}
  1875		}
  1876	
  1877		/* We have to do this with folio locked to prevent races */
  1878		folio_lock(folio);
> 1879		if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
  1880		    folio->swap.val != swap.val ||
  1881		    !shmem_confirm_swap(mapping, index, swap)) {
  1882			error = -EEXIST;
  1883			goto unlock;
  1884		}
  1885		if (!folio_test_uptodate(folio)) {
  1886			error = -EIO;
  1887			goto failed;
  1888		}
  1889		folio_wait_writeback(folio);
  1890	
  1891		/*
  1892		 * Some architectures may have to restore extra metadata to the
  1893		 * folio after reading from swap.
  1894		 */
  1895		arch_swap_restore(swap, folio);
  1896	
  1897		if (shmem_should_replace_folio(folio, gfp)) {
  1898			error = shmem_replace_folio(&folio, gfp, info, index);
  1899			if (error)
  1900				goto failed;
  1901		}
  1902	
  1903		error = shmem_add_to_page_cache(folio, mapping, index,
  1904						swp_to_radix_entry(swap), gfp);
  1905		if (error)
  1906			goto failed;
  1907	
  1908		shmem_recalc_inode(inode, 0, -1);
  1909	
  1910		if (sgp == SGP_WRITE)
  1911			folio_mark_accessed(folio);
  1912	
  1913		delete_from_swap_cache(folio);
  1914		folio_mark_dirty(folio);
  1915		swap_free(swap);
  1916	
  1917		*foliop = folio;
  1918		return 0;
  1919	failed:
  1920		if (!shmem_confirm_swap(mapping, index, swap))
  1921			error = -EEXIST;
  1922		if (error == -EIO)
  1923			shmem_set_folio_swapin_error(inode, index, folio, swap);
  1924	unlock:
  1925		if (folio) {
  1926			folio_unlock(folio);
  1927			folio_put(folio);
  1928		}
  1929	
  1930		return error;
  1931	}
  1932	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 01/24] mm/swap: fix a potential undefined behavior issue
  2023-11-19 20:55   ` Matthew Wilcox
@ 2023-11-20  3:35     ` Chris Li
  2023-11-20 11:14       ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-20  3:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kairui Song, linux-mm, Andrew Morton, Huang, Ying,
	David Hildenbrand, Hugh Dickins, Johannes Weiner, Michal Hocko,
	LKML

Hi Kairui,

On Sun, Nov 19, 2023 at 12:55 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 20, 2023 at 03:47:17AM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > When folio is NULL, taking the address of its struct member is an
> > undefined behavior, the UB is caused by applying -> operator

I think dereferencing the NULL pointer is undefined behavior. There is
no dereferencing here. It is just pointer arithmetic of NULL pointers,
which is adding offset of page to the NULL pointer, you got NULL.

> > won't lead to a real issue, still better to fix it, also makes the
> > code less error-prone, when folio is NULL, page is also NULL,
> > instead of a meanless offset value.

I consider your reasoning is invalid. NULL pointer arithmetic should
be legal. This patch is not needed.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-19 19:47 ` [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check Kairui Song
@ 2023-11-20  4:17   ` Chris Li
  2023-11-20 11:15     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-20  4:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> When swapping in a page, mem_cgroup_swapin_charge_folio is called for new
> allocated folio, nothing else is referencing the folio so no need to set
> the lock bit. This avoided doing unlock check on error path.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 20 +++++++++-----------
>  1 file changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ac4fa404eaa7..45dd8b7c195d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -458,6 +458,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,

You move the mem_cgroup_swapin_charge_folio() inside the for loop:


        for (;;) {
                int err;
                /*
                 * First check the swap cache.  Since this is normally
                 * called after swap_cache_get_folio() failed, re-calling
                 * that would confuse statistics.
                 */
                folio = filemap_get_folio(swap_address_space(entry),
                                                swp_offset(entry));


>                                                 mpol, ilx, numa_node_id());
>                 if (!folio)
>                          goto fail_put_swap;
> +               if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> +                       goto fail_put_folio;

Wouldn't it cause repeat charging of the folio when it is racing
against others in the for loop?

>
>                 /*
>                  * Swap entry may have been freed since our caller observed it.
> @@ -483,13 +485,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         /*
>          * The swap entry is ours to swap in. Prepare the new page.
>          */
> -
>         __folio_set_locked(folio);
>         __folio_set_swapbacked(folio);
>
> -       if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> -               goto fail_unlock;
> -

The original code makes the charge outside of the for loop. Only the
winner can charge once.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-19 19:47 ` [PATCH 08/24] mm/swap: check readahead policy per entry Kairui Song
@ 2023-11-20  6:04   ` Huang, Ying
  2023-11-20 11:17     ` Kairui Song
  2023-11-21  7:54   ` Chris Li
  1 sibling, 1 reply; 93+ messages in thread
From: Huang, Ying @ 2023-11-20  6:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Currently VMA readahead is globally disabled when any rotate disk is
> used as swap backend. So multiple swap devices are enabled, if a slower
> hard disk is set as a low priority fallback, and a high performance SSD
> is used and high priority swap device, vma readahead is disabled globally.
> The SSD swap device performance will drop by a lot.
>
> Check readahead policy per entry to avoid such problem.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff6756f2e8e4..fb78f7f18ed7 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
>  	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
>  }
>  
> -static inline bool swap_use_vma_readahead(void)
> +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>  {
> -	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> +	return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
>  }
>  
>  /*
> @@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>  
>  	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
>  	if (!IS_ERR(folio)) {
> -		bool vma_ra = swap_use_vma_readahead();
> +		bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
>  		bool readahead;
>  
>  		/*
> @@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			      struct vm_fault *vmf, bool *swapcached)
>  {
> +	struct swap_info_struct *si;
>  	struct mempolicy *mpol;
>  	struct page *page;
>  	pgoff_t ilx;
>  	bool cached;
>  
> +	si = swp_swap_info(entry);
>  	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> -	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> +	if (swap_use_no_readahead(si, entry)) {
>  		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
>  		cached = false;
> -	} else if (swap_use_vma_readahead()) {
> +	} else if (swap_use_vma_readahead(si)) {

It's possible that some pages are swapped out to SSD while others are
swapped out to HDD in a readahead window.

I suspect that there are practical requirements to use swap on SSD and
HDD at the same time.

>  		page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>  		cached = true;
>  	} else {

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 19/24] shmem, swap: refactor error check on OOM or race
  2023-11-19 19:47 ` [PATCH 19/24] shmem, swap: refactor error check on OOM or race Kairui Song
@ 2023-11-20  7:04   ` Chris Li
  2023-11-20 11:17     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-20  7:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> It should always check if a swap entry is already swaped in on error,
> fix potential false error issue.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 81d129aa66d1..6154b5b68b8f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1857,13 +1857,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
>         mpol_cond_put(mpol);
>
> -       if (PTR_ERR(page) == -EBUSY) {
> -               if (!shmem_confirm_swap(mapping, index, swap))
> -                       return -EEXIST;

Do you intentionally remove checking shmem_confirm_swap()?
I am not sure I am following.

> +       if (IS_ERR_OR_NULL(page)) {
> +               if (!page)
> +                       error = -ENOMEM;
>                 else
> -                       return -EINVAL;
> -       } else if (!page) {
> -               error = -ENOMEM;
> +                       error = -EINVAL;

The resulting code is a bit hard to read in diff. In plan source it is like:

        if (IS_ERR_OR_NULL(page)) {
                if (!page)
                        error = -ENOMEM;
                else
                        error = -EINVAL;
                goto failed;
        } else {
                folio = page_folio(page);
                if (fault_type && result != SWAP_CACHE_HIT) {
                        *fault_type |= VM_FAULT_MAJOR;
                        count_vm_event(PGMAJFAULT);
                        count_memcg_event_mm(fault_mm, PGMAJFAULT);
                }
        }

First of all, if the first always "goto failed", the second else is not needed.
The whole else block can be flatten one indentation.

 if (IS_ERR_OR_NULL(page)) {
                if (!page)
                        error = -ENOMEM;
                else
                        error = -EINVAL;
                goto failed;
  } else {

Can be rewrite as following with less indentation:

if (!page) {
     error = -ENOMEM;
     goto failed;
}
if (IS_ERR(page)) {
     error = -EINVAL;
     goto failed;
}
/* else block */

Am I missing something and misreading your code?

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate
  2023-11-19 19:47 ` [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate Kairui Song
@ 2023-11-20  7:35   ` Huang, Ying
  2023-11-20 11:17     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Huang, Ying @ 2023-11-20  7:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel, Tejun Heo

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> When a process which previously swapped some memory was moved to
> another cgroup, and the cgroup it previous in is dead, then swapped in
> pages will be leaked into rootcg. Previous commits fixed the bug for
> no readahead path, this commit fix the same issue for readahead path.
>
> This can be easily reproduced by:
> - Setup a SSD or HDD swap.
> - Create memory cgroup A, B and C.
> - Spawn process P1 in cgroup A and make it swap out some pages.
> - Move process P1 to memory cgroup B.
> - Destroy cgroup A.
> - Do a swapoff in cgroup C
> - Swapped in pages is accounted into cgroup C.
>
> This patch will fix it make the swapped in pages accounted in cgroup B.

Accroding to "Memory Ownership" section of
Documentation/admin-guide/cgroup-v2.rst,

"
A memory area is charged to the cgroup which instantiated it and stays
charged to the cgroup until the area is released.  Migrating a process
to a different cgroup doesn't move the memory usages that it
instantiated while in the previous cgroup to the new cgroup.
"

Because we don't move the charge when we move a task from one cgroup to
another.  It's controversial which cgroup should be charged to.
According to the above document, it's acceptable to charge to the cgroup
C (cgroup where swapoff happens).

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 09/24] mm/swap: inline __swap_count
  2023-11-19 19:47 ` [PATCH 09/24] mm/swap: inline __swap_count Kairui Song
@ 2023-11-20  7:41   ` Huang, Ying
  2023-11-21  8:02     ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Huang, Ying @ 2023-11-20  7:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> There is only one caller in swap subsystem now, where it can be inline
> smoothly, avoid the memory access and function call overheads.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 6 ------
>  mm/swap_state.c      | 6 +++---
>  mm/swapfile.c        | 8 --------
>  3 files changed, 3 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2401990d954d..64a37819a9b3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -485,7 +485,6 @@ int swap_type_of(dev_t device, sector_t offset);
>  int find_first_swap(dev_t *device);
>  extern unsigned int count_swap_pages(int, int);
>  extern sector_t swapdev_block(int, pgoff_t);
> -extern int __swap_count(swp_entry_t entry);
>  extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -559,11 +558,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
>  
> -static inline int __swap_count(swp_entry_t entry)
> -{
> -	return 0;
> -}
> -
>  static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>  {
>  	return 0;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index fb78f7f18ed7..d87c20f9f7ec 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -316,9 +316,9 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>  	release_pages(pages, nr);
>  }
>  
> -static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> +static inline bool swap_use_no_readahead(struct swap_info_struct *si, pgoff_t offset)
>  {
> -	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> +	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && swap_count(si->swap_map[offset]) == 1;
>  }
>  
>  static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> @@ -928,7 +928,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  
>  	si = swp_swap_info(entry);
>  	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> -	if (swap_use_no_readahead(si, entry)) {
> +	if (swap_use_no_readahead(si, swp_offset(entry))) {
>  		page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
>  		cached = false;
>  	} else if (swap_use_vma_readahead(si)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a8ae472ed2b6..e15a6c464a38 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1431,14 +1431,6 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
>  		spin_unlock(&p->lock);
>  }
>  
> -int __swap_count(swp_entry_t entry)
> -{
> -	struct swap_info_struct *si = swp_swap_info(entry);
> -	pgoff_t offset = swp_offset(entry);
> -
> -	return swap_count(si->swap_map[offset]);
> -}
> -

I'd rather keep __swap_count() in the original place together with other
swap count related functions.  And si->swap_map[] was hided in
swapfile.c before.  I don't think the change will have any real
performance improvement.

>  /*
>   * How many references to @entry are currently swapped out?
>   * This does not give an exact answer when swap count is continued,

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
  2023-11-19 21:00   ` Matthew Wilcox
@ 2023-11-20 11:14     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Michal Hocko, linux-kernel

Matthew Wilcox <willy@infradead.org> 于2023年11月20日周一 05:00写道:
>
> On Mon, Nov 20, 2023 at 03:47:19AM +0800, Kairui Song wrote:
> > +                     /* skip swapcache and readahead */
> > +                     page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > +                                             vmf);
> > +                     if (page)
> > +                             folio = page_folio(page);
>
> I think this should rather be:
>
>                         folio = swapin_no_readahead(entry,
>                                         GFP_HIGHUSER_MOVABLE, vma);
>                         page = &folio->page;
>
> and have swapin_no_readahead() work entirely in terms of folios.
>

Thanks for the review!

Good suggestion, I was actually thinking about converting more swapin
function to use folio in later series, since that involved more
changes, and this series is getting a bit too long.
I'll try to see if I can reorganize the series to cover that too in a
sensible way.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-19 21:12   ` Matthew Wilcox
@ 2023-11-20 11:14     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Michal Hocko, linux-kernel

Matthew Wilcox <willy@infradead.org> 于2023年11月20日周一 05:12写道:
>
> On Mon, Nov 20, 2023 at 03:47:32AM +0800, Kairui Song wrote:
> >       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> >                               vmf, &cache_result);
> > -     if (page) {
> > +     if (PTR_ERR(page) == -EBUSY) {
>
>         if (page == ERR_PTR(-EBUSY)) {
>

Thanks, I'll fix this.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 01/24] mm/swap: fix a potential undefined behavior issue
  2023-11-20  3:35     ` Chris Li
@ 2023-11-20 11:14       ` Kairui Song
  2023-11-20 17:34         ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:14 UTC (permalink / raw)
  To: Chris Li
  Cc: Matthew Wilcox, linux-mm, Andrew Morton, Huang, Ying,
	David Hildenbrand, Hugh Dickins, Johannes Weiner, Michal Hocko,
	LKML

Chris Li <chrisl@kernel.org> 于2023年11月20日周一 11:36写道:
>
> Hi Kairui,
>
> On Sun, Nov 19, 2023 at 12:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Nov 20, 2023 at 03:47:17AM +0800, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > When folio is NULL, taking the address of its struct member is an
> > > undefined behavior, the UB is caused by applying -> operator
>
> I think dereferencing the NULL pointer is undefined behavior. There is
> no dereferencing here. It is just pointer arithmetic of NULL pointers,
> which is adding offset of page to the NULL pointer, you got NULL.
>
> > > won't lead to a real issue, still better to fix it, also makes the
> > > code less error-prone, when folio is NULL, page is also NULL,
> > > instead of a meanless offset value.
>
> I consider your reasoning is invalid. NULL pointer arithmetic should
> be legal. This patch is not needed.
>
> Chris

Hi, Chris and Matthew.

Thanks for the comments.

Right, it's just a language syntax level thing, since "->" have a
higher priority, so in the syntax level it is doing a member access
first, then take the address. By C definition  member access should
not happen if the object is invalid (NULL). Only a hypothesis problem
on paper...

This is indeed not needed since in reality it's just pointer
arithmetic. I'm OK dropping this.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-20  4:17   ` Chris Li
@ 2023-11-20 11:15     ` Kairui Song
  2023-11-20 17:44       ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:15 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月20日周一 12:18写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > When swapping in a page, mem_cgroup_swapin_charge_folio is called for new
> > allocated folio, nothing else is referencing the folio so no need to set
> > the lock bit. This avoided doing unlock check on error path.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 20 +++++++++-----------
> >  1 file changed, 9 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ac4fa404eaa7..45dd8b7c195d 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -458,6 +458,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>
> You move the mem_cgroup_swapin_charge_folio() inside the for loop:
>
>
>         for (;;) {
>                 int err;
>                 /*
>                  * First check the swap cache.  Since this is normally
>                  * called after swap_cache_get_folio() failed, re-calling
>                  * that would confuse statistics.
>                  */
>                 folio = filemap_get_folio(swap_address_space(entry),
>                                                 swp_offset(entry));
>
>
> >                                                 mpol, ilx, numa_node_id());
> >                 if (!folio)
> >                          goto fail_put_swap;
> > +               if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> > +                       goto fail_put_folio;
>
> Wouldn't it cause repeat charging of the folio when it is racing
> against others in the for loop?

The race loser will call folio_put and discharge it?

>
> >
> >                 /*
> >                  * Swap entry may have been freed since our caller observed it.
> > @@ -483,13 +485,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >         /*
> >          * The swap entry is ours to swap in. Prepare the new page.
> >          */
> > -
> >         __folio_set_locked(folio);
> >         __folio_set_swapbacked(folio);
> >
> > -       if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> > -               goto fail_unlock;
> > -
>
> The original code makes the charge outside of the for loop. Only the
> winner can charge once.

Right, this patch may make the charge/dis-charge path more complex for
race swapin, I'll re-check this part.

>
> Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-20  6:04   ` Huang, Ying
@ 2023-11-20 11:17     ` Kairui Song
  2023-11-21  1:10       ` Huang, Ying
  2023-11-21  5:13       ` Chris Li
  0 siblings, 2 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel

Huang, Ying <ying.huang@intel.com> 于2023年11月20日周一 14:07写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently VMA readahead is globally disabled when any rotate disk is
> > used as swap backend. So multiple swap devices are enabled, if a slower
> > hard disk is set as a low priority fallback, and a high performance SSD
> > is used and high priority swap device, vma readahead is disabled globally.
> > The SSD swap device performance will drop by a lot.
> >
> > Check readahead policy per entry to avoid such problem.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ff6756f2e8e4..fb78f7f18ed7 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
> >       return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> >  }
> >
> > -static inline bool swap_use_vma_readahead(void)
> > +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> >  {
> > -     return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > +     return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
> >  }
> >
> >  /*
> > @@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
> >
> >       folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> >       if (!IS_ERR(folio)) {
> > -             bool vma_ra = swap_use_vma_readahead();
> > +             bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> >               bool readahead;
> >
> >               /*
> > @@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >                             struct vm_fault *vmf, bool *swapcached)
> >  {
> > +     struct swap_info_struct *si;
> >       struct mempolicy *mpol;
> >       struct page *page;
> >       pgoff_t ilx;
> >       bool cached;
> >
> > +     si = swp_swap_info(entry);
> >       mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > -     if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> > +     if (swap_use_no_readahead(si, entry)) {
> >               page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> >               cached = false;
> > -     } else if (swap_use_vma_readahead()) {
> > +     } else if (swap_use_vma_readahead(si)) {
>
> It's possible that some pages are swapped out to SSD while others are
> swapped out to HDD in a readahead window.
>
> I suspect that there are practical requirements to use swap on SSD and
> HDD at the same time.

Hi Ying,

Thanks for the review!

For the first issue "fragmented readahead window", I was planning to
do an extra check in readahead path to skip readahead entries that are
on different swap devices, which is not hard to do, but this series is
growing too long so I thought it will be better done later.

For the second issue, "is there any practical use for multiple swap",
I think actually there are. For example we are trying to use multi
layer swap for offloading memory of different hotness on servers. And
we also tried to implement a mechanism to migrate long sleep swap
entries from high performance SSD/RAMDISK swap to cheap HDD swap
device, with more than two layers of swap, which worked except the
upstream issue, that readahead policy will no longer work as expected.


>
> >               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> >               cached = true;
> >       } else {
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 19/24] shmem, swap: refactor error check on OOM or race
  2023-11-20  7:04   ` Chris Li
@ 2023-11-20 11:17     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:17 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月20日周一 15:12写道:
>
> Hi Kairui,
>
> On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > It should always check if a swap entry is already swaped in on error,
> > fix potential false error issue.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/shmem.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 81d129aa66d1..6154b5b68b8f 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1857,13 +1857,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >         page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
> >         mpol_cond_put(mpol);
> >
> > -       if (PTR_ERR(page) == -EBUSY) {
> > -               if (!shmem_confirm_swap(mapping, index, swap))
> > -                       return -EEXIST;
>
> Do you intentionally remove checking shmem_confirm_swap()?
> I am not sure I am following.

Hi, Chris, thanks for the review.

This was also called under failed: label so I think there is no need
to keep it here.

>
> > +       if (IS_ERR_OR_NULL(page)) {
> > +               if (!page)
> > +                       error = -ENOMEM;
> >                 else
> > -                       return -EINVAL;
> > -       } else if (!page) {
> > -               error = -ENOMEM;
> > +                       error = -EINVAL;
>
> The resulting code is a bit hard to read in diff. In plan source it is like:
>
>         if (IS_ERR_OR_NULL(page)) {
>                 if (!page)
>                         error = -ENOMEM;
>                 else
>                         error = -EINVAL;
>                 goto failed;
>         } else {
>                 folio = page_folio(page);
>                 if (fault_type && result != SWAP_CACHE_HIT) {
>                         *fault_type |= VM_FAULT_MAJOR;
>                         count_vm_event(PGMAJFAULT);
>                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
>                 }
>         }
>
> First of all, if the first always "goto failed", the second else is not needed.
> The whole else block can be flatten one indentation.

Yes, thanks for the suggestion.

>
>  if (IS_ERR_OR_NULL(page)) {
>                 if (!page)
>                         error = -ENOMEM;
>                 else
>                         error = -EINVAL;
>                 goto failed;
>   } else {
>
> Can be rewrite as following with less indentation:
>
> if (!page) {
>      error = -ENOMEM;
>      goto failed;
> }
> if (IS_ERR(page)) {
>      error = -EINVAL;
>      goto failed;
> }
> /* else block */
>
> Am I missing something and misreading your code?

Your code looks cleaner :)

This patch is trying to clean up the error path after the helper
refactor, and your suggestions are very helpful, thanks!

>
> Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate
  2023-11-20  7:35   ` Huang, Ying
@ 2023-11-20 11:17     ` Kairui Song
  2023-11-22  5:34       ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-20 11:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel,
	Tejun Heo

Huang, Ying <ying.huang@intel.com> 于2023年11月20日周一 15:37写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > When a process which previously swapped some memory was moved to
> > another cgroup, and the cgroup it previous in is dead, then swapped in
> > pages will be leaked into rootcg. Previous commits fixed the bug for
> > no readahead path, this commit fix the same issue for readahead path.
> >
> > This can be easily reproduced by:
> > - Setup a SSD or HDD swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
>
> Accroding to "Memory Ownership" section of
> Documentation/admin-guide/cgroup-v2.rst,
>
> "
> A memory area is charged to the cgroup which instantiated it and stays
> charged to the cgroup until the area is released.  Migrating a process
> to a different cgroup doesn't move the memory usages that it
> instantiated while in the previous cgroup to the new cgroup.
> "
>
> Because we don't move the charge when we move a task from one cgroup to
> another.  It's controversial which cgroup should be charged to.
> According to the above document, it's acceptable to charge to the cgroup
> C (cgroup where swapoff happens).

Hi Ying, thank you very much for the info!

It is controversial indeed, just the original behavior is kind of
counter-intuitive.

Image if there are cgroup P1, and its child cgroup C1 C2. If a process
swapped out some memory in C1 then moved to C2, and C1 is dead.
On swapoff the charge will be moved out of P1...

And swapoff often happen on some unlimited cgroup or some cgroup for
management agent.

If P1 have a memory limit, it can breech the limit easily, we will see
a process that never leave P1 having a much higher RSS that P1/C1/C2's
limit.
And if there is a limit for the management agent cgroup, the agent
will be OOM instead of OOM in P1.

Simply moving a process between the child cgroup of the same parent
cgroup won't cause such issue, thing get weird when swapoff is
involved.

Or maybe we should try to be compatible, and introduce a sysctl or
cmdline for this?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
  2023-11-19 19:47 ` [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper Kairui Song
  2023-11-19 21:00   ` Matthew Wilcox
@ 2023-11-20 14:55   ` Dan Carpenter
  2023-11-21  5:34   ` Chris Li
  2 siblings, 0 replies; 93+ messages in thread
From: Dan Carpenter @ 2023-11-20 14:55 UTC (permalink / raw)
  To: oe-kbuild, Kairui Song, linux-mm
  Cc: lkp, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Huang, Ying, David Hildenbrand, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231119194740.94101-4-ryncsn%40gmail.com
patch subject: [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
config: um-randconfig-r081-20231120 (https://download.01.org/0day-ci/archive/20231120/202311201135.iG2KLShW-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce: (https://download.01.org/0day-ci/archive/20231120/202311201135.iG2KLShW-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <error27@gmail.com>
| Closes: https://lore.kernel.org/r/202311201135.iG2KLShW-lkp@intel.com/

smatch warnings:
mm/swap_state.c:881 swapin_no_readahead() warn: use 'gfp_mask' here instead of GFP_KERNEL?

vim +/gfp_mask +881 mm/swap_state.c

19f582d2684e47 Kairui Song 2023-11-20  856  /**
19f582d2684e47 Kairui Song 2023-11-20  857   * swapin_no_readahead - swap in pages skipping swap cache and readahead
19f582d2684e47 Kairui Song 2023-11-20  858   * @entry: swap entry of this memory
19f582d2684e47 Kairui Song 2023-11-20  859   * @gfp_mask: memory allocation flags
19f582d2684e47 Kairui Song 2023-11-20  860   * @vmf: fault information
19f582d2684e47 Kairui Song 2023-11-20  861   *
19f582d2684e47 Kairui Song 2023-11-20  862   * Returns the struct page for entry and addr after the swap entry is read
19f582d2684e47 Kairui Song 2023-11-20  863   * in.
19f582d2684e47 Kairui Song 2023-11-20  864   */
19f582d2684e47 Kairui Song 2023-11-20  865  struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
19f582d2684e47 Kairui Song 2023-11-20  866  				 struct vm_fault *vmf)
19f582d2684e47 Kairui Song 2023-11-20  867  {
19f582d2684e47 Kairui Song 2023-11-20  868  	struct vm_area_struct *vma = vmf->vma;
19f582d2684e47 Kairui Song 2023-11-20  869  	struct page *page = NULL;
19f582d2684e47 Kairui Song 2023-11-20  870  	struct folio *folio;
19f582d2684e47 Kairui Song 2023-11-20  871  	void *shadow = NULL;
19f582d2684e47 Kairui Song 2023-11-20  872  
19f582d2684e47 Kairui Song 2023-11-20  873  	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
19f582d2684e47 Kairui Song 2023-11-20  874  				vma, vmf->address, false);
19f582d2684e47 Kairui Song 2023-11-20  875  	if (folio) {
19f582d2684e47 Kairui Song 2023-11-20  876  		__folio_set_locked(folio);
19f582d2684e47 Kairui Song 2023-11-20  877  		__folio_set_swapbacked(folio);
19f582d2684e47 Kairui Song 2023-11-20  878  
19f582d2684e47 Kairui Song 2023-11-20  879  		if (mem_cgroup_swapin_charge_folio(folio,
19f582d2684e47 Kairui Song 2023-11-20  880  					vma->vm_mm, GFP_KERNEL,

s/GFP_KERNEL/gfp_mask/?

19f582d2684e47 Kairui Song 2023-11-20 @881  					entry)) {
19f582d2684e47 Kairui Song 2023-11-20  882  			folio_unlock(folio);
19f582d2684e47 Kairui Song 2023-11-20  883  			folio_put(folio);
19f582d2684e47 Kairui Song 2023-11-20  884  			return NULL;
19f582d2684e47 Kairui Song 2023-11-20  885  		}
19f582d2684e47 Kairui Song 2023-11-20  886  		mem_cgroup_swapin_uncharge_swap(entry);
19f582d2684e47 Kairui Song 2023-11-20  887  
19f582d2684e47 Kairui Song 2023-11-20  888  		shadow = get_shadow_from_swap_cache(entry);
19f582d2684e47 Kairui Song 2023-11-20  889  		if (shadow)
19f582d2684e47 Kairui Song 2023-11-20  890  			workingset_refault(folio, shadow);
19f582d2684e47 Kairui Song 2023-11-20  891  
19f582d2684e47 Kairui Song 2023-11-20  892  		folio_add_lru(folio);
19f582d2684e47 Kairui Song 2023-11-20  893  
19f582d2684e47 Kairui Song 2023-11-20  894  		/* To provide entry to swap_readpage() */
19f582d2684e47 Kairui Song 2023-11-20  895  		folio->swap = entry;
19f582d2684e47 Kairui Song 2023-11-20  896  		page = &folio->page;
19f582d2684e47 Kairui Song 2023-11-20  897  		swap_readpage(page, true, NULL);
19f582d2684e47 Kairui Song 2023-11-20  898  		folio->private = NULL;
19f582d2684e47 Kairui Song 2023-11-20  899  	}
19f582d2684e47 Kairui Song 2023-11-20  900  
19f582d2684e47 Kairui Song 2023-11-20  901  	return page;
19f582d2684e47 Kairui Song 2023-11-20  902  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 01/24] mm/swap: fix a potential undefined behavior issue
  2023-11-20 11:14       ` Kairui Song
@ 2023-11-20 17:34         ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-20 17:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: Matthew Wilcox, linux-mm, Andrew Morton, Huang, Ying,
	David Hildenbrand, Hugh Dickins, Johannes Weiner, Michal Hocko,
	LKML

Hi Kairui,

On Mon, Nov 20, 2023 at 3:15 AM Kairui Song <ryncsn@gmail.com> wrote:
> > Chris
>
> Hi, Chris and Matthew.
>
> Thanks for the comments.
>
> Right, it's just a language syntax level thing, since "->" have a
> higher priority, so in the syntax level it is doing a member access
> first, then take the address. By C definition  member access should
> not happen if the object is invalid (NULL). Only a hypothesis problem
> on paper...

The dereference only shows up in the abstract syntax tree level.
According to the C standard there are expansion and evaluation phases
after that. At the evaluation phase the dereference will turn into
pointer arithmetic. Per my understanding, the dereference never
actually happens, due to the evaluation rules, not even in theory.

> This is indeed not needed since in reality it's just pointer
> arithmetic. I'm OK dropping this.

Thanks

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-20 11:15     ` Kairui Song
@ 2023-11-20 17:44       ` Chris Li
  2023-11-22 17:32         ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-20 17:44 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Mon, Nov 20, 2023 at 3:15 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index ac4fa404eaa7..45dd8b7c195d 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -458,6 +458,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >
> > You move the mem_cgroup_swapin_charge_folio() inside the for loop:
> >
> >
> >         for (;;) {
> >                 int err;
> >                 /*
> >                  * First check the swap cache.  Since this is normally
> >                  * called after swap_cache_get_folio() failed, re-calling
> >                  * that would confuse statistics.
> >                  */
> >                 folio = filemap_get_folio(swap_address_space(entry),
> >                                                 swp_offset(entry));
> >
> >
> > >                                                 mpol, ilx, numa_node_id());
> > >                 if (!folio)
> > >                          goto fail_put_swap;
> > > +               if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> > > +                       goto fail_put_folio;
> >
> > Wouldn't it cause repeat charging of the folio when it is racing
> > against others in the for loop?
>
> The race loser will call folio_put and discharge it?

There are two different charges. Memcg charging and memcg swapin charging.
The folio_put will do the memcg discharge, the corresponding memcg
charge is in follio allocation.
Memcg swapin charge does things differently, it needs to modify the
swap relately accounting.
The memcg uncharge is not a pair for memcg swapin charge.

> > >                 /*
> > >                  * Swap entry may have been freed since our caller observed it.
> > > @@ -483,13 +485,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >         /*
> > >          * The swap entry is ours to swap in. Prepare the new page.
> > >          */
> > > -
> > >         __folio_set_locked(folio);
> > >         __folio_set_swapbacked(folio);
> > >
> > > -       if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> > > -               goto fail_unlock;
> > > -
> >
> > The original code makes the charge outside of the for loop. Only the
> > winner can charge once.
>
> Right, this patch may make the charge/dis-charge path more complex for
> race swapin, I'll re-check this part.

It is more than just complex, it seems to change the behavior of this code.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 00/24] Swapin path refactor for optimization and bugfix
  2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
                   ` (23 preceding siblings ...)
  2023-11-19 19:47 ` [PATCH 24/24] mm/swap: change swapin_readahead to swapin_page_fault Kairui Song
@ 2023-11-20 19:09 ` Yosry Ahmed
  2023-11-20 20:22   ` Chris Li
  2023-11-22  6:43   ` Kairui Song
  24 siblings, 2 replies; 93+ messages in thread
From: Yosry Ahmed @ 2023-11-20 19:09 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series tries to unify and clean up the swapin path, fixing a few
> issues with optimizations:
>
> 1. Memcg leak issue: when a process that previously swapped out some
>    migrated to another cgroup, and the origianl cgroup is dead. If we
>    do a swapoff, swapped in pages will be accounted into the process
>    doing swapoff instead of the new cgroup. This will allow the process
>    to use more memory than expect easily.
>
>    This can be easily reproduced by:
>    - Setup a swap.
>    - Create memory cgroup A, B and C.
>    - Spawn process P1 in cgroup A and make it swap out some pages.
>    - Move process P1 to memory cgroup B.
>    - Destroy cgroup A.
>    - Do a swapoff in cgroup C
>    - Swapped in pages is accounted into cgroup C.
>
>    This patch will fix it make the swapped in pages accounted in cgroup B.
>

I guess this only works for anonymous memory and not shmem, right?

I think tying memcg charges to a process is not something we usually
do. Charging the pages to the memcg of the faulting process if the
previous owner is dead makes sense, it's essentially recharging the
memory to the new owner. Swapoff is indeed a special case, since the
faulting process is not the new owner, but an admin process or so. I
am guessing charging to the new memcg of the previous owner might make
sense in this case, but it is a change of behavior.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 00/24] Swapin path refactor for optimization and bugfix
  2023-11-20 19:09 ` [PATCH 00/24] Swapin path refactor for optimization and bugfix Yosry Ahmed
@ 2023-11-20 20:22   ` Chris Li
  2023-11-22  6:46     ` Kairui Song
  2023-11-22  6:43   ` Kairui Song
  1 sibling, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-20 20:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kairui Song, linux-mm, Andrew Morton, Huang, Ying,
	David Hildenbrand, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
	Michal Hocko, linux-kernel

Hi Kairui,

On Mon, Nov 20, 2023 at 11:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > This series tries to unify and clean up the swapin path, fixing a few
> > issues with optimizations:
> >
> > 1. Memcg leak issue: when a process that previously swapped out some
> >    migrated to another cgroup, and the origianl cgroup is dead. If we
> >    do a swapoff, swapped in pages will be accounted into the process
> >    doing swapoff instead of the new cgroup. This will allow the process
> >    to use more memory than expect easily.
> >
> >    This can be easily reproduced by:
> >    - Setup a swap.
> >    - Create memory cgroup A, B and C.
> >    - Spawn process P1 in cgroup A and make it swap out some pages.
> >    - Move process P1 to memory cgroup B.
> >    - Destroy cgroup A.
> >    - Do a swapoff in cgroup C
> >    - Swapped in pages is accounted into cgroup C.
> >
> >    This patch will fix it make the swapped in pages accounted in cgroup B.
> >
>
> I guess this only works for anonymous memory and not shmem, right?
>
> I think tying memcg charges to a process is not something we usually
> do. Charging the pages to the memcg of the faulting process if the
> previous owner is dead makes sense, it's essentially recharging the
> memory to the new owner. Swapoff is indeed a special case, since the
> faulting process is not the new owner, but an admin process or so. I
> am guessing charging to the new memcg of the previous owner might make
> sense in this case, but it is a change of behavior.
>

I was looking at this at patch 23 as well. Will ask more questions in
the patch thread.
I would suggest making these two behavior change patches separate out
from the clean up series to give it more exposure and proper
discussion.
Patch 5 and patch 23.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-20 11:17     ` Kairui Song
@ 2023-11-21  1:10       ` Huang, Ying
  2023-11-21  5:20         ` Chris Li
  2023-11-21  5:13       ` Chris Li
  1 sibling, 1 reply; 93+ messages in thread
From: Huang, Ying @ 2023-11-21  1:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> Huang, Ying <ying.huang@intel.com> 于2023年11月20日周一 14:07写道:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > Currently VMA readahead is globally disabled when any rotate disk is
>> > used as swap backend. So multiple swap devices are enabled, if a slower
>> > hard disk is set as a low priority fallback, and a high performance SSD
>> > is used and high priority swap device, vma readahead is disabled globally.
>> > The SSD swap device performance will drop by a lot.
>> >
>> > Check readahead policy per entry to avoid such problem.
>> >
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > ---
>> >  mm/swap_state.c | 12 +++++++-----
>> >  1 file changed, 7 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index ff6756f2e8e4..fb78f7f18ed7 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
>> >       return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
>> >  }
>> >
>> > -static inline bool swap_use_vma_readahead(void)
>> > +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>> >  {
>> > -     return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> > +     return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
>> >  }
>> >
>> >  /*
>> > @@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>> >
>> >       folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
>> >       if (!IS_ERR(folio)) {
>> > -             bool vma_ra = swap_use_vma_readahead();
>> > +             bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
>> >               bool readahead;
>> >
>> >               /*
>> > @@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> >                             struct vm_fault *vmf, bool *swapcached)
>> >  {
>> > +     struct swap_info_struct *si;
>> >       struct mempolicy *mpol;
>> >       struct page *page;
>> >       pgoff_t ilx;
>> >       bool cached;
>> >
>> > +     si = swp_swap_info(entry);
>> >       mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> > -     if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> > +     if (swap_use_no_readahead(si, entry)) {
>> >               page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
>> >               cached = false;
>> > -     } else if (swap_use_vma_readahead()) {
>> > +     } else if (swap_use_vma_readahead(si)) {
>>
>> It's possible that some pages are swapped out to SSD while others are
>> swapped out to HDD in a readahead window.
>>
>> I suspect that there are practical requirements to use swap on SSD and
>> HDD at the same time.
>
> Hi Ying,
>
> Thanks for the review!
>
> For the first issue "fragmented readahead window", I was planning to
> do an extra check in readahead path to skip readahead entries that are
> on different swap devices, which is not hard to do,

This is a possible solution.

> but this series is growing too long so I thought it will be better
> done later.

You don't need to keep everything in one series.  Just use multiple
series.  Even if they are all swap-related.  They are dealing with
different problem in fact.

> For the second issue, "is there any practical use for multiple swap",
> I think actually there are. For example we are trying to use multi
> layer swap for offloading memory of different hotness on servers. And
> we also tried to implement a mechanism to migrate long sleep swap
> entries from high performance SSD/RAMDISK swap to cheap HDD swap
> device, with more than two layers of swap, which worked except the
> upstream issue, that readahead policy will no longer work as expected.

Thanks for your information.

>> >               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>> >               cached = true;
>> >       } else {

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-20 11:17     ` Kairui Song
  2023-11-21  1:10       ` Huang, Ying
@ 2023-11-21  5:13       ` Chris Li
  1 sibling, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21  5:13 UTC (permalink / raw)
  To: Kairui Song
  Cc: Huang, Ying, linux-mm, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Mon, Nov 20, 2023 at 3:17 AM Kairui Song <ryncsn@gmail.com> wrote:
ime.
>
> Hi Ying,
>
> Thanks for the review!
>
> For the first issue "fragmented readahead window", I was planning to
> do an extra check in readahead path to skip readahead entries that are

That makes sense. The read ahead is an optional thing for speed
optimization. If the read ahead crosses the swap device boundaries.
The read ahead portion can be capped.

> on different swap devices, which is not hard to do, but this series is
> growing too long so I thought it will be better done later.
>
> For the second issue, "is there any practical use for multiple swap",
> I think actually there are. For example we are trying to use multi
> layer swap for offloading memory of different hotness on servers. And
> we also tried to implement a mechanism to migrate long sleep swap
> entries from high performance SSD/RAMDISK swap to cheap HDD swap
> device, with more than two layers of swap, which worked except the
> upstream issue, that readahead policy will no longer work as expected.

Thank you very much for sharing your usage case. I am proposing
"memory.swap.tiers"  in this email thread:
https://lore.kernel.org/linux-mm/CAF8kJuOD6zq2VPcVdoZGvkzYX8iXn1akuYhNDJx-LUdS+Sx3GA@mail.gmail.com/
It allows memcg to select which swap device/tiers it wants to opt in.
Your SSD and HDD swap combination is what I have in mind as well.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-21  1:10       ` Huang, Ying
@ 2023-11-21  5:20         ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21  5:20 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song, linux-mm, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Mon, Nov 20, 2023 at 5:12 PM Huang, Ying <ying.huang@intel.com> wrote:
> > but this series is growing too long so I thought it will be better
> > done later.
>
> You don't need to keep everything in one series.  Just use multiple
> series.  Even if they are all swap-related.  They are dealing with
> different problem in fact.

I second that. Actually having multiple smaller series is *preferred*
over one long series.
Shorter series are easier to review.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
  2023-11-19 19:47 ` [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper Kairui Song
  2023-11-19 21:00   ` Matthew Wilcox
  2023-11-20 14:55   ` Dan Carpenter
@ 2023-11-21  5:34   ` Chris Li
  2023-11-22 17:33     ` Kairui Song
  2 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21  5:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, simply move the routine to a standalone function to
> be used later. The error path handling is copied from the "out_page"
> label, to make the code change minimized for easier reviewing.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 33 +++++----------------------------
>  mm/swap.h       |  2 ++
>  mm/swap_state.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 55 insertions(+), 28 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 70ffa867b1be..fba4a5229163 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3794,7 +3794,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         swp_entry_t entry;
>         pte_t pte;
>         vm_fault_t ret = 0;
> -       void *shadow = NULL;
>
>         if (!pte_unmap_same(vmf))
>                 goto out;
> @@ -3858,33 +3857,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> -                       /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> -                       if (folio) {
> -                               __folio_set_locked(folio);
> -                               __folio_set_swapbacked(folio);
> -
> -                               if (mem_cgroup_swapin_charge_folio(folio,
> -                                                       vma->vm_mm, GFP_KERNEL,
> -                                                       entry)) {
> -                                       ret = VM_FAULT_OOM;
> -                                       goto out_page;
> -                               }
> -                               mem_cgroup_swapin_uncharge_swap(entry);
> -
> -                               shadow = get_shadow_from_swap_cache(entry);
> -                               if (shadow)
> -                                       workingset_refault(folio, shadow);
> -
> -                               folio_add_lru(folio);
> -                               page = &folio->page;
> -
> -                               /* To provide entry to swap_readpage() */
> -                               folio->swap = entry;
> -                               swap_readpage(page, true, NULL);
> -                               folio->private = NULL;
> -                       }
> +                       /* skip swapcache and readahead */
> +                       page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +                                               vmf);

A minor point,  swapin_no_readahead() is expressed in negative words.

Better use positive words to express what the function does rather
than what the function does not do.
I am terrible at naming functions myself. I can think of something
along the lines of:
swapin_direct (no cache).
swapin_minimal?
swapin_entry_only?

Please suggest better names for basic, bare minimum.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead
  2023-11-19 19:47 ` [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead Kairui Song
@ 2023-11-21  6:15   ` Chris Li
  2023-11-21  6:35     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21  6:15 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This makes swapin_readahead a main entry for swapin pages,
> prepare for optimizations in later commits.
>
> This also makes swapoff able to make use of readahead checking
> based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster:
>
> Before:
> time swapoff /dev/zram0
> real    0m12.337s
> user    0m0.001s
> sys     0m12.329s
>
> After:
> time swapoff /dev/zram0
> real    0m9.728s
> user    0m0.001s
> sys     0m9.719s
>
> And what's more, because now swapoff will also make use of no-readahead
> swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM):
> when a process that swapped out some memory previously was moved to a new
> cgroup, and the original cgroup is dead, swapoff the swap device will
> make the swapped in pages accounted into the process doing the swapoff
> instead of the new cgroup the process was moved to.
>
> This can be easily reproduced by:
> - Setup a ramdisk (eg. ZRAM) swap.
> - Create memory cgroup A, B and C.
> - Spawn process P1 in cgroup A and make it swap out some pages.
> - Move process P1 to memory cgroup B.
> - Destroy cgroup A.
> - Do a swapoff in cgroup C.
> - Swapped in pages is accounted into cgroup C.
>
> This patch will fix it make the swapped in pages accounted in cgroup B.

Can you help me understand where the code makes this behavior change?
As far as I can tell, the patch just allows swap off to use the code
path to avoid read ahead and avoid swap cache path. I haven't figured
out where the code makes the swapin charge to B.
Does it need to be ZRAM? Will zswap or SSD work in your example?

>
> The same bug exists for readahead path too, we'll fix it in later
> commits.

As discussed in another email, this behavior change is against the
current memcg memory charging model.
Better separate out this behavior change with code clean up.

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 22 +++++++---------------
>  mm/swap.h       |  6 ++----
>  mm/swap_state.c | 33 ++++++++++++++++++++++++++-------
>  mm/swapfile.c   |  2 +-
>  4 files changed, 36 insertions(+), 27 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fba4a5229163..f4237a2e3b93 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3792,6 +3792,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         rmap_t rmap_flags = RMAP_NONE;
>         bool exclusive = false;
>         swp_entry_t entry;
> +       bool swapcached;
>         pte_t pte;
>         vm_fault_t ret = 0;
>
> @@ -3855,22 +3856,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         swapcache = folio;
>
>         if (!folio) {
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> -                   __swap_count(entry) == 1) {
> -                       /* skip swapcache and readahead */
> -                       page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -                                               vmf);
> -                       if (page)
> -                               folio = page_folio(page);
> +               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +                                       vmf, &swapcached);
> +               if (page) {
> +                       folio = page_folio(page);
> +                       if (swapcached)
> +                               swapcache = folio;
>                 } else {
> -                       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -                                               vmf);
> -                       if (page)
> -                               folio = page_folio(page);
> -                       swapcache = folio;
> -               }
> -
> -               if (!folio) {
>                         /*
>                          * Back out if somebody else faulted in this pte
>                          * while we released the pte lock.
> diff --git a/mm/swap.h b/mm/swap.h
> index ea4be4791394..f82d43d7b52a 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -55,9 +55,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>                                     struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> -                             struct vm_fault *vmf);
> -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t flag,
> -                                struct vm_fault *vmf);
> +                             struct vm_fault *vmf, bool *swapcached);
>
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -89,7 +87,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
>  }
>
>  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> -                       struct vm_fault *vmf)
> +                       struct vm_fault *vmf, bool *swapcached)
>  {
>         return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 45dd8b7c195d..fd0047ae324e 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>         release_pages(pages, nr);
>  }
>
> +static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> +{
> +       return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> +}
> +
>  static inline bool swap_use_vma_readahead(void)
>  {
>         return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> @@ -861,8 +866,8 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>   * Returns the struct page for entry and addr after the swap entry is read
>   * in.
>   */
> -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -                                struct vm_fault *vmf)
> +static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +                                       struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
>         struct page *page = NULL;
> @@ -904,6 +909,8 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>   * @entry: swap entry of this memory
>   * @gfp_mask: memory allocation flags
>   * @vmf: fault information
> + * @swapcached: pointer to a bool used as indicator if the
> + *              page is swapped in through swapcache.
>   *
>   * Returns the struct page for entry and addr, after queueing swapin.
>   *
> @@ -912,17 +919,29 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>   * or vma-based(ie, virtual address based on faulty address) readahead.
>   */
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -                               struct vm_fault *vmf)
> +                             struct vm_fault *vmf, bool *swapcached)

At this point the function name "swapin_readahead" does not match what
it does any more. Because readahead is just one of the behaviors it
does. It also can do without readahead. It needs a better name. This
function becomes a generic swapin_entry.

>  {
>         struct mempolicy *mpol;
> -       pgoff_t ilx;
>         struct page *page;
> +       pgoff_t ilx;
> +       bool cached;
>
>         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> -       page = swap_use_vma_readahead() ?
> -               swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> -               swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> +       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> +               page = swapin_no_readahead(entry, gfp_mask, vmf);
> +               cached = false;
> +       } else if (swap_use_vma_readahead()) {
> +               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> +               cached = true;
> +       } else {

Notice that which flavor of swapin_read is actually a property of the
swap device.
For that device, it always calls the same swapin_entry function.

One idea is to have a VFS-like API for swap devices. This can be a
swapin_entry function callback from the swap_ops struct. Difference
swap devices just register different swapin_entry hooks. That way we
don't need to look at the device flags to decide what to do. We can
just call the VFS like swap_ops->swapin_entry(), the function pointer
will point to the right implementation. Then we don't need these three
levels if/else block. It is more flexible to register custom
implementations of swap layouts as well. Something to consider for the
future, you don't have to implement it in this series.

> +               page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> +               cached = true;
> +       }
>         mpol_cond_put(mpol);
> +
> +       if (swapcached)
> +               *swapcached = cached;
> +
>         return page;
>  }
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 756104ebd585..0142bfc71b81 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1874,7 +1874,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                         };
>
>                         page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -                                               &vmf);
> +                                               &vmf, NULL);
>                         if (page)
>                                 folio = page_folio(page);
>                 }
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead
  2023-11-21  6:15   ` Chris Li
@ 2023-11-21  6:35     ` Kairui Song
  2023-11-21  7:41       ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-21  6:35 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 14:18写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > This makes swapin_readahead a main entry for swapin pages,
> > prepare for optimizations in later commits.
> >
> > This also makes swapoff able to make use of readahead checking
> > based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster:
> >
> > Before:
> > time swapoff /dev/zram0
> > real    0m12.337s
> > user    0m0.001s
> > sys     0m12.329s
> >
> > After:
> > time swapoff /dev/zram0
> > real    0m9.728s
> > user    0m0.001s
> > sys     0m9.719s
> >
> > And what's more, because now swapoff will also make use of no-readahead
> > swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM):
> > when a process that swapped out some memory previously was moved to a new
> > cgroup, and the original cgroup is dead, swapoff the swap device will
> > make the swapped in pages accounted into the process doing the swapoff
> > instead of the new cgroup the process was moved to.
> >
> > This can be easily reproduced by:
> > - Setup a ramdisk (eg. ZRAM) swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C.
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
>
> Can you help me understand where the code makes this behavior change?
> As far as I can tell, the patch just allows swap off to use the code
> path to avoid read ahead and avoid swap cache path. I haven't figured
> out where the code makes the swapin charge to B.

Yes, swapoff always call swapin_readahead to swapin pages. Before this
patch, the page allocate/charge path is like this:
(Here we assume there ia only a ZRAM device so VMA readahead is used)
swapoff
  swapin_readahead
    swap_vma_readahead
      __read_swap_cache_async
        alloc_pages_mpol
        // *** Page charge happens here, and
        // note the second argument is NULL
        mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry)

After the patch:
swapoff
  swapin_readahead
    swapin_no_readahead
      vma_alloc_folio
        // *** Page charge happens here, and
        // note the second argument is vma->mm
      mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)

In the previous callpath (swap_vma_readahead), the mm struct info is
not passed to the final allocation/charge.
But now swapin_no_readahead can simply pass the mm struct to the
allocation/charge.

mem_cgroup_swapin_charge_folio will take the memcg of the mm_struct as
the charge target when the entry memcg is not online.

> Does it need to be ZRAM? Will zswap or SSD work in your example?

The "swapoff moves memory charge out of a dying cgroup" issue exists
for all swap devices, just this patch changed the behavior for ZRAM
(which bypass swapcache and uses swapin_no_readahead).

>
> >
> > The same bug exists for readahead path too, we'll fix it in later
> > commits.
>
> As discussed in another email, this behavior change is against the
> current memcg memory charging model.
> Better separate out this behavior change with code clean up.

Good suggestion.

>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 22 +++++++---------------
> >  mm/swap.h       |  6 ++----
> >  mm/swap_state.c | 33 ++++++++++++++++++++++++++-------
> >  mm/swapfile.c   |  2 +-
> >  4 files changed, 36 insertions(+), 27 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index fba4a5229163..f4237a2e3b93 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3792,6 +3792,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         rmap_t rmap_flags = RMAP_NONE;
> >         bool exclusive = false;
> >         swp_entry_t entry;
> > +       bool swapcached;
> >         pte_t pte;
> >         vm_fault_t ret = 0;
> >
> > @@ -3855,22 +3856,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         swapcache = folio;
> >
> >         if (!folio) {
> > -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > -                   __swap_count(entry) == 1) {
> > -                       /* skip swapcache and readahead */
> > -                       page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > -                                               vmf);
> > -                       if (page)
> > -                               folio = page_folio(page);
> > +               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > +                                       vmf, &swapcached);
> > +               if (page) {
> > +                       folio = page_folio(page);
> > +                       if (swapcached)
> > +                               swapcache = folio;
> >                 } else {
> > -                       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > -                                               vmf);
> > -                       if (page)
> > -                               folio = page_folio(page);
> > -                       swapcache = folio;
> > -               }
> > -
> > -               if (!folio) {
> >                         /*
> >                          * Back out if somebody else faulted in this pte
> >                          * while we released the pte lock.
> > diff --git a/mm/swap.h b/mm/swap.h
> > index ea4be4791394..f82d43d7b52a 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -55,9 +55,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >                                     struct mempolicy *mpol, pgoff_t ilx);
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > -                             struct vm_fault *vmf);
> > -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t flag,
> > -                                struct vm_fault *vmf);
> > +                             struct vm_fault *vmf, bool *swapcached);
> >
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> > @@ -89,7 +87,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> >  }
> >
> >  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > -                       struct vm_fault *vmf)
> > +                       struct vm_fault *vmf, bool *swapcached)
> >  {
> >         return NULL;
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 45dd8b7c195d..fd0047ae324e 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> >         release_pages(pages, nr);
> >  }
> >
> > +static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> > +{
> > +       return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > +}
> > +
> >  static inline bool swap_use_vma_readahead(void)
> >  {
> >         return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > @@ -861,8 +866,8 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >   * Returns the struct page for entry and addr after the swap entry is read
> >   * in.
> >   */
> > -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > -                                struct vm_fault *vmf)
> > +static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > +                                       struct vm_fault *vmf)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct page *page = NULL;
> > @@ -904,6 +909,8 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >   * @entry: swap entry of this memory
> >   * @gfp_mask: memory allocation flags
> >   * @vmf: fault information
> > + * @swapcached: pointer to a bool used as indicator if the
> > + *              page is swapped in through swapcache.
> >   *
> >   * Returns the struct page for entry and addr, after queueing swapin.
> >   *
> > @@ -912,17 +919,29 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >   * or vma-based(ie, virtual address based on faulty address) readahead.
> >   */
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > -                               struct vm_fault *vmf)
> > +                             struct vm_fault *vmf, bool *swapcached)
>
> At this point the function name "swapin_readahead" does not match what
> it does any more. Because readahead is just one of the behaviors it
> does. It also can do without readahead. It needs a better name. This
> function becomes a generic swapin_entry.

I renamed this function in later commits, I can rename it here to
avoid confusion.

>
> >  {
> >         struct mempolicy *mpol;
> > -       pgoff_t ilx;
> >         struct page *page;
> > +       pgoff_t ilx;
> > +       bool cached;
> >
> >         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > -       page = swap_use_vma_readahead() ?
> > -               swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> > -               swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > +       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> > +               page = swapin_no_readahead(entry, gfp_mask, vmf);
> > +               cached = false;
> > +       } else if (swap_use_vma_readahead()) {
> > +               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > +               cached = true;
> > +       } else {
>
> Notice that which flavor of swapin_read is actually a property of the
> swap device.
> For that device, it always calls the same swapin_entry function.
>
> One idea is to have a VFS-like API for swap devices. This can be a
> swapin_entry function callback from the swap_ops struct. Difference
> swap devices just register different swapin_entry hooks. That way we
> don't need to look at the device flags to decide what to do. We can
> just call the VFS like swap_ops->swapin_entry(), the function pointer
> will point to the right implementation. Then we don't need these three
> levels if/else block. It is more flexible to register custom
> implementations of swap layouts as well. Something to consider for the
> future, you don't have to implement it in this series.

Interesting idea, we may look into this later.

>
> > +               page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > +               cached = true;
> > +       }
> >         mpol_cond_put(mpol);
> > +
> > +       if (swapcached)
> > +               *swapcached = cached;
> > +
> >         return page;
> >  }
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 756104ebd585..0142bfc71b81 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1874,7 +1874,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >                         };
> >
> >                         page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > -                                               &vmf);
> > +                                               &vmf, NULL);
> >                         if (page)
> >                                 folio = page_folio(page);
> >                 }
> > --
> > 2.42.0
> >
> >

Thanks!

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 06/24] swap: rework swapin_no_readahead arguments
  2023-11-19 19:47 ` [PATCH 06/24] swap: rework swapin_no_readahead arguments Kairui Song
  2023-11-20  0:20   ` kernel test robot
@ 2023-11-21  6:44   ` Chris Li
  2023-11-23 10:51     ` Kairui Song
  1 sibling, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21  6:44 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Make it use alloc_pages_mpol instead of vma_alloc_folio, and accept
> mm_struct directly as an argument instead of taking a vmf as argument.
> Make its arguments similar to swap_{cluster,vma}_readahead, to
> make the code more aligned.

It is unclear to me what is the value of making the
swap_{cluster,vma}_readahead more aligned here.

> Also prepare for following commits which will skip vmf for certain
> swapin paths.

The following patch 07/24 does not seem to use this patch.
Can it merge with the other patch that uses it?

As it is, this patch does not serve a stand alone value to justify it.

Chris

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index fd0047ae324e..ff6756f2e8e4 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -867,17 +867,17 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>   * in.
>   */
>  static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -                                       struct vm_fault *vmf)
> +                                       struct mempolicy *mpol, pgoff_t ilx,
> +                                       struct mm_struct *mm)
>  {
> -       struct vm_area_struct *vma = vmf->vma;
> -       struct page *page = NULL;
>         struct folio *folio;
> +       struct page *page;
>         void *shadow = NULL;
>
> -       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                               vma, vmf->address, false);
> +       page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
> +       folio = (struct folio *)page;
>         if (folio) {
> -               if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +               if (mem_cgroup_swapin_charge_folio(folio, mm,
>                                                    GFP_KERNEL, entry)) {
>                         folio_put(folio);
>                         return NULL;
> @@ -896,7 +896,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
>                 /* To provide entry to swap_readpage() */
>                 folio->swap = entry;
> -               page = &folio->page;
>                 swap_readpage(page, true, NULL);
>                 folio->private = NULL;
>         }
> @@ -928,7 +927,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
>         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>         if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> -               page = swapin_no_readahead(entry, gfp_mask, vmf);
> +               page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
>                 cached = false;
>         } else if (swap_use_vma_readahead()) {
>                 page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 07/24] mm/swap: move swap_count to header to be shared
  2023-11-19 19:47 ` [PATCH 07/24] mm/swap: move swap_count to header to be shared Kairui Song
@ 2023-11-21  6:51   ` Chris Li
  2023-11-21  7:03     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21  6:51 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, prepare for later commits.

Again, I don't see the value of having this as a stand alone patch.
If one of the later patches needs to use this function as external
rather than static, move it with the patch that uses it. From the
reviewing point of view, it is unnecessary overhead to cross reference
different patches in order to figure out why it is moved.

Chris

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h     | 5 +++++
>  mm/swapfile.c | 5 -----
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index f82d43d7b52a..a9a654af791e 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -61,6 +61,11 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
>         return page_swap_info(&folio->page)->flags;
>  }
> +
> +static inline unsigned char swap_count(unsigned char ent)
> +{
> +       return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
> +}
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
>  static inline void swap_readpage(struct page *page, bool do_poll,
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0142bfc71b81..a8ae472ed2b6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -114,11 +114,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
>         return READ_ONCE(swap_info[type]); /* rcu_dereference() */
>  }
>
> -static inline unsigned char swap_count(unsigned char ent)
> -{
> -       return ent & ~SWAP_HAS_CACHE;   /* may include COUNT_CONTINUED flag */
> -}
> -
>  /* Reclaim the swap entry anyway if possible */
>  #define TTRS_ANYWAY            0x1
>  /*
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 07/24] mm/swap: move swap_count to header to be shared
  2023-11-21  6:51   ` Chris Li
@ 2023-11-21  7:03     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-21  7:03 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 14:52写道:
>
> Hi Kairui,
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, prepare for later commits.
>
> Again, I don't see the value of having this as a stand alone patch.
> If one of the later patches needs to use this function as external
> rather than static, move it with the patch that uses it. From the
> reviewing point of view, it is unnecessary overhead to cross reference
> different patches in order to figure out why it is moved.

Good suggestion, I do this locally for easier rebase & conflict
resolving, will get rid of these little parts next time.

>
> Chris
>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap.h     | 5 +++++
> >  mm/swapfile.c | 5 -----
> >  2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index f82d43d7b52a..a9a654af791e 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -61,6 +61,11 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> >         return page_swap_info(&folio->page)->flags;
> >  }
> > +
> > +static inline unsigned char swap_count(unsigned char ent)
> > +{
> > +       return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
> > +}
> >  #else /* CONFIG_SWAP */
> >  struct swap_iocb;
> >  static inline void swap_readpage(struct page *page, bool do_poll,
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 0142bfc71b81..a8ae472ed2b6 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -114,11 +114,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
> >         return READ_ONCE(swap_info[type]); /* rcu_dereference() */
> >  }
> >
> > -static inline unsigned char swap_count(unsigned char ent)
> > -{
> > -       return ent & ~SWAP_HAS_CACHE;   /* may include COUNT_CONTINUED flag */
> > -}
> > -
> >  /* Reclaim the swap entry anyway if possible */
> >  #define TTRS_ANYWAY            0x1
> >  /*
> > --
> > 2.42.0
> >
> >

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead
  2023-11-21  6:35     ` Kairui Song
@ 2023-11-21  7:41       ` Chris Li
  2023-11-21  8:32         ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21  7:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Mon, Nov 20, 2023 at 10:35 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Chris Li <chrisl@kernel.org> 于2023年11月21日周二 14:18写道:
> >
> > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > This makes swapin_readahead a main entry for swapin pages,
> > > prepare for optimizations in later commits.
> > >
> > > This also makes swapoff able to make use of readahead checking
> > > based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster:
> > >
> > > Before:
> > > time swapoff /dev/zram0
> > > real    0m12.337s
> > > user    0m0.001s
> > > sys     0m12.329s
> > >
> > > After:
> > > time swapoff /dev/zram0
> > > real    0m9.728s
> > > user    0m0.001s
> > > sys     0m9.719s
> > >
> > > And what's more, because now swapoff will also make use of no-readahead
> > > swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM):
> > > when a process that swapped out some memory previously was moved to a new
> > > cgroup, and the original cgroup is dead, swapoff the swap device will
> > > make the swapped in pages accounted into the process doing the swapoff
> > > instead of the new cgroup the process was moved to.
> > >
> > > This can be easily reproduced by:
> > > - Setup a ramdisk (eg. ZRAM) swap.
> > > - Create memory cgroup A, B and C.
> > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > - Move process P1 to memory cgroup B.
> > > - Destroy cgroup A.
> > > - Do a swapoff in cgroup C.
> > > - Swapped in pages is accounted into cgroup C.

In a strange way it makes sense to charge to C.
Swap out == free up memory.
Swap in == consume memory.
C turn off swap, effectively this behavior will consume a lot of memory.
C gets charged, so if the C is out of memory, it will punish C.
C will not be able to continue swap in memory. The problem gets under control.

> > >
> > > This patch will fix it make the swapped in pages accounted in cgroup B.

One problem I can see with your fix is that. C is the one doing the
bad deeds, causing consumption of system memory. Punish B is not going
to stop C continuing doing the bad deeds. If you let C run in B's
context limit, things get complicated very quickly.

> > Can you help me understand where the code makes this behavior change?
> > As far as I can tell, the patch just allows swap off to use the code
> > path to avoid read ahead and avoid swap cache path. I haven't figured
> > out where the code makes the swapin charge to B.
>
> Yes, swapoff always call swapin_readahead to swapin pages. Before this
> patch, the page allocate/charge path is like this:
> (Here we assume there ia only a ZRAM device so VMA readahead is used)
> swapoff
>   swapin_readahead
>     swap_vma_readahead
>       __read_swap_cache_async
>         alloc_pages_mpol
>         // *** Page charge happens here, and
>         // note the second argument is NULL
>         mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry)
>
> After the patch:
> swapoff
>   swapin_readahead
>     swapin_no_readahead
>       vma_alloc_folio
>         // *** Page charge happens here, and
>         // note the second argument is vma->mm
>       mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)

I see. Thanks for the detailed explanation. With that information, let
me go over your patch again.

> In the previous callpath (swap_vma_readahead), the mm struct info is
> not passed to the final allocation/charge.
> But now swapin_no_readahead can simply pass the mm struct to the
> allocation/charge.
>
> mem_cgroup_swapin_charge_folio will take the memcg of the mm_struct as
> the charge target when the entry memcg is not online.
>
> > Does it need to be ZRAM? Will zswap or SSD work in your example?
>
> The "swapoff moves memory charge out of a dying cgroup" issue exists

There is a whole class of zombie memcg issues the current memory
cgroup model can't handle well.

> for all swap devices, just this patch changed the behavior for ZRAM
> (which bypass swapcache and uses swapin_no_readahead).

Thanks for the clarification.

Chris

>
> >
> > >
> > > The same bug exists for readahead path too, we'll fix it in later
> > > commits.
> >
> > As discussed in another email, this behavior change is against the
> > current memcg memory charging model.
> > Better separate out this behavior change with code clean up.
>
> Good suggestion.
>
> >
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/memory.c     | 22 +++++++---------------
> > >  mm/swap.h       |  6 ++----
> > >  mm/swap_state.c | 33 ++++++++++++++++++++++++++-------
> > >  mm/swapfile.c   |  2 +-
> > >  4 files changed, 36 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index fba4a5229163..f4237a2e3b93 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3792,6 +3792,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >         rmap_t rmap_flags = RMAP_NONE;
> > >         bool exclusive = false;
> > >         swp_entry_t entry;
> > > +       bool swapcached;
> > >         pte_t pte;
> > >         vm_fault_t ret = 0;
> > >
> > > @@ -3855,22 +3856,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >         swapcache = folio;
> > >
> > >         if (!folio) {
> > > -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > > -                   __swap_count(entry) == 1) {
> > > -                       /* skip swapcache and readahead */
> > > -                       page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > -                                               vmf);
> > > -                       if (page)
> > > -                               folio = page_folio(page);
> > > +               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > +                                       vmf, &swapcached);
> > > +               if (page) {
> > > +                       folio = page_folio(page);
> > > +                       if (swapcached)
> > > +                               swapcache = folio;
> > >                 } else {
> > > -                       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > -                                               vmf);
> > > -                       if (page)
> > > -                               folio = page_folio(page);
> > > -                       swapcache = folio;
> > > -               }
> > > -
> > > -               if (!folio) {
> > >                         /*
> > >                          * Back out if somebody else faulted in this pte
> > >                          * while we released the pte lock.
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index ea4be4791394..f82d43d7b52a 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -55,9 +55,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> > >                                     struct mempolicy *mpol, pgoff_t ilx);
> > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > > -                             struct vm_fault *vmf);
> > > -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t flag,
> > > -                                struct vm_fault *vmf);
> > > +                             struct vm_fault *vmf, bool *swapcached);
> > >
> > >  static inline unsigned int folio_swap_flags(struct folio *folio)
> > >  {
> > > @@ -89,7 +87,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> > >  }
> > >
> > >  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > > -                       struct vm_fault *vmf)
> > > +                       struct vm_fault *vmf, bool *swapcached)
> > >  {
> > >         return NULL;
> > >  }
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index 45dd8b7c195d..fd0047ae324e 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> > >         release_pages(pages, nr);
> > >  }
> > >
> > > +static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> > > +{
> > > +       return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > > +}
> > > +
> > >  static inline bool swap_use_vma_readahead(void)
> > >  {
> > >         return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > > @@ -861,8 +866,8 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> > >   * Returns the struct page for entry and addr after the swap entry is read
> > >   * in.
> > >   */
> > > -struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > -                                struct vm_fault *vmf)
> > > +static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > +                                       struct vm_fault *vmf)
> > >  {
> > >         struct vm_area_struct *vma = vmf->vma;
> > >         struct page *page = NULL;
> > > @@ -904,6 +909,8 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > >   * @entry: swap entry of this memory
> > >   * @gfp_mask: memory allocation flags
> > >   * @vmf: fault information
> > > + * @swapcached: pointer to a bool used as indicator if the
> > > + *              page is swapped in through swapcache.
> > >   *
> > >   * Returns the struct page for entry and addr, after queueing swapin.
> > >   *
> > > @@ -912,17 +919,29 @@ struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > >   * or vma-based(ie, virtual address based on faulty address) readahead.
> > >   */
> > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > -                               struct vm_fault *vmf)
> > > +                             struct vm_fault *vmf, bool *swapcached)
> >
> > At this point the function name "swapin_readahead" does not match what
> > it does any more. Because readahead is just one of the behaviors it
> > does. It also can do without readahead. It needs a better name. This
> > function becomes a generic swapin_entry.
>
> I renamed this function in later commits, I can rename it here to
> avoid confusion.
>
> >
> > >  {
> > >         struct mempolicy *mpol;
> > > -       pgoff_t ilx;
> > >         struct page *page;
> > > +       pgoff_t ilx;
> > > +       bool cached;
> > >
> > >         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > > -       page = swap_use_vma_readahead() ?
> > > -               swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> > > -               swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > > +       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> > > +               page = swapin_no_readahead(entry, gfp_mask, vmf);
> > > +               cached = false;
> > > +       } else if (swap_use_vma_readahead()) {
> > > +               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > > +               cached = true;
> > > +       } else {
> >
> > Notice that which flavor of swapin_read is actually a property of the
> > swap device.
> > For that device, it always calls the same swapin_entry function.
> >
> > One idea is to have a VFS-like API for swap devices. This can be a
> > swapin_entry function callback from the swap_ops struct. Difference
> > swap devices just register different swapin_entry hooks. That way we
> > don't need to look at the device flags to decide what to do. We can
> > just call the VFS like swap_ops->swapin_entry(), the function pointer
> > will point to the right implementation. Then we don't need these three
> > levels if/else block. It is more flexible to register custom
> > implementations of swap layouts as well. Something to consider for the
> > future, you don't have to implement it in this series.
>
> Interesting idea, we may look into this later.
>
> >
> > > +               page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > > +               cached = true;
> > > +       }
> > >         mpol_cond_put(mpol);
> > > +
> > > +       if (swapcached)
> > > +               *swapcached = cached;
> > > +
> > >         return page;
> > >  }
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 756104ebd585..0142bfc71b81 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1874,7 +1874,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >                         };
> > >
> > >                         page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > -                                               &vmf);
> > > +                                               &vmf, NULL);
> > >                         if (page)
> > >                                 folio = page_folio(page);
> > >                 }
> > > --
> > > 2.42.0
> > >
> > >
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-19 19:47 ` [PATCH 08/24] mm/swap: check readahead policy per entry Kairui Song
  2023-11-20  6:04   ` Huang, Ying
@ 2023-11-21  7:54   ` Chris Li
  2023-11-23 10:52     ` Kairui Song
  1 sibling, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21  7:54 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently VMA readahead is globally disabled when any rotate disk is
> used as swap backend. So multiple swap devices are enabled, if a slower
> hard disk is set as a low priority fallback, and a high performance SSD
> is used and high priority swap device, vma readahead is disabled globally.
> The SSD swap device performance will drop by a lot.
>
> Check readahead policy per entry to avoid such problem.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff6756f2e8e4..fb78f7f18ed7 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
>         return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
>  }
>
> -static inline bool swap_use_vma_readahead(void)
> +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>  {
> -       return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> +       return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);

A very minor point:
I notice you change the order enable_vma_readahead to the last.
Normally if enable_vma_reachahead == 0, there is no need to check the si->flags.
The si->flags check is more expensive than simple memory load.
You might want to check enable_vma_readahead first then you can short
cut the more expensive part.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 09/24] mm/swap: inline __swap_count
  2023-11-20  7:41   ` Huang, Ying
@ 2023-11-21  8:02     ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21  8:02 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song, linux-mm, Kairui Song, Andrew Morton,
	David Hildenbrand, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
	Michal Hocko, linux-kernel

On Sun, Nov 19, 2023 at 11:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > There is only one caller in swap subsystem now, where it can be inline
> > smoothly, avoid the memory access and function call overheads.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/swap.h | 6 ------
> >  mm/swap_state.c      | 6 +++---
> >  mm/swapfile.c        | 8 --------
> >  3 files changed, 3 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2401990d954d..64a37819a9b3 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -485,7 +485,6 @@ int swap_type_of(dev_t device, sector_t offset);
> >  int find_first_swap(dev_t *device);
> >  extern unsigned int count_swap_pages(int, int);
> >  extern sector_t swapdev_block(int, pgoff_t);
> > -extern int __swap_count(swp_entry_t entry);
> >  extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
> >  extern int swp_swapcount(swp_entry_t entry);
> >  extern struct swap_info_struct *page_swap_info(struct page *);
> > @@ -559,11 +558,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> >
> > -static inline int __swap_count(swp_entry_t entry)
> > -{
> > -     return 0;
> > -}
> > -
> >  static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> >  {
> >       return 0;
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index fb78f7f18ed7..d87c20f9f7ec 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -316,9 +316,9 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> >       release_pages(pages, nr);
> >  }
> >
> > -static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> > +static inline bool swap_use_no_readahead(struct swap_info_struct *si, pgoff_t offset)
> >  {
> > -     return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > +     return data_race(si->flags & SWP_SYNCHRONOUS_IO) && swap_count(si->swap_map[offset]) == 1;
> >  }
> >
> >  static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> > @@ -928,7 +928,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >
> >       si = swp_swap_info(entry);
> >       mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > -     if (swap_use_no_readahead(si, entry)) {
> > +     if (swap_use_no_readahead(si, swp_offset(entry))) {
> >               page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> >               cached = false;
> >       } else if (swap_use_vma_readahead(si)) {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index a8ae472ed2b6..e15a6c464a38 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1431,14 +1431,6 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
> >               spin_unlock(&p->lock);
> >  }
> >
> > -int __swap_count(swp_entry_t entry)
> > -{
> > -     struct swap_info_struct *si = swp_swap_info(entry);
> > -     pgoff_t offset = swp_offset(entry);
> > -
> > -     return swap_count(si->swap_map[offset]);
> > -}
> > -
>
> I'd rather keep __swap_count() in the original place together with other
> swap count related functions.  And si->swap_map[] was hided in
> swapfile.c before.  I don't think the change will have any real
> performance improvement.

I agree with Ying here. Does not seem to have value high enough to
justify a patch by itself.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead
  2023-11-21  7:41       ` Chris Li
@ 2023-11-21  8:32         ` Kairui Song
  2023-11-21 15:24           ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-21  8:32 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 15:41写道:
>
> On Mon, Nov 20, 2023 at 10:35 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > Chris Li <chrisl@kernel.org> 于2023年11月21日周二 14:18写道:
> > >
> > > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > This makes swapin_readahead a main entry for swapin pages,
> > > > prepare for optimizations in later commits.
> > > >
> > > > This also makes swapoff able to make use of readahead checking
> > > > based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster:
> > > >
> > > > Before:
> > > > time swapoff /dev/zram0
> > > > real    0m12.337s
> > > > user    0m0.001s
> > > > sys     0m12.329s
> > > >
> > > > After:
> > > > time swapoff /dev/zram0
> > > > real    0m9.728s
> > > > user    0m0.001s
> > > > sys     0m9.719s
> > > >
> > > > And what's more, because now swapoff will also make use of no-readahead
> > > > swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM):
> > > > when a process that swapped out some memory previously was moved to a new
> > > > cgroup, and the original cgroup is dead, swapoff the swap device will
> > > > make the swapped in pages accounted into the process doing the swapoff
> > > > instead of the new cgroup the process was moved to.
> > > >
> > > > This can be easily reproduced by:
> > > > - Setup a ramdisk (eg. ZRAM) swap.
> > > > - Create memory cgroup A, B and C.
> > > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > > - Move process P1 to memory cgroup B.
> > > > - Destroy cgroup A.
> > > > - Do a swapoff in cgroup C.
> > > > - Swapped in pages is accounted into cgroup C.
>
> In a strange way it makes sense to charge to C.
> Swap out == free up memory.
> Swap in == consume memory.
> C turn off swap, effectively this behavior will consume a lot of memory.
> C gets charged, so if the C is out of memory, it will punish C.
> C will not be able to continue swap in memory. The problem gets under control.

Yes, I think charging either C or B makes sense in their own way. To
me I think current behavior is kind of counter-intuitive.

Image if there are cgroup PC1, and its child cgroup CC1, CC2. If a process
swapped out some memory in CC1 then moved to CC2, and CC1 is dying.
On swapoff the charge will be moved out of PC1...

And swapoff often happens in some unlimited admin cgroup or some
cgroup for management agents.

If PC1 has a memory limit, the process in it can breach the limit easily,
we will see a process that never left PC1 having a much higher RSS
than PC1/CC1/CC2's limit.

And if there is a limit for the management agent cgroup, the agent
will be OOM instead of OOM in PC1.

Simply moving a process between the child cgroup of the same parent
cgroup won't cause a similar issue, things get weird when swapoff is
involved.

And actually with multiple layers of swap, it's less risky to swapoff
a device since other swap devices can catch over committed memory.

Oh, and there is one more case I forgot to cover in this series:
Moving a process is indeed something not happening very frequently,
but a process run in cgroup then exit, and leave some shmem swapped
out could be a common case.
Current behavior on swapoff will move these charges out of the
original parent cgroup too.

So maybe a more ideal solution for swapoff is: simply always charge a
dying cgroup parent cgroup?

Maybe a sysctl/cmdline could be introduced to control the behavior.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead
  2023-11-21  8:32         ` Kairui Song
@ 2023-11-21 15:24           ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21 15:24 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Tue, Nov 21, 2023 at 12:33 AM Kairui Song <ryncsn@gmail.com> wrote:
> > In a strange way it makes sense to charge to C.
> > Swap out == free up memory.
> > Swap in == consume memory.
> > C turn off swap, effectively this behavior will consume a lot of memory.
> > C gets charged, so if the C is out of memory, it will punish C.
> > C will not be able to continue swap in memory. The problem gets under control.
>
> Yes, I think charging either C or B makes sense in their own way. To
> me I think current behavior is kind of counter-intuitive.
>
> Image if there are cgroup PC1, and its child cgroup CC1, CC2. If a process
> swapped out some memory in CC1 then moved to CC2, and CC1 is dying.
> On swapoff the charge will be moved out of PC1...

Yes. In the spirit of punishing the one that is actively doing bad
deeds, the swapoff charge move out of PC1 makes sense, because the
memory effectively is not used in PC1 any more.

The question is why do you want to move the process from CC1 to CC2?
Move process between CGroup is not something supported very well in
the current CGroup model. The memory already charged to the current
CGroup will not move with the process.

> And swapoff often happens in some unlimited admin cgroup or some
> cgroup for management agents.
>
> If PC1 has a memory limit, the process in it can breach the limit easily,
> we will see a process that never left PC1 having a much higher RSS
> than PC1/CC1/CC2's limit.

Hmm, how do you set the limit on PC1? Do you set a hard limit
"memory.max"? What does the "PC1/memory.stat" show?
If you have a script to reproduce this behavior, I would love to learn more.

> And if there is a limit for the management agent cgroup, the agent
> will be OOM instead of OOM in PC1.

It seems what you want to do is have a way for the admin agent to turn
off the swapping on PC1 and let PC1 OOM, right?
In other words, you want the admin agent to turn off swapping in PC1's context.
How about start a bash script, add itself to PC1/cgroup.procs, then
that bash script calls swap off for that PC1.
Will that solve your problem?

I am still not sure why you want to turn off swap for PC1? If you know
PC1 is going to OOM, why not just kill it? A little bit more of the
context of the problem will help me understand your usage case.

> Simply moving a process between the child cgroup of the same parent
> cgroup won't cause a similar issue, things get weird when swapoff is
> involved.

Again, having a reproducing script is very helpful for understanding
what is going on. It can serve as a regression testing tool.

> And actually with multiple layers of swap, it's less risky to swapoff
> a device since other swap devices can catch over committed memory.

Sure. Do you have any feedback if we have "memory.swap.tiers", will
that address your need to control swap usage for individual cgroup?
>
> Oh, and there is one more case I forgot to cover in this series:
> Moving a process is indeed something not happening very frequently,
> but a process run in cgroup then exit, and leave some shmem swapped
> out could be a common case.

That is the zombie memcg problem with shared memory. The current
cgroup does work well with shared memory objects. It is another can of
worms.

> Current behavior on swapoff will move these charges out of the
> original parent cgroup too.
>
> So maybe a more ideal solution for swapoff is: simply always charge a
> dying cgroup parent cgroup?

That is the recharging idea. In this year's LSFMM there is a
presentation about zombie memcgs. Recharging/reparenting is one of the
proposals. If I recall correctly, someone from SUSE made some comment
that they tried it in their kernel at one point. But it had a lot of
issues and the feature was removed from the kernel eventually.

I also made a proposal for a shared memory cgroup model as well. I can
dig it up if you are interested.

> Maybe a sysctl/cmdline could be introduced to control the behavior.

I am against this kind of one off behavior change without a consistent
model of charging the memory. Even behind a sysctl, it is some code we
need to maintain and test function properly.
If you want to change the memory charging model, I would like to see
what is the high level principle it follows so we can reason it in
other usage cases consistently. The least thing I want to see is to
make up rules as we go, then we might end up with contradicting rules
and back ourselves to a corner in the future.
I do care about your usage case, I just want to address it consistently.
My understanding of the current memory cgroup model is charge to the
first use. Punish the trouble maker.

BTW, I am still reviewing your 24 patches series, half way through. It
will take me a bit of time to write up the reply.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 10/24] mm/swap: remove nr_rotate_swap and related code
  2023-11-19 19:47 ` [PATCH 10/24] mm/swap: remove nr_rotate_swap and related code Kairui Song
@ 2023-11-21 15:45   ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21 15:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No longer needed after we switched to per entry swap readhead policy.

This is a behavior change patch, better separate out from clean up
series for exposure and discussions.

I think the idea is reasonable. The policy should be made on device
level rather than system level. I saw Ying has some great feedback
regarding readahead cross device boundaries.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
  2023-11-19 19:47 ` [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead Kairui Song
  2023-11-20  0:47   ` kernel test robot
@ 2023-11-21 16:06   ` Chris Li
  2023-11-24  8:42     ` Kairui Song
  1 sibling, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21 16:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, just prepare for later commits.

You need to have a proper commit message why this change needs to happen.
Preparing is too generic, it does not give any real information.
For example, it seems you want to reduce one swap cache lookup because
swap_readahead already has it?

I am a bit puzzled at this patch. It shuffles a lot of sensitive code.
However I do not get  the value.
It seems like this patch should be merged with the later patch that
depends on it to be judged together.

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 61 +++++++++++++++++++++++--------------------------
>  mm/swap.h       | 10 ++++++--
>  mm/swap_state.c | 26 +++++++++++++--------
>  mm/swapfile.c   | 30 +++++++++++-------------
>  4 files changed, 66 insertions(+), 61 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f4237a2e3b93..22af9f3e8c75 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
> -       struct folio *swapcache, *folio = NULL;
> +       struct folio *swapcache = NULL, *folio = NULL;
> +       enum swap_cache_result cache_result;
>         struct page *page;
>         struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
>         bool exclusive = false;
>         swp_entry_t entry;
> -       bool swapcached;
>         pte_t pte;
>         vm_fault_t ret = 0;
>
> @@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (unlikely(!si))
>                 goto out;
>
> -       folio = swap_cache_get_folio(entry, vma, vmf->address);
> -       if (folio)
> -               page = folio_file_page(folio, swp_offset(entry));
> -       swapcache = folio;

Is the motivation that swap_readahead() already has a swap cache look up so you
remove this look up here?


> -
> -       if (!folio) {
> -               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -                                       vmf, &swapcached);
> -               if (page) {
> -                       folio = page_folio(page);
> -                       if (swapcached)
> -                               swapcache = folio;
> -               } else {
> +       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +                               vmf, &cache_result);
> +       if (page) {
> +               folio = page_folio(page);
> +               if (cache_result != SWAP_CACHE_HIT) {
> +                       /* Had to read the page from swap area: Major fault */
> +                       ret = VM_FAULT_MAJOR;
> +                       count_vm_event(PGMAJFAULT);
> +                       count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> +               }
> +               if (cache_result != SWAP_CACHE_BYPASS)
> +                       swapcache = folio;
> +               if (PageHWPoison(page)) {

There is a lot of code shuffle here. From the diff it is hard to tell
if they are doing the same thing as before.

>                         /*
> -                        * Back out if somebody else faulted in this pte
> -                        * while we released the pte lock.
> +                        * hwpoisoned dirty swapcache pages are kept for killing
> +                        * owner processes (which may be unknown at hwpoison time)
>                          */
> -                       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -                                       vmf->address, &vmf->ptl);
> -                       if (likely(vmf->pte &&
> -                                  pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> -                               ret = VM_FAULT_OOM;
> -                       goto unlock;
> +                       ret = VM_FAULT_HWPOISON;
> +                       goto out_release;
>                 }
> -
> -               /* Had to read the page from swap area: Major fault */
> -               ret = VM_FAULT_MAJOR;
> -               count_vm_event(PGMAJFAULT);
> -               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> -       } else if (PageHWPoison(page)) {
> +       } else {
>                 /*
> -                * hwpoisoned dirty swapcache pages are kept for killing
> -                * owner processes (which may be unknown at hwpoison time)
> +                * Back out if somebody else faulted in this pte
> +                * while we released the pte lock.
>                  */
> -               ret = VM_FAULT_HWPOISON;
> -               goto out_release;
> +               vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> +                               vmf->address, &vmf->ptl);
> +               if (likely(vmf->pte &&
> +                          pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> +                       ret = VM_FAULT_OOM;
> +               goto unlock;
>         }
>
>         ret |= folio_lock_or_retry(folio, vmf);
> diff --git a/mm/swap.h b/mm/swap.h
> index a9a654af791e..ac9136eee690 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
>         (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>                 >> SWAP_ADDRESS_SPACE_SHIFT])
>
> +enum swap_cache_result {
> +       SWAP_CACHE_HIT,
> +       SWAP_CACHE_MISS,
> +       SWAP_CACHE_BYPASS,
> +};

Does any function later care about CACHE_BYPASS?

Again, better introduce it with the function that uses it. Don't
introduce it for "just in case I might use it later".

> +
>  void show_swap_cache_info(void);
>  bool add_to_swap(struct folio *folio);
>  void *get_shadow_from_swap_cache(swp_entry_t entry);
> @@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>                                     struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> -                             struct vm_fault *vmf, bool *swapcached);
> +                             struct vm_fault *vmf, enum swap_cache_result *result);
>
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
>  }
>
>  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> -                       struct vm_fault *vmf, bool *swapcached)
> +                       struct vm_fault *vmf, enum swap_cache_result *result)
>  {
>         return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d87c20f9f7ec..e96d63bf8a22 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>   * @entry: swap entry of this memory
>   * @gfp_mask: memory allocation flags
>   * @vmf: fault information
> - * @swapcached: pointer to a bool used as indicator if the
> - *              page is swapped in through swapcache.
> + * @result: a return value to indicate swap cache usage.
>   *
>   * Returns the struct page for entry and addr, after queueing swapin.
>   *
> @@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>   * or vma-based(ie, virtual address based on faulty address) readahead.
>   */
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -                             struct vm_fault *vmf, bool *swapcached)
> +                             struct vm_fault *vmf, enum swap_cache_result *result)
>  {
> +       enum swap_cache_result cache_result;
>         struct swap_info_struct *si;
>         struct mempolicy *mpol;
> +       struct folio *folio;
>         struct page *page;
>         pgoff_t ilx;
> -       bool cached;
> +
> +       folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> +       if (folio) {
> +               page = folio_file_page(folio, swp_offset(entry));
> +               cache_result = SWAP_CACHE_HIT;
> +               goto done;
> +       }
>
>         si = swp_swap_info(entry);
>         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>         if (swap_use_no_readahead(si, swp_offset(entry))) {
>                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> -               cached = false;
> +               cache_result = SWAP_CACHE_BYPASS;
>         } else if (swap_use_vma_readahead(si)) {
>                 page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> -               cached = true;
> +               cache_result = SWAP_CACHE_MISS;
>         } else {
>                 page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> -               cached = true;
> +               cache_result = SWAP_CACHE_MISS;
>         }
>         mpol_cond_put(mpol);
>
> -       if (swapcached)
> -               *swapcached = cached;
> +done:
> +       if (result)
> +               *result = cache_result;
>
>         return page;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 01c3f53b6521..b6d57fff5e21 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>
>         si = swap_info[type];
>         do {
> -               struct folio *folio;
> +               struct page *page;
>                 unsigned long offset;
>                 unsigned char swp_count;
> +               struct folio *folio = NULL;
>                 swp_entry_t entry;
>                 int ret;
>                 pte_t ptent;
>
> +               struct vm_fault vmf = {
> +                       .vma = vma,
> +                       .address = addr,
> +                       .real_address = addr,
> +                       .pmd = pmd,
> +               };

Is this code move caused by skipping the swap cache look up here?

This is very sensitive code related to swap cache racing. It needs
very careful reviewing. Better not shuffle it for no good reason.

Chris

> +
>                 if (!pte++) {
>                         pte = pte_offset_map(pmd, addr);
>                         if (!pte)
> @@ -1847,22 +1855,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                 offset = swp_offset(entry);
>                 pte_unmap(pte);
>                 pte = NULL;
> -
> -               folio = swap_cache_get_folio(entry, vma, addr);
> -               if (!folio) {
> -                       struct page *page;
> -                       struct vm_fault vmf = {
> -                               .vma = vma,
> -                               .address = addr,
> -                               .real_address = addr,
> -                               .pmd = pmd,
> -                       };
> -
> -                       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -                                               &vmf, NULL);
> -                       if (page)
> -                               folio = page_folio(page);
> -               }
> +               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +                                       &vmf, NULL);
> +               if (page)
> +                       folio = page_folio(page);
>                 if (!folio) {
>                         /*
>                          * The entry could have been freed, and will not
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 12/24] mm/swap: simplify arguments for swap_cache_get_folio
  2023-11-19 19:47 ` [PATCH 12/24] mm/swap: simplify arguments for swap_cache_get_folio Kairui Song
@ 2023-11-21 16:36   ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21 16:36 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> There are only two caller now, simplify the arguments.

I don't think this patch is needed. It will not have a real impact on
the resulting kernel.

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c      |  2 +-
>  mm/swap.h       |  2 +-
>  mm/swap_state.c | 15 +++++++--------
>  3 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0d1ce70bce38..72239061c655 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         }
>
>         /* Look it up and read it in.. */
> -       folio = swap_cache_get_folio(swap, NULL, 0);
> +       folio = swap_cache_get_folio(swap, NULL);
>         if (!folio) {
>                 /* Or update major stats only when swapin succeeds?? */
>                 if (fault_type) {
> diff --git a/mm/swap.h b/mm/swap.h
> index ac9136eee690..e43e965f123f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>                                   unsigned long end);
>  struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr);
> +                                  struct vm_fault *vmf);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>                 pgoff_t index);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e96d63bf8a22..91461e26a8cc 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,8 +334,7 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>   *
>   * Caller must lock the swap device or hold a reference to keep it valid.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr)
> +struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)

I actually prefer the original code. The vm_fault is a much heavier object.
vma and address is really what this function needs minimally. I
consider vma and address,
even though two arguments are simpler than vm_fault struct.

It is possible some other non fault patch wants to lookup the swap_cache,
then construct a vmf just for the vma and address is kind of unnatural.

Chris

>  {
>         struct folio *folio;
>
> @@ -352,22 +351,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>                         return folio;
>
>                 readahead = folio_test_clear_readahead(folio);
> -               if (vma && vma_ra) {
> +               if (vmf && vma_ra) {
>                         unsigned long ra_val;
>                         int win, hits;
>
> -                       ra_val = GET_SWAP_RA_VAL(vma);
> +                       ra_val = GET_SWAP_RA_VAL(vmf->vma);
>                         win = SWAP_RA_WIN(ra_val);
>                         hits = SWAP_RA_HITS(ra_val);
>                         if (readahead)
>                                 hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> -                       atomic_long_set(&vma->swap_readahead_info,
> -                                       SWAP_RA_VAL(addr, win, hits));
> +                       atomic_long_set(&vmf->vma->swap_readahead_info,
> +                                       SWAP_RA_VAL(vmf->address, win, hits));
>                 }
>
>                 if (readahead) {
>                         count_vm_event(SWAP_RA_HIT);
> -                       if (!vma || !vma_ra)
> +                       if (!vmf || !vma_ra)
>                                 atomic_inc(&swapin_readahead_hits);
>                 }
>         } else {
> @@ -926,7 +925,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         struct page *page;
>         pgoff_t ilx;
>
> -       folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> +       folio = swap_cache_get_folio(entry, vmf);
>         if (folio) {
>                 page = folio_file_page(folio, swp_offset(entry));
>                 cache_result = SWAP_CACHE_HIT;
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 13/24] swap: simplify swap_cache_get_folio
  2023-11-19 19:47 ` [PATCH 13/24] swap: simplify swap_cache_get_folio Kairui Song
@ 2023-11-21 16:50   ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21 16:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

I agree the resulting code is marginally better.

However, this change does not bring enough value to justify a stand alone patch.
It has no impact on the resulting kernel.

It would be much better if the code was checkin like this, or if you
are modifying this function, rewrite it better. In my opinion, doing
very trivial code shuffling for the sake of cleaning up is not
justifiable for the value it brings.
For one it will make the git blame less obvious who actually changed
that code for what reason. I am against trivial code shuffling.

Chris

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Rearrange the if statement, reduce the code indent, no feature change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 58 ++++++++++++++++++++++++-------------------------
>  1 file changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 91461e26a8cc..3b5a34f47192 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -336,41 +336,39 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
>  {
> +       bool vma_ra, readahead;
>         struct folio *folio;
>
>         folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> -       if (!IS_ERR(folio)) {
> -               bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> -               bool readahead;
> +       if (IS_ERR(folio))
> +               return NULL;
>
> -               /*
> -                * At the moment, we don't support PG_readahead for anon THP
> -                * so let's bail out rather than confusing the readahead stat.
> -                */
> -               if (unlikely(folio_test_large(folio)))
> -                       return folio;
> -
> -               readahead = folio_test_clear_readahead(folio);
> -               if (vmf && vma_ra) {
> -                       unsigned long ra_val;
> -                       int win, hits;
> -
> -                       ra_val = GET_SWAP_RA_VAL(vmf->vma);
> -                       win = SWAP_RA_WIN(ra_val);
> -                       hits = SWAP_RA_HITS(ra_val);
> -                       if (readahead)
> -                               hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> -                       atomic_long_set(&vmf->vma->swap_readahead_info,
> -                                       SWAP_RA_VAL(vmf->address, win, hits));
> -               }
> +       /*
> +        * At the moment, we don't support PG_readahead for anon THP
> +        * so let's bail out rather than confusing the readahead stat.
> +        */
> +       if (unlikely(folio_test_large(folio)))
> +               return folio;
>
> -               if (readahead) {
> -                       count_vm_event(SWAP_RA_HIT);
> -                       if (!vmf || !vma_ra)
> -                               atomic_inc(&swapin_readahead_hits);
> -               }
> -       } else {
> -               folio = NULL;
> +       vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> +       readahead = folio_test_clear_readahead(folio);
> +       if (vmf && vma_ra) {
> +               unsigned long ra_val;
> +               int win, hits;
> +
> +               ra_val = GET_SWAP_RA_VAL(vmf->vma);
> +               win = SWAP_RA_WIN(ra_val);
> +               hits = SWAP_RA_HITS(ra_val);
> +               if (readahead)
> +                       hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> +               atomic_long_set(&vmf->vma->swap_readahead_info,
> +                               SWAP_RA_VAL(vmf->address, win, hits));
> +       }
> +
> +       if (readahead) {
> +               count_vm_event(SWAP_RA_HIT);
> +               if (!vmf || !vma_ra)
> +                       atomic_inc(&swapin_readahead_hits);
>         }
>
>         return folio;
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 14/24] mm/swap: do shadow lookup as well when doing swap cache lookup
  2023-11-19 19:47 ` [PATCH 14/24] mm/swap: do shadow lookup as well when doing swap cache lookup Kairui Song
@ 2023-11-21 16:55   ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21 16:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

Too trivial to stand alone as a patch. Merge it with the patch needed
to use that "*shadow".

Chris

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Make swap_cache_get_folio capable of returning the shadow value when the
> xarray contains a shadow instead of a valid folio. Just extend the
> arguments to be used later.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c      |  2 +-
>  mm/swap.h       |  2 +-
>  mm/swap_state.c | 11 +++++++----
>  3 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 72239061c655..f9ce4067c742 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         }
>
>         /* Look it up and read it in.. */
> -       folio = swap_cache_get_folio(swap, NULL);
> +       folio = swap_cache_get_folio(swap, NULL, NULL);
>         if (!folio) {
>                 /* Or update major stats only when swapin succeeds?? */
>                 if (fault_type) {
> diff --git a/mm/swap.h b/mm/swap.h
> index e43e965f123f..da9deb5ba37d 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>                                   unsigned long end);
>  struct folio *swap_cache_get_folio(swp_entry_t entry,
> -                                  struct vm_fault *vmf);
> +                                  struct vm_fault *vmf, void **shadowp);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>                 pgoff_t index);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3b5a34f47192..e057c79fb06f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,14 +334,17 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>   *
>   * Caller must lock the swap device or hold a reference to keep it valid.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
> +struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
>  {
>         bool vma_ra, readahead;
>         struct folio *folio;
>
> -       folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> -       if (IS_ERR(folio))
> +       folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
> +       if (xa_is_value(folio)) {
> +               if (shadowp)
> +                       *shadowp = folio;
>                 return NULL;
> +       }
>
>         /*
>          * At the moment, we don't support PG_readahead for anon THP
> @@ -923,7 +926,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         struct page *page;
>         pgoff_t ilx;
>
> -       folio = swap_cache_get_folio(entry, vmf);
> +       folio = swap_cache_get_folio(entry, vmf, NULL);
>         if (folio) {
>                 page = folio_file_page(folio, swp_offset(entry));
>                 cache_result = SWAP_CACHE_HIT;
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device
  2023-11-19 19:47 ` [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device Kairui Song
@ 2023-11-21 17:15   ` Chris Li
  2023-11-22 18:08     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-21 17:15 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> When a xa_value is returned by the cache lookup, keep it to be used
> later for workingset refault check instead of doing the looking up again
> in swapin_no_readahead.
>
> This does have a side effect of making swapoff also triggers workingset
> check, but should be fine since swapoff does effect the workload in many
> ways already.

I need to sleep on it a bit to see if this will create another problem or not.

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e057c79fb06f..51de2a0412df 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -872,7 +872,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  {
>         struct folio *folio;
>         struct page *page;
> -       void *shadow = NULL;
>
>         page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
>         folio = (struct folio *)page;
> @@ -888,10 +887,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
>                 mem_cgroup_swapin_uncharge_swap(entry);
>
> -               shadow = get_shadow_from_swap_cache(entry);
> -               if (shadow)
> -                       workingset_refault(folio, shadow);
> -
>                 folio_add_lru(folio);
>
>                 /* To provide entry to swap_readpage() */
> @@ -922,11 +917,12 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         enum swap_cache_result cache_result;
>         struct swap_info_struct *si;
>         struct mempolicy *mpol;
> +       void *shadow = NULL;
>         struct folio *folio;
>         struct page *page;
>         pgoff_t ilx;
>
> -       folio = swap_cache_get_folio(entry, vmf, NULL);
> +       folio = swap_cache_get_folio(entry, vmf, &shadow);
>         if (folio) {
>                 page = folio_file_page(folio, swp_offset(entry));
>                 cache_result = SWAP_CACHE_HIT;
> @@ -938,6 +934,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         if (swap_use_no_readahead(si, swp_offset(entry))) {
>                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
>                 cache_result = SWAP_CACHE_BYPASS;
> +               if (shadow)
> +                       workingset_refault(page_folio(page), shadow);

It is inconsistent why other flavors of readahead do not do the
workingset_refault here.
I suggest keeping the workingset_refault in swapin_no_readahead() and
pass the shadow argument in.

Chris

>         } else if (swap_use_vma_readahead(si)) {
>                 page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>                 cache_result = SWAP_CACHE_MISS;
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-19 19:47 ` [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path Kairui Song
  2023-11-19 21:12   ` Matthew Wilcox
@ 2023-11-21 17:25   ` Chris Li
  2023-11-22  0:36   ` Huang, Ying
  2 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-21 17:25 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Move get_swap_device into swapin_readahead, simplify the code
> and prepare for follow up commits.
>
> For the later part in do_swap_page, using swp_swap_info directly is fine
> since in that context, the swap device is pinned by swapcache reference.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 16 ++++------------
>  mm/swap_state.c |  8 ++++++--
>  mm/swapfile.c   |  4 +++-
>  3 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 22af9f3e8c75..e399b37ef395 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3789,7 +3789,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         struct folio *swapcache = NULL, *folio = NULL;
>         enum swap_cache_result cache_result;
>         struct page *page;
> -       struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
>         bool exclusive = false;
>         swp_entry_t entry;
> @@ -3845,14 +3844,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out;
>         }
>
> -       /* Prevent swapoff from happening to us. */
> -       si = get_swap_device(entry);
> -       if (unlikely(!si))
> -               goto out;
> -
>         page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>                                 vmf, &cache_result);
> -       if (page) {
> +       if (PTR_ERR(page) == -EBUSY) {
> +               goto out;
> +       } else if (page) {

As Matthew suggested, using ERR_PTR(-EBUSY).
you also don't need the else here. The goto already terminates the flow.

if (page == ERR_PTR(-EBUSY))
   goto out;

if (page) {

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-19 19:47 ` [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path Kairui Song
  2023-11-19 21:12   ` Matthew Wilcox
  2023-11-21 17:25   ` Chris Li
@ 2023-11-22  0:36   ` Huang, Ying
  2023-11-23 11:13     ` Kairui Song
  2 siblings, 1 reply; 93+ messages in thread
From: Huang, Ying @ 2023-11-22  0:36 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Move get_swap_device into swapin_readahead, simplify the code
> and prepare for follow up commits.

No.  Please don't do this.  Please check the get/put_swap_device() usage
rule in the comments of get_swap_device().

"
 * When we get a swap entry, if there aren't some other ways to
 * prevent swapoff, such as the folio in swap cache is locked, page
 * table lock is held, etc., the swap entry may become invalid because
 * of swapoff.  Then, we need to enclose all swap related functions
 * with get_swap_device() and put_swap_device(), unless the swap
 * functions call get/put_swap_device() by themselves.
"

This is to simplify the reasoning about swapoff and swap entry.

Why does it bother you?

> For the later part in do_swap_page, using swp_swap_info directly is fine
> since in that context, the swap device is pinned by swapcache reference.

[snip]

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 18/24] mm/swap: introduce a helper non fault swapin
  2023-11-19 19:47 ` [PATCH 18/24] mm/swap: introduce a helper non fault swapin Kairui Song
  2023-11-20  1:07   ` kernel test robot
@ 2023-11-22  4:40   ` Chris Li
  2023-11-28 11:22     ` Kairui Song
  1 sibling, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-22  4:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> There are two places where swapin is not direct caused by page fault:
> shmem swapin is invoked through shmem mapping, swapoff cause swapin by
> walking the page table. They used to construct a pseudo vmfault struct
> for swapin function.
>
> Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> path is still using a pseudo vmfault.
>
> Introduce a helper for them both, this help save stack usage for swapoff
> path, and help apply a unified swapin cache and readahead policy check.
>
> Also prepare for follow up commits.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c      | 51 ++++++++++++++++---------------------------------
>  mm/swap.h       | 11 +++++++++++
>  mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c   | 23 +++++++++++-----------
>  4 files changed, 76 insertions(+), 47 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f9ce4067c742..81d129aa66d1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
>  static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
>                         pgoff_t index, unsigned int order, pgoff_t *ilx);
>
> -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> -                       struct shmem_inode_info *info, pgoff_t index)
> -{
> -       struct mempolicy *mpol;
> -       pgoff_t ilx;
> -       struct page *page;
> -
> -       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> -       page = swap_cluster_readahead(swap, gfp, mpol, ilx);
> -       mpol_cond_put(mpol);
> -
> -       if (!page)
> -               return NULL;
> -       return page_folio(page);
> -}
> -

Nice. Thank you.

>  /*
>   * Make sure huge_gfp is always more limited than limit_gfp.
>   * Some of the flags set permissions, while others set limitations.
> @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  {
>         struct address_space *mapping = inode->i_mapping;
>         struct shmem_inode_info *info = SHMEM_I(inode);
> -       struct swap_info_struct *si;
> +       enum swap_cache_result result;
>         struct folio *folio = NULL;
> +       struct mempolicy *mpol;
> +       struct page *page;
>         swp_entry_t swap;
> +       pgoff_t ilx;
>         int error;
>
>         VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         if (is_poisoned_swp_entry(swap))
>                 return -EIO;
>
> -       si = get_swap_device(swap);
> -       if (!si) {
> +       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> +       page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);

Notice this "result" CAN be outdated. e.g. after this call, the swap
cache can be changed by another thread generating the swap page fault
and installing the folio into the swap cache or removing it.

> +       mpol_cond_put(mpol);
> +
> +       if (PTR_ERR(page) == -EBUSY) {
>                 if (!shmem_confirm_swap(mapping, index, swap))
>                         return -EEXIST;
Not your fault . The if statement already returned.
>                 else
This is not needed, the next return -EINVAL can be one less indent level.
>                         return -EINVAL;
> -       }
> -
> -       /* Look it up and read it in.. */
> -       folio = swap_cache_get_folio(swap, NULL, NULL);
> -       if (!folio) {
> -               /* Or update major stats only when swapin succeeds?? */
> -               if (fault_type) {
> +       } else if (!page) {
Don't need the else here because previous if statement always return.

> +               error = -ENOMEM;
> +               goto failed;
> +       } else {

Don't need the else here. Previous goto terminate the flow.

> +               folio = page_folio(page);
> +               if (fault_type && result != SWAP_CACHE_HIT) {
>                         *fault_type |= VM_FAULT_MAJOR;
>                         count_vm_event(PGMAJFAULT);
>                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
>                 }
> -               /* Here we actually start the io */
> -               folio = shmem_swapin_cluster(swap, gfp, info, index);
> -               if (!folio) {
> -                       error = -ENOMEM;
> -                       goto failed;
> -               }
>         }
>
>         /* We have to do this with folio locked to prevent races */
>         folio_lock(folio);
> -       if (!folio_test_swapcache(folio) ||
> +       if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||

I think there is a possible racing bug here. Because the result can be outdated.

>             folio->swap.val != swap.val ||
>             !shmem_confirm_swap(mapping, index, swap)) {
>                 error = -EEXIST;
> @@ -1930,7 +1913,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         delete_from_swap_cache(folio);
>         folio_mark_dirty(folio);
>         swap_free(swap);
> -       put_swap_device(si);
>
>         *foliop = folio;
>         return 0;
> @@ -1944,7 +1926,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                 folio_unlock(folio);
>                 folio_put(folio);
>         }
> -       put_swap_device(si);
>
>         return error;
>  }
> diff --git a/mm/swap.h b/mm/swap.h
> index da9deb5ba37d..b073c29c9790 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -62,6 +62,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>                                     struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
>                               struct vm_fault *vmf, enum swap_cache_result *result);
> +struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> +                                  struct mempolicy *mpol, pgoff_t ilx,
> +                                  struct mm_struct *mm,
> +                                  enum swap_cache_result *result);
>
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -103,6 +107,13 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>         return NULL;
>  }
>
> +static inline struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> +               struct mempolicy *mpol, pgoff_t ilx, struct mm_struct *mm,
> +               enum swap_cache_result *result)
> +{
> +       return NULL;
> +}
> +
>  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>  {
>         return 0;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff8a166603d0..eef66757c615 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -956,6 +956,44 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         return page;
>  }
>
> +struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> +                                  struct mempolicy *mpol, pgoff_t ilx,
> +                                  struct mm_struct *mm, enum swap_cache_result *result)

Can you get a better function name? e.g. no negative works. The
function should be named after what it does, not who calls it. The
caller usage might change over time.
I saw that swapin_page_non_fault() and swapin_readahead() are doing
similar things and with similar structure. Can you unify these two
somehow?

Chris

> +{
> +       enum swap_cache_result cache_result;
> +       struct swap_info_struct *si;
> +       void *shadow = NULL;
> +       struct folio *folio;
> +       struct page *page;
> +
> +       /* Prevent swapoff from happening to us */
> +       si = get_swap_device(entry);
> +       if (unlikely(!si))
> +               return ERR_PTR(-EBUSY);
> +
> +       folio = swap_cache_get_folio(entry, NULL, &shadow);
> +       if (folio) {
> +               page = folio_file_page(folio, swp_offset(entry));
> +               cache_result = SWAP_CACHE_HIT;
> +               goto done;
> +       }
> +
> +       if (swap_use_no_readahead(si, swp_offset(entry))) {
> +               page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, mm);
> +               if (shadow)
> +                       workingset_refault(page_folio(page), shadow);
> +               cache_result = SWAP_CACHE_BYPASS;
> +       } else {
> +               page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> +               cache_result = SWAP_CACHE_MISS;
> +       }
> +done:
> +       put_swap_device(si);
> +       if (result)
> +               *result = cache_result;
> +       return page;
> +}
> +
>  #ifdef CONFIG_SYSFS
>  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
>                                      struct kobj_attribute *attr, char *buf)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 925ad92486a4..f8c5096fe0f0 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1822,20 +1822,15 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>
>         si = swap_info[type];
>         do {
> +               int ret;
> +               pte_t ptent;
> +               pgoff_t ilx;
> +               swp_entry_t entry;
>                 struct page *page;
>                 unsigned long offset;
> +               struct mempolicy *mpol;
>                 unsigned char swp_count;
>                 struct folio *folio = NULL;
> -               swp_entry_t entry;
> -               int ret;
> -               pte_t ptent;
> -
> -               struct vm_fault vmf = {
> -                       .vma = vma,
> -                       .address = addr,
> -                       .real_address = addr,
> -                       .pmd = pmd,
> -               };
>
>                 if (!pte++) {
>                         pte = pte_offset_map(pmd, addr);
> @@ -1855,8 +1850,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                 offset = swp_offset(entry);
>                 pte_unmap(pte);
>                 pte = NULL;
> -               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -                                       &vmf, NULL);
> +
> +               mpol = get_vma_policy(vma, addr, 0, &ilx);
> +               page = swapin_page_non_fault(entry, GFP_HIGHUSER_MOVABLE,
> +                                            mpol, ilx, vma->vm_mm, NULL);
> +               mpol_cond_put(mpol);
> +
>                 if (IS_ERR(page))
>                         return PTR_ERR(page);
>                 else if (page)
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 20/24] swap: simplify and make swap_find_cache static
  2023-11-19 19:47 ` [PATCH 20/24] swap: simplify and make swap_find_cache static Kairui Song
@ 2023-11-22  5:01   ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-22  5:01 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> There are only two callers within the same file now, make it static and
> drop the redundant if check on the shadow variable.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h       | 2 --
>  mm/swap_state.c | 5 ++---
>  2 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index b073c29c9790..4402970547e7 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -46,8 +46,6 @@ void __delete_from_swap_cache(struct folio *folio,
>  void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>                                   unsigned long end);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -                                  struct vm_fault *vmf, void **shadowp);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>                 pgoff_t index);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index eef66757c615..6f39aa8394f1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,15 +334,14 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>   *
>   * Caller must lock the swap device or hold a reference to keep it valid.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
> +static struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
>  {
>         bool vma_ra, readahead;
>         struct folio *folio;
>
>         folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
>         if (xa_is_value(folio)) {
> -               if (shadowp)
> -                       *shadowp = folio;
> +               *shadowp = folio;

I prefer to keep the original code to perform the check before
assigning it. The caller usage situation might change.
It is more defensive.

Does not seem to justify it as a standalone patch.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory
  2023-11-19 19:47 ` [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory Kairui Song
@ 2023-11-22  5:15   ` Chris Li
  2023-11-24  8:14     ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-22  5:15 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This is only one caller now in page fault path, make the result return
> argument mandatory.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6f39aa8394f1..0433a2586c6d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -913,7 +913,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>                               struct vm_fault *vmf, enum swap_cache_result *result)
>  {
> -       enum swap_cache_result cache_result;
>         struct swap_info_struct *si;
>         struct mempolicy *mpol;
>         void *shadow = NULL;
> @@ -928,29 +927,27 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
>         folio = swap_cache_get_folio(entry, vmf, &shadow);
>         if (folio) {
> +               *result = SWAP_CACHE_HIT;
>                 page = folio_file_page(folio, swp_offset(entry));
> -               cache_result = SWAP_CACHE_HIT;
>                 goto done;
>         }
>
>         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>         if (swap_use_no_readahead(si, swp_offset(entry))) {
> +               *result = SWAP_CACHE_BYPASS;

Each of this "*result" will compile into memory store instructions.
The compiler most likely can't optimize and combine them together
because the store can cause segfault from the compiler's point of
view. The multiple local variable assignment can be compiled into a
few registers assignment so it does not cost as much as multiple
memory stores.

>                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> -               cache_result = SWAP_CACHE_BYPASS;
>                 if (shadow)
>                         workingset_refault(page_folio(page), shadow);
> -       } else if (swap_use_vma_readahead(si)) {
> -               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> -               cache_result = SWAP_CACHE_MISS;
>         } else {
> -               page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> -               cache_result = SWAP_CACHE_MISS;
> +               *result = SWAP_CACHE_MISS;
> +               if (swap_use_vma_readahead(si))
> +                       page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> +               else
> +                       page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);

I recall you introduce or heavy modify this function in previous patch before.
Consider combine some of the patch and present the final version sooner.
From the reviewing point of view, don't need to review so many
internal version which get over written any way.

>         }
>         mpol_cond_put(mpol);
>  done:
>         put_swap_device(si);
> -       if (result)
> -               *result = cache_result;

The original version with check and assign it at one place is better.
Safer and produce better code.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 22/24] swap: make swap_cluster_readahead static
  2023-11-19 19:47 ` [PATCH 22/24] swap: make swap_cluster_readahead static Kairui Song
@ 2023-11-22  5:20   ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-22  5:20 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now there is no caller outside the same file, make it static.

Seems to me too trivial/low value to justify as a standalone patch.

Chris

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h       | 8 --------
>  mm/swap_state.c | 4 ++--
>  2 files changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 4402970547e7..795a25df87da 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -56,8 +56,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                                      struct mempolicy *mpol, pgoff_t ilx,
>                                      bool *new_page_allocated);
> -struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> -                                   struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
>                               struct vm_fault *vmf, enum swap_cache_result *result);
>  struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> @@ -93,12 +91,6 @@ static inline void show_swap_cache_info(void)
>  {
>  }
>
> -static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> -                       gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
> -{
> -       return NULL;
> -}
> -
>  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>                         struct vm_fault *vmf, enum swap_cache_result *result)
>  {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0433a2586c6d..b377e55cb850 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -627,8 +627,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
>   * are used for every page of the readahead: neighbouring pages on swap
>   * are fairly likely to have been swapped out from the same node.
>   */
> -struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -                                   struct mempolicy *mpol, pgoff_t ilx)
> +static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +                                          struct mempolicy *mpol, pgoff_t ilx)
>  {
>         struct page *page;
>         unsigned long entry_offset = swp_offset(entry);
> --
> 2.42.0
>
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate
  2023-11-20 11:17     ` Kairui Song
@ 2023-11-22  5:34       ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-22  5:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: Huang, Ying, linux-mm, Andrew Morton, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel, Tejun Heo

Hi Kairui,

On Mon, Nov 20, 2023 at 3:17 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Huang, Ying <ying.huang@intel.com> 于2023年11月20日周一 15:37写道:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > When a process which previously swapped some memory was moved to
> > > another cgroup, and the cgroup it previous in is dead, then swapped in
> > > pages will be leaked into rootcg. Previous commits fixed the bug for
> > > no readahead path, this commit fix the same issue for readahead path.
> > >
> > > This can be easily reproduced by:
> > > - Setup a SSD or HDD swap.
> > > - Create memory cgroup A, B and C.
> > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > - Move process P1 to memory cgroup B.
> > > - Destroy cgroup A.
> > > - Do a swapoff in cgroup C
> > > - Swapped in pages is accounted into cgroup C.
> > >
> > > This patch will fix it make the swapped in pages accounted in cgroup B.
> >
> > Accroding to "Memory Ownership" section of
> > Documentation/admin-guide/cgroup-v2.rst,
> >
> > "

> > A memory area is charged to the cgroup which instantiated it and stays
> > charged to the cgroup until the area is released.  Migrating a process
> > to a different cgroup doesn't move the memory usages that it
> > instantiated while in the previous cgroup to the new cgroup.
> > "
> >
> > Because we don't move the charge when we move a task from one cgroup to
> > another.  It's controversial which cgroup should be charged to.
> > According to the above document, it's acceptable to charge to the cgroup
> > C (cgroup where swapoff happens).
>
> Hi Ying, thank you very much for the info!
>
> It is controversial indeed, just the original behavior is kind of
> counter-intuitive.
>
> Image if there are cgroup P1, and its child cgroup C1 C2. If a process
> swapped out some memory in C1 then moved to C2, and C1 is dead.
> On swapoff the charge will be moved out of P1...
>
> And swapoff often happen on some unlimited cgroup or some cgroup for
> management agent.
>
> If P1 have a memory limit, it can breech the limit easily, we will see
> a process that never leave P1 having a much higher RSS that P1/C1/C2's
> limit.
> And if there is a limit for the management agent cgroup, the agent
> will be OOM instead of OOM in P1.

I think I will reply to another similar email.

If you want OOM in P1, you can have an admin program. fork and execute
a new process, add the new process into P1, then swap off from that
new process.

>
> Simply moving a process between the child cgroup of the same parent
> cgroup won't cause such issue, thing get weird when swapoff is
> involved.
>
> Or maybe we should try to be compatible, and introduce a sysctl or
> cmdline for this?

If the above suggestion works, then you don't need to change swap off?

If you still want to change the charging model. I like to see the
bigger picture, what rules it follows and how it works in other
situations.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 00/24] Swapin path refactor for optimization and bugfix
  2023-11-20 19:09 ` [PATCH 00/24] Swapin path refactor for optimization and bugfix Yosry Ahmed
  2023-11-20 20:22   ` Chris Li
@ 2023-11-22  6:43   ` Kairui Song
  1 sibling, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-22  6:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Yosry Ahmed <yosryahmed@google.com> 于2023年11月21日周二 03:18写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > This series tries to unify and clean up the swapin path, fixing a few
> > issues with optimizations:
> >
> > 1. Memcg leak issue: when a process that previously swapped out some
> >    migrated to another cgroup, and the origianl cgroup is dead. If we
> >    do a swapoff, swapped in pages will be accounted into the process
> >    doing swapoff instead of the new cgroup. This will allow the process
> >    to use more memory than expect easily.
> >
> >    This can be easily reproduced by:
> >    - Setup a swap.
> >    - Create memory cgroup A, B and C.
> >    - Spawn process P1 in cgroup A and make it swap out some pages.
> >    - Move process P1 to memory cgroup B.
> >    - Destroy cgroup A.
> >    - Do a swapoff in cgroup C
> >    - Swapped in pages is accounted into cgroup C.
> >
> >    This patch will fix it make the swapped in pages accounted in cgroup B.
> >
>
> I guess this only works for anonymous memory and not shmem, right?

Hi Yosry,

Yes, this patch only changed the charge target for anon page.

>
> I think tying memcg charges to a process is not something we usually
> do. Charging the pages to the memcg of the faulting process if the
> previous owner is dead makes sense, it's essentially recharging the
> memory to the new owner. Swapoff is indeed a special case, since the
> faulting process is not the new owner, but an admin process or so. I
> am guessing charging to the new memcg of the previous owner might make
> sense in this case, but it is a change of behavior.

After discuss with others I also found this is a controversial issue,
will send separate series for this, I think it needs more discuss.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 00/24] Swapin path refactor for optimization and bugfix
  2023-11-20 20:22   ` Chris Li
@ 2023-11-22  6:46     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-22  6:46 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, linux-mm, Andrew Morton, Huang, Ying,
	David Hildenbrand, Hugh Dickins, Johannes Weiner, Matthew Wilcox,
	Michal Hocko, linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 04:23写道:
>
> Hi Kairui,
>
> On Mon, Nov 20, 2023 at 11:10 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > This series tries to unify and clean up the swapin path, fixing a few
> > > issues with optimizations:
> > >
> > > 1. Memcg leak issue: when a process that previously swapped out some
> > >    migrated to another cgroup, and the origianl cgroup is dead. If we
> > >    do a swapoff, swapped in pages will be accounted into the process
> > >    doing swapoff instead of the new cgroup. This will allow the process
> > >    to use more memory than expect easily.
> > >
> > >    This can be easily reproduced by:
> > >    - Setup a swap.
> > >    - Create memory cgroup A, B and C.
> > >    - Spawn process P1 in cgroup A and make it swap out some pages.
> > >    - Move process P1 to memory cgroup B.
> > >    - Destroy cgroup A.
> > >    - Do a swapoff in cgroup C
> > >    - Swapped in pages is accounted into cgroup C.
> > >
> > >    This patch will fix it make the swapped in pages accounted in cgroup B.
> > >
> >
> > I guess this only works for anonymous memory and not shmem, right?
> >
> > I think tying memcg charges to a process is not something we usually
> > do. Charging the pages to the memcg of the faulting process if the
> > previous owner is dead makes sense, it's essentially recharging the
> > memory to the new owner. Swapoff is indeed a special case, since the
> > faulting process is not the new owner, but an admin process or so. I
> > am guessing charging to the new memcg of the previous owner might make
> > sense in this case, but it is a change of behavior.
> >
>
> I was looking at this at patch 23 as well. Will ask more questions in
> the patch thread.
> I would suggest making these two behavior change patches separate out
> from the clean up series to give it more exposure and proper
> discussion.
> Patch 5 and patch 23.
>
> Chris
>

Hi Chris,

Thank you very much for reviewing these details, it's really helpful.

I'll send out new serieses after checking your suggestions on these patches.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-20 17:44       ` Chris Li
@ 2023-11-22 17:32         ` Kairui Song
  2023-11-22 20:57           ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-22 17:32 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 01:44写道:
>
> On Mon, Nov 20, 2023 at 3:15 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > > index ac4fa404eaa7..45dd8b7c195d 100644
> > > > --- a/mm/swap_state.c
> > > > +++ b/mm/swap_state.c
> > > > @@ -458,6 +458,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >
> > > You move the mem_cgroup_swapin_charge_folio() inside the for loop:
> > >
> > >
> > >         for (;;) {
> > >                 int err;
> > >                 /*
> > >                  * First check the swap cache.  Since this is normally
> > >                  * called after swap_cache_get_folio() failed, re-calling
> > >                  * that would confuse statistics.
> > >                  */
> > >                 folio = filemap_get_folio(swap_address_space(entry),
> > >                                                 swp_offset(entry));
> > >
> > >
> > > >                                                 mpol, ilx, numa_node_id());
> > > >                 if (!folio)
> > > >                          goto fail_put_swap;
> > > > +               if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> > > > +                       goto fail_put_folio;
> > >
> > > Wouldn't it cause repeat charging of the folio when it is racing
> > > against others in the for loop?
> >
> > The race loser will call folio_put and discharge it?
>
> There are two different charges. Memcg charging and memcg swapin charging.
> The folio_put will do the memcg discharge, the corresponding memcg
> charge is in follio allocation.

Hi Chris,

I didn't get your idea here... By "memcg swapin charge", do you mean
"memory.swap.*"? And "memcg charging" means "memory.*"?. There is no
memcg charge related code in folio allocation (alloc_pages_mpol),
actually the mem_cgroup_swapin_charge_folio here is doing memcg charge
not memcg swapin charge. Swapin path actually need to uncharge
"memory.swap" by mem_cgroup_swapin_uncharge_swap in later part of this
function.


> Memcg swapin charge does things differently, it needs to modify the
> swap relately accounting.
> The memcg uncharge is not a pair for memcg swapin charge.
>
> > > >                 /*
> > > >                  * Swap entry may have been freed since our caller observed it.
> > > > @@ -483,13 +485,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > >         /*
> > > >          * The swap entry is ours to swap in. Prepare the new page.
> > > >          */
> > > > -
> > > >         __folio_set_locked(folio);
> > > >         __folio_set_swapbacked(folio);
> > > >
> > > > -       if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> > > > -               goto fail_unlock;
> > > > -
> > >
> > > The original code makes the charge outside of the for loop. Only the
> > > winner can charge once.
> >
> > Right, this patch may make the charge/dis-charge path more complex for
> > race swapin, I'll re-check this part.
>
> It is more than just complex, it seems to change the behavior of this code.
>
> Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper
  2023-11-21  5:34   ` Chris Li
@ 2023-11-22 17:33     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-22 17:33 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 13:35写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, simply move the routine to a standalone function to
> > be used later. The error path handling is copied from the "out_page"
> > label, to make the code change minimized for easier reviewing.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 33 +++++----------------------------
> >  mm/swap.h       |  2 ++
> >  mm/swap_state.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 55 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 70ffa867b1be..fba4a5229163 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3794,7 +3794,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         swp_entry_t entry;
> >         pte_t pte;
> >         vm_fault_t ret = 0;
> > -       void *shadow = NULL;
> >
> >         if (!pte_unmap_same(vmf))
> >                 goto out;
> > @@ -3858,33 +3857,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         if (!folio) {
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> > -                       /* skip swapcache */
> > -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                               vma, vmf->address, false);
> > -                       if (folio) {
> > -                               __folio_set_locked(folio);
> > -                               __folio_set_swapbacked(folio);
> > -
> > -                               if (mem_cgroup_swapin_charge_folio(folio,
> > -                                                       vma->vm_mm, GFP_KERNEL,
> > -                                                       entry)) {
> > -                                       ret = VM_FAULT_OOM;
> > -                                       goto out_page;
> > -                               }
> > -                               mem_cgroup_swapin_uncharge_swap(entry);
> > -
> > -                               shadow = get_shadow_from_swap_cache(entry);
> > -                               if (shadow)
> > -                                       workingset_refault(folio, shadow);
> > -
> > -                               folio_add_lru(folio);
> > -                               page = &folio->page;
> > -
> > -                               /* To provide entry to swap_readpage() */
> > -                               folio->swap = entry;
> > -                               swap_readpage(page, true, NULL);
> > -                               folio->private = NULL;
> > -                       }
> > +                       /* skip swapcache and readahead */
> > +                       page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > +                                               vmf);
>
> A minor point,  swapin_no_readahead() is expressed in negative words.
>
> Better use positive words to express what the function does rather
> than what the function does not do.
> I am terrible at naming functions myself. I can think of something
> along the lines of:
> swapin_direct (no cache).
> swapin_minimal?
> swapin_entry_only?
>
> Please suggest better names for basic, bare minimum.

Thanks for the suggestions, I prefer swapin_direct here, will update
with this name and also make it return a folio directly.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device
  2023-11-21 17:15   ` Chris Li
@ 2023-11-22 18:08     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-22 18:08 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月22日周三 01:18写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > When a xa_value is returned by the cache lookup, keep it to be used
> > later for workingset refault check instead of doing the looking up again
> > in swapin_no_readahead.
> >
> > This does have a side effect of making swapoff also triggers workingset
> > check, but should be fine since swapoff does effect the workload in many
> > ways already.
>
> I need to sleep on it a bit to see if this will create another problem or not.
>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index e057c79fb06f..51de2a0412df 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -872,7 +872,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  {
> >         struct folio *folio;
> >         struct page *page;
> > -       void *shadow = NULL;
> >
> >         page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
> >         folio = (struct folio *)page;
> > @@ -888,10 +887,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >
> >                 mem_cgroup_swapin_uncharge_swap(entry);
> >
> > -               shadow = get_shadow_from_swap_cache(entry);
> > -               if (shadow)
> > -                       workingset_refault(folio, shadow);
> > -
> >                 folio_add_lru(folio);
> >
> >                 /* To provide entry to swap_readpage() */
> > @@ -922,11 +917,12 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >         enum swap_cache_result cache_result;
> >         struct swap_info_struct *si;
> >         struct mempolicy *mpol;
> > +       void *shadow = NULL;
> >         struct folio *folio;
> >         struct page *page;
> >         pgoff_t ilx;
> >
> > -       folio = swap_cache_get_folio(entry, vmf, NULL);
> > +       folio = swap_cache_get_folio(entry, vmf, &shadow);
> >         if (folio) {
> >                 page = folio_file_page(folio, swp_offset(entry));
> >                 cache_result = SWAP_CACHE_HIT;
> > @@ -938,6 +934,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >         if (swap_use_no_readahead(si, swp_offset(entry))) {
> >                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> >                 cache_result = SWAP_CACHE_BYPASS;
> > +               if (shadow)
> > +                       workingset_refault(page_folio(page), shadow);
>
> It is inconsistent why other flavors of readahead do not do the
> workingset_refault here.

Because of the readaheads and swapcache. Every readahead pages need to
be checked by workingset_refault with a different shadow (and so a
different xarray entry search is needed). And since other swapin path
need to insert page into swapcache, they will do extra xarray
search/insert anyway so this optimization won't work.

> I suggest keeping the workingset_refault in swapin_no_readahead() and
> pass the shadow argument in.

That sounds good to me.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-22 17:32         ` Kairui Song
@ 2023-11-22 20:57           ` Chris Li
  2023-11-24  8:14             ` Kairui Song
  0 siblings, 1 reply; 93+ messages in thread
From: Chris Li @ 2023-11-22 20:57 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Hi Kairui,

On Wed, Nov 22, 2023 at 9:33 AM Kairui Song <ryncsn@gmail.com> wrote:

> > There are two different charges. Memcg charging and memcg swapin charging.
> > The folio_put will do the memcg discharge, the corresponding memcg
> > charge is in follio allocation.
>
> Hi Chris,
>
> I didn't get your idea here... By "memcg swapin charge", do you mean
> "memory.swap.*"? And "memcg charging" means "memory.*"?. There is no

Sorry I should have used the function name then there is no ambiguity.
"memcg swapin charge" I mean function mem_cgroup_swapin_charge_folio().
This function will look up the swap entry and find the memcg by swap entry then
charge to that memcg.

> memcg charge related code in folio allocation (alloc_pages_mpol),
> actually the mem_cgroup_swapin_charge_folio here is doing memcg charge
> not memcg swapin charge. Swapin path actually need to uncharge
> "memory.swap" by mem_cgroup_swapin_uncharge_swap in later part of this
> function.

I still think you have a bug there.

Take this make up example:
Let say the for loop runs 3 times and the 3rd time breaks out the for loop.
The original code will call:
filemap_get_folio() 3 times
folio_put() 2 times
mem_cgroup_swapin_charge_folio() 1 time.

With your patch, it will call:
filemap_get_folio() 3 times
folio_put() 2 times
mem_cgroup_swapin_charge_folio() 3 times.

Do you see the behavior difference there?

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 06/24] swap: rework swapin_no_readahead arguments
  2023-11-21  6:44   ` Chris Li
@ 2023-11-23 10:51     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-23 10:51 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 14:44写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Make it use alloc_pages_mpol instead of vma_alloc_folio, and accept
> > mm_struct directly as an argument instead of taking a vmf as argument.
> > Make its arguments similar to swap_{cluster,vma}_readahead, to
> > make the code more aligned.
>
> It is unclear to me what is the value of making the
> swap_{cluster,vma}_readahead more aligned here.
>
> > Also prepare for following commits which will skip vmf for certain
> > swapin paths.
>
> The following patch 07/24 does not seem to use this patch.
> Can it merge with the other patch that uses it?
>
> As it is, this patch does not serve a stand alone value to justify it.

Yes, I'll rearrange this.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 08/24] mm/swap: check readahead policy per entry
  2023-11-21  7:54   ` Chris Li
@ 2023-11-23 10:52     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-23 10:52 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月21日周二 15:54写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently VMA readahead is globally disabled when any rotate disk is
> > used as swap backend. So multiple swap devices are enabled, if a slower
> > hard disk is set as a low priority fallback, and a high performance SSD
> > is used and high priority swap device, vma readahead is disabled globally.
> > The SSD swap device performance will drop by a lot.
> >
> > Check readahead policy per entry to avoid such problem.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ff6756f2e8e4..fb78f7f18ed7 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
> >         return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> >  }
> >
> > -static inline bool swap_use_vma_readahead(void)
> > +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> >  {
> > -       return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > +       return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
>
> A very minor point:
> I notice you change the order enable_vma_readahead to the last.
> Normally if enable_vma_reachahead == 0, there is no need to check the si->flags.
> The si->flags check is more expensive than simple memory load.
> You might want to check enable_vma_readahead first then you can short
> cut the more expensive part.

Thanks, I'll improve this part.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-22  0:36   ` Huang, Ying
@ 2023-11-23 11:13     ` Kairui Song
  2023-11-24  0:40       ` Huang, Ying
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-23 11:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel

Huang, Ying <ying.huang@intel.com> 于2023年11月22日周三 08:38写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Move get_swap_device into swapin_readahead, simplify the code
> > and prepare for follow up commits.
>
> No.  Please don't do this.  Please check the get/put_swap_device() usage
> rule in the comments of get_swap_device().
>
> "
>  * When we get a swap entry, if there aren't some other ways to
>  * prevent swapoff, such as the folio in swap cache is locked, page
>  * table lock is held, etc., the swap entry may become invalid because
>  * of swapoff.  Then, we need to enclose all swap related functions
>  * with get_swap_device() and put_swap_device(), unless the swap
>  * functions call get/put_swap_device() by themselves.
> "
>
> This is to simplify the reasoning about swapoff and swap entry.
>
> Why does it bother you?

Hi Ying,

This is trying to reduce LOC, avoid a trivial si read, and make error
checking logic easier to refactor in later commits.

And besides there is one trivial change I forgot to include in this
commit, get_swap_device can be put after swap_cache_get_folio in
swapin_readahead, since swap_cache_get_folio doesn't need to hold the
swap device, so in cache hit case this get/put_swap_device() can be
saved.

The comment also mentioned:

"Then, we need to enclose all swap related functions with
get_swap_device() and put_swap_device(), unless the swap functions
call get/put_swap_device() by themselves"

So I think it should be OK to do this.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path
  2023-11-23 11:13     ` Kairui Song
@ 2023-11-24  0:40       ` Huang, Ying
  0 siblings, 0 replies; 93+ messages in thread
From: Huang, Ying @ 2023-11-24  0:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> Huang, Ying <ying.huang@intel.com> 于2023年11月22日周三 08:38写道:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > Move get_swap_device into swapin_readahead, simplify the code
>> > and prepare for follow up commits.
>>
>> No.  Please don't do this.  Please check the get/put_swap_device() usage
>> rule in the comments of get_swap_device().
>>
>> "
>>  * When we get a swap entry, if there aren't some other ways to
>>  * prevent swapoff, such as the folio in swap cache is locked, page
>>  * table lock is held, etc., the swap entry may become invalid because
>>  * of swapoff.  Then, we need to enclose all swap related functions
>>  * with get_swap_device() and put_swap_device(), unless the swap
>>  * functions call get/put_swap_device() by themselves.
>> "
>>
>> This is to simplify the reasoning about swapoff and swap entry.
>>
>> Why does it bother you?
>
> Hi Ying,
>
> This is trying to reduce LOC, avoid a trivial si read, and make error
> checking logic easier to refactor in later commits.

The race with swapoff isn't considered by many developers usually.  So,
we should use a simple rule as much as possible.  For example, if you
get a swap entry, use get/put_swap_device() to enclose all code that
operate on the swap entry.  This makes code easier to be maintained in
the long run.  Yes.  Sometimes we break the rule a little, but only if
we have enough benefit, such as improving performance in some practical
use cases.

> And besides there is one trivial change I forgot to include in this
> commit, get_swap_device can be put after swap_cache_get_folio in
> swapin_readahead, since swap_cache_get_folio doesn't need to hold the
> swap device, so in cache hit case this get/put_swap_device() can be
> saved.

swapoff is rare operation, it's OK to delay it a little to make the code
easier to be understood.

> The comment also mentioned:
>
> "Then, we need to enclose all swap related functions with
> get_swap_device() and put_swap_device(), unless the swap functions
> call get/put_swap_device() by themselves"
>
> So I think it should be OK to do this.

No.  You should change the code with a good enough reason.  Compared
with complexity it introduced, the benefit isn't enough for me so far.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-22 20:57           ` Chris Li
@ 2023-11-24  8:14             ` Kairui Song
  2023-11-24  8:37               ` Christopher Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-24  8:14 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月23日周四 04:57写道:
>
> Hi Kairui,
>
> On Wed, Nov 22, 2023 at 9:33 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> > > There are two different charges. Memcg charging and memcg swapin charging.
> > > The folio_put will do the memcg discharge, the corresponding memcg
> > > charge is in follio allocation.
> >
> > Hi Chris,
> >
> > I didn't get your idea here... By "memcg swapin charge", do you mean
> > "memory.swap.*"? And "memcg charging" means "memory.*"?. There is no
>
> Sorry I should have used the function name then there is no ambiguity.
> "memcg swapin charge" I mean function mem_cgroup_swapin_charge_folio().
> This function will look up the swap entry and find the memcg by swap entry then
> charge to that memcg.
>
> > memcg charge related code in folio allocation (alloc_pages_mpol),
> > actually the mem_cgroup_swapin_charge_folio here is doing memcg charge
> > not memcg swapin charge. Swapin path actually need to uncharge
> > "memory.swap" by mem_cgroup_swapin_uncharge_swap in later part of this
> > function.
>
> I still think you have a bug there.
>
> Take this make up example:
> Let say the for loop runs 3 times and the 3rd time breaks out the for loop.
> The original code will call:
> filemap_get_folio() 3 times
> folio_put() 2 times
> mem_cgroup_swapin_charge_folio() 1 time.
>
> With your patch, it will call:
> filemap_get_folio() 3 times
> folio_put() 2 times
> mem_cgroup_swapin_charge_folio() 3 times.
>
> Do you see the behavior difference there?

Hi Chris.

folio_put will discharge a page if it's charged, in original code the
2 folio_put call simply free the page since it's not charged. But in
this patch, folio_put will cancel previous
mem_cgroup_swapin_charge_folio call, so actually the 3
mem_cgroup_swapin_charge_folio calls will only charge once. (2 calls
was cancelled by folio_put).

I think this is making it confusing indeed and causing more trouble in
error path (the uncharge could be more expensive than unlock check),
will rework this part.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory
  2023-11-22  5:15   ` Chris Li
@ 2023-11-24  8:14     ` Kairui Song
  0 siblings, 0 replies; 93+ messages in thread
From: Kairui Song @ 2023-11-24  8:14 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月22日周三 13:18写道:
>
> On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > This is only one caller now in page fault path, make the result return
> > argument mandatory.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 17 +++++++----------
> >  1 file changed, 7 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 6f39aa8394f1..0433a2586c6d 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -913,7 +913,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >                               struct vm_fault *vmf, enum swap_cache_result *result)
> >  {
> > -       enum swap_cache_result cache_result;
> >         struct swap_info_struct *si;
> >         struct mempolicy *mpol;
> >         void *shadow = NULL;
> > @@ -928,29 +927,27 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >
> >         folio = swap_cache_get_folio(entry, vmf, &shadow);
> >         if (folio) {
> > +               *result = SWAP_CACHE_HIT;
> >                 page = folio_file_page(folio, swp_offset(entry));
> > -               cache_result = SWAP_CACHE_HIT;
> >                 goto done;
> >         }
> >
> >         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >         if (swap_use_no_readahead(si, swp_offset(entry))) {
> > +               *result = SWAP_CACHE_BYPASS;
>
> Each of this "*result" will compile into memory store instructions.
> The compiler most likely can't optimize and combine them together
> because the store can cause segfault from the compiler's point of
> view. The multiple local variable assignment can be compiled into a
> few registers assignment so it does not cost as much as multiple
> memory stores.
>
> >                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > -               cache_result = SWAP_CACHE_BYPASS;
> >                 if (shadow)
> >                         workingset_refault(page_folio(page), shadow);
> > -       } else if (swap_use_vma_readahead(si)) {
> > -               page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > -               cache_result = SWAP_CACHE_MISS;
> >         } else {
> > -               page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > -               cache_result = SWAP_CACHE_MISS;
> > +               *result = SWAP_CACHE_MISS;
> > +               if (swap_use_vma_readahead(si))
> > +                       page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > +               else
> > +                       page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>
> I recall you introduce or heavy modify this function in previous patch before.
> Consider combine some of the patch and present the final version sooner.
> From the reviewing point of view, don't need to review so many
> internal version which get over written any way.
>
> >         }
> >         mpol_cond_put(mpol);
> >  done:
> >         put_swap_device(si);
> > -       if (result)
> > -               *result = cache_result;
>
> The original version with check and assign it at one place is better.
> Safer and produce better code.

Yes, that's less error-prone indeed, saving a "if" seems doesn't worth
all the potential trouble, will drop this.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check
  2023-11-24  8:14             ` Kairui Song
@ 2023-11-24  8:37               ` Christopher Li
  0 siblings, 0 replies; 93+ messages in thread
From: Christopher Li @ 2023-11-24  8:37 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Fri, Nov 24, 2023 at 12:15 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> folio_put will discharge a page if it's charged, in original code the
> 2 folio_put call simply free the page since it's not charged. But in
> this patch, folio_put will cancel previous
> mem_cgroup_swapin_charge_folio call, so actually the 3
> mem_cgroup_swapin_charge_folio calls will only charge once. (2 calls
> was cancelled by folio_put).

You are saying the original code case folio_put() will be free without
uncharge. But with your patch folio_put() will free and cancel
mem_cgroup_swapin_charge_folio() charge. That is because the
folio->memcg_data was not set in the first case and folio->memcg_data
was set in the second case?

>
> I think this is making it confusing indeed and causing more trouble in
> error path (the uncharge could be more expensive than unlock check),
> will rework this part.

Agree. Thanks.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
  2023-11-21 16:06   ` Chris Li
@ 2023-11-24  8:42     ` Kairui Song
  2023-11-24  9:10       ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-24  8:42 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月22日周三 00:07写道:


>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, just prepare for later commits.
>
> You need to have a proper commit message why this change needs to happen.
> Preparing is too generic, it does not give any real information.
> For example, it seems you want to reduce one swap cache lookup because
> swap_readahead already has it?
>
> I am a bit puzzled at this patch. It shuffles a lot of sensitive code.
> However I do not get  the value.
> It seems like this patch should be merged with the later patch that
> depends on it to be judged together.
>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 61 +++++++++++++++++++++++--------------------------
> >  mm/swap.h       | 10 ++++++--
> >  mm/swap_state.c | 26 +++++++++++++--------
> >  mm/swapfile.c   | 30 +++++++++++-------------
> >  4 files changed, 66 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f4237a2e3b93..22af9f3e8c75 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >  vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> > -       struct folio *swapcache, *folio = NULL;
> > +       struct folio *swapcache = NULL, *folio = NULL;
> > +       enum swap_cache_result cache_result;
> >         struct page *page;
> >         struct swap_info_struct *si = NULL;
> >         rmap_t rmap_flags = RMAP_NONE;
> >         bool exclusive = false;
> >         swp_entry_t entry;
> > -       bool swapcached;
> >         pte_t pte;
> >         vm_fault_t ret = 0;
> >
> > @@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         if (unlikely(!si))
> >                 goto out;
> >
> > -       folio = swap_cache_get_folio(entry, vma, vmf->address);
> > -       if (folio)
> > -               page = folio_file_page(folio, swp_offset(entry));
> > -       swapcache = folio;
>
> Is the motivation that swap_readahead() already has a swap cache look up so you
> remove this look up here?

Yes, the cache look up can is moved and shared in swapin_readahead,
and this also make it possible to use that look up to return a shadow
when entry is not a page, so another shadow look up can be saved for
sync (ZRAM) swapin path. This can help improve ZRAM performance for
~4% for a 10G ZRAM, and should improves more when the cache tree grows
large.

>
> > -
> > -       if (!folio) {
> > -               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > -                                       vmf, &swapcached);
> > -               if (page) {
> > -                       folio = page_folio(page);
> > -                       if (swapcached)
> > -                               swapcache = folio;
> > -               } else {
> > +       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > +                               vmf, &cache_result);
> > +       if (page) {
> > +               folio = page_folio(page);
> > +               if (cache_result != SWAP_CACHE_HIT) {
> > +                       /* Had to read the page from swap area: Major fault */
> > +                       ret = VM_FAULT_MAJOR;
> > +                       count_vm_event(PGMAJFAULT);
> > +                       count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > +               }
> > +               if (cache_result != SWAP_CACHE_BYPASS)
> > +                       swapcache = folio;
> > +               if (PageHWPoison(page)) {
>
> There is a lot of code shuffle here. From the diff it is hard to tell
> if they are doing the same thing as before.
>
> >                         /*
> > -                        * Back out if somebody else faulted in this pte
> > -                        * while we released the pte lock.
> > +                        * hwpoisoned dirty swapcache pages are kept for killing
> > +                        * owner processes (which may be unknown at hwpoison time)
> >                          */
> > -                       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > -                                       vmf->address, &vmf->ptl);
> > -                       if (likely(vmf->pte &&
> > -                                  pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > -                               ret = VM_FAULT_OOM;
> > -                       goto unlock;
> > +                       ret = VM_FAULT_HWPOISON;
> > +                       goto out_release;
> >                 }
> > -
> > -               /* Had to read the page from swap area: Major fault */
> > -               ret = VM_FAULT_MAJOR;
> > -               count_vm_event(PGMAJFAULT);
> > -               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > -       } else if (PageHWPoison(page)) {
> > +       } else {
> >                 /*
> > -                * hwpoisoned dirty swapcache pages are kept for killing
> > -                * owner processes (which may be unknown at hwpoison time)
> > +                * Back out if somebody else faulted in this pte
> > +                * while we released the pte lock.
> >                  */
> > -               ret = VM_FAULT_HWPOISON;
> > -               goto out_release;
> > +               vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > +                               vmf->address, &vmf->ptl);
> > +               if (likely(vmf->pte &&
> > +                          pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > +                       ret = VM_FAULT_OOM;
> > +               goto unlock;
> >         }
> >
> >         ret |= folio_lock_or_retry(folio, vmf);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index a9a654af791e..ac9136eee690 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
> >         (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> >                 >> SWAP_ADDRESS_SPACE_SHIFT])
> >
> > +enum swap_cache_result {
> > +       SWAP_CACHE_HIT,
> > +       SWAP_CACHE_MISS,
> > +       SWAP_CACHE_BYPASS,
> > +};
>
> Does any function later care about CACHE_BYPASS?
>
> Again, better introduce it with the function that uses it. Don't
> introduce it for "just in case I might use it later".

Yes,  callers in shmem will also need to know if the page is cached in
swap, and need a value to indicate the bypass case. I can add some
comments here to indicate the usage.

>
> > +
> >  void show_swap_cache_info(void);
> >  bool add_to_swap(struct folio *folio);
> >  void *get_shadow_from_swap_cache(swp_entry_t entry);
> > @@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >                                     struct mempolicy *mpol, pgoff_t ilx);
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > -                             struct vm_fault *vmf, bool *swapcached);
> > +                             struct vm_fault *vmf, enum swap_cache_result *result);
> >
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> > @@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> >  }
> >
> >  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > -                       struct vm_fault *vmf, bool *swapcached)
> > +                       struct vm_fault *vmf, enum swap_cache_result *result)
> >  {
> >         return NULL;
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index d87c20f9f7ec..e96d63bf8a22 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >   * @entry: swap entry of this memory
> >   * @gfp_mask: memory allocation flags
> >   * @vmf: fault information
> > - * @swapcached: pointer to a bool used as indicator if the
> > - *              page is swapped in through swapcache.
> > + * @result: a return value to indicate swap cache usage.
> >   *
> >   * Returns the struct page for entry and addr, after queueing swapin.
> >   *
> > @@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >   * or vma-based(ie, virtual address based on faulty address) readahead.
> >   */
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > -                             struct vm_fault *vmf, bool *swapcached)
> > +                             struct vm_fault *vmf, enum swap_cache_result *result)
> >  {
> > +       enum swap_cache_result cache_result;
> >         struct swap_info_struct *si;
> >         struct mempolicy *mpol;
> > +       struct folio *folio;
> >         struct page *page;
> >         pgoff_t ilx;
> > -       bool cached;
> > +
> > +       folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> > +       if (folio) {
> > +               page = folio_file_page(folio, swp_offset(entry));
> > +               cache_result = SWAP_CACHE_HIT;
> > +               goto done;
> > +       }
> >
> >         si = swp_swap_info(entry);
> >         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >         if (swap_use_no_readahead(si, swp_offset(entry))) {
> >                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > -               cached = false;
> > +               cache_result = SWAP_CACHE_BYPASS;
> >         } else if (swap_use_vma_readahead(si)) {
> >                 page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > -               cached = true;
> > +               cache_result = SWAP_CACHE_MISS;
> >         } else {
> >                 page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > -               cached = true;
> > +               cache_result = SWAP_CACHE_MISS;
> >         }
> >         mpol_cond_put(mpol);
> >
> > -       if (swapcached)
> > -               *swapcached = cached;
> > +done:
> > +       if (result)
> > +               *result = cache_result;
> >
> >         return page;
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 01c3f53b6521..b6d57fff5e21 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >
> >         si = swap_info[type];
> >         do {
> > -               struct folio *folio;
> > +               struct page *page;
> >                 unsigned long offset;
> >                 unsigned char swp_count;
> > +               struct folio *folio = NULL;
> >                 swp_entry_t entry;
> >                 int ret;
> >                 pte_t ptent;
> >
> > +               struct vm_fault vmf = {
> > +                       .vma = vma,
> > +                       .address = addr,
> > +                       .real_address = addr,
> > +                       .pmd = pmd,
> > +               };
>
> Is this code move caused by skipping the swap cache look up here?

Yes.

>
> This is very sensitive code related to swap cache racing. It needs
> very careful reviewing. Better not shuffle it for no good reason.

Thanks for the suggestion, I'll try to avoid these shuffling, but
cache lookup is moved into swappin_readahead so some changes in the
original caller are not avoidable...

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
  2023-11-24  8:42     ` Kairui Song
@ 2023-11-24  9:10       ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-11-24  9:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Fri, Nov 24, 2023 at 12:42 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Chris Li <chrisl@kernel.org> 于2023年11月22日周三 00:07写道:
>
>
> >
> > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > No feature change, just prepare for later commits.
> >
> > You need to have a proper commit message why this change needs to happen.
> > Preparing is too generic, it does not give any real information.
> > For example, it seems you want to reduce one swap cache lookup because
> > swap_readahead already has it?
> >
> > I am a bit puzzled at this patch. It shuffles a lot of sensitive code.
> > However I do not get  the value.
> > It seems like this patch should be merged with the later patch that
> > depends on it to be judged together.
> >
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/memory.c     | 61 +++++++++++++++++++++++--------------------------
> > >  mm/swap.h       | 10 ++++++--
> > >  mm/swap_state.c | 26 +++++++++++++--------
> > >  mm/swapfile.c   | 30 +++++++++++-------------
> > >  4 files changed, 66 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f4237a2e3b93..22af9f3e8c75 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > >  vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >  {
> > >         struct vm_area_struct *vma = vmf->vma;
> > > -       struct folio *swapcache, *folio = NULL;
> > > +       struct folio *swapcache = NULL, *folio = NULL;
> > > +       enum swap_cache_result cache_result;
> > >         struct page *page;
> > >         struct swap_info_struct *si = NULL;
> > >         rmap_t rmap_flags = RMAP_NONE;
> > >         bool exclusive = false;
> > >         swp_entry_t entry;
> > > -       bool swapcached;
> > >         pte_t pte;
> > >         vm_fault_t ret = 0;
> > >
> > > @@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >         if (unlikely(!si))
> > >                 goto out;
> > >
> > > -       folio = swap_cache_get_folio(entry, vma, vmf->address);
> > > -       if (folio)
> > > -               page = folio_file_page(folio, swp_offset(entry));
> > > -       swapcache = folio;
> >
> > Is the motivation that swap_readahead() already has a swap cache look up so you
> > remove this look up here?
>
> Yes, the cache look up can is moved and shared in swapin_readahead,
> and this also make it possible to use that look up to return a shadow
> when entry is not a page, so another shadow look up can be saved for
> sync (ZRAM) swapin path. This can help improve ZRAM performance for
> ~4% for a 10G ZRAM, and should improves more when the cache tree grows
> large.
>
> >
> > > -
> > > -       if (!folio) {
> > > -               page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > -                                       vmf, &swapcached);
> > > -               if (page) {
> > > -                       folio = page_folio(page);
> > > -                       if (swapcached)
> > > -                               swapcache = folio;
> > > -               } else {
> > > +       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > +                               vmf, &cache_result);
> > > +       if (page) {
> > > +               folio = page_folio(page);
> > > +               if (cache_result != SWAP_CACHE_HIT) {
> > > +                       /* Had to read the page from swap area: Major fault */
> > > +                       ret = VM_FAULT_MAJOR;
> > > +                       count_vm_event(PGMAJFAULT);
> > > +                       count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > > +               }
> > > +               if (cache_result != SWAP_CACHE_BYPASS)
> > > +                       swapcache = folio;
> > > +               if (PageHWPoison(page)) {
> >
> > There is a lot of code shuffle here. From the diff it is hard to tell
> > if they are doing the same thing as before.
> >
> > >                         /*
> > > -                        * Back out if somebody else faulted in this pte
> > > -                        * while we released the pte lock.
> > > +                        * hwpoisoned dirty swapcache pages are kept for killing
> > > +                        * owner processes (which may be unknown at hwpoison time)
> > >                          */
> > > -                       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > > -                                       vmf->address, &vmf->ptl);
> > > -                       if (likely(vmf->pte &&
> > > -                                  pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > > -                               ret = VM_FAULT_OOM;
> > > -                       goto unlock;
> > > +                       ret = VM_FAULT_HWPOISON;
> > > +                       goto out_release;
> > >                 }
> > > -
> > > -               /* Had to read the page from swap area: Major fault */
> > > -               ret = VM_FAULT_MAJOR;
> > > -               count_vm_event(PGMAJFAULT);
> > > -               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > > -       } else if (PageHWPoison(page)) {
> > > +       } else {
> > >                 /*
> > > -                * hwpoisoned dirty swapcache pages are kept for killing
> > > -                * owner processes (which may be unknown at hwpoison time)
> > > +                * Back out if somebody else faulted in this pte
> > > +                * while we released the pte lock.
> > >                  */
> > > -               ret = VM_FAULT_HWPOISON;
> > > -               goto out_release;
> > > +               vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > > +                               vmf->address, &vmf->ptl);
> > > +               if (likely(vmf->pte &&
> > > +                          pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > > +                       ret = VM_FAULT_OOM;
> > > +               goto unlock;
> > >         }
> > >
> > >         ret |= folio_lock_or_retry(folio, vmf);
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index a9a654af791e..ac9136eee690 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
> > >         (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> > >                 >> SWAP_ADDRESS_SPACE_SHIFT])
> > >
> > > +enum swap_cache_result {
> > > +       SWAP_CACHE_HIT,
> > > +       SWAP_CACHE_MISS,
> > > +       SWAP_CACHE_BYPASS,
> > > +};
> >
> > Does any function later care about CACHE_BYPASS?
> >
> > Again, better introduce it with the function that uses it. Don't
> > introduce it for "just in case I might use it later".
>
> Yes,  callers in shmem will also need to know if the page is cached in
> swap, and need a value to indicate the bypass case. I can add some
> comments here to indicate the usage.

I also comment on the later patch. Because you do the look up without
the folio locked. The swap_cache can change by the time you use the
"*result". I suspect some of the swap_cache look up you need to add it
back due to the race before locking. This better introduces this field
with the user side together. Make it easier to reason the usage in one
patch.

BTW, one way to flatten the development history of the patch is to
flatten the branch as one big patch. Then copy/paste from the big
patch to introduce a sub patch step by step. That way the sub patch is
closer to the latest version of the code. Just something for you to
consider.

>
> >
> > > +
> > >  void show_swap_cache_info(void);
> > >  bool add_to_swap(struct folio *folio);
> > >  void *get_shadow_from_swap_cache(swp_entry_t entry);
> > > @@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> > >                                     struct mempolicy *mpol, pgoff_t ilx);
> > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > > -                             struct vm_fault *vmf, bool *swapcached);
> > > +                             struct vm_fault *vmf, enum swap_cache_result *result);
> > >
> > >  static inline unsigned int folio_swap_flags(struct folio *folio)
> > >  {
> > > @@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> > >  }
> > >
> > >  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > > -                       struct vm_fault *vmf, bool *swapcached)
> > > +                       struct vm_fault *vmf, enum swap_cache_result *result)
> > >  {
> > >         return NULL;
> > >  }
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index d87c20f9f7ec..e96d63bf8a22 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > >   * @entry: swap entry of this memory
> > >   * @gfp_mask: memory allocation flags
> > >   * @vmf: fault information
> > > - * @swapcached: pointer to a bool used as indicator if the
> > > - *              page is swapped in through swapcache.
> > > + * @result: a return value to indicate swap cache usage.
> > >   *
> > >   * Returns the struct page for entry and addr, after queueing swapin.
> > >   *
> > > @@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > >   * or vma-based(ie, virtual address based on faulty address) readahead.
> > >   */
> > >  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > -                             struct vm_fault *vmf, bool *swapcached)
> > > +                             struct vm_fault *vmf, enum swap_cache_result *result)
> > >  {
> > > +       enum swap_cache_result cache_result;
> > >         struct swap_info_struct *si;
> > >         struct mempolicy *mpol;
> > > +       struct folio *folio;
> > >         struct page *page;
> > >         pgoff_t ilx;
> > > -       bool cached;
> > > +
> > > +       folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> > > +       if (folio) {
> > > +               page = folio_file_page(folio, swp_offset(entry));
> > > +               cache_result = SWAP_CACHE_HIT;
> > > +               goto done;
> > > +       }
> > >
> > >         si = swp_swap_info(entry);
> > >         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > >         if (swap_use_no_readahead(si, swp_offset(entry))) {
> > >                 page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > > -               cached = false;
> > > +               cache_result = SWAP_CACHE_BYPASS;
> > >         } else if (swap_use_vma_readahead(si)) {
> > >                 page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > > -               cached = true;
> > > +               cache_result = SWAP_CACHE_MISS;
> > >         } else {
> > >                 page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > > -               cached = true;
> > > +               cache_result = SWAP_CACHE_MISS;
> > >         }
> > >         mpol_cond_put(mpol);
> > >
> > > -       if (swapcached)
> > > -               *swapcached = cached;
> > > +done:
> > > +       if (result)
> > > +               *result = cache_result;
> > >
> > >         return page;
> > >  }
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 01c3f53b6521..b6d57fff5e21 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >
> > >         si = swap_info[type];
> > >         do {
> > > -               struct folio *folio;
> > > +               struct page *page;
> > >                 unsigned long offset;
> > >                 unsigned char swp_count;
> > > +               struct folio *folio = NULL;
> > >                 swp_entry_t entry;
> > >                 int ret;
> > >                 pte_t ptent;
> > >
> > > +               struct vm_fault vmf = {
> > > +                       .vma = vma,
> > > +                       .address = addr,
> > > +                       .real_address = addr,
> > > +                       .pmd = pmd,
> > > +               };
> >
> > Is this code move caused by skipping the swap cache look up here?
>
> Yes.
>
> >
> > This is very sensitive code related to swap cache racing. It needs
> > very careful reviewing. Better not shuffle it for no good reason.
>
> Thanks for the suggestion, I'll try to avoid these shuffling, but
> cache lookup is moved into swappin_readahead so some changes in the
> original caller are not avoidable...
Yes I agree sometimes it is unavoidable.

Chris

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 18/24] mm/swap: introduce a helper non fault swapin
  2023-11-22  4:40   ` Chris Li
@ 2023-11-28 11:22     ` Kairui Song
  2023-12-13  2:22       ` Chris Li
  0 siblings, 1 reply; 93+ messages in thread
From: Kairui Song @ 2023-11-28 11:22 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

Chris Li <chrisl@kernel.org> 于2023年11月22日周三 12:41写道:
>
> On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > There are two places where swapin is not direct caused by page fault:
> > shmem swapin is invoked through shmem mapping, swapoff cause swapin by
> > walking the page table. They used to construct a pseudo vmfault struct
> > for swapin function.
> >
> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> > path is still using a pseudo vmfault.
> >
> > Introduce a helper for them both, this help save stack usage for swapoff
> > path, and help apply a unified swapin cache and readahead policy check.
> >
> > Also prepare for follow up commits.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/shmem.c      | 51 ++++++++++++++++---------------------------------
> >  mm/swap.h       | 11 +++++++++++
> >  mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c   | 23 +++++++++++-----------
> >  4 files changed, 76 insertions(+), 47 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index f9ce4067c742..81d129aa66d1 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
> >  static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
> >                         pgoff_t index, unsigned int order, pgoff_t *ilx);
> >
> > -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> > -                       struct shmem_inode_info *info, pgoff_t index)
> > -{
> > -       struct mempolicy *mpol;
> > -       pgoff_t ilx;
> > -       struct page *page;
> > -
> > -       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> > -       page = swap_cluster_readahead(swap, gfp, mpol, ilx);
> > -       mpol_cond_put(mpol);
> > -
> > -       if (!page)
> > -               return NULL;
> > -       return page_folio(page);
> > -}
> > -
>
> Nice. Thank you.
>
> >  /*
> >   * Make sure huge_gfp is always more limited than limit_gfp.
> >   * Some of the flags set permissions, while others set limitations.
> > @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >  {
> >         struct address_space *mapping = inode->i_mapping;
> >         struct shmem_inode_info *info = SHMEM_I(inode);
> > -       struct swap_info_struct *si;
> > +       enum swap_cache_result result;
> >         struct folio *folio = NULL;
> > +       struct mempolicy *mpol;
> > +       struct page *page;
> >         swp_entry_t swap;
> > +       pgoff_t ilx;
> >         int error;
> >
> >         VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >         if (is_poisoned_swp_entry(swap))
> >                 return -EIO;
> >
> > -       si = get_swap_device(swap);
> > -       if (!si) {
> > +       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> > +       page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);

Hi Chris,

I've been trying to address these issues in V2, most issue in other
patches have a straight solution, some could be discuss in seperate
series, but I come up with some thoughts here:

>
> Notice this "result" CAN be outdated. e.g. after this call, the swap
> cache can be changed by another thread generating the swap page fault
> and installing the folio into the swap cache or removing it.

This is true, and it seems a potential race also exist before this
series for direct (no swapcache) swapin path (do_swap_page) if I
understand it correctly:

In do_swap_page path, multiple process could swapin the page at the
same time (a mapped once page can still be shared by sub threads),
they could get different folios. The later pte lock and pte_same check
is not enough, because while one process is not holding the pte lock,
another process could read-in, swap_free the entry, then swap-out the
page again, using same entry, an ABA problem. The race is not likely
to happen in reality but in theory possible.

Same issue for shmem here, there are
shmem_confirm_swap/shmem_add_to_page_cache check later to prevent
re-installing into shmem mapping for direct swap in, but also not
enough. Other process could read-in and re-swapout using same entry so
the mapping entry seems unchanged during the time window. Still very
unlikely to happen in reality, but not impossible.

When swapcache is used there is no such issue, since swap lock and
swap_map are used to sync all readers, and while one reader is still
holding the folio, the entry is locked through swapcache, or if a
folio is removed from swapcache, folio_test_swapcache will fail, and
the reader could retry.

I'm trying to come up with a better locking for direct swap in, am I
missing anything here? Correct me if I get it wrong...

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH 18/24] mm/swap: introduce a helper non fault swapin
  2023-11-28 11:22     ` Kairui Song
@ 2023-12-13  2:22       ` Chris Li
  0 siblings, 0 replies; 93+ messages in thread
From: Chris Li @ 2023-12-13  2:22 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Huang, Ying, David Hildenbrand,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	linux-kernel

On Tue, Nov 28, 2023 at 3:22 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> > >  /*
> > >   * Make sure huge_gfp is always more limited than limit_gfp.
> > >   * Some of the flags set permissions, while others set limitations.
> > > @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > >  {
> > >         struct address_space *mapping = inode->i_mapping;
> > >         struct shmem_inode_info *info = SHMEM_I(inode);
> > > -       struct swap_info_struct *si;
> > > +       enum swap_cache_result result;
> > >         struct folio *folio = NULL;
> > > +       struct mempolicy *mpol;
> > > +       struct page *page;
> > >         swp_entry_t swap;
> > > +       pgoff_t ilx;
> > >         int error;
> > >
> > >         VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > > @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > >         if (is_poisoned_swp_entry(swap))
> > >                 return -EIO;
> > >
> > > -       si = get_swap_device(swap);
> > > -       if (!si) {
> > > +       mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> > > +       page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
>
> Hi Chris,
>
> I've been trying to address these issues in V2, most issue in other
> patches have a straight solution, some could be discuss in seperate
> series, but I come up with some thoughts here:
>
> >
> > Notice this "result" CAN be outdated. e.g. after this call, the swap
> > cache can be changed by another thread generating the swap page fault
> > and installing the folio into the swap cache or removing it.
>
> This is true, and it seems a potential race also exist before this
> series for direct (no swapcache) swapin path (do_swap_page) if I
> understand it correctly:

I just noticed I missed this email while I was cleaning up my email
archive. Sorry for the late reply. Traveling does not help either.

I am not aware of swap in racing bugs in the existing code. Racing,
yes. If you discover a code path for racing causing bug, please report
it.
>
> In do_swap_page path, multiple process could swapin the page at the
> same time (a mapped once page can still be shared by sub threads),
> they could get different folios. The later pte lock and pte_same check
> is not enough, because while one process is not holding the pte lock,
> another process could read-in, swap_free the entry, then swap-out the
> page again, using same entry, an ABA problem. The race is not likely
> to happen in reality but in theory possible.

Have you taken into account that if the page was locked, then it
wasn't able to change from the swapcache? I think the swap cache find
and get function will return the page locked. Then swapcache will not
be able to change the mapping as long as the page is still locked.

>
> Same issue for shmem here, there are
> shmem_confirm_swap/shmem_add_to_page_cache check later to prevent
> re-installing into shmem mapping for direct swap in, but also not
> enough. Other process could read-in and re-swapout using same entry so
> the mapping entry seems unchanged during the time window. Still very
> unlikely to happen in reality, but not impossible.

Please take a look again with the page lock information. Report back
if you still think there is a racing bug in the existing code. We can
take a closer look at the concurrent call stack to trigger the bug.

Chris

>
> When swapcache is used there is no such issue, since swap lock and
> swap_map are used to sync all readers, and while one reader is still
> holding the folio, the entry is locked through swapcache, or if a
> folio is removed from swapcache, folio_test_swapcache will fail, and
> the reader could retry.
>
> I'm trying to come up with a better locking for direct swap in, am I
> missing anything here? Correct me if I get it wrong...
>

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2023-12-13  2:23 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-19 19:47 [PATCH 00/24] Swapin path refactor for optimization and bugfix Kairui Song
2023-11-19 19:47 ` [PATCH 01/24] mm/swap: fix a potential undefined behavior issue Kairui Song
2023-11-19 20:55   ` Matthew Wilcox
2023-11-20  3:35     ` Chris Li
2023-11-20 11:14       ` Kairui Song
2023-11-20 17:34         ` Chris Li
2023-11-19 19:47 ` [PATCH 02/24] mm/swapfile.c: add back some comment Kairui Song
2023-11-19 19:47 ` [PATCH 03/24] mm/swap: move no readahead swapin code to a stand alone helper Kairui Song
2023-11-19 21:00   ` Matthew Wilcox
2023-11-20 11:14     ` Kairui Song
2023-11-20 14:55   ` Dan Carpenter
2023-11-21  5:34   ` Chris Li
2023-11-22 17:33     ` Kairui Song
2023-11-19 19:47 ` [PATCH 04/24] mm/swap: avoid setting page lock bit and doing extra unlock check Kairui Song
2023-11-20  4:17   ` Chris Li
2023-11-20 11:15     ` Kairui Song
2023-11-20 17:44       ` Chris Li
2023-11-22 17:32         ` Kairui Song
2023-11-22 20:57           ` Chris Li
2023-11-24  8:14             ` Kairui Song
2023-11-24  8:37               ` Christopher Li
2023-11-19 19:47 ` [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead Kairui Song
2023-11-21  6:15   ` Chris Li
2023-11-21  6:35     ` Kairui Song
2023-11-21  7:41       ` Chris Li
2023-11-21  8:32         ` Kairui Song
2023-11-21 15:24           ` Chris Li
2023-11-19 19:47 ` [PATCH 06/24] swap: rework swapin_no_readahead arguments Kairui Song
2023-11-20  0:20   ` kernel test robot
2023-11-21  6:44   ` Chris Li
2023-11-23 10:51     ` Kairui Song
2023-11-19 19:47 ` [PATCH 07/24] mm/swap: move swap_count to header to be shared Kairui Song
2023-11-21  6:51   ` Chris Li
2023-11-21  7:03     ` Kairui Song
2023-11-19 19:47 ` [PATCH 08/24] mm/swap: check readahead policy per entry Kairui Song
2023-11-20  6:04   ` Huang, Ying
2023-11-20 11:17     ` Kairui Song
2023-11-21  1:10       ` Huang, Ying
2023-11-21  5:20         ` Chris Li
2023-11-21  5:13       ` Chris Li
2023-11-21  7:54   ` Chris Li
2023-11-23 10:52     ` Kairui Song
2023-11-19 19:47 ` [PATCH 09/24] mm/swap: inline __swap_count Kairui Song
2023-11-20  7:41   ` Huang, Ying
2023-11-21  8:02     ` Chris Li
2023-11-19 19:47 ` [PATCH 10/24] mm/swap: remove nr_rotate_swap and related code Kairui Song
2023-11-21 15:45   ` Chris Li
2023-11-19 19:47 ` [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead Kairui Song
2023-11-20  0:47   ` kernel test robot
2023-11-21 16:06   ` Chris Li
2023-11-24  8:42     ` Kairui Song
2023-11-24  9:10       ` Chris Li
2023-11-19 19:47 ` [PATCH 12/24] mm/swap: simplify arguments for swap_cache_get_folio Kairui Song
2023-11-21 16:36   ` Chris Li
2023-11-19 19:47 ` [PATCH 13/24] swap: simplify swap_cache_get_folio Kairui Song
2023-11-21 16:50   ` Chris Li
2023-11-19 19:47 ` [PATCH 14/24] mm/swap: do shadow lookup as well when doing swap cache lookup Kairui Song
2023-11-21 16:55   ` Chris Li
2023-11-19 19:47 ` [PATCH 15/24] mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO device Kairui Song
2023-11-21 17:15   ` Chris Li
2023-11-22 18:08     ` Kairui Song
2023-11-19 19:47 ` [PATCH 16/24] mm/swap: reduce scope of get_swap_device in swapin path Kairui Song
2023-11-19 21:12   ` Matthew Wilcox
2023-11-20 11:14     ` Kairui Song
2023-11-21 17:25   ` Chris Li
2023-11-22  0:36   ` Huang, Ying
2023-11-23 11:13     ` Kairui Song
2023-11-24  0:40       ` Huang, Ying
2023-11-19 19:47 ` [PATCH 17/24] mm/swap: fix false error when swapoff race with swapin Kairui Song
2023-11-19 19:47 ` [PATCH 18/24] mm/swap: introduce a helper non fault swapin Kairui Song
2023-11-20  1:07   ` kernel test robot
2023-11-22  4:40   ` Chris Li
2023-11-28 11:22     ` Kairui Song
2023-12-13  2:22       ` Chris Li
2023-11-19 19:47 ` [PATCH 19/24] shmem, swap: refactor error check on OOM or race Kairui Song
2023-11-20  7:04   ` Chris Li
2023-11-20 11:17     ` Kairui Song
2023-11-19 19:47 ` [PATCH 20/24] swap: simplify and make swap_find_cache static Kairui Song
2023-11-22  5:01   ` Chris Li
2023-11-19 19:47 ` [PATCH 21/24] swap: make swapin_readahead result checking argument mandatory Kairui Song
2023-11-22  5:15   ` Chris Li
2023-11-24  8:14     ` Kairui Song
2023-11-19 19:47 ` [PATCH 22/24] swap: make swap_cluster_readahead static Kairui Song
2023-11-22  5:20   ` Chris Li
2023-11-19 19:47 ` [PATCH 23/24] swap: fix multiple swap leak when after cgroup migrate Kairui Song
2023-11-20  7:35   ` Huang, Ying
2023-11-20 11:17     ` Kairui Song
2023-11-22  5:34       ` Chris Li
2023-11-19 19:47 ` [PATCH 24/24] mm/swap: change swapin_readahead to swapin_page_fault Kairui Song
2023-11-20 19:09 ` [PATCH 00/24] Swapin path refactor for optimization and bugfix Yosry Ahmed
2023-11-20 20:22   ` Chris Li
2023-11-22  6:46     ` Kairui Song
2023-11-22  6:43   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).