linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] swapin refactor for optimization and unified readahead
@ 2024-01-29 17:54 Kairui Song
  2024-01-29 17:54 ` [PATCH v3 1/7] mm/swapfile.c: add back some comment Kairui Song
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series tries to unify and clean up the swapin path, introduce minor
optimization, and make both shmem swapoff make use of SWP_SYNCHRONOUS_IO
flag to skip readahead and swapcache for better performance.

Test results:
- swap out 10G zero-filled data to ZRAM then read them in:
  Before: 11143285 us
  After:  10692644 us (+4.1%)

- swapping off a 10G ZRAM (lzo-rle) after same workload:
  Before:
  time swapoff /dev/zram0
  real    0m12.337s
  user    0m0.001s
  sys     0m12.329s

  After:
  time swapoff /dev/zram0
  real    0m9.728s
  user    0m0.001s
  sys     0m9.719s

- shmem FIO test 1 on a Ryzen 5900HX:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \
    --ioengine=mmap --rw=randread --random_distribution=zipf:0.5 \
    --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before:
    bw (  MiB/s): min= 1167, max= 1732, per=100.00%, avg=1460.82, stdev= 4.38, samples=9536
    iops        : min=298938, max=443557, avg=373964.41, stdev=1121.27, samples=9536
  After (+3.5%):
    bw (  MiB/s): min= 1285, max= 1738, per=100.00%, avg=1512.88, stdev= 4.34, samples=9456
    iops        : min=328957, max=445105, avg=387294.21, stdev=1111.15, samples=9456

- shmem FIO test 2 on a Ryzen 5900HX:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \
    --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
    --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before:
    bw (  MiB/s): min= 5296, max= 7112, per=100.00%, avg=6131.93, stdev=17.09, samples=9536
    iops        : min=1355934, max=1820833, avg=1569769.11, stdev=4375.93, samples=9536
  After (+3.1%):
    bw (  MiB/s): min= 5466, max= 7173, per=100.00%, avg=6324.51, stdev=16.66, samples=9521
    iops        : min=1399355, max=1836435, avg=1619068.90, stdev=4263.94, samples=9521

- Some built objects are very slightly smaller (gcc 13.2.1):
./scripts/bloat-o-meter ./vmlinux ./vmlinux.new
add/remove: 4/2 grow/shrink: 1/10 up/down: 818/-983 (-165)
Function                                     old     new   delta
swapin_entry                                   -     482    +482
mm_counter                                     -     248    +248
shmem_swapin_folio                          1412    1468     +56
__pfx_swapin_entry                             -      16     +16
__pfx_mm_counter                               -      16     +16
__read_swap_cache_async                      738     736      -2
copy_present_pte                            1258    1249      -9
mem_cgroup_swapin_charge_folio               297     285     -12
__pfx_swapin_readahead                        16       -     -16
swap_cache_get_folio                         364     345     -19
do_anonymous_page                           1488    1458     -30
unuse_pte_range                              889     833     -56
free_p4d_range                               524     446     -78
restore_exclusive_pte                        937     822    -115
do_swap_page                                2969    2817    -152
swapin_readahead                             239       -    -239
copy_nonpresent_pte                         1478    1223    -255
Total: Before=26056243, After=26056078, chg -0.00%

V2: https://lore.kernel.org/linux-mm/20240102175338.62012-1-ryncsn@gmail.com/
Update from V2:
  - Many code path clean up (merge swapin_entry with swapin_entry_mpol,
    drop second param of mem_cgroup_swapin_charge_folio, swapin_entry
    takes a pointer to folio as return value instaed of pointer to
    boolean to reduce LOC and logic), thanks for Huang, Ying.
  - Don't use cluster readhead for swapoff, the performance is worse
    than VMA readahead for NVME.
  - Add a refactor patch for swap_cache_get_folio.

V1: https://lore.kernel.org/linux-mm/20231119194740.94101-1-ryncsn@gmail.com/T/
Update from V1:
  - Rebased based on mm-unstable.
  - Remove behaviour changing patches, will submit in seperate series
    later.
  - Code style, naming and comments updates.
  - Thanks to Chris Li for very detailed and helpful review of V1. Thanks
    to Matthew Wilcox and Huang Ying for helpful suggestions.

Kairui Song (7):
  mm/swapfile.c: add back some comment
  mm/swap: move no readahead swapin code to a stand-alone helper
  mm/swap: always account swapped in page into current memcg
  mm/swap: introduce swapin_entry for unified readahead policy
  mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  mm/swap, shmem: use unified swapin helper for shmem
  mm/swap: refactor swap_cache_get_folio

 include/linux/memcontrol.h |   4 +-
 mm/memcontrol.c            |   5 +-
 mm/memory.c                |  45 ++--------
 mm/shmem.c                 |  50 +++++++----
 mm/swap.h                  |  23 ++---
 mm/swap_state.c            | 176 ++++++++++++++++++++++++++-----------
 mm/swapfile.c              |  20 +++--
 7 files changed, 190 insertions(+), 133 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/7] mm/swapfile.c: add back some comment
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  2024-01-29 17:54 ` [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Some useful comments were dropped in commit b56a2d8af914 ("mm: rid
swapoff of quadratic complexity"), add them back.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0008cd39af42..606d95b56304 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1881,6 +1881,17 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				folio = page_folio(page);
 		}
 		if (!folio) {
+			/*
+			 * The entry could have been freed, and will not
+			 * be reused since swapoff() already disabled
+			 * allocation from here, or alloc_page() failed.
+			 *
+			 * We don't hold lock here, so the swap entry could be
+			 * SWAP_MAP_BAD (when the cluster is discarding).
+			 * Instead of fail out, We can just skip the swap
+			 * entry because swapoff will wait for discarding
+			 * finish anyway.
+			 */
 			swp_count = READ_ONCE(si->swap_map[offset]);
 			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
 				continue;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
  2024-01-29 17:54 ` [PATCH v3 1/7] mm/swapfile.c: add back some comment Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  2024-01-30  5:38   ` Huang, Ying
  2024-01-29 17:54 ` [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg Kairui Song
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, simply move the routine to a standalone function to
be re-used later. The error path handling is copied from the "out_page"
label, to make the code change minimized for easier reviewing.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 32 ++++----------------------------
 mm/swap.h       |  8 ++++++++
 mm/swap_state.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..81dc9d467f4e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3803,7 +3803,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -3867,33 +3866,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
-			page = &folio->page;
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				if (mem_cgroup_swapin_charge_folio(folio,
-							vma->vm_mm, GFP_KERNEL,
-							entry)) {
-					ret = VM_FAULT_OOM;
-					goto out_page;
-				}
-				mem_cgroup_swapin_uncharge_swap(entry);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, true, NULL);
-				folio->private = NULL;
-			}
+			/* skip swapcache and readahead */
+			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
+			if (folio)
+				page = &folio->page;
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
diff --git a/mm/swap.h b/mm/swap.h
index 758c46ca671e..83eab7b67e77 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,6 +56,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf);
+struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			    struct vm_fault *vmf);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -86,6 +88,12 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 	return NULL;
 }
 
+struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			struct vm_fault *vmf)
+{
+	return NULL;
+}
+
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e671266ad772..645f5bcad123 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -861,6 +861,53 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	return folio;
 }
 
+/**
+ * swapin_direct - swap in a folio skipping swap cache and readahead
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vmf: fault information
+ *
+ * Returns the struct folio for entry and addr after the swap entry is read
+ * in.
+ */
+struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
+			    struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct folio *folio;
+	void *shadow = NULL;
+
+	/* skip swapcache */
+	folio = vma_alloc_folio(gfp_mask, 0,
+				vma, vmf->address, false);
+	if (folio) {
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
+		if (mem_cgroup_swapin_charge_folio(folio,
+					vma->vm_mm, GFP_KERNEL,
+					entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return NULL;
+		}
+		mem_cgroup_swapin_uncharge_swap(entry);
+
+		shadow = get_shadow_from_swap_cache(entry);
+		if (shadow)
+			workingset_refault(folio, shadow);
+
+		folio_add_lru(folio);
+
+		/* To provide entry to swap_read_folio() */
+		folio->swap = entry;
+		swap_read_folio(folio, true, NULL);
+		folio->private = NULL;
+	}
+
+	return folio;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
  2024-01-29 17:54 ` [PATCH v3 1/7] mm/swapfile.c: add back some comment Kairui Song
  2024-01-29 17:54 ` [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  2024-01-30  6:12   ` Huang, Ying
  2024-01-29 17:54 ` [PATCH v3 4/7] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, mem_cgroup_swapin_charge_folio is always called with
mm == NULL, except in swapin_direct.

swapin_direct is only used when swapin should skip readahead
and swapcache (SWP_SYNCHRONOUS_IO). All other callers of
mem_cgroup_swapin_charge_folio are for swapin that should
not skip readahead and cache.

This could cause swapin charging to behave differently depending
on swap device, which is unexpected.

This is currently not happening because the only caller of
swapin_direct is the direct anon page fault path, where mm always
equals to current->mm, but will no longer be true if swapin_direct
is shared and have other callers (eg, swapoff) to share the
readahead skipping logic.

So make swapin_direct also pass NULL for mm, so swpain charge
will behave consistently and not effected by type of swapin device
or readahead policy.

After this, the second param of mem_cgroup_swapin_charge_folio is
never used now, so it can be safely dropped.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 5 ++---
 mm/swap_state.c            | 7 +++----
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20ff87f8e001..540590d80958 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -693,7 +693,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
 		long nr_pages);
 
-int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
+int mem_cgroup_swapin_charge_folio(struct folio *folio,
 				  gfp_t gfp, swp_entry_t entry);
 void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
 
@@ -1281,7 +1281,7 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
 }
 
 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
-			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
+		gfp_t gfp, swp_entry_t entry)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4c8735e7c85..5852742df958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7306,8 +7306,7 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
  *
  * Returns 0 on success. Otherwise, an error code is returned.
  */
-int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
-				  gfp_t gfp, swp_entry_t entry)
+int mem_cgroup_swapin_charge_folio(struct folio *folio, gfp_t gfp, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg;
 	unsigned short id;
@@ -7320,7 +7319,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (!memcg || !css_tryget_online(&memcg->css))
-		memcg = get_mem_cgroup_from_mm(mm);
+		memcg = get_mem_cgroup_from_current();
 	rcu_read_unlock();
 
 	ret = charge_memcg(folio, memcg, gfp);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 645f5bcad123..a450d09fc0db 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -495,7 +495,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
-	if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
+	if (mem_cgroup_swapin_charge_folio(folio, gfp_mask, entry))
 		goto fail_unlock;
 
 	/* May fail (-ENOMEM) if XArray node allocation failed. */
@@ -884,9 +884,8 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 		__folio_set_locked(folio);
 		__folio_set_swapbacked(folio);
 
-		if (mem_cgroup_swapin_charge_folio(folio,
-					vma->vm_mm, GFP_KERNEL,
-					entry)) {
+		if (mem_cgroup_swapin_charge_folio(folio, GFP_KERNEL,
+						   entry)) {
 			folio_unlock(folio);
 			folio_put(folio);
 			return NULL;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/7] mm/swap: introduce swapin_entry for unified readahead policy
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
                   ` (2 preceding siblings ...)
  2024-01-29 17:54 ` [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  2024-01-30  6:29   ` Huang, Ying
  2024-01-29 17:54 ` [PATCH v3 5/7] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Introduce swapin_entry which merges swapin_readahead and swapin_direct
making it the main entry for swapin pages, and use a unified swapin
readahead policy.

This commit makes swapoff make use of this new helper and skip readahead
for SYNCHRONOUS_IO device since it's not helpful here. Now swapping
off a 10G ZRAM (lzo-rle) after same workload is faster since readahead
is skipped and overhead is reduced.

Before:
time swapoff /dev/zram0
real    0m12.337s
user    0m0.001s
sys     0m12.329s

After:
time swapoff /dev/zram0
real    0m9.728s
user    0m0.001s
sys     0m9.719s

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 18 +++---------------
 mm/swap.h       | 16 ++++------------
 mm/swap_state.c | 40 ++++++++++++++++++++++++----------------
 mm/swapfile.c   |  7 ++-----
 4 files changed, 33 insertions(+), 48 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 81dc9d467f4e..8711f8a07039 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3864,20 +3864,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			/* skip swapcache and readahead */
-			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
-			if (folio)
-				page = &folio->page;
-		} else {
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			if (page)
-				folio = page_folio(page);
-			swapcache = folio;
-		}
-
+		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+				     vmf, &swapcache);
 		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -3890,11 +3878,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				ret = VM_FAULT_OOM;
 			goto unlock;
 		}
-
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+		page = folio_file_page(folio, swp_offset(entry));
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
diff --git a/mm/swap.h b/mm/swap.h
index 83eab7b67e77..8f8185d3865c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -54,10 +54,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
-struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf);
-struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
-			    struct vm_fault *vmf);
+struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
+			   struct vm_fault *vmf, struct folio **swapcached);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -88,14 +86,8 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 	return NULL;
 }
 
-struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
-			struct vm_fault *vmf)
-{
-	return NULL;
-}
-
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
+static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
+			struct vm_fault *vmf, struct folio **swapcached)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index a450d09fc0db..5e06b2e140d4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -870,8 +870,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * Returns the struct folio for entry and addr after the swap entry is read
  * in.
  */
-struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-			    struct vm_fault *vmf)
+static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
+				  struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -908,33 +908,41 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 }
 
 /**
- * swapin_readahead - swap in pages in hope we need them soon
+ * swapin_entry - swap in a folio from swap entry
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
+ * @swapcache: set to the swapcache folio if swapcache is used
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
- * It's a main entry function for swap readahead. By the configuration,
+ * It's the main entry function for swap in. By the configuration,
  * it will read ahead blocks by cluster-based(ie, physical disk based)
- * or vma-based(ie, virtual address based on faulty address) readahead.
+ * or vma-based(ie, virtual address based on faulty address) readahead,
+ * or skip the readahead(ie, ramdisk based swap device).
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
+			   struct vm_fault *vmf, struct folio **swapcache)
 {
 	struct mempolicy *mpol;
-	pgoff_t ilx;
 	struct folio *folio;
+	pgoff_t ilx;
 
-	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	folio = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
-	mpol_cond_put(mpol);
+	if (data_race(swp_swap_info(entry)->flags & SWP_SYNCHRONOUS_IO) &&
+	    __swap_count(entry) == 1) {
+		folio = swapin_direct(entry, gfp_mask, vmf);
+	} else {
+		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
+		if (swap_use_vma_readahead())
+			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+		else
+			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		mpol_cond_put(mpol);
+		if (swapcache)
+			*swapcache = folio;
+	}
 
-	if (!folio)
-		return NULL;
-	return folio_file_page(folio, swp_offset(entry));
+	return folio;
 }
 
 #ifdef CONFIG_SYSFS
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 606d95b56304..1cf7e72e19e3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1867,7 +1867,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 		folio = swap_cache_get_folio(entry, vma, addr);
 		if (!folio) {
-			struct page *page;
 			struct vm_fault vmf = {
 				.vma = vma,
 				.address = addr,
@@ -1875,10 +1874,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				.pmd = pmd,
 			};
 
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
-			if (page)
-				folio = page_folio(page);
+			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+					    &vmf, NULL);
 		}
 		if (!folio) {
 			/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 5/7] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
                   ` (3 preceding siblings ...)
  2024-01-29 17:54 ` [PATCH v3 4/7] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  2024-01-30  6:51   ` Huang, Ying
  2024-01-29 17:54 ` [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem Kairui Song
  2024-01-29 17:54 ` [PATCH v3 7/7] mm/swap: refactor swap_cache_get_folio Kairui Song
  6 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

When a xa_value is returned by the cache lookup, keep it to be used
later for workingset refault check instead of doing the looking up again
in swapin_no_readahead.

Shadow look up and workingset check is skipped for swapoff to reduce
overhead, workingset checking for anon pages upon swapoff is not
helpful, simply consider all pages as inactive make more sense since
swapoff doesn't mean pages is being accessed.

After this commit, swappin is about 4% faster for ZRAM, micro benchmark
result which use madvise to swap out 10G zero-filled data to ZRAM then
read them in:

Before: 11143285 us
After:  10692644 us (+4.1%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     |  5 +++--
 mm/shmem.c      |  2 +-
 mm/swap.h       | 11 ++++++-----
 mm/swap_state.c | 23 +++++++++++++----------
 mm/swapfile.c   |  4 ++--
 5 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8711f8a07039..349946899f8d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3800,6 +3800,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool exclusive = false;
+	void *shadow = NULL;
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
@@ -3858,14 +3859,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
+	folio = swap_cache_get_folio(entry, vma, vmf->address, &shadow);
 	if (folio)
 		page = folio_file_page(folio, swp_offset(entry));
 	swapcache = folio;
 
 	if (!folio) {
 		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-				     vmf, &swapcache);
+				     vmf, &swapcache, shadow);
 		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
diff --git a/mm/shmem.c b/mm/shmem.c
index d7c84ff62186..698a31bf7baa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1873,7 +1873,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_get_folio(swap, NULL, 0, NULL);
 	if (!folio) {
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
diff --git a/mm/swap.h b/mm/swap.h
index 8f8185d3865c..ca9cb472a263 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -42,7 +42,8 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr);
+		struct vm_area_struct *vma, unsigned long addr,
+		void **shadowp);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
 
@@ -54,8 +55,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
-struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
-			   struct vm_fault *vmf, struct folio **swapcached);
+struct folio *swapin_entry(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf,
+		struct folio **swapcached, void *shadow);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -87,7 +88,7 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 }
 
 static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf, struct folio **swapcached)
+			struct vm_fault *vmf, struct folio **swapcached, void *shadow)
 {
 	return NULL;
 }
@@ -98,7 +99,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 }
 
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5e06b2e140d4..e41a137a6123 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -330,12 +330,18 @@ static inline bool swap_use_vma_readahead(void)
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
 {
 	struct folio *folio;
 
-	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
-	if (!IS_ERR(folio)) {
+	folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
+	if (xa_is_value(folio)) {
+		if (shadowp)
+			*shadowp = folio;
+		return NULL;
+	}
+
+	if (folio) {
 		bool vma_ra = swap_use_vma_readahead();
 		bool readahead;
 
@@ -365,8 +371,6 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
 			if (!vma || !vma_ra)
 				atomic_inc(&swapin_readahead_hits);
 		}
-	} else {
-		folio = NULL;
 	}
 
 	return folio;
@@ -866,16 +870,16 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
+ * @shadow: workingset shadow corresponding to entry
  *
  * Returns the struct folio for entry and addr after the swap entry is read
  * in.
  */
 static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-				  struct vm_fault *vmf)
+				  struct vm_fault *vmf, void *shadow)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	void *shadow = NULL;
 
 	/* skip swapcache */
 	folio = vma_alloc_folio(gfp_mask, 0,
@@ -892,7 +896,6 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 		}
 		mem_cgroup_swapin_uncharge_swap(entry);
 
-		shadow = get_shadow_from_swap_cache(entry);
 		if (shadow)
 			workingset_refault(folio, shadow);
 
@@ -922,7 +925,7 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * or skip the readahead(ie, ramdisk based swap device).
  */
 struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
-			   struct vm_fault *vmf, struct folio **swapcache)
+			   struct vm_fault *vmf, struct folio **swapcache, void *shadow)
 {
 	struct mempolicy *mpol;
 	struct folio *folio;
@@ -930,7 +933,7 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
 
 	if (data_race(swp_swap_info(entry)->flags & SWP_SYNCHRONOUS_IO) &&
 	    __swap_count(entry) == 1) {
-		folio = swapin_direct(entry, gfp_mask, vmf);
+		folio = swapin_direct(entry, gfp_mask, vmf, shadow);
 	} else {
 		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 		if (swap_use_vma_readahead())
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1cf7e72e19e3..aac26f5a6cec 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1865,7 +1865,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 		pte = NULL;
 
-		folio = swap_cache_get_folio(entry, vma, addr);
+		folio = swap_cache_get_folio(entry, vma, addr, NULL);
 		if (!folio) {
 			struct vm_fault vmf = {
 				.vma = vma,
@@ -1875,7 +1875,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			};
 
 			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-					    &vmf, NULL);
+					    &vmf, NULL, NULL);
 		}
 		if (!folio) {
 			/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
                   ` (4 preceding siblings ...)
  2024-01-29 17:54 ` [PATCH v3 5/7] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  2024-01-31  2:51   ` Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem) Huang, Ying
  2024-01-29 17:54 ` [PATCH v3 7/7] mm/swap: refactor swap_cache_get_folio Kairui Song
  6 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, shmem uses cluster readahead for all swap backends. Cluster
readahead is not a good solution for ramdisk based device (ZRAM), and
it's better to skip it.

After switching to the new helper, most benchmarks showed a good result:

- Single file sequence read on ramdisk:
  perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192
  (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit)
  Before: 22.248 +- 0.549
  After:  22.021 +- 0.684 (-1.1%)

- shmem FIO test 1 on a Ryzen 5900HX:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \
    --ioengine=mmap --rw=randread --random_distribution=zipf:0.5 \
    --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

Before:
  bw (  MiB/s): min= 1167, max= 1732, per=100.00%, avg=1460.82, stdev= 4.38, samples=9536
  iops        : min=298938, max=443557, avg=373964.41, stdev=1121.27, samples=9536
After (+3.5%):
  bw (  MiB/s): min= 1285, max= 1738, per=100.00%, avg=1512.88, stdev= 4.34, samples=9456
  iops        : min=328957, max=445105, avg=387294.21, stdev=1111.15, samples=9456

- shmem FIO test 2 on a Ryzen 5900HX:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=960m \
    --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
    --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

Before:
  bw (  MiB/s): min= 5296, max= 7112, per=100.00%, avg=6131.93, stdev=17.09, samples=9536
  iops        : min=1355934, max=1820833, avg=1569769.11, stdev=4375.93, samples=9536
After (+3.1%):
  bw (  MiB/s): min= 5466, max= 7173, per=100.00%, avg=6324.51, stdev=16.66, samples=9521
  iops        : min=1399355, max=1836435, avg=1619068.90, stdev=4263.94, samples=9521

So cluster readahead doesn't help much even for single sequence read,
and for random stress tests, the performance is better without it.

Considering both memory and swap devices will get more fragmented
slowly, and commonly used ZRAM consumes much more CPU than plain
ramdisk, false readahead could occur more frequently and waste
more CPU. Direct SWAP is cheaper, so use the new helper and skip
read ahead for SWP_SYNCHRONOUS_IO device.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     |  2 +-
 mm/shmem.c      | 50 +++++++++++++++++++++++++++++++----------------
 mm/swap.h       | 14 ++++---------
 mm/swap_state.c | 52 +++++++++++++++++++++++++++++++++----------------
 mm/swapfile.c   |  2 +-
 5 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 349946899f8d..51962126a79c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3866,7 +3866,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	if (!folio) {
 		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-				     vmf, &swapcache, shadow);
+				     vmf, NULL, 0, &swapcache, shadow);
 		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
diff --git a/mm/shmem.c b/mm/shmem.c
index 698a31bf7baa..d3722e25cb32 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1565,15 +1565,16 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
 			pgoff_t index, unsigned int order, pgoff_t *ilx);
 
-static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+			struct shmem_inode_info *info, pgoff_t index,
+			struct folio **swapcache, void *shadow)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct folio *folio;
 
 	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
+	folio = swapin_entry(swap, gfp, NULL, mpol, ilx, swapcache, shadow);
 	mpol_cond_put(mpol);
 
 	return folio;
@@ -1852,8 +1853,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct folio *swapcache = NULL, *folio;
 	struct swap_info_struct *si;
-	struct folio *folio = NULL;
+	void *shadow = NULL;
 	swp_entry_t swap;
 	int error;
 
@@ -1873,8 +1875,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0, NULL);
-	if (!folio) {
+	folio = swap_cache_get_folio(swap, NULL, 0, &shadow);
+	if (folio) {
+		swapcache = folio;
+	} else {
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
@@ -1882,7 +1886,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
 		/* Here we actually start the io */
-		folio = shmem_swapin_cluster(swap, gfp, info, index);
+		folio = shmem_swapin(swap, gfp, info, index, &swapcache, shadow);
 		if (!folio) {
 			error = -ENOMEM;
 			goto failed;
@@ -1891,17 +1895,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
-	if (!folio_test_swapcache(folio) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap)) {
+	if (swapcache) {
+		if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) {
+			error = -EEXIST;
+			goto unlock;
+		}
+		if (!folio_test_uptodate(folio)) {
+			error = -EIO;
+			goto failed;
+		}
+		folio_wait_writeback(folio);
+	}
+	if (!shmem_confirm_swap(mapping, index, swap)) {
 		error = -EEXIST;
 		goto unlock;
 	}
-	if (!folio_test_uptodate(folio)) {
-		error = -EIO;
-		goto failed;
-	}
-	folio_wait_writeback(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -1909,12 +1917,19 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	 */
 	arch_swap_restore(swap, folio);
 
-	if (shmem_should_replace_folio(folio, gfp)) {
+	/* If swapcache is bypassed, folio is newly allocated respects gfp flags */
+	if (swapcache && shmem_should_replace_folio(folio, gfp)) {
 		error = shmem_replace_folio(&folio, gfp, info, index);
 		if (error)
 			goto failed;
 	}
 
+	/*
+	 * The expected value checking below should be enough to ensure
+	 * only one up-to-date swapin success. swap_free() is called after
+	 * this, so the entry can't be reused. As long as the mapping still
+	 * has the old entry value, it's never swapped in or modified.
+	 */
 	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
@@ -1925,7 +1940,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	delete_from_swap_cache(folio);
+	if (swapcache)
+		delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
 	swap_free(swap);
 	put_swap_device(si);
diff --git a/mm/swap.h b/mm/swap.h
index ca9cb472a263..597a56c7fb02 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -53,10 +53,9 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
-struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_entry(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf,
-		struct folio **swapcached, void *shadow);
+		struct mempolicy *mpol, pgoff_t ilx,
+		struct folio **swapcache, void *shadow);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -81,14 +80,9 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
-			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
-{
-	return NULL;
-}
-
 static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf, struct folio **swapcached, void *shadow)
+			struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx,
+			struct folio *swapcache, void *shadow);
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e41a137a6123..20c206149be4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -316,6 +316,18 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	release_pages(pages, nr);
 }
 
+static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
+{
+	int count;
+
+	if (!data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		return false;
+
+	count = __swap_count(entry);
+
+	return (count == 1 || count == SWAP_MAP_SHMEM);
+}
+
 static inline bool swap_use_vma_readahead(void)
 {
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
@@ -635,8 +647,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are used for every page of the readahead: neighbouring pages on swap
  * are fairly likely to have been swapped out from the same node.
  */
-struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+static struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
+					    struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct folio *folio;
 	unsigned long entry_offset = swp_offset(entry);
@@ -876,14 +888,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * in.
  */
 static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-				  struct vm_fault *vmf, void *shadow)
+				   struct mempolicy *mpol, pgoff_t ilx,
+				   void *shadow)
 {
-	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
 
-	/* skip swapcache */
-	folio = vma_alloc_folio(gfp_mask, 0,
-				vma, vmf->address, false);
+	folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
+			mpol, ilx, numa_node_id());
 	if (folio) {
 		__folio_set_locked(folio);
 		__folio_set_swapbacked(folio);
@@ -916,6 +927,10 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
  * @swapcache: set to the swapcache folio if swapcache is used
+ * @mpol: NUMA memory alloc policy to be applied,
+ *        not needed if vmf is not NULL
+ * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE,
+ *            not needed if vmf is not NULL
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
@@ -924,26 +939,29 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead,
  * or skip the readahead(ie, ramdisk based swap device).
  */
-struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
-			   struct vm_fault *vmf, struct folio **swapcache, void *shadow)
+struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask, struct vm_fault *vmf,
+			   struct mempolicy *mpol, pgoff_t ilx,
+			   struct folio **swapcache, void *shadow)
 {
-	struct mempolicy *mpol;
+	bool mpol_put = false;
 	struct folio *folio;
-	pgoff_t ilx;
 
-	if (data_race(swp_swap_info(entry)->flags & SWP_SYNCHRONOUS_IO) &&
-	    __swap_count(entry) == 1) {
-		folio = swapin_direct(entry, gfp_mask, vmf, shadow);
-	} else {
+	if (!mpol) {
 		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-		if (swap_use_vma_readahead())
+		mpol_put = true;
+	}
+	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
+		folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
+	} else {
+		if (vmf && swap_use_vma_readahead())
 			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
 		else
 			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
-		mpol_cond_put(mpol);
 		if (swapcache)
 			*swapcache = folio;
 	}
+	if (mpol_put)
+		mpol_cond_put(mpol);
 
 	return folio;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index aac26f5a6cec..7ff05aaf6925 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1875,7 +1875,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			};
 
 			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-					    &vmf, NULL, NULL);
+					    &vmf, NULL, 0, NULL, NULL);
 		}
 		if (!folio) {
 			/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 7/7] mm/swap: refactor swap_cache_get_folio
  2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
                   ` (5 preceding siblings ...)
  2024-01-29 17:54 ` [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem Kairui Song
@ 2024-01-29 17:54 ` Kairui Song
  6 siblings, 0 replies; 20+ messages in thread
From: Kairui Song @ 2024-01-29 17:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, change the code layout to reduce object size
and kill redundant indent.

With gcc 13.2.1:

./scripts/bloat-o-meter mm/swap_state.o.old mm/swap_state.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-35 (-35)
Function                                     old     new   delta
swap_cache_get_folio                         380     345     -35
Total: Before=8785, After=8750, chg -0.40%

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 59 ++++++++++++++++++++++++-------------------------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 20c206149be4..2f809b69b65a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -341,9 +341,10 @@ static inline bool swap_use_vma_readahead(void)
  *
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
+struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma,
+				   unsigned long addr, void **shadowp)
 {
+	bool vma_ra, readahead;
 	struct folio *folio;
 
 	folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
@@ -352,37 +353,35 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
 			*shadowp = folio;
 		return NULL;
 	}
+	if (!folio)
+		return NULL;
 
-	if (folio) {
-		bool vma_ra = swap_use_vma_readahead();
-		bool readahead;
+	/*
+	 * At the moment, we don't support PG_readahead for anon THP
+	 * so let's bail out rather than confusing the readahead stat.
+	 */
+	if (unlikely(folio_test_large(folio)))
+		return folio;
 
-		/*
-		 * At the moment, we don't support PG_readahead for anon THP
-		 * so let's bail out rather than confusing the readahead stat.
-		 */
-		if (unlikely(folio_test_large(folio)))
-			return folio;
-
-		readahead = folio_test_clear_readahead(folio);
-		if (vma && vma_ra) {
-			unsigned long ra_val;
-			int win, hits;
-
-			ra_val = GET_SWAP_RA_VAL(vma);
-			win = SWAP_RA_WIN(ra_val);
-			hits = SWAP_RA_HITS(ra_val);
-			if (readahead)
-				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vma->swap_readahead_info,
-					SWAP_RA_VAL(addr, win, hits));
-		}
+	vma_ra = swap_use_vma_readahead();
+	readahead = folio_test_clear_readahead(folio);
+	if (vma && vma_ra) {
+		unsigned long ra_val;
+		int win, hits;
+
+		ra_val = GET_SWAP_RA_VAL(vma);
+		win = SWAP_RA_WIN(ra_val);
+		hits = SWAP_RA_HITS(ra_val);
+		if (readahead)
+			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+		atomic_long_set(&vma->swap_readahead_info,
+				SWAP_RA_VAL(addr, win, hits));
+	}
 
-		if (readahead) {
-			count_vm_event(SWAP_RA_HIT);
-			if (!vma || !vma_ra)
-				atomic_inc(&swapin_readahead_hits);
-		}
+	if (readahead) {
+		count_vm_event(SWAP_RA_HIT);
+		if (!vma || !vma_ra)
+			atomic_inc(&swapin_readahead_hits);
 	}
 
 	return folio;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-01-29 17:54 ` [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
@ 2024-01-30  5:38   ` Huang, Ying
  2024-01-30  5:55     ` Kairui Song
  0 siblings, 1 reply; 20+ messages in thread
From: Huang, Ying @ 2024-01-30  5:38 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> No feature change, simply move the routine to a standalone function to
> be re-used later. The error path handling is copied from the "out_page"
> label, to make the code change minimized for easier reviewing.

The error processing for mem_cgroup_swapin_charge_folio() failure is
changed a little.  That looks OK for me.  But you need to make it
explicit in change log.  Especially, it's not "no feature change"
strictly.

--
Best Regards,
Huang, Ying

> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 32 ++++----------------------------
>  mm/swap.h       |  8 ++++++++
>  mm/swap_state.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 59 insertions(+), 28 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463a..81dc9d467f4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3803,7 +3803,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	swp_entry_t entry;
>  	pte_t pte;
>  	vm_fault_t ret = 0;
> -	void *shadow = NULL;
>  
>  	if (!pte_unmap_same(vmf))
>  		goto out;
> @@ -3867,33 +3866,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!folio) {
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>  		    __swap_count(entry) == 1) {
> -			/* skip swapcache */
> -			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -						vma, vmf->address, false);
> -			page = &folio->page;
> -			if (folio) {
> -				__folio_set_locked(folio);
> -				__folio_set_swapbacked(folio);
> -
> -				if (mem_cgroup_swapin_charge_folio(folio,
> -							vma->vm_mm, GFP_KERNEL,
> -							entry)) {
> -					ret = VM_FAULT_OOM;
> -					goto out_page;
> -				}
> -				mem_cgroup_swapin_uncharge_swap(entry);
> -
> -				shadow = get_shadow_from_swap_cache(entry);
> -				if (shadow)
> -					workingset_refault(folio, shadow);
> -
> -				folio_add_lru(folio);
> -
> -				/* To provide entry to swap_read_folio() */
> -				folio->swap = entry;
> -				swap_read_folio(folio, true, NULL);
> -				folio->private = NULL;
> -			}
> +			/* skip swapcache and readahead */
> +			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> +			if (folio)
> +				page = &folio->page;
>  		} else {
>  			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>  						vmf);
> diff --git a/mm/swap.h b/mm/swap.h
> index 758c46ca671e..83eab7b67e77 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -56,6 +56,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
>  			      struct vm_fault *vmf);
> +struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> +			    struct vm_fault *vmf);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -86,6 +88,12 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
>  	return NULL;
>  }
>  
> +struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> +			struct vm_fault *vmf)
> +{
> +	return NULL;
> +}
> +
>  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>  			struct vm_fault *vmf)
>  {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e671266ad772..645f5bcad123 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -861,6 +861,53 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>  	return folio;
>  }
>  
> +/**
> + * swapin_direct - swap in a folio skipping swap cache and readahead
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vmf: fault information
> + *
> + * Returns the struct folio for entry and addr after the swap entry is read
> + * in.
> + */
> +struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> +			    struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct folio *folio;
> +	void *shadow = NULL;
> +
> +	/* skip swapcache */
> +	folio = vma_alloc_folio(gfp_mask, 0,
> +				vma, vmf->address, false);
> +	if (folio) {
> +		__folio_set_locked(folio);
> +		__folio_set_swapbacked(folio);
> +
> +		if (mem_cgroup_swapin_charge_folio(folio,
> +					vma->vm_mm, GFP_KERNEL,
> +					entry)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			return NULL;
> +		}
> +		mem_cgroup_swapin_uncharge_swap(entry);
> +
> +		shadow = get_shadow_from_swap_cache(entry);
> +		if (shadow)
> +			workingset_refault(folio, shadow);
> +
> +		folio_add_lru(folio);
> +
> +		/* To provide entry to swap_read_folio() */
> +		folio->swap = entry;
> +		swap_read_folio(folio, true, NULL);
> +		folio->private = NULL;
> +	}
> +
> +	return folio;
> +}
> +
>  /**
>   * swapin_readahead - swap in pages in hope we need them soon
>   * @entry: swap entry of this memory

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-01-30  5:38   ` Huang, Ying
@ 2024-01-30  5:55     ` Kairui Song
  0 siblings, 0 replies; 20+ messages in thread
From: Kairui Song @ 2024-01-30  5:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

On Tue, Jan 30, 2024 at 1:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, simply move the routine to a standalone function to
> > be re-used later. The error path handling is copied from the "out_page"
> > label, to make the code change minimized for easier reviewing.
>
> The error processing for mem_cgroup_swapin_charge_folio() failure is
> changed a little.  That looks OK for me.  But you need to make it
> explicit in change log.  Especially, it's not "no feature change"
> strictly.

Yes, you are correct, I thought it was hardly observable for users, so
ignored that, let me fix the commit message then. Thanks for the
suggestion.

>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg
  2024-01-29 17:54 ` [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg Kairui Song
@ 2024-01-30  6:12   ` Huang, Ying
  2024-01-30  7:01     ` Kairui Song
  0 siblings, 1 reply; 20+ messages in thread
From: Huang, Ying @ 2024-01-30  6:12 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Currently, mem_cgroup_swapin_charge_folio is always called with
> mm == NULL, except in swapin_direct.
>
> swapin_direct is only used when swapin should skip readahead
> and swapcache (SWP_SYNCHRONOUS_IO). All other callers of
> mem_cgroup_swapin_charge_folio are for swapin that should
> not skip readahead and cache.
>
> This could cause swapin charging to behave differently depending
> on swap device, which is unexpected.
>
> This is currently not happening because the only caller of
> swapin_direct is the direct anon page fault path, where mm always
> equals to current->mm, but will no longer be true if swapin_direct
> is shared and have other callers (eg, swapoff) to share the
> readahead skipping logic.
>
> So make swapin_direct also pass NULL for mm, so swpain charge
> will behave consistently and not effected by type of swapin device
> or readahead policy.
>
> After this, the second param of mem_cgroup_swapin_charge_folio is
> never used now, so it can be safely dropped.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/memcontrol.h | 4 ++--
>  mm/memcontrol.c            | 5 ++---
>  mm/swap_state.c            | 7 +++----
>  3 files changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20ff87f8e001..540590d80958 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -693,7 +693,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>  		long nr_pages);
>  
> -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> +int mem_cgroup_swapin_charge_folio(struct folio *folio,
>  				  gfp_t gfp, swp_entry_t entry);
>  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
>  
> @@ -1281,7 +1281,7 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
>  }
>  
>  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> -			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> +		gfp_t gfp, swp_entry_t entry)
>  {
>  	return 0;
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4c8735e7c85..5852742df958 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7306,8 +7306,7 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>   *
>   * Returns 0 on success. Otherwise, an error code is returned.
>   */
> -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> -				  gfp_t gfp, swp_entry_t entry)
> +int mem_cgroup_swapin_charge_folio(struct folio *folio, gfp_t gfp, swp_entry_t entry)
>  {
>  	struct mem_cgroup *memcg;
>  	unsigned short id;
> @@ -7320,7 +7319,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>  	rcu_read_lock();
>  	memcg = mem_cgroup_from_id(id);
>  	if (!memcg || !css_tryget_online(&memcg->css))
> -		memcg = get_mem_cgroup_from_mm(mm);
> +		memcg = get_mem_cgroup_from_current();

The behavior of get_mem_cgroup_from_mm(NULL) and
get_mem_cgroup_from_current() isn't same exactly.  Are you sure that
this is OK?

--
Best Regards,
Huang, Ying


>  	rcu_read_unlock();
>  
>  	ret = charge_memcg(folio, memcg, gfp);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 645f5bcad123..a450d09fc0db 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -495,7 +495,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	__folio_set_locked(folio);
>  	__folio_set_swapbacked(folio);
>  
> -	if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
> +	if (mem_cgroup_swapin_charge_folio(folio, gfp_mask, entry))
>  		goto fail_unlock;
>  
>  	/* May fail (-ENOMEM) if XArray node allocation failed. */
> @@ -884,9 +884,8 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  		__folio_set_locked(folio);
>  		__folio_set_swapbacked(folio);
>  
> -		if (mem_cgroup_swapin_charge_folio(folio,
> -					vma->vm_mm, GFP_KERNEL,
> -					entry)) {
> +		if (mem_cgroup_swapin_charge_folio(folio, GFP_KERNEL,
> +						   entry)) {
>  			folio_unlock(folio);
>  			folio_put(folio);
>  			return NULL;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] mm/swap: introduce swapin_entry for unified readahead policy
  2024-01-29 17:54 ` [PATCH v3 4/7] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
@ 2024-01-30  6:29   ` Huang, Ying
  0 siblings, 0 replies; 20+ messages in thread
From: Huang, Ying @ 2024-01-30  6:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Introduce swapin_entry which merges swapin_readahead and swapin_direct
> making it the main entry for swapin pages, and use a unified swapin
> readahead policy.
>
> This commit makes swapoff make use of this new helper and skip readahead
> for SYNCHRONOUS_IO device since it's not helpful here. Now swapping
> off a 10G ZRAM (lzo-rle) after same workload is faster since readahead
> is skipped and overhead is reduced.
>
> Before:
> time swapoff /dev/zram0
> real    0m12.337s
> user    0m0.001s
> sys     0m12.329s
>
> After:
> time swapoff /dev/zram0
> real    0m9.728s
> user    0m0.001s
> sys     0m9.719s
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 18 +++---------------
>  mm/swap.h       | 16 ++++------------
>  mm/swap_state.c | 40 ++++++++++++++++++++++++----------------
>  mm/swapfile.c   |  7 ++-----
>  4 files changed, 33 insertions(+), 48 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 81dc9d467f4e..8711f8a07039 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3864,20 +3864,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	swapcache = folio;
>  
>  	if (!folio) {
> -		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> -		    __swap_count(entry) == 1) {
> -			/* skip swapcache and readahead */
> -			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> -			if (folio)
> -				page = &folio->page;
> -		} else {
> -			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						vmf);
> -			if (page)
> -				folio = page_folio(page);
> -			swapcache = folio;
> -		}
> -
> +		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> +				     vmf, &swapcache);
>  		if (!folio) {
>  			/*
>  			 * Back out if somebody else faulted in this pte
> @@ -3890,11 +3878,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  				ret = VM_FAULT_OOM;
>  			goto unlock;
>  		}
> -

Change by accident?

>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		count_vm_event(PGMAJFAULT);
>  		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> +		page = folio_file_page(folio, swp_offset(entry));

Better to move this line just after checking !folio.  This make it a
little easier to associate page to folio.

>  	} else if (PageHWPoison(page)) {
>  		/*
>  		 * hwpoisoned dirty swapcache pages are kept for killing
> diff --git a/mm/swap.h b/mm/swap.h
> index 83eab7b67e77..8f8185d3865c 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -54,10 +54,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
>  		bool skip_if_exists);
>  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> -			      struct vm_fault *vmf);
> -struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> -			    struct vm_fault *vmf);
> +struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> +			   struct vm_fault *vmf, struct folio **swapcached);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -88,14 +86,8 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
>  	return NULL;
>  }
>  
> -struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> -			struct vm_fault *vmf)
> -{
> -	return NULL;
> -}
> -
> -static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> -			struct vm_fault *vmf)
> +static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> +			struct vm_fault *vmf, struct folio **swapcached)
>  {
>  	return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index a450d09fc0db..5e06b2e140d4 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -870,8 +870,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>   * Returns the struct folio for entry and addr after the swap entry is read
>   * in.
>   */
> -struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> -			    struct vm_fault *vmf)
> +static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> +				  struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
> @@ -908,33 +908,41 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  }
>  
>  /**
> - * swapin_readahead - swap in pages in hope we need them soon
> + * swapin_entry - swap in a folio from swap entry
>   * @entry: swap entry of this memory
>   * @gfp_mask: memory allocation flags
>   * @vmf: fault information
> + * @swapcache: set to the swapcache folio if swapcache is used
>   *
>   * Returns the struct page for entry and addr, after queueing swapin.
>   *
> - * It's a main entry function for swap readahead. By the configuration,
> + * It's the main entry function for swap in. By the configuration,
>   * it will read ahead blocks by cluster-based(ie, physical disk based)
> - * or vma-based(ie, virtual address based on faulty address) readahead.
> + * or vma-based(ie, virtual address based on faulty address) readahead,
> + * or skip the readahead(ie, ramdisk based swap device).
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -				struct vm_fault *vmf)
> +struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> +			   struct vm_fault *vmf, struct folio **swapcache)
>  {
>  	struct mempolicy *mpol;
> -	pgoff_t ilx;
>  	struct folio *folio;
> +	pgoff_t ilx;

ditto.

Otherwise, looks good to me.  Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

>  
> -	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> -	folio = swap_use_vma_readahead() ?
> -		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> -		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> -	mpol_cond_put(mpol);
> +	if (data_race(swp_swap_info(entry)->flags & SWP_SYNCHRONOUS_IO) &&
> +	    __swap_count(entry) == 1) {
> +		folio = swapin_direct(entry, gfp_mask, vmf);
> +	} else {
> +		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> +		if (swap_use_vma_readahead())
> +			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> +		else
> +			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> +		mpol_cond_put(mpol);
> +		if (swapcache)
> +			*swapcache = folio;
> +	}
>  
> -	if (!folio)
> -		return NULL;
> -	return folio_file_page(folio, swp_offset(entry));
> +	return folio;
>  }
>  
>  #ifdef CONFIG_SYSFS
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 606d95b56304..1cf7e72e19e3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1867,7 +1867,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  		folio = swap_cache_get_folio(entry, vma, addr);
>  		if (!folio) {
> -			struct page *page;
>  			struct vm_fault vmf = {
>  				.vma = vma,
>  				.address = addr,
> @@ -1875,10 +1874,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				.pmd = pmd,
>  			};
>  
> -			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						&vmf);
> -			if (page)
> -				folio = page_folio(page);
> +			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> +					    &vmf, NULL);
>  		}
>  		if (!folio) {
>  			/*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 5/7] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  2024-01-29 17:54 ` [PATCH v3 5/7] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
@ 2024-01-30  6:51   ` Huang, Ying
  0 siblings, 0 replies; 20+ messages in thread
From: Huang, Ying @ 2024-01-30  6:51 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> When a xa_value is returned by the cache lookup, keep it to be used
> later for workingset refault check instead of doing the looking up again
> in swapin_no_readahead.
>
> Shadow look up and workingset check is skipped for swapoff to reduce
> overhead, workingset checking for anon pages upon swapoff is not
> helpful, simply consider all pages as inactive make more sense since
> swapoff doesn't mean pages is being accessed.
>
> After this commit, swappin is about 4% faster for ZRAM, micro benchmark
> result which use madvise to swap out 10G zero-filled data to ZRAM then
> read them in:
>
> Before: 11143285 us
> After:  10692644 us (+4.1%)
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/memory.c     |  5 +++--
>  mm/shmem.c      |  2 +-
>  mm/swap.h       | 11 ++++++-----
>  mm/swap_state.c | 23 +++++++++++++----------
>  mm/swapfile.c   |  4 ++--
>  5 files changed, 25 insertions(+), 20 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 8711f8a07039..349946899f8d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3800,6 +3800,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	struct swap_info_struct *si = NULL;
>  	rmap_t rmap_flags = RMAP_NONE;
>  	bool exclusive = false;
> +	void *shadow = NULL;
>  	swp_entry_t entry;
>  	pte_t pte;
>  	vm_fault_t ret = 0;
> @@ -3858,14 +3859,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (unlikely(!si))
>  		goto out;
>  
> -	folio = swap_cache_get_folio(entry, vma, vmf->address);
> +	folio = swap_cache_get_folio(entry, vma, vmf->address, &shadow);
>  	if (folio)
>  		page = folio_file_page(folio, swp_offset(entry));
>  	swapcache = folio;
>  
>  	if (!folio) {
>  		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> -				     vmf, &swapcache);
> +				     vmf, &swapcache, shadow);
>  		if (!folio) {
>  			/*
>  			 * Back out if somebody else faulted in this pte
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d7c84ff62186..698a31bf7baa 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1873,7 +1873,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	}
>  
>  	/* Look it up and read it in.. */
> -	folio = swap_cache_get_folio(swap, NULL, 0);
> +	folio = swap_cache_get_folio(swap, NULL, 0, NULL);
>  	if (!folio) {
>  		/* Or update major stats only when swapin succeeds?? */
>  		if (fault_type) {
> diff --git a/mm/swap.h b/mm/swap.h
> index 8f8185d3865c..ca9cb472a263 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -42,7 +42,8 @@ void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>  				  unsigned long end);
>  struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr);
> +		struct vm_area_struct *vma, unsigned long addr,
> +		void **shadowp);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>  		pgoff_t index);
>  
> @@ -54,8 +55,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
>  		bool skip_if_exists);
>  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
> -struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> -			   struct vm_fault *vmf, struct folio **swapcached);
> +struct folio *swapin_entry(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf,
> +		struct folio **swapcached, void *shadow);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -87,7 +88,7 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
>  }
>  
>  static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> -			struct vm_fault *vmf, struct folio **swapcached)
> +			struct vm_fault *vmf, struct folio **swapcached, void *shadow)
>  {
>  	return NULL;
>  }
> @@ -98,7 +99,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>  }
>  
>  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr)
> +		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
>  {
>  	return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 5e06b2e140d4..e41a137a6123 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -330,12 +330,18 @@ static inline bool swap_use_vma_readahead(void)
>   * Caller must lock the swap device or hold a reference to keep it valid.
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr)
> +		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
>  {
>  	struct folio *folio;
>  
> -	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> -	if (!IS_ERR(folio)) {
> +	folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
> +	if (xa_is_value(folio)) {
> +		if (shadowp)
> +			*shadowp = folio;
> +		return NULL;
> +	}
> +
> +	if (folio) {
>  		bool vma_ra = swap_use_vma_readahead();
>  		bool readahead;
>  
> @@ -365,8 +371,6 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>  			if (!vma || !vma_ra)
>  				atomic_inc(&swapin_readahead_hits);
>  		}
> -	} else {
> -		folio = NULL;
>  	}
>  
>  	return folio;
> @@ -866,16 +870,16 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>   * @entry: swap entry of this memory
>   * @gfp_mask: memory allocation flags
>   * @vmf: fault information
> + * @shadow: workingset shadow corresponding to entry
>   *
>   * Returns the struct folio for entry and addr after the swap entry is read
>   * in.
>   */
>  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> -				  struct vm_fault *vmf)
> +				  struct vm_fault *vmf, void *shadow)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
> -	void *shadow = NULL;
>  
>  	/* skip swapcache */
>  	folio = vma_alloc_folio(gfp_mask, 0,
> @@ -892,7 +896,6 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  		}
>  		mem_cgroup_swapin_uncharge_swap(entry);
>  
> -		shadow = get_shadow_from_swap_cache(entry);
>  		if (shadow)
>  			workingset_refault(folio, shadow);
>  
> @@ -922,7 +925,7 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>   * or skip the readahead(ie, ramdisk based swap device).
>   */
>  struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> -			   struct vm_fault *vmf, struct folio **swapcache)
> +			   struct vm_fault *vmf, struct folio **swapcache, void *shadow)
>  {
>  	struct mempolicy *mpol;
>  	struct folio *folio;
> @@ -930,7 +933,7 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>  
>  	if (data_race(swp_swap_info(entry)->flags & SWP_SYNCHRONOUS_IO) &&
>  	    __swap_count(entry) == 1) {
> -		folio = swapin_direct(entry, gfp_mask, vmf);
> +		folio = swapin_direct(entry, gfp_mask, vmf, shadow);
>  	} else {
>  		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>  		if (swap_use_vma_readahead())
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1cf7e72e19e3..aac26f5a6cec 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1865,7 +1865,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		pte_unmap(pte);
>  		pte = NULL;
>  
> -		folio = swap_cache_get_folio(entry, vma, addr);
> +		folio = swap_cache_get_folio(entry, vma, addr, NULL);
>  		if (!folio) {
>  			struct vm_fault vmf = {
>  				.vma = vma,
> @@ -1875,7 +1875,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			};
>  
>  			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> -					    &vmf, NULL);
> +					    &vmf, NULL, NULL);
>  		}
>  		if (!folio) {
>  			/*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg
  2024-01-30  6:12   ` Huang, Ying
@ 2024-01-30  7:01     ` Kairui Song
  2024-01-30  7:03       ` Kairui Song
  0 siblings, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-30  7:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

On Tue, Jan 30, 2024 at 2:14 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, mem_cgroup_swapin_charge_folio is always called with
> > mm == NULL, except in swapin_direct.
> >
> > swapin_direct is only used when swapin should skip readahead
> > and swapcache (SWP_SYNCHRONOUS_IO). All other callers of
> > mem_cgroup_swapin_charge_folio are for swapin that should
> > not skip readahead and cache.
> >
> > This could cause swapin charging to behave differently depending
> > on swap device, which is unexpected.
> >
> > This is currently not happening because the only caller of
> > swapin_direct is the direct anon page fault path, where mm always
> > equals to current->mm, but will no longer be true if swapin_direct
> > is shared and have other callers (eg, swapoff) to share the
> > readahead skipping logic.
> >
> > So make swapin_direct also pass NULL for mm, so swpain charge
> > will behave consistently and not effected by type of swapin device
> > or readahead policy.
> >
> > After this, the second param of mem_cgroup_swapin_charge_folio is
> > never used now, so it can be safely dropped.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/memcontrol.h | 4 ++--
> >  mm/memcontrol.c            | 5 ++---
> >  mm/swap_state.c            | 7 +++----
> >  3 files changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 20ff87f8e001..540590d80958 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -693,7 +693,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> >  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >               long nr_pages);
> >
> > -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> > +int mem_cgroup_swapin_charge_folio(struct folio *folio,
> >                                 gfp_t gfp, swp_entry_t entry);
> >  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> >
> > @@ -1281,7 +1281,7 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> >  }
> >
> >  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> > -                     struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> > +             gfp_t gfp, swp_entry_t entry)
> >  {
> >       return 0;
> >  }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e4c8735e7c85..5852742df958 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7306,8 +7306,7 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> >   *
> >   * Returns 0 on success. Otherwise, an error code is returned.
> >   */
> > -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> > -                               gfp_t gfp, swp_entry_t entry)
> > +int mem_cgroup_swapin_charge_folio(struct folio *folio, gfp_t gfp, swp_entry_t entry)
> >  {
> >       struct mem_cgroup *memcg;
> >       unsigned short id;
> > @@ -7320,7 +7319,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> >       rcu_read_lock();
> >       memcg = mem_cgroup_from_id(id);
> >       if (!memcg || !css_tryget_online(&memcg->css))
> > -             memcg = get_mem_cgroup_from_mm(mm);
> > +             memcg = get_mem_cgroup_from_current();
>
> The behavior of get_mem_cgroup_from_mm(NULL) and
> get_mem_cgroup_from_current() isn't same exactly.  Are you sure that
> this is OK?

Hi Ying, thank you very much for the careful review.

IIUC, usually get_mem_cgroup_from_mm(NULL) is for allocations without
mm context (after set_active_memcg), so remote charging cgroup is used
first.

But for swap cases it's a bit special, all swapin are issued from
userspace, so remote charging isn't useful. Not sure if it even may
potentially lead to charging into the wrong cgroup.

And for this callsite, it's called only when `if (!memcg ||
!css_tryget_online(&memcg->css))` is true, only case I know is swapoff
(and the memcg is dead) case, or there are some leaks. The behaviour
of swapoff case has been discussed previously, so currently we just
charge it into the current task's memcg.

This is indeed a potential behaviour change though, I can change it
back to get_mem_cgroup_from_mm(NULL), and post another patch later for
this and discuss in more details.

>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg
  2024-01-30  7:01     ` Kairui Song
@ 2024-01-30  7:03       ` Kairui Song
  0 siblings, 0 replies; 20+ messages in thread
From: Kairui Song @ 2024-01-30  7:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

On Tue, Jan 30, 2024 at 3:01 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Jan 30, 2024 at 2:14 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Currently, mem_cgroup_swapin_charge_folio is always called with
> > > mm == NULL, except in swapin_direct.
> > >
> > > swapin_direct is only used when swapin should skip readahead
> > > and swapcache (SWP_SYNCHRONOUS_IO). All other callers of
> > > mem_cgroup_swapin_charge_folio are for swapin that should
> > > not skip readahead and cache.
> > >
> > > This could cause swapin charging to behave differently depending
> > > on swap device, which is unexpected.
> > >
> > > This is currently not happening because the only caller of
> > > swapin_direct is the direct anon page fault path, where mm always
> > > equals to current->mm, but will no longer be true if swapin_direct
> > > is shared and have other callers (eg, swapoff) to share the
> > > readahead skipping logic.
> > >
> > > So make swapin_direct also pass NULL for mm, so swpain charge
> > > will behave consistently and not effected by type of swapin device
> > > or readahead policy.
> > >
> > > After this, the second param of mem_cgroup_swapin_charge_folio is
> > > never used now, so it can be safely dropped.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  include/linux/memcontrol.h | 4 ++--
> > >  mm/memcontrol.c            | 5 ++---
> > >  mm/swap_state.c            | 7 +++----
> > >  3 files changed, 7 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 20ff87f8e001..540590d80958 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -693,7 +693,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> > >  int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> > >               long nr_pages);
> > >
> > > -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> > > +int mem_cgroup_swapin_charge_folio(struct folio *folio,
> > >                                 gfp_t gfp, swp_entry_t entry);
> > >  void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> > >
> > > @@ -1281,7 +1281,7 @@ static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> > >  }
> > >
> > >  static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> > > -                     struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> > > +             gfp_t gfp, swp_entry_t entry)
> > >  {
> > >       return 0;
> > >  }
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e4c8735e7c85..5852742df958 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7306,8 +7306,7 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> > >   *
> > >   * Returns 0 on success. Otherwise, an error code is returned.
> > >   */
> > > -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> > > -                               gfp_t gfp, swp_entry_t entry)
> > > +int mem_cgroup_swapin_charge_folio(struct folio *folio, gfp_t gfp, swp_entry_t entry)
> > >  {
> > >       struct mem_cgroup *memcg;
> > >       unsigned short id;
> > > @@ -7320,7 +7319,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> > >       rcu_read_lock();
> > >       memcg = mem_cgroup_from_id(id);
> > >       if (!memcg || !css_tryget_online(&memcg->css))
> > > -             memcg = get_mem_cgroup_from_mm(mm);
> > > +             memcg = get_mem_cgroup_from_current();
> >
> > The behavior of get_mem_cgroup_from_mm(NULL) and
> > get_mem_cgroup_from_current() isn't same exactly.  Are you sure that
> > this is OK?
>
> Hi Ying, thank you very much for the careful review.
>
> IIUC, usually get_mem_cgroup_from_mm(NULL) is for allocations without
> mm context (after set_active_memcg), so remote charging cgroup is used
> first.
>
> But for swap cases it's a bit special, all swapin are issued from
> userspace, so remote charging isn't useful. Not sure if it even may
> potentially lead to charging into the wrong cgroup.
>
> And for this callsite, it's called only when `if (!memcg ||
> !css_tryget_online(&memcg->css))` is true, only case I know is swapoff
> (and the memcg is dead) case, or there are some leaks. The behaviour

Oh, actually shmem may also have the zombie cgroup issue, the
conclusion is still the same though, the task accessed the shmem owns
the charge.

> of swapoff case has been discussed previously, so currently we just
> charge it into the current task's memcg.
>
> This is indeed a potential behaviour change though, I can change it
> back to get_mem_cgroup_from_mm(NULL), and post another patch later for
> this and discuss in more details.
>
> >
> > --
> > Best Regards,
> > Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem)
  2024-01-29 17:54 ` [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem Kairui Song
@ 2024-01-31  2:51   ` Huang, Ying
  2024-01-31  3:58     ` Kairui Song
  2024-01-31 23:38     ` Chris Li
  0 siblings, 2 replies; 20+ messages in thread
From: Huang, Ying @ 2024-01-31  2:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kairui Song, linux-mm, Kairui Song, Andrew Morton, Chris Li,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	Yosry Ahmed, David Hildenbrand, linux-kernel, Yu Zhao

Hi, Minchan,

When I review the patchset from Kairui, I checked the code to skip swap
cache in do_swap_page() for swap device with SWP_SYNCHRONOUS_IO.  Is the
following race possible?  Where a page is swapped out to a swap device
with SWP_SYNCHRONOUS_IO and the swap count is 1.  Then 2 threads of the
process runs on CPU0 and CPU1 as below.  CPU0 is running do_swap_page().

CPU0				CPU1
----				----
swap_cache_get_folio()
check sync io and swap count
alloc folio
swap_readpage()
folio_lock_or_retry()
				swap in the swap entry
				write page
				swap out to same swap entry
pte_offset_map_lock()
check pte_same()
swap_free()   <-- new content lost!
set_pte_at()  <-- stale page!
folio_unlock()
pte_unmap_unlock()


Do I miss anything?

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem)
  2024-01-31  2:51   ` Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem) Huang, Ying
@ 2024-01-31  3:58     ` Kairui Song
  2024-01-31 23:45       ` Chris Li
  2024-01-31 23:38     ` Chris Li
  1 sibling, 1 reply; 20+ messages in thread
From: Kairui Song @ 2024-01-31  3:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, linux-mm, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Yu Zhao

Hi Ying,

On Wed, Jan 31, 2024 at 10:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Minchan,
>
> When I review the patchset from Kairui, I checked the code to skip swap
> cache in do_swap_page() for swap device with SWP_SYNCHRONOUS_IO.  Is the
> following race possible?  Where a page is swapped out to a swap device
> with SWP_SYNCHRONOUS_IO and the swap count is 1.  Then 2 threads of the
> process runs on CPU0 and CPU1 as below.  CPU0 is running do_swap_page().

Chris raised a similar issue about the shmem path, and I was worrying
about the same issue in previous discussions about do_swap_page:
https://lore.kernel.org/linux-mm/CAMgjq7AwFiDb7cAMkWMWb3vkccie1-tocmZfT7m4WRb_UKPghg@mail.gmail.com/

"""
In do_swap_page path, multiple process could swapin the page at the
same time (a mapped once page can still be shared by sub threads),
they could get different folios. The later pte lock and pte_same check
is not enough, because while one process is not holding the pte lock,
another process could read-in, swap_free the entry, then swap-out the
page again, using same entry, an ABA problem. The race is not likely
to happen in reality but in theory possible.
"""

>
> CPU0                            CPU1
> ----                            ----
> swap_cache_get_folio()
> check sync io and swap count
> alloc folio
> swap_readpage()
> folio_lock_or_retry()
>                                 swap in the swap entry
>                                 write page
>                                 swap out to same swap entry
> pte_offset_map_lock()
> check pte_same()
> swap_free()   <-- new content lost!
> set_pte_at()  <-- stale page!
> folio_unlock()
> pte_unmap_unlock()

Thank you very much for highlighting this!

My concern previously is the same as yours (swapping out using the
same entry is like an ABA issue, where pte_same failed to detect the
page table change), later when working on V3, I mistakenly thought
that's impossible as entry should be pinned until swap_free on CPU0,
and I'm wrong. CPU1 can also just call swap_free, then swap count is
dropped to 0 and it can just swap out using the same entry. Now I
think my patch 6/7 is also affected by this potential race. Seems
nothing can stop it from doing this.

Actually I was trying to make a reproducer locally, due to swap slot
cache, swap allocation algorithm, and the short race window, this is
very unlikely to happen though.

How about we just increase the swap count temporarily in the direct
swap in path (after alloc folio), then drop the count after pte_same
(or shmem_add_to_page_cache in shmem path)? That seems enough to
prevent the entry reuse issue.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem)
  2024-01-31  2:51   ` Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem) Huang, Ying
  2024-01-31  3:58     ` Kairui Song
@ 2024-01-31 23:38     ` Chris Li
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Li @ 2024-01-31 23:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Kairui Song, linux-mm, Kairui Song, Andrew Morton,
	Hugh Dickins, Johannes Weiner, Matthew Wilcox, Michal Hocko,
	Yosry Ahmed, David Hildenbrand, linux-kernel, Yu Zhao

On Tue, Jan 30, 2024 at 6:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Minchan,
>
> When I review the patchset from Kairui, I checked the code to skip swap
> cache in do_swap_page() for swap device with SWP_SYNCHRONOUS_IO.  Is the
> following race possible?  Where a page is swapped out to a swap device
> with SWP_SYNCHRONOUS_IO and the swap count is 1.  Then 2 threads of the
> process runs on CPU0 and CPU1 as below.  CPU0 is running do_swap_page().
>
> CPU0                            CPU1
> ----                            ----
> swap_cache_get_folio()
> check sync io and swap count
> alloc folio
> swap_readpage()
> folio_lock_or_retry()
>                                 swap in the swap entry
>                                 write page
>                                 swap out to same swap entry
> pte_offset_map_lock()
> check pte_same()
> swap_free()   <-- new content lost!
> set_pte_at()  <-- stale page!
> folio_unlock()
> pte_unmap_unlock()

Yes, that path looks possible but hard to hit due to the requirement
of swap in and swap out in a short window.
I have the similar question on the previous zswap rb tree to xarray
discussion regarding deleting an entry where the entry might change
due to swap in then swap out.

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem)
  2024-01-31  3:58     ` Kairui Song
@ 2024-01-31 23:45       ` Chris Li
  2024-02-01  0:52         ` Huang, Ying
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Li @ 2024-01-31 23:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: Huang, Ying, Minchan Kim, linux-mm, Andrew Morton, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Yu Zhao

On Tue, Jan 30, 2024 at 7:58 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi Ying,
>
> On Wed, Jan 31, 2024 at 10:53 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, Minchan,
> >
> > When I review the patchset from Kairui, I checked the code to skip swap
> > cache in do_swap_page() for swap device with SWP_SYNCHRONOUS_IO.  Is the
> > following race possible?  Where a page is swapped out to a swap device
> > with SWP_SYNCHRONOUS_IO and the swap count is 1.  Then 2 threads of the
> > process runs on CPU0 and CPU1 as below.  CPU0 is running do_swap_page().
>
> Chris raised a similar issue about the shmem path, and I was worrying
> about the same issue in previous discussions about do_swap_page:
> https://lore.kernel.org/linux-mm/CAMgjq7AwFiDb7cAMkWMWb3vkccie1-tocmZfT7m4WRb_UKPghg@mail.gmail.com/

Ha thanks for remembering that.

>
> """
> In do_swap_page path, multiple process could swapin the page at the
> same time (a mapped once page can still be shared by sub threads),
> they could get different folios. The later pte lock and pte_same check
> is not enough, because while one process is not holding the pte lock,
> another process could read-in, swap_free the entry, then swap-out the
> page again, using same entry, an ABA problem. The race is not likely
> to happen in reality but in theory possible.
> """
>
> >
> > CPU0                            CPU1
> > ----                            ----
> > swap_cache_get_folio()
> > check sync io and swap count
> > alloc folio
> > swap_readpage()
> > folio_lock_or_retry()
> >                                 swap in the swap entry
> >                                 write page
> >                                 swap out to same swap entry
> > pte_offset_map_lock()
> > check pte_same()
> > swap_free()   <-- new content lost!
> > set_pte_at()  <-- stale page!
> > folio_unlock()
> > pte_unmap_unlock()
>
> Thank you very much for highlighting this!
>
> My concern previously is the same as yours (swapping out using the
> same entry is like an ABA issue, where pte_same failed to detect the
> page table change), later when working on V3, I mistakenly thought
> that's impossible as entry should be pinned until swap_free on CPU0,
> and I'm wrong. CPU1 can also just call swap_free, then swap count is
> dropped to 0 and it can just swap out using the same entry. Now I
> think my patch 6/7 is also affected by this potential race. Seems
> nothing can stop it from doing this.
>
> Actually I was trying to make a reproducer locally, due to swap slot
> cache, swap allocation algorithm, and the short race window, this is
> very unlikely to happen though.

You can put some sleep in some of the CPU0 where expect the other race
to happen to manual help triggering it. Yes, it sounds hard to trigger
in real life due to reclaim swap out.

>
> How about we just increase the swap count temporarily in the direct
> swap in path (after alloc folio), then drop the count after pte_same
> (or shmem_add_to_page_cache in shmem path)? That seems enough to
> prevent the entry reuse issue.

Sounds like a good solution.

Chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem)
  2024-01-31 23:45       ` Chris Li
@ 2024-02-01  0:52         ` Huang, Ying
  0 siblings, 0 replies; 20+ messages in thread
From: Huang, Ying @ 2024-02-01  0:52 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, Minchan Kim, linux-mm, Andrew Morton, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Yu Zhao

Chris Li <chrisl@kernel.org> writes:

> On Tue, Jan 30, 2024 at 7:58 PM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> Hi Ying,
>>
>> On Wed, Jan 31, 2024 at 10:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Hi, Minchan,
>> >
>> > When I review the patchset from Kairui, I checked the code to skip swap
>> > cache in do_swap_page() for swap device with SWP_SYNCHRONOUS_IO.  Is the
>> > following race possible?  Where a page is swapped out to a swap device
>> > with SWP_SYNCHRONOUS_IO and the swap count is 1.  Then 2 threads of the
>> > process runs on CPU0 and CPU1 as below.  CPU0 is running do_swap_page().
>>
>> Chris raised a similar issue about the shmem path, and I was worrying
>> about the same issue in previous discussions about do_swap_page:
>> https://lore.kernel.org/linux-mm/CAMgjq7AwFiDb7cAMkWMWb3vkccie1-tocmZfT7m4WRb_UKPghg@mail.gmail.com/
>
> Ha thanks for remembering that.
>
>>
>> """
>> In do_swap_page path, multiple process could swapin the page at the
>> same time (a mapped once page can still be shared by sub threads),
>> they could get different folios. The later pte lock and pte_same check
>> is not enough, because while one process is not holding the pte lock,
>> another process could read-in, swap_free the entry, then swap-out the
>> page again, using same entry, an ABA problem. The race is not likely
>> to happen in reality but in theory possible.
>> """
>>
>> >
>> > CPU0                            CPU1
>> > ----                            ----
>> > swap_cache_get_folio()
>> > check sync io and swap count
>> > alloc folio
>> > swap_readpage()
>> > folio_lock_or_retry()
>> >                                 swap in the swap entry
>> >                                 write page
>> >                                 swap out to same swap entry
>> > pte_offset_map_lock()
>> > check pte_same()
>> > swap_free()   <-- new content lost!
>> > set_pte_at()  <-- stale page!
>> > folio_unlock()
>> > pte_unmap_unlock()
>>
>> Thank you very much for highlighting this!
>>
>> My concern previously is the same as yours (swapping out using the
>> same entry is like an ABA issue, where pte_same failed to detect the
>> page table change), later when working on V3, I mistakenly thought
>> that's impossible as entry should be pinned until swap_free on CPU0,
>> and I'm wrong. CPU1 can also just call swap_free, then swap count is
>> dropped to 0 and it can just swap out using the same entry. Now I
>> think my patch 6/7 is also affected by this potential race. Seems
>> nothing can stop it from doing this.
>>
>> Actually I was trying to make a reproducer locally, due to swap slot
>> cache, swap allocation algorithm, and the short race window, this is
>> very unlikely to happen though.
>
> You can put some sleep in some of the CPU0 where expect the other race
> to happen to manual help triggering it. Yes, it sounds hard to trigger
> in real life due to reclaim swap out.
>
>>
>> How about we just increase the swap count temporarily in the direct
>> swap in path (after alloc folio), then drop the count after pte_same
>> (or shmem_add_to_page_cache in shmem path)? That seems enough to
>> prevent the entry reuse issue.
>
> Sounds like a good solution.

Yes.  It seems that this can solve the race.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-02-01  0:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-29 17:54 [PATCH v3 0/7] swapin refactor for optimization and unified readahead Kairui Song
2024-01-29 17:54 ` [PATCH v3 1/7] mm/swapfile.c: add back some comment Kairui Song
2024-01-29 17:54 ` [PATCH v3 2/7] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
2024-01-30  5:38   ` Huang, Ying
2024-01-30  5:55     ` Kairui Song
2024-01-29 17:54 ` [PATCH v3 3/7] mm/swap: always account swapped in page into current memcg Kairui Song
2024-01-30  6:12   ` Huang, Ying
2024-01-30  7:01     ` Kairui Song
2024-01-30  7:03       ` Kairui Song
2024-01-29 17:54 ` [PATCH v3 4/7] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
2024-01-30  6:29   ` Huang, Ying
2024-01-29 17:54 ` [PATCH v3 5/7] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
2024-01-30  6:51   ` Huang, Ying
2024-01-29 17:54 ` [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem Kairui Song
2024-01-31  2:51   ` Whether is the race for SWP_SYNCHRONOUS_IO possible? (was Re: [PATCH v3 6/7] mm/swap, shmem: use unified swapin helper for shmem) Huang, Ying
2024-01-31  3:58     ` Kairui Song
2024-01-31 23:45       ` Chris Li
2024-02-01  0:52         ` Huang, Ying
2024-01-31 23:38     ` Chris Li
2024-01-29 17:54 ` [PATCH v3 7/7] mm/swap: refactor swap_cache_get_folio Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).