All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] swapin refactor for optimization and unified readahead
@ 2024-01-02 17:53 Kairui Song
  2024-01-02 17:53 ` [PATCH v2 1/9] mm/swapfile.c: add back some comment Kairui Song
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series is rebased on latest mm-stable to avoid conflicts.

This series tries to unify and clean up the swapin path, introduce minor
optimization, and make both shmem swapoff make use of SWP_SYNCHRONOUS_IO
flag to skip readahead and swapcache for better performance.

1. Some benchmark for dropping readahead and swapcache for shmem with ZRAM:

- Single file sequence read:
  perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192
  (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit)
  Before: 22.248 +- 0.549
  After:  22.021 +- 0.684 (-1.1%)

- Random read stress test:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
  --size=256m --ioengine=mmap --rw=randread --random_distribution=random \
  --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before: 1818MiB/s
  After:  1888MiB/s (+3.85%)

- Zipf biased random read stress test:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
  --size=256m --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
  --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before: 31.1GiB/s
  After:  32.3GiB/s (+3.86%)

Previously, shmem always used cluster readahead, it doesn't help much even
for single sequence read, and for random stress tests, the performance is
better without it. In reality, due to memory and swap fragmentation cluster
read-head is less helpful for ZRAM.

2. Micro benchmark which use madvise to swap out 10G zero-filled data to
   ZRAM then read them in, shows a performance gain for swapin path:

Before: 11143285 us
After:  10692644 us (+4.1%)

3. Swap off an 10G ZRAM:

Before:
time swapoff /dev/zram0
real    0m12.337s
user    0m0.001s
sys     0m12.329s

After:
time swapoff /dev/zram0
real    0m9.728s
user    0m0.001s
sys     0m9.719s

This also clean up the path to apply a per swap device readahead
policy for all swapin paths.

V1: https://lkml.org/lkml/2023/11/19/296
Update from V1:
  - Rebased based on mm-unstable.
  - Remove behaviour changing patches, will submit in seperate series
    later.
  - Code style, naming and comments updates.
  - Thanks to Chris Li for very detailed and helpful review of V1. Thanks
    to Matthew Wilcox and Huang Ying for helpful suggestions.

Kairui Song (9):
  mm/swapfile.c: add back some comment
  mm/swap: move no readahead swapin code to a stand-alone helper
  mm/swap: avoid doing extra unlock error checks for direct swapin
  mm/swap: always account swapped in page into current memcg
  mm/swap: introduce swapin_entry for unified readahead policy
  mm/swap: also handle swapcache lookup in swapin_entry
  mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  mm/swap: introduce a helper for swapin without vmfault
  swap, shmem: use new swapin helper and skip readahead conditionally

 mm/memory.c     |  74 +++++++-------------------
 mm/shmem.c      |  67 +++++++++++------------
 mm/swap.h       |  39 ++++++++++----
 mm/swap_state.c | 138 +++++++++++++++++++++++++++++++++++++++++-------
 mm/swapfile.c   |  32 +++++------
 5 files changed, 218 insertions(+), 132 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/9] mm/swapfile.c: add back some comment
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-02 17:53 ` [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Some useful comments were dropped in commit b56a2d8af914 ('mm: rid
swapoff of quadratic complexity'), add them back.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3eec686484ef..f7271504aa0a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1880,6 +1880,17 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				folio = page_folio(page);
 		}
 		if (!folio) {
+			/*
+			 * The entry could have been freed, and will not
+			 * be reused since swapoff() already disabled
+			 * allocation from here, or alloc_page() failed.
+			 *
+			 * We don't hold lock here, so the swap entry could be
+			 * SWAP_MAP_BAD (when the cluster is discarding).
+			 * Instead of fail out, We can just skip the swap
+			 * entry because swapoff will wait for discarding
+			 * finish anyway.
+			 */
 			swp_count = READ_ONCE(si->swap_map[offset]);
 			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
 				continue;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
  2024-01-02 17:53 ` [PATCH v2 1/9] mm/swapfile.c: add back some comment Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-04  7:28   ` Huang, Ying
  2024-01-02 17:53 ` [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin Kairui Song
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, simply move the routine to a standalone function to
be re-used later. The error path handling is copied from the "out_page"
label, to make the code change minimized for easier reviewing.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 32 ++++----------------------------
 mm/swap.h       |  8 ++++++++
 mm/swap_state.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a0a50d3754f0..0165c8cad489 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3803,7 +3803,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -3867,33 +3866,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
-			page = &folio->page;
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				if (mem_cgroup_swapin_charge_folio(folio,
-							vma->vm_mm, GFP_KERNEL,
-							entry)) {
-					ret = VM_FAULT_OOM;
-					goto out_page;
-				}
-				mem_cgroup_swapin_uncharge_swap(entry);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, true, NULL);
-				folio->private = NULL;
-			}
+			/* skip swapcache and readahead */
+			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
+			if (folio)
+				page = &folio->page;
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
diff --git a/mm/swap.h b/mm/swap.h
index 758c46ca671e..83eab7b67e77 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,6 +56,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf);
+struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			    struct vm_fault *vmf);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -86,6 +88,12 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 	return NULL;
 }
 
+struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			struct vm_fault *vmf)
+{
+	return NULL;
+}
+
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e671266ad772..24cb93ed5081 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -861,6 +861,53 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	return folio;
 }
 
+/**
+ * swapin_direct - swap in folios skipping swap cache and readahead
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vmf: fault information
+ *
+ * Returns the struct folio for entry and addr after the swap entry is read
+ * in.
+ */
+struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
+			    struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct folio *folio;
+	void *shadow = NULL;
+
+	/* skip swapcache */
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
+				vma, vmf->address, false);
+	if (folio) {
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
+		if (mem_cgroup_swapin_charge_folio(folio,
+					vma->vm_mm, GFP_KERNEL,
+					entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return NULL;
+		}
+		mem_cgroup_swapin_uncharge_swap(entry);
+
+		shadow = get_shadow_from_swap_cache(entry);
+		if (shadow)
+			workingset_refault(folio, shadow);
+
+		folio_add_lru(folio);
+
+		/* To provide entry to swap_read_folio() */
+		folio->swap = entry;
+		swap_read_folio(folio, true, NULL);
+		folio->private = NULL;
+	}
+
+	return folio;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
  2024-01-02 17:53 ` [PATCH v2 1/9] mm/swapfile.c: add back some comment Kairui Song
  2024-01-02 17:53 ` [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-04  8:10   ` Huang, Ying
  2024-01-02 17:53 ` [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg Kairui Song
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

When swapping in a page, mem_cgroup_swapin_charge_folio is called for
new allocated folio, nothing else is referencing the folio so no need
to set the lock bit early. This avoided doing extra unlock checks
on the error path.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 24cb93ed5081..6130de8d5226 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -881,16 +881,15 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 				vma, vmf->address, false);
 	if (folio) {
-		__folio_set_locked(folio);
-		__folio_set_swapbacked(folio);
-
-		if (mem_cgroup_swapin_charge_folio(folio,
-					vma->vm_mm, GFP_KERNEL,
-					entry)) {
-			folio_unlock(folio);
+		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+						   GFP_KERNEL, entry)) {
 			folio_put(folio);
 			return NULL;
 		}
+
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
 		mem_cgroup_swapin_uncharge_swap(entry);
 
 		shadow = get_shadow_from_swap_cache(entry);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
                   ` (2 preceding siblings ...)
  2024-01-02 17:53 ` [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-05  7:14   ` Huang, Ying
  2024-01-02 17:53 ` [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, mem_cgroup_swapin_charge_folio is always called with
mm argument as NULL, except in swapin_direct.

swapin_direct is used when swapin should skip readahead and
swapcache (SWP_SYNCHRONOUS_IO). Other caller paths of
mem_cgroup_swapin_charge_folio are for swapin that should
not skip readahead and cache.

This could cause swapin charging to behave differently depending
on swap device. This currently didn't happen because the only call
path of swapin_direct is the direct anon page fault path, where mm
equals to current->mm, but will no longer be true if swapin_direct
is shared and have other callers (eg, swapoff).

So make swapin_direct also passes NULL for mm, no feature change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6130de8d5226..d39c5369da21 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -881,7 +881,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 				vma, vmf->address, false);
 	if (folio) {
-		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+		if (mem_cgroup_swapin_charge_folio(folio, NULL,
 						   GFP_KERNEL, entry)) {
 			folio_put(folio);
 			return NULL;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
                   ` (3 preceding siblings ...)
  2024-01-02 17:53 ` [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-05  7:28   ` Huang, Ying
  2024-01-02 17:53 ` [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry Kairui Song
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Introduce swapin_entry which merges swapin_readahead and swapin_direct
making it the main entry for swapin pages, and use a unified swapin
policy.

This commit makes swapoff make use of this new helper and now swapping
off a 10G ZRAM (lzo-rle) is faster since readahead is skipped.

Before:
time swapoff /dev/zram0
real    0m12.337s
user    0m0.001s
sys     0m12.329s

After:
time swapoff /dev/zram0
real    0m9.728s
user    0m0.001s
sys     0m9.719s

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 21 +++++++--------------
 mm/swap.h       | 16 ++++------------
 mm/swap_state.c | 49 +++++++++++++++++++++++++++++++++----------------
 mm/swapfile.c   |  7 ++-----
 4 files changed, 46 insertions(+), 47 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0165c8cad489..b56254a875f8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3801,6 +3801,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	rmap_t rmap_flags = RMAP_NONE;
 	bool exclusive = false;
 	swp_entry_t entry;
+	bool swapcached;
 	pte_t pte;
 	vm_fault_t ret = 0;
 
@@ -3864,21 +3865,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			/* skip swapcache and readahead */
-			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
-			if (folio)
-				page = &folio->page;
+		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+				     vmf, &swapcached);
+		if (folio) {
+			page = folio_file_page(folio, swp_offset(entry));
+			if (swapcached)
+				swapcache = folio;
 		} else {
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			if (page)
-				folio = page_folio(page);
-			swapcache = folio;
-		}
-
-		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
diff --git a/mm/swap.h b/mm/swap.h
index 83eab7b67e77..502a2801f817 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -54,10 +54,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
-struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf);
-struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
-			    struct vm_fault *vmf);
+struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
+			    struct vm_fault *vmf, bool *swapcached);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -88,14 +86,8 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 	return NULL;
 }
 
-struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
-			struct vm_fault *vmf)
-{
-	return NULL;
-}
-
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
+static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
+			struct vm_fault *vmf, bool *swapcached)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d39c5369da21..66ff187aa5d3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 	release_pages(pages, nr);
 }
 
+static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
+{
+	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
+}
+
 static inline bool swap_use_vma_readahead(void)
 {
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
@@ -870,8 +875,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * Returns the struct folio for entry and addr after the swap entry is read
  * in.
  */
-struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-			    struct vm_fault *vmf)
+static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
+				  struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -908,33 +913,45 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 }
 
 /**
- * swapin_readahead - swap in pages in hope we need them soon
+ * swapin_entry - swap in a page from swap entry
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
+ * @swapcached: pointer to a bool used as indicator if the
+ *              page is swapped in through swapcache.
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
- * It's a main entry function for swap readahead. By the configuration,
+ * It's a main entry function for swap in. By the configuration,
  * it will read ahead blocks by cluster-based(ie, physical disk based)
- * or vma-based(ie, virtual address based on faulty address) readahead.
+ * or vma-based(ie, virtual address based on faulty address) readahead,
+ * or skip the readahead (ie, ramdisk based swap device).
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
+			   struct vm_fault *vmf, bool *swapcached)
 {
 	struct mempolicy *mpol;
-	pgoff_t ilx;
 	struct folio *folio;
+	pgoff_t ilx;
+	bool cached;
 
-	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	folio = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
-	mpol_cond_put(mpol);
+	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
+		folio = swapin_direct(entry, gfp_mask, vmf);
+		cached = false;
+	} else {
+		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
+		if (swap_use_vma_readahead())
+			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+		else
+			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		mpol_cond_put(mpol);
+		cached = true;
+	}
 
-	if (!folio)
-		return NULL;
-	return folio_file_page(folio, swp_offset(entry));
+	if (swapcached)
+		*swapcached = cached;
+
+	return folio;
 }
 
 #ifdef CONFIG_SYSFS
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f7271504aa0a..ce4e6c10dce7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1866,7 +1866,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 		folio = swap_cache_get_folio(entry, vma, addr);
 		if (!folio) {
-			struct page *page;
 			struct vm_fault vmf = {
 				.vma = vma,
 				.address = addr,
@@ -1874,10 +1873,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				.pmd = pmd,
 			};
 
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
-			if (page)
-				folio = page_folio(page);
+			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+					    &vmf, NULL);
 		}
 		if (!folio) {
 			/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
                   ` (4 preceding siblings ...)
  2024-01-02 17:53 ` [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-08  8:26   ` Huang, Ying
  2024-01-02 17:53 ` [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Since all callers of swapin_entry need to check the swap cache first, we
can merge this common routine into swapin_entry, so it can be shared and
optimized later.

Also introduce a enum to better represent possible swap cache usage, and
add some comments about it, make the usage of swap cache easier to
understand.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 45 ++++++++++++++++++++-------------------------
 mm/swap.h       | 20 ++++++++++++++++++--
 mm/swap_state.c | 22 ++++++++++++++--------
 mm/swapfile.c   | 21 +++++++++------------
 4 files changed, 61 insertions(+), 47 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b56254a875f8..ab6e76c95632 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3795,13 +3795,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct folio *swapcache, *folio = NULL;
+	struct folio *swapcache = NULL, *folio;
+	enum swap_cache_result cache_result;
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool exclusive = false;
 	swp_entry_t entry;
-	bool swapcached;
 	pte_t pte;
 	vm_fault_t ret = 0;
 
@@ -3859,31 +3859,26 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
-	if (folio)
+	folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+			     vmf, &cache_result);
+	if (folio) {
 		page = folio_file_page(folio, swp_offset(entry));
-	swapcache = folio;
-
-	if (!folio) {
-		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-				     vmf, &swapcached);
-		if (folio) {
-			page = folio_file_page(folio, swp_offset(entry));
-			if (swapcached)
-				swapcache = folio;
-		} else {
-			/*
-			 * Back out if somebody else faulted in this pte
-			 * while we released the pte lock.
-			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
-			if (likely(vmf->pte &&
-				   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
-				ret = VM_FAULT_OOM;
-			goto unlock;
-		}
+		if (cache_result != SWAP_CACHE_BYPASS)
+			swapcache = folio;
+	} else {
+		/*
+		 * Back out if somebody else faulted in this pte
+		 * while we released the pte lock.
+		 */
+		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+				vmf->address, &vmf->ptl);
+		if (likely(vmf->pte &&
+			   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+			ret = VM_FAULT_OOM;
+		goto unlock;
+	}
 
+	if (cache_result != SWAP_CACHE_HIT) {
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/swap.h b/mm/swap.h
index 502a2801f817..1f4cdb324bf0 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -4,6 +4,22 @@
 
 struct mempolicy;
 
+/*
+ * Caller of swapin_entry may need to know the cache lookup result:
+ *
+ * SWAP_CACHE_HIT: cache hit, cached folio is retured.
+ * SWAP_CACHE_MISS: cache miss, folio is allocated, read from swap device
+ *                  and adde to swap cache, but still may return a cached
+ *                  folio if raced (check __read_swap_cache_async).
+ * SWAP_CACHE_BYPASS: cache miss, folio is new allocated and read
+ *                    from swap device bypassing the cache.
+ */
+enum swap_cache_result {
+	SWAP_CACHE_HIT,
+	SWAP_CACHE_MISS,
+	SWAP_CACHE_BYPASS,
+};
+
 #ifdef CONFIG_SWAP
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
@@ -55,7 +71,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
-			    struct vm_fault *vmf, bool *swapcached);
+			    struct vm_fault *vmf, enum swap_cache_result *result);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -87,7 +103,7 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 }
 
 static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf, bool *swapcached)
+			struct vm_fault *vmf, enum swap_cache_result *result)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 66ff187aa5d3..f6f1e6f5d782 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -917,8 +917,7 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
- * @swapcached: pointer to a bool used as indicator if the
- *              page is swapped in through swapcache.
+ * @result: a return value to indicate swap cache usage.
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
@@ -928,16 +927,22 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * or skip the readahead (ie, ramdisk based swap device).
  */
 struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
-			   struct vm_fault *vmf, bool *swapcached)
+			   struct vm_fault *vmf, enum swap_cache_result *result)
 {
+	enum swap_cache_result cache_result;
 	struct mempolicy *mpol;
 	struct folio *folio;
 	pgoff_t ilx;
-	bool cached;
+
+	folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
+	if (folio) {
+		cache_result = SWAP_CACHE_HIT;
+		goto done;
+	}
 
 	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
 		folio = swapin_direct(entry, gfp_mask, vmf);
-		cached = false;
+		cache_result = SWAP_CACHE_BYPASS;
 	} else {
 		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 		if (swap_use_vma_readahead())
@@ -945,11 +950,12 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
 		else
 			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
 		mpol_cond_put(mpol);
-		cached = true;
+		cache_result = SWAP_CACHE_MISS;
 	}
 
-	if (swapcached)
-		*swapcached = cached;
+done:
+	if (result)
+		*result = cache_result;
 
 	return folio;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ce4e6c10dce7..5aa44de11edc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1845,6 +1845,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		int ret;
 		pte_t ptent;
 
+		struct vm_fault vmf = {
+			.vma = vma,
+			.address = addr,
+			.real_address = addr,
+			.pmd = pmd,
+		};
+
 		if (!pte++) {
 			pte = pte_offset_map(pmd, addr);
 			if (!pte)
@@ -1864,18 +1871,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 		pte = NULL;
 
-		folio = swap_cache_get_folio(entry, vma, addr);
-		if (!folio) {
-			struct vm_fault vmf = {
-				.vma = vma,
-				.address = addr,
-				.real_address = addr,
-				.pmd = pmd,
-			};
-
-			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-					    &vmf, NULL);
-		}
+		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+				     &vmf, NULL);
 		if (!folio) {
 			/*
 			 * The entry could have been freed, and will not
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
                   ` (5 preceding siblings ...)
  2024-01-02 17:53 ` [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-03 12:50   ` kernel test robot
  2024-01-02 17:53 ` [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault Kairui Song
  2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

When a xa_value is returned by the cache lookup, keep it to be used
later for workingset refault check instead of doing the looking up again
in swapin_no_readahead.

This does have a side effect of making swapoff also triggers workingset
check, but should be fine since swapoff does affect the workload in many
ways already.

After this commit, swappin is about 4% faster for ZRAM, micro benchmark
result which use madvise to swap out 10G zero-filled data to ZRAM then
read them in:

Before: 11143285 us
After:  10692644 us (+4.1%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c      |  2 +-
 mm/swap.h       |  3 ++-
 mm/swap_state.c | 24 +++++++++++++-----------
 3 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 928aa2304932..9da9f7a0e620 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1872,7 +1872,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_get_folio(swap, NULL, 0, NULL);
 	if (!folio) {
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
diff --git a/mm/swap.h b/mm/swap.h
index 1f4cdb324bf0..9180411afcfe 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -58,7 +58,8 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr);
+		struct vm_area_struct *vma, unsigned long addr,
+		void **shadowp);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f6f1e6f5d782..21badd4f0fc7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -335,12 +335,18 @@ static inline bool swap_use_vma_readahead(void)
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
 {
 	struct folio *folio;
 
-	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
-	if (!IS_ERR(folio)) {
+	folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
+	if (xa_is_value(folio)) {
+		if (shadowp)
+			*shadowp = folio;
+		return NULL;
+	}
+
+	if (folio) {
 		bool vma_ra = swap_use_vma_readahead();
 		bool readahead;
 
@@ -370,8 +376,6 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
 			if (!vma || !vma_ra)
 				atomic_inc(&swapin_readahead_hits);
 		}
-	} else {
-		folio = NULL;
 	}
 
 	return folio;
@@ -876,11 +880,10 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * in.
  */
 static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-				  struct vm_fault *vmf)
+				  struct vm_fault *vmf, void *shadow)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	void *shadow = NULL;
 
 	/* skip swapcache */
 	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
@@ -897,7 +900,6 @@ static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 
 		mem_cgroup_swapin_uncharge_swap(entry);
 
-		shadow = get_shadow_from_swap_cache(entry);
 		if (shadow)
 			workingset_refault(folio, shadow);
 
@@ -931,17 +933,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
 {
 	enum swap_cache_result cache_result;
 	struct mempolicy *mpol;
+	void *shadow = NULL;
 	struct folio *folio;
 	pgoff_t ilx;
 
-	folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
+	folio = swap_cache_get_folio(entry, vmf->vma, vmf->address, &shadow);
 	if (folio) {
 		cache_result = SWAP_CACHE_HIT;
 		goto done;
 	}
 
 	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
-		folio = swapin_direct(entry, gfp_mask, vmf);
+		folio = swapin_direct(entry, gfp_mask, vmf, shadow);
 		cache_result = SWAP_CACHE_BYPASS;
 	} else {
 		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
@@ -952,7 +955,6 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
 		mpol_cond_put(mpol);
 		cache_result = SWAP_CACHE_MISS;
 	}
-
 done:
 	if (result)
 		*result = cache_result;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
                   ` (6 preceding siblings ...)
  2024-01-02 17:53 ` [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-09  1:08   ` Huang, Ying
  2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
  8 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

There are two places where swapin is not caused by direct anon page fault:
- shmem swapin, invoked indirectly through shmem mapping
- swapoff

They used to construct a pseudo vmfault struct for swapin function.
Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
path is still using one.

Introduce a helper for them both, this help save stack usage for swapoff
path, and help apply a unified swapin cache and readahead policy check.

Due to missing vmfault info, the caller have to pass in mempolicy
explicitly, make it different from swapin_entry and name it
swapin_entry_mpol.

This commit convert swapoff to use this helper, follow-up commits will
convert shmem to use it too.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  9 +++++++++
 mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
 mm/swapfile.c   | 15 ++++++---------
 3 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 9180411afcfe..8f790a67b948 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
 			    struct vm_fault *vmf, enum swap_cache_result *result);
+struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
+				struct mempolicy *mpol, pgoff_t ilx,
+				enum swap_cache_result *result);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
+		struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
+{
+	return NULL;
+}
+
 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 21badd4f0fc7..3edf4b63158d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * in.
  */
 static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-				  struct vm_fault *vmf, void *shadow)
+				   struct mempolicy *mpol, pgoff_t ilx,
+				   void *shadow)
 {
-	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
 
-	/* skip swapcache */
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-				vma, vmf->address, false);
+	folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
+			mpol, ilx, numa_node_id());
 	if (folio) {
 		if (mem_cgroup_swapin_charge_folio(folio, NULL,
 						   GFP_KERNEL, entry)) {
@@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
 		goto done;
 	}
 
+	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
-		folio = swapin_direct(entry, gfp_mask, vmf, shadow);
+		folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
 		cache_result = SWAP_CACHE_BYPASS;
 	} else {
-		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 		if (swap_use_vma_readahead())
 			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
 		else
 			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
-		mpol_cond_put(mpol);
 		cache_result = SWAP_CACHE_MISS;
 	}
+	mpol_cond_put(mpol);
 done:
 	if (result)
 		*result = cache_result;
@@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
+struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
+				struct mempolicy *mpol, pgoff_t ilx,
+				enum swap_cache_result *result)
+{
+	enum swap_cache_result cache_result;
+	void *shadow = NULL;
+	struct folio *folio;
+
+	folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
+	if (folio) {
+		cache_result = SWAP_CACHE_HIT;
+	} else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
+		folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
+		cache_result = SWAP_CACHE_BYPASS;
+	} else {
+		folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		cache_result = SWAP_CACHE_MISS;
+	}
+
+	if (result)
+		*result = cache_result;
+
+	return folio;
+}
+
 #ifdef CONFIG_SYSFS
 static ssize_t vma_ra_enabled_show(struct kobject *kobj,
 				     struct kobj_attribute *attr, char *buf)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5aa44de11edc..2f77bf143af8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	do {
 		struct folio *folio;
 		unsigned long offset;
+		struct mempolicy *mpol;
 		unsigned char swp_count;
 		swp_entry_t entry;
+		pgoff_t ilx;
 		int ret;
 		pte_t ptent;
 
-		struct vm_fault vmf = {
-			.vma = vma,
-			.address = addr,
-			.real_address = addr,
-			.pmd = pmd,
-		};
-
 		if (!pte++) {
 			pte = pte_offset_map(pmd, addr);
 			if (!pte)
@@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 		pte = NULL;
 
-		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
-				     &vmf, NULL);
+		mpol = get_vma_policy(vma, addr, 0, &ilx);
+		folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
+					  mpol, ilx, NULL);
+		mpol_cond_put(mpol);
 		if (!folio) {
 			/*
 			 * The entry could have been freed, and will not
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
                   ` (7 preceding siblings ...)
  2024-01-02 17:53 ` [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault Kairui Song
@ 2024-01-02 17:53 ` Kairui Song
  2024-01-03 11:56   ` kernel test robot
                     ` (2 more replies)
  8 siblings, 3 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-02 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Huang, Ying, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, shmem uses cluster readahead for all swap backends. Cluster
readahead is not a good solution for ramdisk based device (ZRAM) at all.

After switching to the new helper, most benchmarks showed a good result:

- Single file sequence read:
  perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192
  (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit)
  Before: 22.248 +- 0.549
  After:  22.021 +- 0.684 (-1.1%)

- Random read stress test:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
  --size=256m --ioengine=mmap --rw=randread --random_distribution=random \
  --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before: 1818MiB/s
  After:  1888MiB/s (+3.85%)

- Zipf biased random read stress test:
  fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
  --size=256m --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
  --time_based --ramp_time=1m --runtime=5m --group_reporting
  (using brd as swap, 2G memcg limit)

  Before: 31.1GiB/s
  After:  32.3GiB/s (+3.86%)

So cluster readahead doesn't help much even for single sequence read,
and for random stress test, the performance is better without it.

Considering both memory and swap device will get more fragmented
slowly, and commonly used ZRAM consumes much more CPU than plain
ramdisk, false readahead could occur more frequently and waste
more CPU. Direct SWAP is cheaper, so use the new helper and skip
read ahead for SWP_SYNCHRONOUS_IO device.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c      | 67 +++++++++++++++++++++++++------------------------
 mm/swap.h       |  9 -------
 mm/swap_state.c | 11 ++++++--
 3 files changed, 43 insertions(+), 44 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 9da9f7a0e620..3c0729fe934d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1564,20 +1564,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
 			pgoff_t index, unsigned int order, pgoff_t *ilx);
 
-static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	struct mempolicy *mpol;
-	pgoff_t ilx;
-	struct folio *folio;
-
-	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
-	mpol_cond_put(mpol);
-
-	return folio;
-}
-
 /*
  * Make sure huge_gfp is always more limited than limit_gfp.
  * Some of the flags set permissions, while others set limitations.
@@ -1851,9 +1837,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	enum swap_cache_result cache_result;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
+	struct mempolicy *mpol;
 	swp_entry_t swap;
+	pgoff_t ilx;
 	int error;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1871,36 +1860,40 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			return -EINVAL;
 	}
 
-	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0, NULL);
+	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+	folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
+	mpol_cond_put(mpol);
+
 	if (!folio) {
-		/* Or update major stats only when swapin succeeds?? */
+		error = -ENOMEM;
+		goto failed;
+	}
+	if (cache_result != SWAP_CACHE_HIT) {
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-		/* Here we actually start the io */
-		folio = shmem_swapin_cluster(swap, gfp, info, index);
-		if (!folio) {
-			error = -ENOMEM;
-			goto failed;
-		}
 	}
 
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
-	if (!folio_test_swapcache(folio) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap)) {
+	if (cache_result != SWAP_CACHE_BYPASS) {
+		/* With cache bypass, folio is new allocated, sync, and not in cache */
+		if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) {
+			error = -EEXIST;
+			goto unlock;
+		}
+		if (!folio_test_uptodate(folio)) {
+			error = -EIO;
+			goto failed;
+		}
+		folio_wait_writeback(folio);
+	}
+	if (!shmem_confirm_swap(mapping, index, swap)) {
 		error = -EEXIST;
 		goto unlock;
 	}
-	if (!folio_test_uptodate(folio)) {
-		error = -EIO;
-		goto failed;
-	}
-	folio_wait_writeback(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -1908,12 +1901,19 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	 */
 	arch_swap_restore(swap, folio);
 
-	if (shmem_should_replace_folio(folio, gfp)) {
+	/* With cache bypass, folio is new allocated and always respect gfp flags */
+	if (cache_result != SWAP_CACHE_BYPASS && shmem_should_replace_folio(folio, gfp)) {
 		error = shmem_replace_folio(&folio, gfp, info, index);
 		if (error)
 			goto failed;
 	}
 
+	/*
+	 * The expected value checking below should be enough to ensure
+	 * only one up-to-date swapin success. swap_free() is called after
+	 * this, so the entry can't be reused. As long as the mapping still
+	 * has the old entry value, it's never swapped in or modified.
+	 */
 	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
@@ -1924,7 +1924,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	delete_from_swap_cache(folio);
+	if (cache_result != SWAP_CACHE_BYPASS)
+		delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
 	swap_free(swap);
 	put_swap_device(si);
diff --git a/mm/swap.h b/mm/swap.h
index 8f790a67b948..20f4048c971c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -57,9 +57,6 @@ void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr,
-		void **shadowp);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
 
@@ -123,12 +120,6 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
-static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
-{
-	return NULL;
-}
-
 static inline
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3edf4b63158d..10eec68475dd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -318,7 +318,14 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 
 static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
 {
-	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
+	int count;
+
+	if (!data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		return false;
+
+	count = __swap_count(entry);
+
+	return (count == 1 || count == SWAP_MAP_SHMEM);
 }
 
 static inline bool swap_use_vma_readahead(void)
@@ -334,7 +341,7 @@ static inline bool swap_use_vma_readahead(void)
  *
  * Caller must lock the swap device or hold a reference to keep it valid.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
+static struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
 {
 	struct folio *folio;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
@ 2024-01-03 11:56   ` kernel test robot
  2024-01-03 13:45   ` kernel test robot
  2024-01-09  2:03   ` Huang, Ying
  2 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-01-03 11:56 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Chris Li, Huang, Ying, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20240103]
[cannot apply to linus/master v6.7-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swapfile-c-add-back-some-comment/20240103-015650
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240102175338.62012-10-ryncsn%40gmail.com
patch subject: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
config: arm-randconfig-004-20240103 (https://download.01.org/0day-ci/archive/20240103/202401031953.WBMZtIe7-lkp@intel.com/config)
compiler: clang version 18.0.0git (https://github.com/llvm/llvm-project baf8a39aaf8b61a38b5b2b5591deb765e42eb00b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240103/202401031953.WBMZtIe7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401031953.WBMZtIe7-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/shmem.c:1864:8: error: incompatible pointer types assigning to 'struct folio *' from 'struct page *' [-Werror,-Wincompatible-pointer-types]
    1864 |         folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
         |               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.


vim +1864 mm/shmem.c

  1826	
  1827	/*
  1828	 * Swap in the folio pointed to by *foliop.
  1829	 * Caller has to make sure that *foliop contains a valid swapped folio.
  1830	 * Returns 0 and the folio in foliop if success. On failure, returns the
  1831	 * error code and NULL in *foliop.
  1832	 */
  1833	static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
  1834				     struct folio **foliop, enum sgp_type sgp,
  1835				     gfp_t gfp, struct mm_struct *fault_mm,
  1836				     vm_fault_t *fault_type)
  1837	{
  1838		struct address_space *mapping = inode->i_mapping;
  1839		struct shmem_inode_info *info = SHMEM_I(inode);
  1840		enum swap_cache_result cache_result;
  1841		struct swap_info_struct *si;
  1842		struct folio *folio = NULL;
  1843		struct mempolicy *mpol;
  1844		swp_entry_t swap;
  1845		pgoff_t ilx;
  1846		int error;
  1847	
  1848		VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
  1849		swap = radix_to_swp_entry(*foliop);
  1850		*foliop = NULL;
  1851	
  1852		if (is_poisoned_swp_entry(swap))
  1853			return -EIO;
  1854	
  1855		si = get_swap_device(swap);
  1856		if (!si) {
  1857			if (!shmem_confirm_swap(mapping, index, swap))
  1858				return -EEXIST;
  1859			else
  1860				return -EINVAL;
  1861		}
  1862	
  1863		mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> 1864		folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
  1865		mpol_cond_put(mpol);
  1866	
  1867		if (!folio) {
  1868			error = -ENOMEM;
  1869			goto failed;
  1870		}
  1871		if (cache_result != SWAP_CACHE_HIT) {
  1872			if (fault_type) {
  1873				*fault_type |= VM_FAULT_MAJOR;
  1874				count_vm_event(PGMAJFAULT);
  1875				count_memcg_event_mm(fault_mm, PGMAJFAULT);
  1876			}
  1877		}
  1878	
  1879		/* We have to do this with folio locked to prevent races */
  1880		folio_lock(folio);
  1881		if (cache_result != SWAP_CACHE_BYPASS) {
  1882			/* With cache bypass, folio is new allocated, sync, and not in cache */
  1883			if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) {
  1884				error = -EEXIST;
  1885				goto unlock;
  1886			}
  1887			if (!folio_test_uptodate(folio)) {
  1888				error = -EIO;
  1889				goto failed;
  1890			}
  1891			folio_wait_writeback(folio);
  1892		}
  1893		if (!shmem_confirm_swap(mapping, index, swap)) {
  1894			error = -EEXIST;
  1895			goto unlock;
  1896		}
  1897	
  1898		/*
  1899		 * Some architectures may have to restore extra metadata to the
  1900		 * folio after reading from swap.
  1901		 */
  1902		arch_swap_restore(swap, folio);
  1903	
  1904		/* With cache bypass, folio is new allocated and always respect gfp flags */
  1905		if (cache_result != SWAP_CACHE_BYPASS && shmem_should_replace_folio(folio, gfp)) {
  1906			error = shmem_replace_folio(&folio, gfp, info, index);
  1907			if (error)
  1908				goto failed;
  1909		}
  1910	
  1911		/*
  1912		 * The expected value checking below should be enough to ensure
  1913		 * only one up-to-date swapin success. swap_free() is called after
  1914		 * this, so the entry can't be reused. As long as the mapping still
  1915		 * has the old entry value, it's never swapped in or modified.
  1916		 */
  1917		error = shmem_add_to_page_cache(folio, mapping, index,
  1918						swp_to_radix_entry(swap), gfp);
  1919		if (error)
  1920			goto failed;
  1921	
  1922		shmem_recalc_inode(inode, 0, -1);
  1923	
  1924		if (sgp == SGP_WRITE)
  1925			folio_mark_accessed(folio);
  1926	
  1927		if (cache_result != SWAP_CACHE_BYPASS)
  1928			delete_from_swap_cache(folio);
  1929		folio_mark_dirty(folio);
  1930		swap_free(swap);
  1931		put_swap_device(si);
  1932	
  1933		*foliop = folio;
  1934		return 0;
  1935	failed:
  1936		if (!shmem_confirm_swap(mapping, index, swap))
  1937			error = -EEXIST;
  1938		if (error == -EIO)
  1939			shmem_set_folio_swapin_error(inode, index, folio, swap);
  1940	unlock:
  1941		if (folio) {
  1942			folio_unlock(folio);
  1943			folio_put(folio);
  1944		}
  1945		put_swap_device(si);
  1946	
  1947		return error;
  1948	}
  1949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
  2024-01-02 17:53 ` [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
@ 2024-01-03 12:50   ` kernel test robot
  0 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-01-03 12:50 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Chris Li, Huang, Ying, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20240103]
[cannot apply to linus/master v6.7-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swapfile-c-add-back-some-comment/20240103-015650
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240102175338.62012-8-ryncsn%40gmail.com
patch subject: [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO
config: arc-vdk_hs38_smp_defconfig (https://download.01.org/0day-ci/archive/20240103/202401032010.yrIDf885-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240103/202401032010.yrIDf885-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401032010.yrIDf885-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/swap_state.c:884: warning: Function parameter or member 'shadow' not described in 'swapin_direct'


vim +884 mm/swap_state.c

d9bfcfdc41e8e5 Huang Ying  2017-09-06  872  
b16a5db0ccd159 Kairui Song 2024-01-03  873  /**
b16a5db0ccd159 Kairui Song 2024-01-03  874   * swapin_direct - swap in folios skipping swap cache and readahead
b16a5db0ccd159 Kairui Song 2024-01-03  875   * @entry: swap entry of this memory
b16a5db0ccd159 Kairui Song 2024-01-03  876   * @gfp_mask: memory allocation flags
b16a5db0ccd159 Kairui Song 2024-01-03  877   * @vmf: fault information
b16a5db0ccd159 Kairui Song 2024-01-03  878   *
b16a5db0ccd159 Kairui Song 2024-01-03  879   * Returns the struct folio for entry and addr after the swap entry is read
b16a5db0ccd159 Kairui Song 2024-01-03  880   * in.
b16a5db0ccd159 Kairui Song 2024-01-03  881   */
983c0b807f7eda Kairui Song 2024-01-03  882  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
cd81b9fd3de376 Kairui Song 2024-01-03  883  				  struct vm_fault *vmf, void *shadow)
b16a5db0ccd159 Kairui Song 2024-01-03 @884  {
b16a5db0ccd159 Kairui Song 2024-01-03  885  	struct vm_area_struct *vma = vmf->vma;
b16a5db0ccd159 Kairui Song 2024-01-03  886  	struct folio *folio;
b16a5db0ccd159 Kairui Song 2024-01-03  887  
b16a5db0ccd159 Kairui Song 2024-01-03  888  	/* skip swapcache */
b16a5db0ccd159 Kairui Song 2024-01-03  889  	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
b16a5db0ccd159 Kairui Song 2024-01-03  890  				vma, vmf->address, false);
b16a5db0ccd159 Kairui Song 2024-01-03  891  	if (folio) {
9e22e4254bdb8c Kairui Song 2024-01-03  892  		if (mem_cgroup_swapin_charge_folio(folio, NULL,
64ae20cbed3891 Kairui Song 2024-01-03  893  						   GFP_KERNEL, entry)) {
b16a5db0ccd159 Kairui Song 2024-01-03  894  			folio_put(folio);
b16a5db0ccd159 Kairui Song 2024-01-03  895  			return NULL;
b16a5db0ccd159 Kairui Song 2024-01-03  896  		}
64ae20cbed3891 Kairui Song 2024-01-03  897  
64ae20cbed3891 Kairui Song 2024-01-03  898  		__folio_set_locked(folio);
64ae20cbed3891 Kairui Song 2024-01-03  899  		__folio_set_swapbacked(folio);
64ae20cbed3891 Kairui Song 2024-01-03  900  
b16a5db0ccd159 Kairui Song 2024-01-03  901  		mem_cgroup_swapin_uncharge_swap(entry);
b16a5db0ccd159 Kairui Song 2024-01-03  902  
b16a5db0ccd159 Kairui Song 2024-01-03  903  		if (shadow)
b16a5db0ccd159 Kairui Song 2024-01-03  904  			workingset_refault(folio, shadow);
b16a5db0ccd159 Kairui Song 2024-01-03  905  
b16a5db0ccd159 Kairui Song 2024-01-03  906  		folio_add_lru(folio);
b16a5db0ccd159 Kairui Song 2024-01-03  907  
b16a5db0ccd159 Kairui Song 2024-01-03  908  		/* To provide entry to swap_read_folio() */
b16a5db0ccd159 Kairui Song 2024-01-03  909  		folio->swap = entry;
b16a5db0ccd159 Kairui Song 2024-01-03  910  		swap_read_folio(folio, true, NULL);
b16a5db0ccd159 Kairui Song 2024-01-03  911  		folio->private = NULL;
b16a5db0ccd159 Kairui Song 2024-01-03  912  	}
b16a5db0ccd159 Kairui Song 2024-01-03  913  
b16a5db0ccd159 Kairui Song 2024-01-03  914  	return folio;
b16a5db0ccd159 Kairui Song 2024-01-03  915  }
b16a5db0ccd159 Kairui Song 2024-01-03  916  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
  2024-01-03 11:56   ` kernel test robot
@ 2024-01-03 13:45   ` kernel test robot
  2024-01-09  2:03   ` Huang, Ying
  2 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-01-03 13:45 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Chris Li, Huang, Ying, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20240103]
[cannot apply to linus/master v6.7-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swapfile-c-add-back-some-comment/20240103-015650
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240102175338.62012-10-ryncsn%40gmail.com
patch subject: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
config: i386-randconfig-014-20240103 (https://download.01.org/0day-ci/archive/20240103/202401032120.uWKybc56-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240103/202401032120.uWKybc56-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401032120.uWKybc56-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/shmem.c: In function 'shmem_swapin_folio':
>> mm/shmem.c:1864:8: error: assignment from incompatible pointer type [-Werror=incompatible-pointer-types]
     folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
           ^
   cc1: some warnings being treated as errors


vim +1864 mm/shmem.c

  1826	
  1827	/*
  1828	 * Swap in the folio pointed to by *foliop.
  1829	 * Caller has to make sure that *foliop contains a valid swapped folio.
  1830	 * Returns 0 and the folio in foliop if success. On failure, returns the
  1831	 * error code and NULL in *foliop.
  1832	 */
  1833	static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
  1834				     struct folio **foliop, enum sgp_type sgp,
  1835				     gfp_t gfp, struct mm_struct *fault_mm,
  1836				     vm_fault_t *fault_type)
  1837	{
  1838		struct address_space *mapping = inode->i_mapping;
  1839		struct shmem_inode_info *info = SHMEM_I(inode);
  1840		enum swap_cache_result cache_result;
  1841		struct swap_info_struct *si;
  1842		struct folio *folio = NULL;
  1843		struct mempolicy *mpol;
  1844		swp_entry_t swap;
  1845		pgoff_t ilx;
  1846		int error;
  1847	
  1848		VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
  1849		swap = radix_to_swp_entry(*foliop);
  1850		*foliop = NULL;
  1851	
  1852		if (is_poisoned_swp_entry(swap))
  1853			return -EIO;
  1854	
  1855		si = get_swap_device(swap);
  1856		if (!si) {
  1857			if (!shmem_confirm_swap(mapping, index, swap))
  1858				return -EEXIST;
  1859			else
  1860				return -EINVAL;
  1861		}
  1862	
  1863		mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> 1864		folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
  1865		mpol_cond_put(mpol);
  1866	
  1867		if (!folio) {
  1868			error = -ENOMEM;
  1869			goto failed;
  1870		}
  1871		if (cache_result != SWAP_CACHE_HIT) {
  1872			if (fault_type) {
  1873				*fault_type |= VM_FAULT_MAJOR;
  1874				count_vm_event(PGMAJFAULT);
  1875				count_memcg_event_mm(fault_mm, PGMAJFAULT);
  1876			}
  1877		}
  1878	
  1879		/* We have to do this with folio locked to prevent races */
  1880		folio_lock(folio);
  1881		if (cache_result != SWAP_CACHE_BYPASS) {
  1882			/* With cache bypass, folio is new allocated, sync, and not in cache */
  1883			if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) {
  1884				error = -EEXIST;
  1885				goto unlock;
  1886			}
  1887			if (!folio_test_uptodate(folio)) {
  1888				error = -EIO;
  1889				goto failed;
  1890			}
  1891			folio_wait_writeback(folio);
  1892		}
  1893		if (!shmem_confirm_swap(mapping, index, swap)) {
  1894			error = -EEXIST;
  1895			goto unlock;
  1896		}
  1897	
  1898		/*
  1899		 * Some architectures may have to restore extra metadata to the
  1900		 * folio after reading from swap.
  1901		 */
  1902		arch_swap_restore(swap, folio);
  1903	
  1904		/* With cache bypass, folio is new allocated and always respect gfp flags */
  1905		if (cache_result != SWAP_CACHE_BYPASS && shmem_should_replace_folio(folio, gfp)) {
  1906			error = shmem_replace_folio(&folio, gfp, info, index);
  1907			if (error)
  1908				goto failed;
  1909		}
  1910	
  1911		/*
  1912		 * The expected value checking below should be enough to ensure
  1913		 * only one up-to-date swapin success. swap_free() is called after
  1914		 * this, so the entry can't be reused. As long as the mapping still
  1915		 * has the old entry value, it's never swapped in or modified.
  1916		 */
  1917		error = shmem_add_to_page_cache(folio, mapping, index,
  1918						swp_to_radix_entry(swap), gfp);
  1919		if (error)
  1920			goto failed;
  1921	
  1922		shmem_recalc_inode(inode, 0, -1);
  1923	
  1924		if (sgp == SGP_WRITE)
  1925			folio_mark_accessed(folio);
  1926	
  1927		if (cache_result != SWAP_CACHE_BYPASS)
  1928			delete_from_swap_cache(folio);
  1929		folio_mark_dirty(folio);
  1930		swap_free(swap);
  1931		put_swap_device(si);
  1932	
  1933		*foliop = folio;
  1934		return 0;
  1935	failed:
  1936		if (!shmem_confirm_swap(mapping, index, swap))
  1937			error = -EEXIST;
  1938		if (error == -EIO)
  1939			shmem_set_folio_swapin_error(inode, index, folio, swap);
  1940	unlock:
  1941		if (folio) {
  1942			folio_unlock(folio);
  1943			folio_put(folio);
  1944		}
  1945		put_swap_device(si);
  1946	
  1947		return error;
  1948	}
  1949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-01-02 17:53 ` [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
@ 2024-01-04  7:28   ` Huang, Ying
  2024-01-05  7:43     ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-04  7:28 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> No feature change, simply move the routine to a standalone function to
> be re-used later. The error path handling is copied from the "out_page"
> label, to make the code change minimized for easier reviewing.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 32 ++++----------------------------
>  mm/swap.h       |  8 ++++++++
>  mm/swap_state.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 59 insertions(+), 28 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a0a50d3754f0..0165c8cad489 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3803,7 +3803,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	swp_entry_t entry;
>  	pte_t pte;
>  	vm_fault_t ret = 0;
> -	void *shadow = NULL;
>  
>  	if (!pte_unmap_same(vmf))
>  		goto out;
> @@ -3867,33 +3866,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!folio) {
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>  		    __swap_count(entry) == 1) {
> -			/* skip swapcache */
> -			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -						vma, vmf->address, false);
> -			page = &folio->page;
> -			if (folio) {
> -				__folio_set_locked(folio);
> -				__folio_set_swapbacked(folio);
> -
> -				if (mem_cgroup_swapin_charge_folio(folio,
> -							vma->vm_mm, GFP_KERNEL,
> -							entry)) {
> -					ret = VM_FAULT_OOM;
> -					goto out_page;
> -				}
> -				mem_cgroup_swapin_uncharge_swap(entry);
> -
> -				shadow = get_shadow_from_swap_cache(entry);
> -				if (shadow)
> -					workingset_refault(folio, shadow);
> -
> -				folio_add_lru(folio);
> -
> -				/* To provide entry to swap_read_folio() */
> -				folio->swap = entry;
> -				swap_read_folio(folio, true, NULL);
> -				folio->private = NULL;
> -			}
> +			/* skip swapcache and readahead */
> +			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> +			if (folio)
> +				page = &folio->page;
>  		} else {
>  			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>  						vmf);
> diff --git a/mm/swap.h b/mm/swap.h
> index 758c46ca671e..83eab7b67e77 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -56,6 +56,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
>  			      struct vm_fault *vmf);
> +struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> +			    struct vm_fault *vmf);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -86,6 +88,12 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
>  	return NULL;
>  }
>  
> +struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> +			struct vm_fault *vmf)
> +{
> +	return NULL;
> +}
> +
>  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>  			struct vm_fault *vmf)
>  {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e671266ad772..24cb93ed5081 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -861,6 +861,53 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>  	return folio;
>  }
>  
> +/**
> + * swapin_direct - swap in folios skipping swap cache and readahead

swap in a folio ...

> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vmf: fault information
> + *
> + * Returns the struct folio for entry and addr after the swap entry is read
> + * in.
> + */
> +struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> +			    struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct folio *folio;
> +	void *shadow = NULL;
> +
> +	/* skip swapcache */
> +	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,

You passed gfp_mask in, but don't use it.

> +				vma, vmf->address, false);
> +	if (folio) {
> +		__folio_set_locked(folio);
> +		__folio_set_swapbacked(folio);
> +
> +		if (mem_cgroup_swapin_charge_folio(folio,
> +					vma->vm_mm, GFP_KERNEL,
> +					entry)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			return NULL;
> +		}
> +		mem_cgroup_swapin_uncharge_swap(entry);
> +
> +		shadow = get_shadow_from_swap_cache(entry);
> +		if (shadow)
> +			workingset_refault(folio, shadow);
> +
> +		folio_add_lru(folio);
> +
> +		/* To provide entry to swap_read_folio() */
> +		folio->swap = entry;
> +		swap_read_folio(folio, true, NULL);
> +		folio->private = NULL;
> +	}
> +
> +	return folio;
> +}
> +
>  /**
>   * swapin_readahead - swap in pages in hope we need them soon
>   * @entry: swap entry of this memory

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin
  2024-01-02 17:53 ` [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin Kairui Song
@ 2024-01-04  8:10   ` Huang, Ying
  2024-01-09  9:38     ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-04  8:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> When swapping in a page, mem_cgroup_swapin_charge_folio is called for
> new allocated folio, nothing else is referencing the folio so no need
> to set the lock bit early. This avoided doing extra unlock checks
> on the error path.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 24cb93ed5081..6130de8d5226 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -881,16 +881,15 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>  				vma, vmf->address, false);
>  	if (folio) {
> -		__folio_set_locked(folio);
> -		__folio_set_swapbacked(folio);
> -
> -		if (mem_cgroup_swapin_charge_folio(folio,
> -					vma->vm_mm, GFP_KERNEL,
> -					entry)) {
> -			folio_unlock(folio);
> +		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +						   GFP_KERNEL, entry)) {
>  			folio_put(folio);
>  			return NULL;
>  		}
> +
> +		__folio_set_locked(folio);
> +		__folio_set_swapbacked(folio);
> +
>  		mem_cgroup_swapin_uncharge_swap(entry);
>  
>  		shadow = get_shadow_from_swap_cache(entry);

I don't find any issue with the patch.  But another caller of
mem_cgroup_swapin_charge_folio() in __read_swap_cache_async() setups
newly allocated folio in the same way before the change.  Better to keep
them same?  Because the benefit of change is small too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg
  2024-01-02 17:53 ` [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg Kairui Song
@ 2024-01-05  7:14   ` Huang, Ying
  2024-01-05  7:33     ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-05  7:14 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Currently, mem_cgroup_swapin_charge_folio is always called with
> mm argument as NULL, except in swapin_direct.
>
> swapin_direct is used when swapin should skip readahead and
> swapcache (SWP_SYNCHRONOUS_IO). Other caller paths of
> mem_cgroup_swapin_charge_folio are for swapin that should
> not skip readahead and cache.
>
> This could cause swapin charging to behave differently depending
> on swap device. This currently didn't happen because the only call
> path of swapin_direct is the direct anon page fault path, where mm
> equals to current->mm, but will no longer be true if swapin_direct
> is shared and have other callers (eg, swapoff).
>
> So make swapin_direct also passes NULL for mm, no feature change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6130de8d5226..d39c5369da21 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -881,7 +881,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>  				vma, vmf->address, false);
>  	if (folio) {
> -		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> +		if (mem_cgroup_swapin_charge_folio(folio, NULL,
>  						   GFP_KERNEL, entry)) {
>  			folio_put(folio);
>  			return NULL;

I think that why not provide "mm" when it's available?  For
swapin_direct() called by do_swap_page(), mm can be provided.  While,
for swapin_direct() called by shmem swapin, mm will be NULL.  We can
even provide "mm" for __read_swap_cache_async() for VMA based swapin and
for the fault address for cluster swapin.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy
  2024-01-02 17:53 ` [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
@ 2024-01-05  7:28   ` Huang, Ying
  2024-01-10  2:42     ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-05  7:28 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Introduce swapin_entry which merges swapin_readahead and swapin_direct
> making it the main entry for swapin pages, and use a unified swapin
> policy.
>
> This commit makes swapoff make use of this new helper and now swapping
> off a 10G ZRAM (lzo-rle) is faster since readahead is skipped.
>
> Before:
> time swapoff /dev/zram0
> real    0m12.337s
> user    0m0.001s
> sys     0m12.329s
>
> After:
> time swapoff /dev/zram0
> real    0m9.728s
> user    0m0.001s
> sys     0m9.719s
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 21 +++++++--------------
>  mm/swap.h       | 16 ++++------------
>  mm/swap_state.c | 49 +++++++++++++++++++++++++++++++++----------------
>  mm/swapfile.c   |  7 ++-----
>  4 files changed, 46 insertions(+), 47 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0165c8cad489..b56254a875f8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3801,6 +3801,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	rmap_t rmap_flags = RMAP_NONE;
>  	bool exclusive = false;
>  	swp_entry_t entry;
> +	bool swapcached;
>  	pte_t pte;
>  	vm_fault_t ret = 0;
>  
> @@ -3864,21 +3865,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	swapcache = folio;
>  
>  	if (!folio) {
> -		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> -		    __swap_count(entry) == 1) {
> -			/* skip swapcache and readahead */
> -			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> -			if (folio)
> -				page = &folio->page;
> +		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> +				     vmf, &swapcached);
> +		if (folio) {
> +			page = folio_file_page(folio, swp_offset(entry));
> +			if (swapcached)
> +				swapcache = folio;
>  		} else {
> -			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						vmf);
> -			if (page)
> -				folio = page_folio(page);
> -			swapcache = folio;
> -		}
> -
> -		if (!folio) {
>  			/*
>  			 * Back out if somebody else faulted in this pte
>  			 * while we released the pte lock.
> diff --git a/mm/swap.h b/mm/swap.h
> index 83eab7b67e77..502a2801f817 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -54,10 +54,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
>  		bool skip_if_exists);
>  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> -			      struct vm_fault *vmf);
> -struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> -			    struct vm_fault *vmf);
> +struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> +			    struct vm_fault *vmf, bool *swapcached);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -88,14 +86,8 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
>  	return NULL;
>  }
>  
> -struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> -			struct vm_fault *vmf)
> -{
> -	return NULL;
> -}
> -
> -static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> -			struct vm_fault *vmf)
> +static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> +			struct vm_fault *vmf, bool *swapcached)
>  {
>  	return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d39c5369da21..66ff187aa5d3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>  	release_pages(pages, nr);
>  }
>  
> +static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> +{
> +	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> +}
> +

It appears that there's only one caller of the function in the same
file?  Why add a function?

>  static inline bool swap_use_vma_readahead(void)
>  {
>  	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> @@ -870,8 +875,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>   * Returns the struct folio for entry and addr after the swap entry is read
>   * in.
>   */
> -struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> -			    struct vm_fault *vmf)
> +static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> +				  struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
> @@ -908,33 +913,45 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  }
>  
>  /**
> - * swapin_readahead - swap in pages in hope we need them soon
> + * swapin_entry - swap in a page from swap entry
>   * @entry: swap entry of this memory
>   * @gfp_mask: memory allocation flags
>   * @vmf: fault information
> + * @swapcached: pointer to a bool used as indicator if the
> + *              page is swapped in through swapcache.
>   *
>   * Returns the struct page for entry and addr, after queueing swapin.
>   *
> - * It's a main entry function for swap readahead. By the configuration,
> + * It's a main entry function for swap in. By the configuration,
>   * it will read ahead blocks by cluster-based(ie, physical disk based)
> - * or vma-based(ie, virtual address based on faulty address) readahead.
> + * or vma-based(ie, virtual address based on faulty address) readahead,
> + * or skip the readahead (ie, ramdisk based swap device).
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> -				struct vm_fault *vmf)
> +struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> +			   struct vm_fault *vmf, bool *swapcached)

May be better to use

struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
			   struct vm_fault *vmf, struct folio **swapcache)

In this way, we can reduce the number of source lines in the caller.

>  {
>  	struct mempolicy *mpol;
> -	pgoff_t ilx;
>  	struct folio *folio;
> +	pgoff_t ilx;
> +	bool cached;
>  
> -	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> -	folio = swap_use_vma_readahead() ?
> -		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> -		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> -	mpol_cond_put(mpol);
> +	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> +		folio = swapin_direct(entry, gfp_mask, vmf);
> +		cached = false;
> +	} else {
> +		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> +		if (swap_use_vma_readahead())
> +			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> +		else
> +			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> +		mpol_cond_put(mpol);
> +		cached = true;
> +	}
>  
> -	if (!folio)
> -		return NULL;
> -	return folio_file_page(folio, swp_offset(entry));
> +	if (swapcached)
> +		*swapcached = cached;
> +
> +	return folio;
>  }
>  
>  #ifdef CONFIG_SYSFS
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f7271504aa0a..ce4e6c10dce7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1866,7 +1866,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  		folio = swap_cache_get_folio(entry, vma, addr);
>  		if (!folio) {
> -			struct page *page;
>  			struct vm_fault vmf = {
>  				.vma = vma,
>  				.address = addr,
> @@ -1874,10 +1873,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				.pmd = pmd,
>  			};
>  
> -			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						&vmf);
> -			if (page)
> -				folio = page_folio(page);
> +			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> +					    &vmf, NULL);
>  		}
>  		if (!folio) {
>  			/*

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg
  2024-01-05  7:14   ` Huang, Ying
@ 2024-01-05  7:33     ` Kairui Song
  2024-01-08  7:44       ` Huang, Ying
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-05  7:33 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月5日周五 15:16写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, mem_cgroup_swapin_charge_folio is always called with
> > mm argument as NULL, except in swapin_direct.
> >
> > swapin_direct is used when swapin should skip readahead and
> > swapcache (SWP_SYNCHRONOUS_IO). Other caller paths of
> > mem_cgroup_swapin_charge_folio are for swapin that should
> > not skip readahead and cache.
> >
> > This could cause swapin charging to behave differently depending
> > on swap device. This currently didn't happen because the only call
> > path of swapin_direct is the direct anon page fault path, where mm
> > equals to current->mm, but will no longer be true if swapin_direct
> > is shared and have other callers (eg, swapoff).
> >
> > So make swapin_direct also passes NULL for mm, no feature change.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 6130de8d5226..d39c5369da21 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -881,7 +881,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >                               vma, vmf->address, false);
> >       if (folio) {
> > -             if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> > +             if (mem_cgroup_swapin_charge_folio(folio, NULL,
> >                                                  GFP_KERNEL, entry)) {
> >                       folio_put(folio);
> >                       return NULL;
>
> I think that why not provide "mm" when it's available?  For
> swapin_direct() called by do_swap_page(), mm can be provided.  While,
> for swapin_direct() called by shmem swapin, mm will be NULL.  We can
> even provide "mm" for __read_swap_cache_async() for VMA based swapin and
> for the fault address for cluster swapin.

Hi, Ying.

Thanks for the comment.

Without modifying too much code, providing mm here will change swapin
charge behaviour on swapoff, we discussed it previously:
https://lkml.org/lkml/2023/11/19/320

If we want to avoid the behavior change, we have to extend all direct
and indirect callers of mem_cgroup_swapin_charge_folio to accept a
standalone mm argument (including swapin_direct, swap_vma_readahead,
swap_cluster_readahead, __read_swap_cache_async, swapin_entry_mpol,
read_swap_cache_async, and some other path may need more audit), and
sort things our case by case. I'm not sure if that's a good idea...

Simply dropping it here seems the easiest way to avoid such change.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-01-04  7:28   ` Huang, Ying
@ 2024-01-05  7:43     ` Kairui Song
  0 siblings, 0 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-05  7:43 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月4日周四 15:31写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, simply move the routine to a standalone function to
> > be re-used later. The error path handling is copied from the "out_page"
> > label, to make the code change minimized for easier reviewing.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 32 ++++----------------------------
> >  mm/swap.h       |  8 ++++++++
> >  mm/swap_state.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 59 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a0a50d3754f0..0165c8cad489 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3803,7 +3803,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       swp_entry_t entry;
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> > -     void *shadow = NULL;
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -3867,33 +3866,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> > -                     /* skip swapcache */
> > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                             vma, vmf->address, false);
> > -                     page = &folio->page;
> > -                     if (folio) {
> > -                             __folio_set_locked(folio);
> > -                             __folio_set_swapbacked(folio);
> > -
> > -                             if (mem_cgroup_swapin_charge_folio(folio,
> > -                                                     vma->vm_mm, GFP_KERNEL,
> > -                                                     entry)) {
> > -                                     ret = VM_FAULT_OOM;
> > -                                     goto out_page;
> > -                             }
> > -                             mem_cgroup_swapin_uncharge_swap(entry);
> > -
> > -                             shadow = get_shadow_from_swap_cache(entry);
> > -                             if (shadow)
> > -                                     workingset_refault(folio, shadow);
> > -
> > -                             folio_add_lru(folio);
> > -
> > -                             /* To provide entry to swap_read_folio() */
> > -                             folio->swap = entry;
> > -                             swap_read_folio(folio, true, NULL);
> > -                             folio->private = NULL;
> > -                     }
> > +                     /* skip swapcache and readahead */
> > +                     folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > +                     if (folio)
> > +                             page = &folio->page;
> >               } else {
> >                       page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> >                                               vmf);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 758c46ca671e..83eab7b67e77 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -56,6 +56,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >               struct mempolicy *mpol, pgoff_t ilx);
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> >                             struct vm_fault *vmf);
> > +struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> > +                         struct vm_fault *vmf);
> >
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> > @@ -86,6 +88,12 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
> >       return NULL;
> >  }
> >
> > +struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> > +                     struct vm_fault *vmf)
> > +{
> > +     return NULL;
> > +}
> > +
> >  static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> >                       struct vm_fault *vmf)
> >  {
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index e671266ad772..24cb93ed5081 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -861,6 +861,53 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >       return folio;
> >  }
> >
> > +/**
> > + * swapin_direct - swap in folios skipping swap cache and readahead
>
> swap in a folio ...
>
> > + * @entry: swap entry of this memory
> > + * @gfp_mask: memory allocation flags
> > + * @vmf: fault information
> > + *
> > + * Returns the struct folio for entry and addr after the swap entry is read
> > + * in.
> > + */
> > +struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> > +                         struct vm_fault *vmf)
> > +{
> > +     struct vm_area_struct *vma = vmf->vma;
> > +     struct folio *folio;
> > +     void *shadow = NULL;
> > +
> > +     /* skip swapcache */
> > +     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>
> You passed gfp_mask in, but don't use it.
>

Thanks, will fix these.

> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg
  2024-01-05  7:33     ` Kairui Song
@ 2024-01-08  7:44       ` Huang, Ying
  2024-01-09  9:42         ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-08  7:44 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> Huang, Ying <ying.huang@intel.com> 于2024年1月5日周五 15:16写道:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > Currently, mem_cgroup_swapin_charge_folio is always called with
>> > mm argument as NULL, except in swapin_direct.
>> >
>> > swapin_direct is used when swapin should skip readahead and
>> > swapcache (SWP_SYNCHRONOUS_IO). Other caller paths of
>> > mem_cgroup_swapin_charge_folio are for swapin that should
>> > not skip readahead and cache.
>> >
>> > This could cause swapin charging to behave differently depending
>> > on swap device. This currently didn't happen because the only call
>> > path of swapin_direct is the direct anon page fault path, where mm
>> > equals to current->mm, but will no longer be true if swapin_direct
>> > is shared and have other callers (eg, swapoff).
>> >
>> > So make swapin_direct also passes NULL for mm, no feature change.
>> >
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > ---
>> >  mm/swap_state.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index 6130de8d5226..d39c5369da21 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -881,7 +881,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>> >       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> >                               vma, vmf->address, false);
>> >       if (folio) {
>> > -             if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
>> > +             if (mem_cgroup_swapin_charge_folio(folio, NULL,
>> >                                                  GFP_KERNEL, entry)) {
>> >                       folio_put(folio);
>> >                       return NULL;
>>
>> I think that why not provide "mm" when it's available?  For
>> swapin_direct() called by do_swap_page(), mm can be provided.  While,
>> for swapin_direct() called by shmem swapin, mm will be NULL.  We can
>> even provide "mm" for __read_swap_cache_async() for VMA based swapin and
>> for the fault address for cluster swapin.
>
> Hi, Ying.
>
> Thanks for the comment.
>
> Without modifying too much code, providing mm here will change swapin
> charge behaviour on swapoff, we discussed it previously:
> https://lkml.org/lkml/2023/11/19/320

It's better to use "lore" for kernel email link, for example,

https://lore.kernel.org/lkml/20231119194740.94101-24-ryncsn@gmail.com/

This is more readable than the above link.

> If we want to avoid the behavior change, we have to extend all direct
> and indirect callers of mem_cgroup_swapin_charge_folio to accept a
> standalone mm argument (including swapin_direct, swap_vma_readahead,
> swap_cluster_readahead, __read_swap_cache_async, swapin_entry_mpol,
> read_swap_cache_async, and some other path may need more audit), and
> sort things our case by case. I'm not sure if that's a good idea...
>
> Simply dropping it here seems the easiest way to avoid such change.

OK.  Didn't realize that they are related directly.  It appears that we
always pass NULL as mm to mem_cgroup_swapin_charge_folio().  If so,
please just remove the parameter.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry
  2024-01-02 17:53 ` [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry Kairui Song
@ 2024-01-08  8:26   ` Huang, Ying
  2024-01-10  2:53     ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-08  8:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Since all callers of swapin_entry need to check the swap cache first, we
> can merge this common routine into swapin_entry, so it can be shared and
> optimized later.
>
> Also introduce a enum to better represent possible swap cache usage, and
> add some comments about it, make the usage of swap cache easier to
> understand.

I don't find any benefit to do this.  The code line number isn't
reduced.  The concept of swap cache isn't hided either.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-02 17:53 ` [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault Kairui Song
@ 2024-01-09  1:08   ` Huang, Ying
  2024-01-10  3:32     ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-09  1:08 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> There are two places where swapin is not caused by direct anon page fault:
> - shmem swapin, invoked indirectly through shmem mapping
> - swapoff
>
> They used to construct a pseudo vmfault struct for swapin function.
> Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> path is still using one.
>
> Introduce a helper for them both, this help save stack usage for swapoff
> path, and help apply a unified swapin cache and readahead policy check.
>
> Due to missing vmfault info, the caller have to pass in mempolicy
> explicitly, make it different from swapin_entry and name it
> swapin_entry_mpol.
>
> This commit convert swapoff to use this helper, follow-up commits will
> convert shmem to use it too.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h       |  9 +++++++++
>  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
>  mm/swapfile.c   | 15 ++++++---------
>  3 files changed, 47 insertions(+), 17 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 9180411afcfe..8f790a67b948 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
>  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
>  			    struct vm_fault *vmf, enum swap_cache_result *result);
> +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> +				struct mempolicy *mpol, pgoff_t ilx,
> +				enum swap_cache_result *result);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
>  	return NULL;
>  }
>  
> +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> +		struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
> +{
> +	return NULL;
> +}
> +
>  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>  {
>  	return 0;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 21badd4f0fc7..3edf4b63158d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>   * in.
>   */
>  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> -				  struct vm_fault *vmf, void *shadow)
> +				   struct mempolicy *mpol, pgoff_t ilx,
> +				   void *shadow)
>  {
> -	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
>  
> -	/* skip swapcache */
> -	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -				vma, vmf->address, false);
> +	folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
> +			mpol, ilx, numa_node_id());
>  	if (folio) {
>  		if (mem_cgroup_swapin_charge_folio(folio, NULL,
>  						   GFP_KERNEL, entry)) {
> @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>  		goto done;
>  	}
>  
> +	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>  	if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> -		folio = swapin_direct(entry, gfp_mask, vmf, shadow);
> +		folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>  		cache_result = SWAP_CACHE_BYPASS;
>  	} else {
> -		mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>  		if (swap_use_vma_readahead())
>  			folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>  		else
>  			folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> -		mpol_cond_put(mpol);
>  		cache_result = SWAP_CACHE_MISS;
>  	}
> +	mpol_cond_put(mpol);
>  done:
>  	if (result)
>  		*result = cache_result;
> @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>  	return folio;
>  }
>  
> +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> +				struct mempolicy *mpol, pgoff_t ilx,
> +				enum swap_cache_result *result)
> +{
> +	enum swap_cache_result cache_result;
> +	void *shadow = NULL;
> +	struct folio *folio;
> +
> +	folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
> +	if (folio) {
> +		cache_result = SWAP_CACHE_HIT;
> +	} else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> +		folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> +		cache_result = SWAP_CACHE_BYPASS;
> +	} else {
> +		folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> +		cache_result = SWAP_CACHE_MISS;
> +	}
> +
> +	if (result)
> +		*result = cache_result;
> +
> +	return folio;
> +}
> +
>  #ifdef CONFIG_SYSFS
>  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
>  				     struct kobj_attribute *attr, char *buf)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5aa44de11edc..2f77bf143af8 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	do {
>  		struct folio *folio;
>  		unsigned long offset;
> +		struct mempolicy *mpol;
>  		unsigned char swp_count;
>  		swp_entry_t entry;
> +		pgoff_t ilx;
>  		int ret;
>  		pte_t ptent;
>  
> -		struct vm_fault vmf = {
> -			.vma = vma,
> -			.address = addr,
> -			.real_address = addr,
> -			.pmd = pmd,
> -		};
> -
>  		if (!pte++) {
>  			pte = pte_offset_map(pmd, addr);
>  			if (!pte)
> @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		pte_unmap(pte);
>  		pte = NULL;
>  
> -		folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> -				     &vmf, NULL);
> +		mpol = get_vma_policy(vma, addr, 0, &ilx);
> +		folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
> +					  mpol, ilx, NULL);
> +		mpol_cond_put(mpol);
>  		if (!folio) {
>  			/*
>  			 * The entry could have been freed, and will not

IIUC, after the change, we will always use cluster readahead for
swapoff.  This may be OK.  But, at least we need some test results which
show that this will not cause any issue for this behavior change.  And
the behavior change should be described explicitly in patch description.

And I don't think it's a good abstraction to make swapin_entry_mpol()
always use cluster swapin, while swapin_entry() will try to use vma
swapin.  I think we can add "struct mempolicy *mpol" and "pgoff_t ilx"
to swapin_entry() as parameters, and use them if vmf == NULL.  If we
want to enforce cluster swapin in swapoff path, it will be better to add
some comments to describe why.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
  2024-01-03 11:56   ` kernel test robot
  2024-01-03 13:45   ` kernel test robot
@ 2024-01-09  2:03   ` Huang, Ying
  2024-01-10  3:35     ` Kairui Song
  2 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-09  2:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Andrew Morton, Chris Li, Hugh Dickins,
	Johannes Weiner, Matthew Wilcox, Michal Hocko, Yosry Ahmed,
	David Hildenbrand, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Currently, shmem uses cluster readahead for all swap backends. Cluster
> readahead is not a good solution for ramdisk based device (ZRAM) at all.
>
> After switching to the new helper, most benchmarks showed a good result:
>
> - Single file sequence read:
>   perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192
>   (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit)
>   Before: 22.248 +- 0.549
>   After:  22.021 +- 0.684 (-1.1%)
>
> - Random read stress test:
>   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
>   --size=256m --ioengine=mmap --rw=randread --random_distribution=random \
>   --time_based --ramp_time=1m --runtime=5m --group_reporting
>   (using brd as swap, 2G memcg limit)
>
>   Before: 1818MiB/s
>   After:  1888MiB/s (+3.85%)
>
> - Zipf biased random read stress test:
>   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
>   --size=256m --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
>   --time_based --ramp_time=1m --runtime=5m --group_reporting
>   (using brd as swap, 2G memcg limit)
>
>   Before: 31.1GiB/s
>   After:  32.3GiB/s (+3.86%)
>
> So cluster readahead doesn't help much even for single sequence read,
> and for random stress test, the performance is better without it.
>
> Considering both memory and swap device will get more fragmented
> slowly, and commonly used ZRAM consumes much more CPU than plain
> ramdisk, false readahead could occur more frequently and waste
> more CPU. Direct SWAP is cheaper, so use the new helper and skip
> read ahead for SWP_SYNCHRONOUS_IO device.

It's good to take advantage of swap_direct (no readahead).  I also hopes
we can take advantage of VMA based swapin if shmem is accessed via mmap.
That appears possible.

--
Best Regards,
Huang, Ying

> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/shmem.c      | 67 +++++++++++++++++++++++++------------------------
>  mm/swap.h       |  9 -------
>  mm/swap_state.c | 11 ++++++--
>  3 files changed, 43 insertions(+), 44 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9da9f7a0e620..3c0729fe934d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1564,20 +1564,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
>  static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
>  			pgoff_t index, unsigned int order, pgoff_t *ilx);
>  
> -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> -			struct shmem_inode_info *info, pgoff_t index)
> -{
> -	struct mempolicy *mpol;
> -	pgoff_t ilx;
> -	struct folio *folio;
> -
> -	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> -	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
> -	mpol_cond_put(mpol);
> -
> -	return folio;
> -}
> -
>  /*
>   * Make sure huge_gfp is always more limited than limit_gfp.
>   * Some of the flags set permissions, while others set limitations.
> @@ -1851,9 +1837,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> +	enum swap_cache_result cache_result;
>  	struct swap_info_struct *si;
>  	struct folio *folio = NULL;
> +	struct mempolicy *mpol;
>  	swp_entry_t swap;
> +	pgoff_t ilx;
>  	int error;
>  
>  	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> @@ -1871,36 +1860,40 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  			return -EINVAL;
>  	}
>  
> -	/* Look it up and read it in.. */
> -	folio = swap_cache_get_folio(swap, NULL, 0, NULL);
> +	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> +	folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
> +	mpol_cond_put(mpol);
> +
>  	if (!folio) {
> -		/* Or update major stats only when swapin succeeds?? */
> +		error = -ENOMEM;
> +		goto failed;
> +	}
> +	if (cache_result != SWAP_CACHE_HIT) {
>  		if (fault_type) {
>  			*fault_type |= VM_FAULT_MAJOR;
>  			count_vm_event(PGMAJFAULT);
>  			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>  		}
> -		/* Here we actually start the io */
> -		folio = shmem_swapin_cluster(swap, gfp, info, index);
> -		if (!folio) {
> -			error = -ENOMEM;
> -			goto failed;
> -		}
>  	}
>  
>  	/* We have to do this with folio locked to prevent races */
>  	folio_lock(folio);
> -	if (!folio_test_swapcache(folio) ||
> -	    folio->swap.val != swap.val ||
> -	    !shmem_confirm_swap(mapping, index, swap)) {
> +	if (cache_result != SWAP_CACHE_BYPASS) {
> +		/* With cache bypass, folio is new allocated, sync, and not in cache */
> +		if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) {
> +			error = -EEXIST;
> +			goto unlock;
> +		}
> +		if (!folio_test_uptodate(folio)) {
> +			error = -EIO;
> +			goto failed;
> +		}
> +		folio_wait_writeback(folio);
> +	}
> +	if (!shmem_confirm_swap(mapping, index, swap)) {
>  		error = -EEXIST;
>  		goto unlock;
>  	}
> -	if (!folio_test_uptodate(folio)) {
> -		error = -EIO;
> -		goto failed;
> -	}
> -	folio_wait_writeback(folio);
>  
>  	/*
>  	 * Some architectures may have to restore extra metadata to the
> @@ -1908,12 +1901,19 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	 */
>  	arch_swap_restore(swap, folio);
>  
> -	if (shmem_should_replace_folio(folio, gfp)) {
> +	/* With cache bypass, folio is new allocated and always respect gfp flags */
> +	if (cache_result != SWAP_CACHE_BYPASS && shmem_should_replace_folio(folio, gfp)) {
>  		error = shmem_replace_folio(&folio, gfp, info, index);
>  		if (error)
>  			goto failed;
>  	}
>  
> +	/*
> +	 * The expected value checking below should be enough to ensure
> +	 * only one up-to-date swapin success. swap_free() is called after
> +	 * this, so the entry can't be reused. As long as the mapping still
> +	 * has the old entry value, it's never swapped in or modified.
> +	 */
>  	error = shmem_add_to_page_cache(folio, mapping, index,
>  					swp_to_radix_entry(swap), gfp);
>  	if (error)
> @@ -1924,7 +1924,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	if (sgp == SGP_WRITE)
>  		folio_mark_accessed(folio);
>  
> -	delete_from_swap_cache(folio);
> +	if (cache_result != SWAP_CACHE_BYPASS)
> +		delete_from_swap_cache(folio);
>  	folio_mark_dirty(folio);
>  	swap_free(swap);
>  	put_swap_device(si);
> diff --git a/mm/swap.h b/mm/swap.h
> index 8f790a67b948..20f4048c971c 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -57,9 +57,6 @@ void __delete_from_swap_cache(struct folio *folio,
>  void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>  				  unsigned long end);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr,
> -		void **shadowp);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>  		pgoff_t index);
>  
> @@ -123,12 +120,6 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>  	return 0;
>  }
>  
> -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr)
> -{
> -	return NULL;
> -}
> -
>  static inline
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>  		pgoff_t index)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3edf4b63158d..10eec68475dd 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -318,7 +318,14 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>  
>  static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
>  {
> -	return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> +	int count;
> +
> +	if (!data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +		return false;
> +
> +	count = __swap_count(entry);
> +
> +	return (count == 1 || count == SWAP_MAP_SHMEM);
>  }
>  
>  static inline bool swap_use_vma_readahead(void)
> @@ -334,7 +341,7 @@ static inline bool swap_use_vma_readahead(void)
>   *
>   * Caller must lock the swap device or hold a reference to keep it valid.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> +static struct folio *swap_cache_get_folio(swp_entry_t entry,
>  		struct vm_area_struct *vma, unsigned long addr, void **shadowp)
>  {
>  	struct folio *folio;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin
  2024-01-04  8:10   ` Huang, Ying
@ 2024-01-09  9:38     ` Kairui Song
  0 siblings, 0 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-09  9:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月4日周四 16:12写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > When swapping in a page, mem_cgroup_swapin_charge_folio is called for
> > new allocated folio, nothing else is referencing the folio so no need
> > to set the lock bit early. This avoided doing extra unlock checks
> > on the error path.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 24cb93ed5081..6130de8d5226 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -881,16 +881,15 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >                               vma, vmf->address, false);
> >       if (folio) {
> > -             __folio_set_locked(folio);
> > -             __folio_set_swapbacked(folio);
> > -
> > -             if (mem_cgroup_swapin_charge_folio(folio,
> > -                                     vma->vm_mm, GFP_KERNEL,
> > -                                     entry)) {
> > -                     folio_unlock(folio);
> > +             if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> > +                                                GFP_KERNEL, entry)) {
> >                       folio_put(folio);
> >                       return NULL;
> >               }
> > +
> > +             __folio_set_locked(folio);
> > +             __folio_set_swapbacked(folio);
> > +
> >               mem_cgroup_swapin_uncharge_swap(entry);
> >
> >               shadow = get_shadow_from_swap_cache(entry);
>
> I don't find any issue with the patch.  But another caller of
> mem_cgroup_swapin_charge_folio() in __read_swap_cache_async() setups
> newly allocated folio in the same way before the change.  Better to keep
> them same?  Because the benefit of change is small too.

OK, this is just a trivial optimization, I can drop it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg
  2024-01-08  7:44       ` Huang, Ying
@ 2024-01-09  9:42         ` Kairui Song
  0 siblings, 0 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-09  9:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月8日周一 15:46写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > Huang, Ying <ying.huang@intel.com> 于2024年1月5日周五 15:16写道:
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > Currently, mem_cgroup_swapin_charge_folio is always called with
> >> > mm argument as NULL, except in swapin_direct.
> >> >
> >> > swapin_direct is used when swapin should skip readahead and
> >> > swapcache (SWP_SYNCHRONOUS_IO). Other caller paths of
> >> > mem_cgroup_swapin_charge_folio are for swapin that should
> >> > not skip readahead and cache.
> >> >
> >> > This could cause swapin charging to behave differently depending
> >> > on swap device. This currently didn't happen because the only call
> >> > path of swapin_direct is the direct anon page fault path, where mm
> >> > equals to current->mm, but will no longer be true if swapin_direct
> >> > is shared and have other callers (eg, swapoff).
> >> >
> >> > So make swapin_direct also passes NULL for mm, no feature change.
> >> >
> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >> > ---
> >> >  mm/swap_state.c | 2 +-
> >> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> > index 6130de8d5226..d39c5369da21 100644
> >> > --- a/mm/swap_state.c
> >> > +++ b/mm/swap_state.c
> >> > @@ -881,7 +881,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >> >       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >> >                               vma, vmf->address, false);
> >> >       if (folio) {
> >> > -             if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> >> > +             if (mem_cgroup_swapin_charge_folio(folio, NULL,
> >> >                                                  GFP_KERNEL, entry)) {
> >> >                       folio_put(folio);
> >> >                       return NULL;
> >>
> >> I think that why not provide "mm" when it's available?  For
> >> swapin_direct() called by do_swap_page(), mm can be provided.  While,
> >> for swapin_direct() called by shmem swapin, mm will be NULL.  We can
> >> even provide "mm" for __read_swap_cache_async() for VMA based swapin and
> >> for the fault address for cluster swapin.
> >
> > Hi, Ying.
> >
> > Thanks for the comment.
> >
> > Without modifying too much code, providing mm here will change swapin
> > charge behaviour on swapoff, we discussed it previously:
> > https://lkml.org/lkml/2023/11/19/320
>
> It's better to use "lore" for kernel email link, for example,
>
> https://lore.kernel.org/lkml/20231119194740.94101-24-ryncsn@gmail.com/
>
> This is more readable than the above link.

Hi Ying,

Thanks for your advice.

>
> > If we want to avoid the behavior change, we have to extend all direct
> > and indirect callers of mem_cgroup_swapin_charge_folio to accept a
> > standalone mm argument (including swapin_direct, swap_vma_readahead,
> > swap_cluster_readahead, __read_swap_cache_async, swapin_entry_mpol,
> > read_swap_cache_async, and some other path may need more audit), and
> > sort things our case by case. I'm not sure if that's a good idea...
> >
> > Simply dropping it here seems the easiest way to avoid such change.
>
> OK.  Didn't realize that they are related directly.  It appears that we
> always pass NULL as mm to mem_cgroup_swapin_charge_folio().  If so,
> please just remove the parameter.

Yes, that's also what I wanted.

>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy
  2024-01-05  7:28   ` Huang, Ying
@ 2024-01-10  2:42     ` Kairui Song
  0 siblings, 0 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-10  2:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月5日周五 15:30写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Introduce swapin_entry which merges swapin_readahead and swapin_direct
> > making it the main entry for swapin pages, and use a unified swapin
> > policy.
> >
> > This commit makes swapoff make use of this new helper and now swapping
> > off a 10G ZRAM (lzo-rle) is faster since readahead is skipped.
> >
> > Before:
> > time swapoff /dev/zram0
> > real    0m12.337s
> > user    0m0.001s
> > sys     0m12.329s
> >
> > After:
> > time swapoff /dev/zram0
> > real    0m9.728s
> > user    0m0.001s
> > sys     0m9.719s
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 21 +++++++--------------
> >  mm/swap.h       | 16 ++++------------
> >  mm/swap_state.c | 49 +++++++++++++++++++++++++++++++++----------------
> >  mm/swapfile.c   |  7 ++-----
> >  4 files changed, 46 insertions(+), 47 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0165c8cad489..b56254a875f8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3801,6 +3801,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       rmap_t rmap_flags = RMAP_NONE;
> >       bool exclusive = false;
> >       swp_entry_t entry;
> > +     bool swapcached;
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >
> > @@ -3864,21 +3865,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       swapcache = folio;
> >
> >       if (!folio) {
> > -             if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > -                 __swap_count(entry) == 1) {
> > -                     /* skip swapcache and readahead */
> > -                     folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > -                     if (folio)
> > -                             page = &folio->page;
> > +             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> > +                                  vmf, &swapcached);
> > +             if (folio) {
> > +                     page = folio_file_page(folio, swp_offset(entry));
> > +                     if (swapcached)
> > +                             swapcache = folio;
> >               } else {
> > -                     page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > -                                             vmf);
> > -                     if (page)
> > -                             folio = page_folio(page);
> > -                     swapcache = folio;
> > -             }
> > -
> > -             if (!folio) {
> >                       /*
> >                        * Back out if somebody else faulted in this pte
> >                        * while we released the pte lock.
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 83eab7b67e77..502a2801f817 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -54,10 +54,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
> >               bool skip_if_exists);
> >  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >               struct mempolicy *mpol, pgoff_t ilx);
> > -struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > -                           struct vm_fault *vmf);
> > -struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> > -                         struct vm_fault *vmf);
> > +struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> > +                         struct vm_fault *vmf, bool *swapcached);
> >
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> > @@ -88,14 +86,8 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
> >       return NULL;
> >  }
> >
> > -struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> > -                     struct vm_fault *vmf)
> > -{
> > -     return NULL;
> > -}
> > -
> > -static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > -                     struct vm_fault *vmf)
> > +static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> > +                     struct vm_fault *vmf, bool *swapcached)
> >  {
> >       return NULL;
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index d39c5369da21..66ff187aa5d3 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -316,6 +316,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> >       release_pages(pages, nr);
> >  }
> >
> > +static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> > +{
> > +     return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > +}
> > +

Hi Ying,

Thanks for the review.

>
> It appears that there's only one caller of the function in the same
> file?  Why add a function?

Later patch will extend the checker function.
I can defer this change so it won't cause confusion for reviewers.

>
> >  static inline bool swap_use_vma_readahead(void)
> >  {
> >       return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > @@ -870,8 +875,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >   * Returns the struct folio for entry and addr after the swap entry is read
> >   * in.
> >   */
> > -struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> > -                         struct vm_fault *vmf)
> > +static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> > +                               struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >       struct folio *folio;
> > @@ -908,33 +913,45 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >  }
> >
> >  /**
> > - * swapin_readahead - swap in pages in hope we need them soon
> > + * swapin_entry - swap in a page from swap entry
> >   * @entry: swap entry of this memory
> >   * @gfp_mask: memory allocation flags
> >   * @vmf: fault information
> > + * @swapcached: pointer to a bool used as indicator if the
> > + *              page is swapped in through swapcache.
> >   *
> >   * Returns the struct page for entry and addr, after queueing swapin.
> >   *
> > - * It's a main entry function for swap readahead. By the configuration,
> > + * It's a main entry function for swap in. By the configuration,
> >   * it will read ahead blocks by cluster-based(ie, physical disk based)
> > - * or vma-based(ie, virtual address based on faulty address) readahead.
> > + * or vma-based(ie, virtual address based on faulty address) readahead,
> > + * or skip the readahead (ie, ramdisk based swap device).
> >   */
> > -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > -                             struct vm_fault *vmf)
> > +struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> > +                        struct vm_fault *vmf, bool *swapcached)
>
> May be better to use
>
> struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>                            struct vm_fault *vmf, struct folio **swapcache)
>
> In this way, we can reduce the number of source lines in the caller.

Following commit will rewrite this part to return a enum instead of
bool, so this is just a intermediate change. And do_swap_page is the
only caller that can benefit from this, not helpful for swapoff/shmem.
I think we can just keep it this way here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry
  2024-01-08  8:26   ` Huang, Ying
@ 2024-01-10  2:53     ` Kairui Song
  2024-01-15  1:45       ` Huang, Ying
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-10  2:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月8日周一 16:28写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Since all callers of swapin_entry need to check the swap cache first, we
> > can merge this common routine into swapin_entry, so it can be shared and
> > optimized later.
> >
> > Also introduce a enum to better represent possible swap cache usage, and
> > add some comments about it, make the usage of swap cache easier to
> > understand.
>
> I don't find any benefit to do this.  The code line number isn't
> reduced.  The concept of swap cache isn't hided either.

Hi Ying

Thanks for the comments.

Maybe I should squash this with the following commit? The following
commit want to do cache lookup in swapin_entry to avoid a redundant
shadow lookup, it can help to improve the performance by about 4% for
swapin path. So it need to return a enum to represent cache status.

Further more, note the comments added here:

+/*
+ * Caller of swapin_entry may need to know the cache lookup result:
+ *
+ * SWAP_CACHE_HIT: cache hit, cached folio is retured.
+ * SWAP_CACHE_MISS: cache miss, folio is allocated, read from swap device
+ *                  and adde to swap cache, but still may return a cached
+ *                  folio if raced (check __read_swap_cache_async).
+ * SWAP_CACHE_BYPASS: cache miss, folio is new allocated and read
+ *                    from swap device bypassing the cache.
+ */

SWAP_CACHE_MISS might be inaccurate, this is not an issue introduced
by this commit, but better exposed. May worth a fix later. So far I
can see two benefits fixing it:
- More accurate maj/min page fault count.
- Note the PageHWPoison check in do_swap_page, it ignored the race
case, if a page getting poisoned is raced with swapcache then it may
not work as expected.

These are all minor issue indeed, some other optimization might also
be doable, but at least a comment might be helpful.

> --
> Best Regards,
> Huang, Ying
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-09  1:08   ` Huang, Ying
@ 2024-01-10  3:32     ` Kairui Song
  2024-01-15  1:52       ` Huang, Ying
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-10  3:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 09:11写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > There are two places where swapin is not caused by direct anon page fault:
> > - shmem swapin, invoked indirectly through shmem mapping
> > - swapoff
> >
> > They used to construct a pseudo vmfault struct for swapin function.
> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> > path is still using one.
> >
> > Introduce a helper for them both, this help save stack usage for swapoff
> > path, and help apply a unified swapin cache and readahead policy check.
> >
> > Due to missing vmfault info, the caller have to pass in mempolicy
> > explicitly, make it different from swapin_entry and name it
> > swapin_entry_mpol.
> >
> > This commit convert swapoff to use this helper, follow-up commits will
> > convert shmem to use it too.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap.h       |  9 +++++++++
> >  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
> >  mm/swapfile.c   | 15 ++++++---------
> >  3 files changed, 47 insertions(+), 17 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 9180411afcfe..8f790a67b948 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >               struct mempolicy *mpol, pgoff_t ilx);
> >  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> >                           struct vm_fault *vmf, enum swap_cache_result *result);
> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> > +                             struct mempolicy *mpol, pgoff_t ilx,
> > +                             enum swap_cache_result *result);
> >
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> >       return NULL;
> >  }
> >
> > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> > +             struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
> > +{
> > +     return NULL;
> > +}
> > +
> >  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> >  {
> >       return 0;
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 21badd4f0fc7..3edf4b63158d 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >   * in.
> >   */
> >  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> > -                               struct vm_fault *vmf, void *shadow)
> > +                                struct mempolicy *mpol, pgoff_t ilx,
> > +                                void *shadow)
> >  {
> > -     struct vm_area_struct *vma = vmf->vma;
> >       struct folio *folio;
> >
> > -     /* skip swapcache */
> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                             vma, vmf->address, false);
> > +     folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
> > +                     mpol, ilx, numa_node_id());
> >       if (folio) {
> >               if (mem_cgroup_swapin_charge_folio(folio, NULL,
> >                                                  GFP_KERNEL, entry)) {
> > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> >               goto done;
> >       }
> >
> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> > -             folio = swapin_direct(entry, gfp_mask, vmf, shadow);
> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> >               cache_result = SWAP_CACHE_BYPASS;
> >       } else {
> > -             mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >               if (swap_use_vma_readahead())
> >                       folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> >               else
> >                       folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > -             mpol_cond_put(mpol);
> >               cache_result = SWAP_CACHE_MISS;
> >       }
> > +     mpol_cond_put(mpol);
> >  done:
> >       if (result)
> >               *result = cache_result;
> > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> >       return folio;
> >  }
> >
> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> > +                             struct mempolicy *mpol, pgoff_t ilx,
> > +                             enum swap_cache_result *result)
> > +{
> > +     enum swap_cache_result cache_result;
> > +     void *shadow = NULL;
> > +     struct folio *folio;
> > +
> > +     folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
> > +     if (folio) {
> > +             cache_result = SWAP_CACHE_HIT;
> > +     } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> > +             cache_result = SWAP_CACHE_BYPASS;
> > +     } else {
> > +             folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > +             cache_result = SWAP_CACHE_MISS;
> > +     }
> > +
> > +     if (result)
> > +             *result = cache_result;
> > +
> > +     return folio;
> > +}
> > +
> >  #ifdef CONFIG_SYSFS
> >  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
> >                                    struct kobj_attribute *attr, char *buf)
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5aa44de11edc..2f77bf143af8 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >       do {
> >               struct folio *folio;
> >               unsigned long offset;
> > +             struct mempolicy *mpol;
> >               unsigned char swp_count;
> >               swp_entry_t entry;
> > +             pgoff_t ilx;
> >               int ret;
> >               pte_t ptent;
> >
> > -             struct vm_fault vmf = {
> > -                     .vma = vma,
> > -                     .address = addr,
> > -                     .real_address = addr,
> > -                     .pmd = pmd,
> > -             };
> > -
> >               if (!pte++) {
> >                       pte = pte_offset_map(pmd, addr);
> >                       if (!pte)
> > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >               pte_unmap(pte);
> >               pte = NULL;
> >
> > -             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> > -                                  &vmf, NULL);
> > +             mpol = get_vma_policy(vma, addr, 0, &ilx);
> > +             folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
> > +                                       mpol, ilx, NULL);
> > +             mpol_cond_put(mpol);
> >               if (!folio) {
> >                       /*
> >                        * The entry could have been freed, and will not
>
> IIUC, after the change, we will always use cluster readahead for
> swapoff.  This may be OK.  But, at least we need some test results which
> show that this will not cause any issue for this behavior change.  And
> the behavior change should be described explicitly in patch description.

Hi Ying

Actually there is a swap_use_no_readahead check in swapin_entry_mpol,
so when readahaed is not needed (SYNC_IO), it's just skipped.

And I think VMA readahead is not helpful swapoff, swapoff is already
walking the VMA, mostly uninterrupted in kernel space. With VMA
readahead or not, it will issue IO page by page.
The benchmark result I posted before is actually VMA readhead vs
no-readahead for ZRAM, sorry I didn't make it clear. It's obvious
no-readahead is faster.

For actual block device, cluster readahead might be a good choice for
swapoff, since all pages will be read for swapoff, there has to be
enough memory for all swapcached page to stay in memory or swapoff
will fail anyway, and cluster read is faster for block devices.

>
> And I don't think it's a good abstraction to make swapin_entry_mpol()
> always use cluster swapin, while swapin_entry() will try to use vma
> swapin.  I think we can add "struct mempolicy *mpol" and "pgoff_t ilx"
> to swapin_entry() as parameters, and use them if vmf == NULL.  If we
> want to enforce cluster swapin in swapoff path, it will be better to add
> some comments to describe why.

Good suggestion, I thought extending swapin_entry may make its
arguments list get too long, but seems mpol and ilx are the only thing
needed here. I'll update it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-09  2:03   ` Huang, Ying
@ 2024-01-10  3:35     ` Kairui Song
  2024-01-30  0:39       ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-10  3:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 10:05写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, shmem uses cluster readahead for all swap backends. Cluster
> > readahead is not a good solution for ramdisk based device (ZRAM) at all.
> >
> > After switching to the new helper, most benchmarks showed a good result:
> >
> > - Single file sequence read:
> >   perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192
> >   (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit)
> >   Before: 22.248 +- 0.549
> >   After:  22.021 +- 0.684 (-1.1%)
> >
> > - Random read stress test:
> >   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
> >   --size=256m --ioengine=mmap --rw=randread --random_distribution=random \
> >   --time_based --ramp_time=1m --runtime=5m --group_reporting
> >   (using brd as swap, 2G memcg limit)
> >
> >   Before: 1818MiB/s
> >   After:  1888MiB/s (+3.85%)
> >
> > - Zipf biased random read stress test:
> >   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
> >   --size=256m --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
> >   --time_based --ramp_time=1m --runtime=5m --group_reporting
> >   (using brd as swap, 2G memcg limit)
> >
> >   Before: 31.1GiB/s
> >   After:  32.3GiB/s (+3.86%)
> >
> > So cluster readahead doesn't help much even for single sequence read,
> > and for random stress test, the performance is better without it.
> >
> > Considering both memory and swap device will get more fragmented
> > slowly, and commonly used ZRAM consumes much more CPU than plain
> > ramdisk, false readahead could occur more frequently and waste
> > more CPU. Direct SWAP is cheaper, so use the new helper and skip
> > read ahead for SWP_SYNCHRONOUS_IO device.
>
> It's good to take advantage of swap_direct (no readahead).  I also hopes
> we can take advantage of VMA based swapin if shmem is accessed via mmap.
> That appears possible.

Good idea, that should be doable, will update the series.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry
  2024-01-10  2:53     ` Kairui Song
@ 2024-01-15  1:45       ` Huang, Ying
  2024-01-15 17:11         ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-15  1:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> Huang, Ying <ying.huang@intel.com> 于2024年1月8日周一 16:28写道:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > Since all callers of swapin_entry need to check the swap cache first, we
>> > can merge this common routine into swapin_entry, so it can be shared and
>> > optimized later.
>> >
>> > Also introduce a enum to better represent possible swap cache usage, and
>> > add some comments about it, make the usage of swap cache easier to
>> > understand.
>>
>> I don't find any benefit to do this.  The code line number isn't
>> reduced.  The concept of swap cache isn't hided either.
>
> Hi Ying
>
> Thanks for the comments.
>
> Maybe I should squash this with the following commit? The following
> commit want to do cache lookup in swapin_entry to avoid a redundant
> shadow lookup, it can help to improve the performance by about 4% for
> swapin path.

It's good to improve performance.  But I don't think we must put
swap_cache_get_folio() in swapin_entry() to do that.  We can just get
"shadow" from swap_cache_get_folio() and pass it to swapin_entry().

> So it need to return a enum to represent cache status.

I don't think we are talking about the same thing here.

> Further more, note the comments added here:
>
> +/*
> + * Caller of swapin_entry may need to know the cache lookup result:
> + *
> + * SWAP_CACHE_HIT: cache hit, cached folio is retured.
> + * SWAP_CACHE_MISS: cache miss, folio is allocated, read from swap device
> + *                  and adde to swap cache, but still may return a cached
> + *                  folio if raced (check __read_swap_cache_async).
> + * SWAP_CACHE_BYPASS: cache miss, folio is new allocated and read
> + *                    from swap device bypassing the cache.
> + */
>
> SWAP_CACHE_MISS might be inaccurate, this is not an issue introduced
> by this commit, but better exposed. May worth a fix later. So far I
> can see two benefits fixing it:
> - More accurate maj/min page fault count.
> - Note the PageHWPoison check in do_swap_page, it ignored the race
> case, if a page getting poisoned is raced with swapcache then it may
> not work as expected.
>
> These are all minor issue indeed, some other optimization might also
> be doable, but at least a comment might be helpful.
>

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-10  3:32     ` Kairui Song
@ 2024-01-15  1:52       ` Huang, Ying
  2024-01-21 18:40         ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-15  1:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 09:11写道:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > There are two places where swapin is not caused by direct anon page fault:
>> > - shmem swapin, invoked indirectly through shmem mapping
>> > - swapoff
>> >
>> > They used to construct a pseudo vmfault struct for swapin function.
>> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
>> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
>> > path is still using one.
>> >
>> > Introduce a helper for them both, this help save stack usage for swapoff
>> > path, and help apply a unified swapin cache and readahead policy check.
>> >
>> > Due to missing vmfault info, the caller have to pass in mempolicy
>> > explicitly, make it different from swapin_entry and name it
>> > swapin_entry_mpol.
>> >
>> > This commit convert swapoff to use this helper, follow-up commits will
>> > convert shmem to use it too.
>> >
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > ---
>> >  mm/swap.h       |  9 +++++++++
>> >  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
>> >  mm/swapfile.c   | 15 ++++++---------
>> >  3 files changed, 47 insertions(+), 17 deletions(-)
>> >
>> > diff --git a/mm/swap.h b/mm/swap.h
>> > index 9180411afcfe..8f790a67b948 100644
>> > --- a/mm/swap.h
>> > +++ b/mm/swap.h
>> > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>> >               struct mempolicy *mpol, pgoff_t ilx);
>> >  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
>> >                           struct vm_fault *vmf, enum swap_cache_result *result);
>> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> > +                             struct mempolicy *mpol, pgoff_t ilx,
>> > +                             enum swap_cache_result *result);
>> >
>> >  static inline unsigned int folio_swap_flags(struct folio *folio)
>> >  {
>> > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
>> >       return NULL;
>> >  }
>> >
>> > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> > +             struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
>> > +{
>> > +     return NULL;
>> > +}
>> > +
>> >  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> >  {
>> >       return 0;
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index 21badd4f0fc7..3edf4b63158d 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>> >   * in.
>> >   */
>> >  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>> > -                               struct vm_fault *vmf, void *shadow)
>> > +                                struct mempolicy *mpol, pgoff_t ilx,
>> > +                                void *shadow)
>> >  {
>> > -     struct vm_area_struct *vma = vmf->vma;
>> >       struct folio *folio;
>> >
>> > -     /* skip swapcache */
>> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> > -                             vma, vmf->address, false);
>> > +     folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
>> > +                     mpol, ilx, numa_node_id());
>> >       if (folio) {
>> >               if (mem_cgroup_swapin_charge_folio(folio, NULL,
>> >                                                  GFP_KERNEL, entry)) {
>> > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>> >               goto done;
>> >       }
>> >
>> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> >       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> > -             folio = swapin_direct(entry, gfp_mask, vmf, shadow);
>> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>> >               cache_result = SWAP_CACHE_BYPASS;
>> >       } else {
>> > -             mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> >               if (swap_use_vma_readahead())
>> >                       folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>> >               else
>> >                       folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>> > -             mpol_cond_put(mpol);
>> >               cache_result = SWAP_CACHE_MISS;
>> >       }
>> > +     mpol_cond_put(mpol);
>> >  done:
>> >       if (result)
>> >               *result = cache_result;
>> > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>> >       return folio;
>> >  }
>> >
>> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> > +                             struct mempolicy *mpol, pgoff_t ilx,
>> > +                             enum swap_cache_result *result)
>> > +{
>> > +     enum swap_cache_result cache_result;
>> > +     void *shadow = NULL;
>> > +     struct folio *folio;
>> > +
>> > +     folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
>> > +     if (folio) {
>> > +             cache_result = SWAP_CACHE_HIT;
>> > +     } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>> > +             cache_result = SWAP_CACHE_BYPASS;
>> > +     } else {
>> > +             folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>> > +             cache_result = SWAP_CACHE_MISS;
>> > +     }
>> > +
>> > +     if (result)
>> > +             *result = cache_result;
>> > +
>> > +     return folio;
>> > +}
>> > +
>> >  #ifdef CONFIG_SYSFS
>> >  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
>> >                                    struct kobj_attribute *attr, char *buf)
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 5aa44de11edc..2f77bf143af8 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >       do {
>> >               struct folio *folio;
>> >               unsigned long offset;
>> > +             struct mempolicy *mpol;
>> >               unsigned char swp_count;
>> >               swp_entry_t entry;
>> > +             pgoff_t ilx;
>> >               int ret;
>> >               pte_t ptent;
>> >
>> > -             struct vm_fault vmf = {
>> > -                     .vma = vma,
>> > -                     .address = addr,
>> > -                     .real_address = addr,
>> > -                     .pmd = pmd,
>> > -             };
>> > -
>> >               if (!pte++) {
>> >                       pte = pte_offset_map(pmd, addr);
>> >                       if (!pte)
>> > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >               pte_unmap(pte);
>> >               pte = NULL;
>> >
>> > -             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
>> > -                                  &vmf, NULL);
>> > +             mpol = get_vma_policy(vma, addr, 0, &ilx);
>> > +             folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
>> > +                                       mpol, ilx, NULL);
>> > +             mpol_cond_put(mpol);
>> >               if (!folio) {
>> >                       /*
>> >                        * The entry could have been freed, and will not
>>
>> IIUC, after the change, we will always use cluster readahead for
>> swapoff.  This may be OK.  But, at least we need some test results which
>> show that this will not cause any issue for this behavior change.  And
>> the behavior change should be described explicitly in patch description.
>
> Hi Ying
>
> Actually there is a swap_use_no_readahead check in swapin_entry_mpol,
> so when readahaed is not needed (SYNC_IO), it's just skipped.
>
> And I think VMA readahead is not helpful swapoff, swapoff is already
> walking the VMA, mostly uninterrupted in kernel space. With VMA
> readahead or not, it will issue IO page by page.
> The benchmark result I posted before is actually VMA readhead vs
> no-readahead for ZRAM, sorry I didn't make it clear. It's obvious
> no-readahead is faster.
>
> For actual block device, cluster readahead might be a good choice for
> swapoff, since all pages will be read for swapoff, there has to be
> enough memory for all swapcached page to stay in memory or swapoff
> will fail anyway, and cluster read is faster for block devices.

It's possible.  But please run the tests on some actual block devices
and show your results.  Random memory accessing pattern should be
tested, and the swap space usage should be > 50% to show some not so
friendly situation.

>>
>> And I don't think it's a good abstraction to make swapin_entry_mpol()
>> always use cluster swapin, while swapin_entry() will try to use vma
>> swapin.  I think we can add "struct mempolicy *mpol" and "pgoff_t ilx"
>> to swapin_entry() as parameters, and use them if vmf == NULL.  If we
>> want to enforce cluster swapin in swapoff path, it will be better to add
>> some comments to describe why.
>
> Good suggestion, I thought extending swapin_entry may make its
> arguments list get too long, but seems mpol and ilx are the only thing
> needed here. I'll update it.

Thanks!

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry
  2024-01-15  1:45       ` Huang, Ying
@ 2024-01-15 17:11         ` Kairui Song
  0 siblings, 0 replies; 38+ messages in thread
From: Kairui Song @ 2024-01-15 17:11 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Huang, Ying <ying.huang@intel.com> 于2024年1月15日周一 09:47写道:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > Huang, Ying <ying.huang@intel.com> 于2024年1月8日周一 16:28写道:
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > Since all callers of swapin_entry need to check the swap cache first, we
> >> > can merge this common routine into swapin_entry, so it can be shared and
> >> > optimized later.
> >> >
> >> > Also introduce a enum to better represent possible swap cache usage, and
> >> > add some comments about it, make the usage of swap cache easier to
> >> > understand.
> >>
> >> I don't find any benefit to do this.  The code line number isn't
> >> reduced.  The concept of swap cache isn't hided either.
> >
> > Hi Ying
> >
> > Thanks for the comments.
> >
> > Maybe I should squash this with the following commit? The following
> > commit want to do cache lookup in swapin_entry to avoid a redundant
> > shadow lookup, it can help to improve the performance by about 4% for
> > swapin path.
>
> It's good to improve performance.  But I don't think we must put
> swap_cache_get_folio() in swapin_entry() to do that.  We can just get
> "shadow" from swap_cache_get_folio() and pass it to swapin_entry().

Good idea. My only concern is, if the argument list is getting too
long (7 args if we added all mpol, ilx, shadow here), will that cause
issue for readability or performance?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-15  1:52       ` Huang, Ying
@ 2024-01-21 18:40         ` Kairui Song
  2024-01-22  6:38           ` Huang, Ying
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-21 18:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

On Mon, Jan 15, 2024 at 9:54 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 09:11写道:
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > There are two places where swapin is not caused by direct anon page fault:
> >> > - shmem swapin, invoked indirectly through shmem mapping
> >> > - swapoff
> >> >
> >> > They used to construct a pseudo vmfault struct for swapin function.
> >> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> >> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> >> > path is still using one.
> >> >
> >> > Introduce a helper for them both, this help save stack usage for swapoff
> >> > path, and help apply a unified swapin cache and readahead policy check.
> >> >
> >> > Due to missing vmfault info, the caller have to pass in mempolicy
> >> > explicitly, make it different from swapin_entry and name it
> >> > swapin_entry_mpol.
> >> >
> >> > This commit convert swapoff to use this helper, follow-up commits will
> >> > convert shmem to use it too.
> >> >
> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >> > ---
> >> >  mm/swap.h       |  9 +++++++++
> >> >  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
> >> >  mm/swapfile.c   | 15 ++++++---------
> >> >  3 files changed, 47 insertions(+), 17 deletions(-)
> >> >
> >> > diff --git a/mm/swap.h b/mm/swap.h
> >> > index 9180411afcfe..8f790a67b948 100644
> >> > --- a/mm/swap.h
> >> > +++ b/mm/swap.h
> >> > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >> >               struct mempolicy *mpol, pgoff_t ilx);
> >> >  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> >> >                           struct vm_fault *vmf, enum swap_cache_result *result);
> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
> >> > +                             enum swap_cache_result *result);
> >> >
> >> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >> >  {
> >> > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> >> >       return NULL;
> >> >  }
> >> >
> >> > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> >> > +             struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
> >> > +{
> >> > +     return NULL;
> >> > +}
> >> > +
> >> >  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> >> >  {
> >> >       return 0;
> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> > index 21badd4f0fc7..3edf4b63158d 100644
> >> > --- a/mm/swap_state.c
> >> > +++ b/mm/swap_state.c
> >> > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >> >   * in.
> >> >   */
> >> >  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >> > -                               struct vm_fault *vmf, void *shadow)
> >> > +                                struct mempolicy *mpol, pgoff_t ilx,
> >> > +                                void *shadow)
> >> >  {
> >> > -     struct vm_area_struct *vma = vmf->vma;
> >> >       struct folio *folio;
> >> >
> >> > -     /* skip swapcache */
> >> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >> > -                             vma, vmf->address, false);
> >> > +     folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
> >> > +                     mpol, ilx, numa_node_id());
> >> >       if (folio) {
> >> >               if (mem_cgroup_swapin_charge_folio(folio, NULL,
> >> >                                                  GFP_KERNEL, entry)) {
> >> > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> >> >               goto done;
> >> >       }
> >> >
> >> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >> >       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> >> > -             folio = swapin_direct(entry, gfp_mask, vmf, shadow);
> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> >> >               cache_result = SWAP_CACHE_BYPASS;
> >> >       } else {
> >> > -             mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >> >               if (swap_use_vma_readahead())
> >> >                       folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> >> >               else
> >> >                       folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> >> > -             mpol_cond_put(mpol);
> >> >               cache_result = SWAP_CACHE_MISS;
> >> >       }
> >> > +     mpol_cond_put(mpol);
> >> >  done:
> >> >       if (result)
> >> >               *result = cache_result;
> >> > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> >> >       return folio;
> >> >  }
> >> >
> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
> >> > +                             enum swap_cache_result *result)
> >> > +{
> >> > +     enum swap_cache_result cache_result;
> >> > +     void *shadow = NULL;
> >> > +     struct folio *folio;
> >> > +
> >> > +     folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
> >> > +     if (folio) {
> >> > +             cache_result = SWAP_CACHE_HIT;
> >> > +     } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> >> > +             cache_result = SWAP_CACHE_BYPASS;
> >> > +     } else {
> >> > +             folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> >> > +             cache_result = SWAP_CACHE_MISS;
> >> > +     }
> >> > +
> >> > +     if (result)
> >> > +             *result = cache_result;
> >> > +
> >> > +     return folio;
> >> > +}
> >> > +
> >> >  #ifdef CONFIG_SYSFS
> >> >  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
> >> >                                    struct kobj_attribute *attr, char *buf)
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 5aa44de11edc..2f77bf143af8 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >> >       do {
> >> >               struct folio *folio;
> >> >               unsigned long offset;
> >> > +             struct mempolicy *mpol;
> >> >               unsigned char swp_count;
> >> >               swp_entry_t entry;
> >> > +             pgoff_t ilx;
> >> >               int ret;
> >> >               pte_t ptent;
> >> >
> >> > -             struct vm_fault vmf = {
> >> > -                     .vma = vma,
> >> > -                     .address = addr,
> >> > -                     .real_address = addr,
> >> > -                     .pmd = pmd,
> >> > -             };
> >> > -
> >> >               if (!pte++) {
> >> >                       pte = pte_offset_map(pmd, addr);
> >> >                       if (!pte)
> >> > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >> >               pte_unmap(pte);
> >> >               pte = NULL;
> >> >
> >> > -             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> >> > -                                  &vmf, NULL);
> >> > +             mpol = get_vma_policy(vma, addr, 0, &ilx);
> >> > +             folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
> >> > +                                       mpol, ilx, NULL);
> >> > +             mpol_cond_put(mpol);
> >> >               if (!folio) {
> >> >                       /*
> >> >                        * The entry could have been freed, and will not
> >>
> >> IIUC, after the change, we will always use cluster readahead for
> >> swapoff.  This may be OK.  But, at least we need some test results which
> >> show that this will not cause any issue for this behavior change.  And
> >> the behavior change should be described explicitly in patch description.
> >
> > Hi Ying
> >
> > Actually there is a swap_use_no_readahead check in swapin_entry_mpol,
> > so when readahaed is not needed (SYNC_IO), it's just skipped.
> >
> > And I think VMA readahead is not helpful swapoff, swapoff is already
> > walking the VMA, mostly uninterrupted in kernel space. With VMA
> > readahead or not, it will issue IO page by page.
> > The benchmark result I posted before is actually VMA readhead vs
> > no-readahead for ZRAM, sorry I didn't make it clear. It's obvious
> > no-readahead is faster.
> >
> > For actual block device, cluster readahead might be a good choice for
> > swapoff, since all pages will be read for swapoff, there has to be
> > enough memory for all swapcached page to stay in memory or swapoff
> > will fail anyway, and cluster read is faster for block devices.
>
> It's possible.  But please run the tests on some actual block devices
> and show your results.  Random memory accessing pattern should be
> tested, and the swap space usage should be > 50% to show some not so
> friendly situation.
>

Hi Ying,

I setup a test environment and did following test, and found that
cluster readahaed for swapoff is actually much worse in default setup:

1. Setup MySQL server using 2G memcg, with 28G buffer pool, and 24G NVME swap
2. Stress test with sysbench for 15min.
3. Remove the 2G memcg limit and swapoff.

Before this patch, swapoff will take about 9m.
After this patch, swapoff will take about 30m.

After some analysis I found the reason is that cluster readahead is
almost disabled (window == 1 or 2) during swapoff, because it will
detect a very low hit rate on fragmented swap. But VMA readhead is
much more aggressive here since swapoff is walking the VMA, with a
very high hit rate.

But If I force cluster readahead to use a large window for swapoff,
the swapoff performance boost by a lot:
By adding following change in swap_cluster_readahead:

if (unlikely(!(si->flags & SWP_WRITEOK)))
    mask = max_t(unsigned long, 1 << READ_ONCE(page_cluster), PMD_SIZE
/ PAGE_SIZE) - 1;

The swapoff will take only 40s to finish, more than 10x faster than
the VMA readahead path (9m), because VMA readhead is still doing 4K
random IO just with a longer queue due to async readahead. But cluster
readhead will be doing 2M IO now.
I think PMD size window is good here since it still keep a balance
between good IO performance and the swapoff progress can still be
interrupted, and the system is responsible. And in most cases we
expect swapoff to success, if it fail, the RA windows should still
keep the side effect of extra swapcache being generated acceptable.

But this showed a bad effect of ignoring mem policy. Actually this is
not a new issue, cluster readhead is already violating VMA's mem
policy since it does readhead only based on entry value not VMA, the
entry being swapped in is not aware of which VMA it belongs.

And I was thinking, maybe we can just drop the mpol all the way, and
use the nid from page shadow to alloc pages, that may save a lot of
effort, and make cluster readhead more usable in general, also might
simplify a lot of code. How do you think? If this is acceptable, I
think I can send out a new series first and then rework this one
later.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-21 18:40         ` Kairui Song
@ 2024-01-22  6:38           ` Huang, Ying
  2024-01-22 11:35             ` Kairui Song
  0 siblings, 1 reply; 38+ messages in thread
From: Huang, Ying @ 2024-01-22  6:38 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Mon, Jan 15, 2024 at 9:54 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 09:11写道:
>> >>
>> >> Kairui Song <ryncsn@gmail.com> writes:
>> >>
>> >> > From: Kairui Song <kasong@tencent.com>
>> >> >
>> >> > There are two places where swapin is not caused by direct anon page fault:
>> >> > - shmem swapin, invoked indirectly through shmem mapping
>> >> > - swapoff
>> >> >
>> >> > They used to construct a pseudo vmfault struct for swapin function.
>> >> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
>> >> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
>> >> > path is still using one.
>> >> >
>> >> > Introduce a helper for them both, this help save stack usage for swapoff
>> >> > path, and help apply a unified swapin cache and readahead policy check.
>> >> >
>> >> > Due to missing vmfault info, the caller have to pass in mempolicy
>> >> > explicitly, make it different from swapin_entry and name it
>> >> > swapin_entry_mpol.
>> >> >
>> >> > This commit convert swapoff to use this helper, follow-up commits will
>> >> > convert shmem to use it too.
>> >> >
>> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> >> > ---
>> >> >  mm/swap.h       |  9 +++++++++
>> >> >  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
>> >> >  mm/swapfile.c   | 15 ++++++---------
>> >> >  3 files changed, 47 insertions(+), 17 deletions(-)
>> >> >
>> >> > diff --git a/mm/swap.h b/mm/swap.h
>> >> > index 9180411afcfe..8f790a67b948 100644
>> >> > --- a/mm/swap.h
>> >> > +++ b/mm/swap.h
>> >> > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>> >> >               struct mempolicy *mpol, pgoff_t ilx);
>> >> >  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
>> >> >                           struct vm_fault *vmf, enum swap_cache_result *result);
>> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
>> >> > +                             enum swap_cache_result *result);
>> >> >
>> >> >  static inline unsigned int folio_swap_flags(struct folio *folio)
>> >> >  {
>> >> > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
>> >> >       return NULL;
>> >> >  }
>> >> >
>> >> > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> >> > +             struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
>> >> > +{
>> >> > +     return NULL;
>> >> > +}
>> >> > +
>> >> >  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> >> >  {
>> >> >       return 0;
>> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> >> > index 21badd4f0fc7..3edf4b63158d 100644
>> >> > --- a/mm/swap_state.c
>> >> > +++ b/mm/swap_state.c
>> >> > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>> >> >   * in.
>> >> >   */
>> >> >  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>> >> > -                               struct vm_fault *vmf, void *shadow)
>> >> > +                                struct mempolicy *mpol, pgoff_t ilx,
>> >> > +                                void *shadow)
>> >> >  {
>> >> > -     struct vm_area_struct *vma = vmf->vma;
>> >> >       struct folio *folio;
>> >> >
>> >> > -     /* skip swapcache */
>> >> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> >> > -                             vma, vmf->address, false);
>> >> > +     folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
>> >> > +                     mpol, ilx, numa_node_id());
>> >> >       if (folio) {
>> >> >               if (mem_cgroup_swapin_charge_folio(folio, NULL,
>> >> >                                                  GFP_KERNEL, entry)) {
>> >> > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>> >> >               goto done;
>> >> >       }
>> >> >
>> >> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> >> >       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> >> > -             folio = swapin_direct(entry, gfp_mask, vmf, shadow);
>> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>> >> >               cache_result = SWAP_CACHE_BYPASS;
>> >> >       } else {
>> >> > -             mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> >> >               if (swap_use_vma_readahead())
>> >> >                       folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>> >> >               else
>> >> >                       folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>> >> > -             mpol_cond_put(mpol);
>> >> >               cache_result = SWAP_CACHE_MISS;
>> >> >       }
>> >> > +     mpol_cond_put(mpol);
>> >> >  done:
>> >> >       if (result)
>> >> >               *result = cache_result;
>> >> > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>> >> >       return folio;
>> >> >  }
>> >> >
>> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
>> >> > +                             enum swap_cache_result *result)
>> >> > +{
>> >> > +     enum swap_cache_result cache_result;
>> >> > +     void *shadow = NULL;
>> >> > +     struct folio *folio;
>> >> > +
>> >> > +     folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
>> >> > +     if (folio) {
>> >> > +             cache_result = SWAP_CACHE_HIT;
>> >> > +     } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>> >> > +             cache_result = SWAP_CACHE_BYPASS;
>> >> > +     } else {
>> >> > +             folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>> >> > +             cache_result = SWAP_CACHE_MISS;
>> >> > +     }
>> >> > +
>> >> > +     if (result)
>> >> > +             *result = cache_result;
>> >> > +
>> >> > +     return folio;
>> >> > +}
>> >> > +
>> >> >  #ifdef CONFIG_SYSFS
>> >> >  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
>> >> >                                    struct kobj_attribute *attr, char *buf)
>> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> > index 5aa44de11edc..2f77bf143af8 100644
>> >> > --- a/mm/swapfile.c
>> >> > +++ b/mm/swapfile.c
>> >> > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >> >       do {
>> >> >               struct folio *folio;
>> >> >               unsigned long offset;
>> >> > +             struct mempolicy *mpol;
>> >> >               unsigned char swp_count;
>> >> >               swp_entry_t entry;
>> >> > +             pgoff_t ilx;
>> >> >               int ret;
>> >> >               pte_t ptent;
>> >> >
>> >> > -             struct vm_fault vmf = {
>> >> > -                     .vma = vma,
>> >> > -                     .address = addr,
>> >> > -                     .real_address = addr,
>> >> > -                     .pmd = pmd,
>> >> > -             };
>> >> > -
>> >> >               if (!pte++) {
>> >> >                       pte = pte_offset_map(pmd, addr);
>> >> >                       if (!pte)
>> >> > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >> >               pte_unmap(pte);
>> >> >               pte = NULL;
>> >> >
>> >> > -             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
>> >> > -                                  &vmf, NULL);
>> >> > +             mpol = get_vma_policy(vma, addr, 0, &ilx);
>> >> > +             folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
>> >> > +                                       mpol, ilx, NULL);
>> >> > +             mpol_cond_put(mpol);
>> >> >               if (!folio) {
>> >> >                       /*
>> >> >                        * The entry could have been freed, and will not
>> >>
>> >> IIUC, after the change, we will always use cluster readahead for
>> >> swapoff.  This may be OK.  But, at least we need some test results which
>> >> show that this will not cause any issue for this behavior change.  And
>> >> the behavior change should be described explicitly in patch description.
>> >
>> > Hi Ying
>> >
>> > Actually there is a swap_use_no_readahead check in swapin_entry_mpol,
>> > so when readahaed is not needed (SYNC_IO), it's just skipped.
>> >
>> > And I think VMA readahead is not helpful swapoff, swapoff is already
>> > walking the VMA, mostly uninterrupted in kernel space. With VMA
>> > readahead or not, it will issue IO page by page.
>> > The benchmark result I posted before is actually VMA readhead vs
>> > no-readahead for ZRAM, sorry I didn't make it clear. It's obvious
>> > no-readahead is faster.
>> >
>> > For actual block device, cluster readahead might be a good choice for
>> > swapoff, since all pages will be read for swapoff, there has to be
>> > enough memory for all swapcached page to stay in memory or swapoff
>> > will fail anyway, and cluster read is faster for block devices.
>>
>> It's possible.  But please run the tests on some actual block devices
>> and show your results.  Random memory accessing pattern should be
>> tested, and the swap space usage should be > 50% to show some not so
>> friendly situation.
>>
>
> Hi Ying,
>
> I setup a test environment and did following test, and found that
> cluster readahaed for swapoff is actually much worse in default setup:
>
> 1. Setup MySQL server using 2G memcg, with 28G buffer pool, and 24G NVME swap
> 2. Stress test with sysbench for 15min.
> 3. Remove the 2G memcg limit and swapoff.
>
> Before this patch, swapoff will take about 9m.
> After this patch, swapoff will take about 30m.

Thanks for data!

> After some analysis I found the reason is that cluster readahead is
> almost disabled (window == 1 or 2) during swapoff, because it will
> detect a very low hit rate on fragmented swap. But VMA readhead is
> much more aggressive here since swapoff is walking the VMA, with a
> very high hit rate.
>
> But If I force cluster readahead to use a large window for swapoff,
> the swapoff performance boost by a lot:
> By adding following change in swap_cluster_readahead:
>
> if (unlikely(!(si->flags & SWP_WRITEOK)))
>     mask = max_t(unsigned long, 1 << READ_ONCE(page_cluster), PMD_SIZE
> / PAGE_SIZE) - 1;
>
> The swapoff will take only 40s to finish, more than 10x faster than
> the VMA readahead path (9m), because VMA readhead is still doing 4K
> random IO just with a longer queue due to async readahead. But cluster
> readhead will be doing 2M IO now.
> I think PMD size window is good here since it still keep a balance
> between good IO performance and the swapoff progress can still be
> interrupted, and the system is responsible. And in most cases we
> expect swapoff to success, if it fail, the RA windows should still
> keep the side effect of extra swapcache being generated acceptable.

swapoff performance isn't very important because swapoff is a very rare
operation.  It's OK to optimize it if the change is simple and doesn't
compromise other stuff.  But, as you said below, using large readahead
window makes mempolicy issue more serious.  Why isn't the original
swapoff performance good enough for you?

> But this showed a bad effect of ignoring mem policy. Actually this is
> not a new issue, cluster readhead is already violating VMA's mem
> policy since it does readhead only based on entry value not VMA, the
> entry being swapped in is not aware of which VMA it belongs.
>
> And I was thinking, maybe we can just drop the mpol all the way, and
> use the nid from page shadow to alloc pages, that may save a lot of
> effort, and make cluster readhead more usable in general, also might
> simplify a lot of code. How do you think? If this is acceptable, I
> think I can send out a new series first and then rework this one
> later.

The "shadow" node can be reclaimed, please take a look at
scan_shadow_nodes().  Although this hasn't been implemented, it may be
implemented someday.

And, shadow nid may not record the correct memory policy for the memory
allocation.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-22  6:38           ` Huang, Ying
@ 2024-01-22 11:35             ` Kairui Song
  2024-01-24  3:31               ` Huang, Ying
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-22 11:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

On Mon, Jan 22, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Mon, Jan 15, 2024 at 9:54 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 09:11写道:
> >> >>
> >> >> Kairui Song <ryncsn@gmail.com> writes:
> >> >>
> >> >> > From: Kairui Song <kasong@tencent.com>
> >> >> >
> >> >> > There are two places where swapin is not caused by direct anon page fault:
> >> >> > - shmem swapin, invoked indirectly through shmem mapping
> >> >> > - swapoff
> >> >> >
> >> >> > They used to construct a pseudo vmfault struct for swapin function.
> >> >> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> >> >> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> >> >> > path is still using one.
> >> >> >
> >> >> > Introduce a helper for them both, this help save stack usage for swapoff
> >> >> > path, and help apply a unified swapin cache and readahead policy check.
> >> >> >
> >> >> > Due to missing vmfault info, the caller have to pass in mempolicy
> >> >> > explicitly, make it different from swapin_entry and name it
> >> >> > swapin_entry_mpol.
> >> >> >
> >> >> > This commit convert swapoff to use this helper, follow-up commits will
> >> >> > convert shmem to use it too.
> >> >> >
> >> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
> >> >> > ---
> >> >> >  mm/swap.h       |  9 +++++++++
> >> >> >  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
> >> >> >  mm/swapfile.c   | 15 ++++++---------
> >> >> >  3 files changed, 47 insertions(+), 17 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/swap.h b/mm/swap.h
> >> >> > index 9180411afcfe..8f790a67b948 100644
> >> >> > --- a/mm/swap.h
> >> >> > +++ b/mm/swap.h
> >> >> > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >> >> >               struct mempolicy *mpol, pgoff_t ilx);
> >> >> >  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
> >> >> >                           struct vm_fault *vmf, enum swap_cache_result *result);
> >> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> >> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
> >> >> > +                             enum swap_cache_result *result);
> >> >> >
> >> >> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >> >> >  {
> >> >> > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
> >> >> >       return NULL;
> >> >> >  }
> >> >> >
> >> >> > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> >> >> > +             struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
> >> >> > +{
> >> >> > +     return NULL;
> >> >> > +}
> >> >> > +
> >> >> >  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> >> >> >  {
> >> >> >       return 0;
> >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> >> > index 21badd4f0fc7..3edf4b63158d 100644
> >> >> > --- a/mm/swap_state.c
> >> >> > +++ b/mm/swap_state.c
> >> >> > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >> >> >   * in.
> >> >> >   */
> >> >> >  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >> >> > -                               struct vm_fault *vmf, void *shadow)
> >> >> > +                                struct mempolicy *mpol, pgoff_t ilx,
> >> >> > +                                void *shadow)
> >> >> >  {
> >> >> > -     struct vm_area_struct *vma = vmf->vma;
> >> >> >       struct folio *folio;
> >> >> >
> >> >> > -     /* skip swapcache */
> >> >> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >> >> > -                             vma, vmf->address, false);
> >> >> > +     folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
> >> >> > +                     mpol, ilx, numa_node_id());
> >> >> >       if (folio) {
> >> >> >               if (mem_cgroup_swapin_charge_folio(folio, NULL,
> >> >> >                                                  GFP_KERNEL, entry)) {
> >> >> > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> >> >> >               goto done;
> >> >> >       }
> >> >> >
> >> >> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >> >> >       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> >> >> > -             folio = swapin_direct(entry, gfp_mask, vmf, shadow);
> >> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> >> >> >               cache_result = SWAP_CACHE_BYPASS;
> >> >> >       } else {
> >> >> > -             mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >> >> >               if (swap_use_vma_readahead())
> >> >> >                       folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> >> >> >               else
> >> >> >                       folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> >> >> > -             mpol_cond_put(mpol);
> >> >> >               cache_result = SWAP_CACHE_MISS;
> >> >> >       }
> >> >> > +     mpol_cond_put(mpol);
> >> >> >  done:
> >> >> >       if (result)
> >> >> >               *result = cache_result;
> >> >> > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
> >> >> >       return folio;
> >> >> >  }
> >> >> >
> >> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
> >> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
> >> >> > +                             enum swap_cache_result *result)
> >> >> > +{
> >> >> > +     enum swap_cache_result cache_result;
> >> >> > +     void *shadow = NULL;
> >> >> > +     struct folio *folio;
> >> >> > +
> >> >> > +     folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
> >> >> > +     if (folio) {
> >> >> > +             cache_result = SWAP_CACHE_HIT;
> >> >> > +     } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> >> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
> >> >> > +             cache_result = SWAP_CACHE_BYPASS;
> >> >> > +     } else {
> >> >> > +             folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> >> >> > +             cache_result = SWAP_CACHE_MISS;
> >> >> > +     }
> >> >> > +
> >> >> > +     if (result)
> >> >> > +             *result = cache_result;
> >> >> > +
> >> >> > +     return folio;
> >> >> > +}
> >> >> > +
> >> >> >  #ifdef CONFIG_SYSFS
> >> >> >  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
> >> >> >                                    struct kobj_attribute *attr, char *buf)
> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> > index 5aa44de11edc..2f77bf143af8 100644
> >> >> > --- a/mm/swapfile.c
> >> >> > +++ b/mm/swapfile.c
> >> >> > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >> >> >       do {
> >> >> >               struct folio *folio;
> >> >> >               unsigned long offset;
> >> >> > +             struct mempolicy *mpol;
> >> >> >               unsigned char swp_count;
> >> >> >               swp_entry_t entry;
> >> >> > +             pgoff_t ilx;
> >> >> >               int ret;
> >> >> >               pte_t ptent;
> >> >> >
> >> >> > -             struct vm_fault vmf = {
> >> >> > -                     .vma = vma,
> >> >> > -                     .address = addr,
> >> >> > -                     .real_address = addr,
> >> >> > -                     .pmd = pmd,
> >> >> > -             };
> >> >> > -
> >> >> >               if (!pte++) {
> >> >> >                       pte = pte_offset_map(pmd, addr);
> >> >> >                       if (!pte)
> >> >> > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >> >> >               pte_unmap(pte);
> >> >> >               pte = NULL;
> >> >> >
> >> >> > -             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
> >> >> > -                                  &vmf, NULL);
> >> >> > +             mpol = get_vma_policy(vma, addr, 0, &ilx);
> >> >> > +             folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
> >> >> > +                                       mpol, ilx, NULL);
> >> >> > +             mpol_cond_put(mpol);
> >> >> >               if (!folio) {
> >> >> >                       /*
> >> >> >                        * The entry could have been freed, and will not
> >> >>
> >> >> IIUC, after the change, we will always use cluster readahead for
> >> >> swapoff.  This may be OK.  But, at least we need some test results which
> >> >> show that this will not cause any issue for this behavior change.  And
> >> >> the behavior change should be described explicitly in patch description.
> >> >
> >> > Hi Ying
> >> >
> >> > Actually there is a swap_use_no_readahead check in swapin_entry_mpol,
> >> > so when readahaed is not needed (SYNC_IO), it's just skipped.
> >> >
> >> > And I think VMA readahead is not helpful swapoff, swapoff is already
> >> > walking the VMA, mostly uninterrupted in kernel space. With VMA
> >> > readahead or not, it will issue IO page by page.
> >> > The benchmark result I posted before is actually VMA readhead vs
> >> > no-readahead for ZRAM, sorry I didn't make it clear. It's obvious
> >> > no-readahead is faster.
> >> >
> >> > For actual block device, cluster readahead might be a good choice for
> >> > swapoff, since all pages will be read for swapoff, there has to be
> >> > enough memory for all swapcached page to stay in memory or swapoff
> >> > will fail anyway, and cluster read is faster for block devices.
> >>
> >> It's possible.  But please run the tests on some actual block devices
> >> and show your results.  Random memory accessing pattern should be
> >> tested, and the swap space usage should be > 50% to show some not so
> >> friendly situation.
> >>
> >
> > Hi Ying,
> >
> > I setup a test environment and did following test, and found that
> > cluster readahaed for swapoff is actually much worse in default setup:
> >
> > 1. Setup MySQL server using 2G memcg, with 28G buffer pool, and 24G NVME swap
> > 2. Stress test with sysbench for 15min.
> > 3. Remove the 2G memcg limit and swapoff.
> >
> > Before this patch, swapoff will take about 9m.
> > After this patch, swapoff will take about 30m.
>
> Thanks for data!
>
> > After some analysis I found the reason is that cluster readahead is
> > almost disabled (window == 1 or 2) during swapoff, because it will
> > detect a very low hit rate on fragmented swap. But VMA readhead is
> > much more aggressive here since swapoff is walking the VMA, with a
> > very high hit rate.
> >
> > But If I force cluster readahead to use a large window for swapoff,
> > the swapoff performance boost by a lot:
> > By adding following change in swap_cluster_readahead:
> >
> > if (unlikely(!(si->flags & SWP_WRITEOK)))
> >     mask = max_t(unsigned long, 1 << READ_ONCE(page_cluster), PMD_SIZE
> > / PAGE_SIZE) - 1;
> >
> > The swapoff will take only 40s to finish, more than 10x faster than
> > the VMA readahead path (9m), because VMA readhead is still doing 4K
> > random IO just with a longer queue due to async readahead. But cluster
> > readhead will be doing 2M IO now.
> > I think PMD size window is good here since it still keep a balance
> > between good IO performance and the swapoff progress can still be
> > interrupted, and the system is responsible. And in most cases we
> > expect swapoff to success, if it fail, the RA windows should still
> > keep the side effect of extra swapcache being generated acceptable.
>
> swapoff performance isn't very important because swapoff is a very rare
> operation.  It's OK to optimize it if the change is simple and doesn't
> compromise other stuff.  But, as you said below, using large readahead
> window makes mempolicy issue more serious.  Why isn't the original
> swapoff performance good enough for you?

Thanks for the reply.

I think I'll just keep the original VMA readahead policy here then.
Just I noticed that VMA readhead will also violate ranged memory
policy too... That's some different issue, looks trivial though.

>
> > But this showed a bad effect of ignoring mem policy. Actually this is
> > not a new issue, cluster readhead is already violating VMA's mem
> > policy since it does readhead only based on entry value not VMA, the
> > entry being swapped in is not aware of which VMA it belongs.
> >
> > And I was thinking, maybe we can just drop the mpol all the way, and
> > use the nid from page shadow to alloc pages, that may save a lot of
> > effort, and make cluster readhead more usable in general, also might
> > simplify a lot of code. How do you think? If this is acceptable, I
> > think I can send out a new series first and then rework this one
> > later.
>
> The "shadow" node can be reclaimed, please take a look at
> scan_shadow_nodes().  Although this hasn't been implemented, it may be
> implemented someday.

Right, I noticed upstream commit 5649d113ffce ("swap_state: update
shadow_nodes for anonymous page") started reclaiming anon pages
shadows now, thanks for the info.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault
  2024-01-22 11:35             ` Kairui Song
@ 2024-01-24  3:31               ` Huang, Ying
  0 siblings, 0 replies; 38+ messages in thread
From: Huang, Ying @ 2024-01-24  3:31 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Mon, Jan 22, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Mon, Jan 15, 2024 at 9:54 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Kairui Song <ryncsn@gmail.com> writes:
>> >>
>> >> > Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 09:11写道:
>> >> >>
>> >> >> Kairui Song <ryncsn@gmail.com> writes:
>> >> >>
>> >> >> > From: Kairui Song <kasong@tencent.com>
>> >> >> >
>> >> >> > There are two places where swapin is not caused by direct anon page fault:
>> >> >> > - shmem swapin, invoked indirectly through shmem mapping
>> >> >> > - swapoff
>> >> >> >
>> >> >> > They used to construct a pseudo vmfault struct for swapin function.
>> >> >> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
>> >> >> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
>> >> >> > path is still using one.
>> >> >> >
>> >> >> > Introduce a helper for them both, this help save stack usage for swapoff
>> >> >> > path, and help apply a unified swapin cache and readahead policy check.
>> >> >> >
>> >> >> > Due to missing vmfault info, the caller have to pass in mempolicy
>> >> >> > explicitly, make it different from swapin_entry and name it
>> >> >> > swapin_entry_mpol.
>> >> >> >
>> >> >> > This commit convert swapoff to use this helper, follow-up commits will
>> >> >> > convert shmem to use it too.
>> >> >> >
>> >> >> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> >> >> > ---
>> >> >> >  mm/swap.h       |  9 +++++++++
>> >> >> >  mm/swap_state.c | 40 ++++++++++++++++++++++++++++++++--------
>> >> >> >  mm/swapfile.c   | 15 ++++++---------
>> >> >> >  3 files changed, 47 insertions(+), 17 deletions(-)
>> >> >> >
>> >> >> > diff --git a/mm/swap.h b/mm/swap.h
>> >> >> > index 9180411afcfe..8f790a67b948 100644
>> >> >> > --- a/mm/swap.h
>> >> >> > +++ b/mm/swap.h
>> >> >> > @@ -73,6 +73,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>> >> >> >               struct mempolicy *mpol, pgoff_t ilx);
>> >> >> >  struct folio *swapin_entry(swp_entry_t entry, gfp_t flag,
>> >> >> >                           struct vm_fault *vmf, enum swap_cache_result *result);
>> >> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> >> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
>> >> >> > +                             enum swap_cache_result *result);
>> >> >> >
>> >> >> >  static inline unsigned int folio_swap_flags(struct folio *folio)
>> >> >> >  {
>> >> >> > @@ -109,6 +112,12 @@ static inline struct folio *swapin_entry(swp_entry_t swp, gfp_t gfp_mask,
>> >> >> >       return NULL;
>> >> >> >  }
>> >> >> >
>> >> >> > +static inline struct page *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> >> >> > +             struct mempolicy *mpol, pgoff_t ilx, enum swap_cache_result *result)
>> >> >> > +{
>> >> >> > +     return NULL;
>> >> >> > +}
>> >> >> > +
>> >> >> >  static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> >> >> >  {
>> >> >> >       return 0;
>> >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> >> >> > index 21badd4f0fc7..3edf4b63158d 100644
>> >> >> > --- a/mm/swap_state.c
>> >> >> > +++ b/mm/swap_state.c
>> >> >> > @@ -880,14 +880,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>> >> >> >   * in.
>> >> >> >   */
>> >> >> >  static struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>> >> >> > -                               struct vm_fault *vmf, void *shadow)
>> >> >> > +                                struct mempolicy *mpol, pgoff_t ilx,
>> >> >> > +                                void *shadow)
>> >> >> >  {
>> >> >> > -     struct vm_area_struct *vma = vmf->vma;
>> >> >> >       struct folio *folio;
>> >> >> >
>> >> >> > -     /* skip swapcache */
>> >> >> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> >> >> > -                             vma, vmf->address, false);
>> >> >> > +     folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
>> >> >> > +                     mpol, ilx, numa_node_id());
>> >> >> >       if (folio) {
>> >> >> >               if (mem_cgroup_swapin_charge_folio(folio, NULL,
>> >> >> >                                                  GFP_KERNEL, entry)) {
>> >> >> > @@ -943,18 +942,18 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>> >> >> >               goto done;
>> >> >> >       }
>> >> >> >
>> >> >> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> >> >> >       if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> >> >> > -             folio = swapin_direct(entry, gfp_mask, vmf, shadow);
>> >> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>> >> >> >               cache_result = SWAP_CACHE_BYPASS;
>> >> >> >       } else {
>> >> >> > -             mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> >> >> >               if (swap_use_vma_readahead())
>> >> >> >                       folio = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>> >> >> >               else
>> >> >> >                       folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>> >> >> > -             mpol_cond_put(mpol);
>> >> >> >               cache_result = SWAP_CACHE_MISS;
>> >> >> >       }
>> >> >> > +     mpol_cond_put(mpol);
>> >> >> >  done:
>> >> >> >       if (result)
>> >> >> >               *result = cache_result;
>> >> >> > @@ -962,6 +961,31 @@ struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp_mask,
>> >> >> >       return folio;
>> >> >> >  }
>> >> >> >
>> >> >> > +struct folio *swapin_entry_mpol(swp_entry_t entry, gfp_t gfp_mask,
>> >> >> > +                             struct mempolicy *mpol, pgoff_t ilx,
>> >> >> > +                             enum swap_cache_result *result)
>> >> >> > +{
>> >> >> > +     enum swap_cache_result cache_result;
>> >> >> > +     void *shadow = NULL;
>> >> >> > +     struct folio *folio;
>> >> >> > +
>> >> >> > +     folio = swap_cache_get_folio(entry, NULL, 0, &shadow);
>> >> >> > +     if (folio) {
>> >> >> > +             cache_result = SWAP_CACHE_HIT;
>> >> >> > +     } else if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> >> >> > +             folio = swapin_direct(entry, gfp_mask, mpol, ilx, shadow);
>> >> >> > +             cache_result = SWAP_CACHE_BYPASS;
>> >> >> > +     } else {
>> >> >> > +             folio = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>> >> >> > +             cache_result = SWAP_CACHE_MISS;
>> >> >> > +     }
>> >> >> > +
>> >> >> > +     if (result)
>> >> >> > +             *result = cache_result;
>> >> >> > +
>> >> >> > +     return folio;
>> >> >> > +}
>> >> >> > +
>> >> >> >  #ifdef CONFIG_SYSFS
>> >> >> >  static ssize_t vma_ra_enabled_show(struct kobject *kobj,
>> >> >> >                                    struct kobj_attribute *attr, char *buf)
>> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> > index 5aa44de11edc..2f77bf143af8 100644
>> >> >> > --- a/mm/swapfile.c
>> >> >> > +++ b/mm/swapfile.c
>> >> >> > @@ -1840,18 +1840,13 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >> >> >       do {
>> >> >> >               struct folio *folio;
>> >> >> >               unsigned long offset;
>> >> >> > +             struct mempolicy *mpol;
>> >> >> >               unsigned char swp_count;
>> >> >> >               swp_entry_t entry;
>> >> >> > +             pgoff_t ilx;
>> >> >> >               int ret;
>> >> >> >               pte_t ptent;
>> >> >> >
>> >> >> > -             struct vm_fault vmf = {
>> >> >> > -                     .vma = vma,
>> >> >> > -                     .address = addr,
>> >> >> > -                     .real_address = addr,
>> >> >> > -                     .pmd = pmd,
>> >> >> > -             };
>> >> >> > -
>> >> >> >               if (!pte++) {
>> >> >> >                       pte = pte_offset_map(pmd, addr);
>> >> >> >                       if (!pte)
>> >> >> > @@ -1871,8 +1866,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >> >> >               pte_unmap(pte);
>> >> >> >               pte = NULL;
>> >> >> >
>> >> >> > -             folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
>> >> >> > -                                  &vmf, NULL);
>> >> >> > +             mpol = get_vma_policy(vma, addr, 0, &ilx);
>> >> >> > +             folio = swapin_entry_mpol(entry, GFP_HIGHUSER_MOVABLE,
>> >> >> > +                                       mpol, ilx, NULL);
>> >> >> > +             mpol_cond_put(mpol);
>> >> >> >               if (!folio) {
>> >> >> >                       /*
>> >> >> >                        * The entry could have been freed, and will not
>> >> >>
>> >> >> IIUC, after the change, we will always use cluster readahead for
>> >> >> swapoff.  This may be OK.  But, at least we need some test results which
>> >> >> show that this will not cause any issue for this behavior change.  And
>> >> >> the behavior change should be described explicitly in patch description.
>> >> >
>> >> > Hi Ying
>> >> >
>> >> > Actually there is a swap_use_no_readahead check in swapin_entry_mpol,
>> >> > so when readahaed is not needed (SYNC_IO), it's just skipped.
>> >> >
>> >> > And I think VMA readahead is not helpful swapoff, swapoff is already
>> >> > walking the VMA, mostly uninterrupted in kernel space. With VMA
>> >> > readahead or not, it will issue IO page by page.
>> >> > The benchmark result I posted before is actually VMA readhead vs
>> >> > no-readahead for ZRAM, sorry I didn't make it clear. It's obvious
>> >> > no-readahead is faster.
>> >> >
>> >> > For actual block device, cluster readahead might be a good choice for
>> >> > swapoff, since all pages will be read for swapoff, there has to be
>> >> > enough memory for all swapcached page to stay in memory or swapoff
>> >> > will fail anyway, and cluster read is faster for block devices.
>> >>
>> >> It's possible.  But please run the tests on some actual block devices
>> >> and show your results.  Random memory accessing pattern should be
>> >> tested, and the swap space usage should be > 50% to show some not so
>> >> friendly situation.
>> >>
>> >
>> > Hi Ying,
>> >
>> > I setup a test environment and did following test, and found that
>> > cluster readahaed for swapoff is actually much worse in default setup:
>> >
>> > 1. Setup MySQL server using 2G memcg, with 28G buffer pool, and 24G NVME swap
>> > 2. Stress test with sysbench for 15min.
>> > 3. Remove the 2G memcg limit and swapoff.
>> >
>> > Before this patch, swapoff will take about 9m.
>> > After this patch, swapoff will take about 30m.
>>
>> Thanks for data!
>>
>> > After some analysis I found the reason is that cluster readahead is
>> > almost disabled (window == 1 or 2) during swapoff, because it will
>> > detect a very low hit rate on fragmented swap. But VMA readhead is
>> > much more aggressive here since swapoff is walking the VMA, with a
>> > very high hit rate.
>> >
>> > But If I force cluster readahead to use a large window for swapoff,
>> > the swapoff performance boost by a lot:
>> > By adding following change in swap_cluster_readahead:
>> >
>> > if (unlikely(!(si->flags & SWP_WRITEOK)))
>> >     mask = max_t(unsigned long, 1 << READ_ONCE(page_cluster), PMD_SIZE
>> > / PAGE_SIZE) - 1;
>> >
>> > The swapoff will take only 40s to finish, more than 10x faster than
>> > the VMA readahead path (9m), because VMA readhead is still doing 4K
>> > random IO just with a longer queue due to async readahead. But cluster
>> > readhead will be doing 2M IO now.
>> > I think PMD size window is good here since it still keep a balance
>> > between good IO performance and the swapoff progress can still be
>> > interrupted, and the system is responsible. And in most cases we
>> > expect swapoff to success, if it fail, the RA windows should still
>> > keep the side effect of extra swapcache being generated acceptable.
>>
>> swapoff performance isn't very important because swapoff is a very rare
>> operation.  It's OK to optimize it if the change is simple and doesn't
>> compromise other stuff.  But, as you said below, using large readahead
>> window makes mempolicy issue more serious.  Why isn't the original
>> swapoff performance good enough for you?
>
> Thanks for the reply.
>
> I think I'll just keep the original VMA readahead policy here then.
> Just I noticed that VMA readhead will also violate ranged memory
> policy too... That's some different issue, looks trivial though.

During reviewing your patch, I found that too.  I think that they can be
fixed because we have enough information.

--
Best Regards,
Huang, Ying

>>
>> > But this showed a bad effect of ignoring mem policy. Actually this is
>> > not a new issue, cluster readhead is already violating VMA's mem
>> > policy since it does readhead only based on entry value not VMA, the
>> > entry being swapped in is not aware of which VMA it belongs.
>> >
>> > And I was thinking, maybe we can just drop the mpol all the way, and
>> > use the nid from page shadow to alloc pages, that may save a lot of
>> > effort, and make cluster readhead more usable in general, also might
>> > simplify a lot of code. How do you think? If this is acceptable, I
>> > think I can send out a new series first and then rework this one
>> > later.
>>
>> The "shadow" node can be reclaimed, please take a look at
>> scan_shadow_nodes().  Although this hasn't been implemented, it may be
>> implemented someday.
>
> Right, I noticed upstream commit 5649d113ffce ("swap_state: update
> shadow_nodes for anonymous page") started reclaiming anon pages
> shadows now, thanks for the info.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-10  3:35     ` Kairui Song
@ 2024-01-30  0:39       ` Kairui Song
  2024-01-30  2:01         ` Huang, Ying
  0 siblings, 1 reply; 38+ messages in thread
From: Kairui Song @ 2024-01-30  0:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	LKML

[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]

On Wed, Jan 10, 2024 at 11:35 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 10:05写道:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Currently, shmem uses cluster readahead for all swap backends. Cluster
> > > readahead is not a good solution for ramdisk based device (ZRAM) at
all.
> > >
> > > After switching to the new helper, most benchmarks showed a good
result:
> > >
> > > - Single file sequence read:
> > >   perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M
count=8192
> > >   (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg
limit)
> > >   Before: 22.248 +- 0.549
> > >   After:  22.021 +- 0.684 (-1.1%)
> > >
> > > - Random read stress test:
> > >   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
> > >   --size=256m --ioengine=mmap --rw=randread
--random_distribution=random \
> > >   --time_based --ramp_time=1m --runtime=5m --group_reporting
> > >   (using brd as swap, 2G memcg limit)
> > >
> > >   Before: 1818MiB/s
> > >   After:  1888MiB/s (+3.85%)
> > >
> > > - Zipf biased random read stress test:
> > >   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
> > >   --size=256m --ioengine=mmap --rw=randread
--random_distribution=zipf:1.2 \
> > >   --time_based --ramp_time=1m --runtime=5m --group_reporting
> > >   (using brd as swap, 2G memcg limit)
> > >
> > >   Before: 31.1GiB/s
> > >   After:  32.3GiB/s (+3.86%)
> > >
> > > So cluster readahead doesn't help much even for single sequence read,
> > > and for random stress test, the performance is better without it.
> > >
> > > Considering both memory and swap device will get more fragmented
> > > slowly, and commonly used ZRAM consumes much more CPU than plain
> > > ramdisk, false readahead could occur more frequently and waste
> > > more CPU. Direct SWAP is cheaper, so use the new helper and skip
> > > read ahead for SWP_SYNCHRONOUS_IO device.
> >
> > It's good to take advantage of swap_direct (no readahead).  I also hopes
> > we can take advantage of VMA based swapin if shmem is accessed via mmap.
> > That appears possible.
>
> Good idea, that should be doable, will update the series.

Hi Ying,

Turns out it's quite complex to do VMA bases swapin readhead for shmem: VMA
address / Page Tables doesn't contain swapin entry for shmem. For anon page
simply read nearby page table is easy and good enough, but for shmem, it's
stored in the inode mapping so the readahead needs to walk the inode
mapping instead. That's doable but requires more work to make it actually
usable. I've sent V3 without this feature, worth another series for this
readahead extension.

[-- Attachment #2: Type: text/html, Size: 3783 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
  2024-01-30  0:39       ` Kairui Song
@ 2024-01-30  2:01         ` Huang, Ying
  0 siblings, 0 replies; 38+ messages in thread
From: Huang, Ying @ 2024-01-30  2:01 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Hugh Dickins, Johannes Weiner,
	Matthew Wilcox, Michal Hocko, Yosry Ahmed, David Hildenbrand,
	LKML

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Jan 10, 2024 at 11:35 AM Kairui Song <ryncsn@gmail.com> wrote:
>>
>> Huang, Ying <ying.huang@intel.com> 于2024年1月9日周二 10:05写道:
>> >
>> > Kairui Song <ryncsn@gmail.com> writes:
>> >
>> > > From: Kairui Song <kasong@tencent.com>
>> > >
>> > > Currently, shmem uses cluster readahead for all swap backends. Cluster
>> > > readahead is not a good solution for ramdisk based device (ZRAM) at
> all.
>> > >
>> > > After switching to the new helper, most benchmarks showed a good
> result:
>> > >
>> > > - Single file sequence read:
>> > >   perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M
> count=8192
>> > >   (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg
> limit)
>> > >   Before: 22.248 +- 0.549
>> > >   After:  22.021 +- 0.684 (-1.1%)
>> > >
>> > > - Random read stress test:
>> > >   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
>> > >   --size=256m --ioengine=mmap --rw=randread
> --random_distribution=random \
>> > >   --time_based --ramp_time=1m --runtime=5m --group_reporting
>> > >   (using brd as swap, 2G memcg limit)
>> > >
>> > >   Before: 1818MiB/s
>> > >   After:  1888MiB/s (+3.85%)
>> > >
>> > > - Zipf biased random read stress test:
>> > >   fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
>> > >   --size=256m --ioengine=mmap --rw=randread
> --random_distribution=zipf:1.2 \
>> > >   --time_based --ramp_time=1m --runtime=5m --group_reporting
>> > >   (using brd as swap, 2G memcg limit)
>> > >
>> > >   Before: 31.1GiB/s
>> > >   After:  32.3GiB/s (+3.86%)
>> > >
>> > > So cluster readahead doesn't help much even for single sequence read,
>> > > and for random stress test, the performance is better without it.
>> > >
>> > > Considering both memory and swap device will get more fragmented
>> > > slowly, and commonly used ZRAM consumes much more CPU than plain
>> > > ramdisk, false readahead could occur more frequently and waste
>> > > more CPU. Direct SWAP is cheaper, so use the new helper and skip
>> > > read ahead for SWP_SYNCHRONOUS_IO device.
>> >
>> > It's good to take advantage of swap_direct (no readahead).  I also hopes
>> > we can take advantage of VMA based swapin if shmem is accessed via mmap.
>> > That appears possible.
>>
>> Good idea, that should be doable, will update the series.
>
> Hi Ying,
>
> Turns out it's quite complex to do VMA bases swapin readhead for shmem: VMA
> address / Page Tables doesn't contain swapin entry for shmem. For anon page
> simply read nearby page table is easy and good enough, but for shmem, it's
> stored in the inode mapping so the readahead needs to walk the inode
> mapping instead. That's doable but requires more work to make it actually
> usable. I've sent V3 without this feature, worth another series for this
> readahead extension.

Got it.  Thanks for looking at this.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-01-30  2:04 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
2024-01-02 17:53 ` [PATCH v2 1/9] mm/swapfile.c: add back some comment Kairui Song
2024-01-02 17:53 ` [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
2024-01-04  7:28   ` Huang, Ying
2024-01-05  7:43     ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin Kairui Song
2024-01-04  8:10   ` Huang, Ying
2024-01-09  9:38     ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg Kairui Song
2024-01-05  7:14   ` Huang, Ying
2024-01-05  7:33     ` Kairui Song
2024-01-08  7:44       ` Huang, Ying
2024-01-09  9:42         ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
2024-01-05  7:28   ` Huang, Ying
2024-01-10  2:42     ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry Kairui Song
2024-01-08  8:26   ` Huang, Ying
2024-01-10  2:53     ` Kairui Song
2024-01-15  1:45       ` Huang, Ying
2024-01-15 17:11         ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
2024-01-03 12:50   ` kernel test robot
2024-01-02 17:53 ` [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault Kairui Song
2024-01-09  1:08   ` Huang, Ying
2024-01-10  3:32     ` Kairui Song
2024-01-15  1:52       ` Huang, Ying
2024-01-21 18:40         ` Kairui Song
2024-01-22  6:38           ` Huang, Ying
2024-01-22 11:35             ` Kairui Song
2024-01-24  3:31               ` Huang, Ying
2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
2024-01-03 11:56   ` kernel test robot
2024-01-03 13:45   ` kernel test robot
2024-01-09  2:03   ` Huang, Ying
2024-01-10  3:35     ` Kairui Song
2024-01-30  0:39       ` Kairui Song
2024-01-30  2:01         ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.