All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
@ 2024-03-26 18:50 Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 01/10] mm/filemap: split filemap storing logic into a standalone helper Kairui Song
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
bypass swapin):
https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@gmail.com/

Because we have to spin on the swap map on race, and swap map is too small
to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
is added. It's not the first time a hackish workaround was added for cache
bypass swapin and not the last time. I did many experiments locally to
see if the swap cache bypass path can be dropped while keeping the
performance still comparable. And it seems doable.

This series does the following things:
1. Remove swap cache bypass completely.
2. Apply multiple optimizations after that, these optimizations are
   either undoable or very difficult to do without dropping the cache
   bypass swapin path.
3. Use swap cache as a synchronization layer, also unify some code
   with page cache (filemap).

As a result, we have:
1. A comparable performance, some tests are even faster.
2. Multi-index support for swap cache.
3. Removed many hackish workarounds including above long tailing
   issue is gone.

Sending this as RFC to collect some discussion, suggestion, or rejection
early, this seems need to be split into multiple series, but the
performance is not good until the last patch so I think start by
seperating them may make this approach not very convincing. And there
are still some (maybe further) TODO items and optimization space
if we are OK with this approach.

This is based on my another series, for reusing filemap code for swapcache:
[PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@gmail.com/

Patch 1/10, introduce a helper from filemap side to be used later.
Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
  bypass swapin path.
Patch 4/10, removed the swap cache bypass swapin path, and the
  performance drop heavily (-28%).
Patch 5/10, apply the first optimization after the removal, since all
  folios goes through swap cache now, there is no need to explicit shadow
  clearing any more.
Patch 6/10, apply another optimization after clean up shadow clearing
  routines. Now swapcache is very alike page cache, so just reuse page
  cache code and we will have multi-index support. Shadow memory usage
  dropped a lot.
Patch 7/10, just rename __read_swap_cache_async, it will be refactored
  and a key part of this series, and the naming is very confusing to me.
Patch 8/10, make swap cache as a synchronization layer, introduce two
  helpers for adding folios to swap cache, caller will either succeed or
  get a folio to wait on.
Patch 9/10, apply another optimization. With above two helpers, looking
  up of swapcache can be optimized and avoid false looking up, which
  helped improve the performance.
Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
  after this commit, performance for simple swapin/swapout is basically
  same as before.

Test 1, sequential swapin/out of 30G zero page on ZRAM:

               Before (us)        After (us)
Swapout:       33619409           33886008
Swapin:        32393771           32465441 (- 0.2%)
Swapout (THP): 7817909            6899938  (+11.8%)
Swapin (THP) : 32452387           33193479 (- 2.2%)

And after swapping out 30G with THP, the radix node usage dropped by a
lot:

Before: radix_tree_node 73728K
After:  radix_tree_node  7056K (-94%)

Test 2:
Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
  sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
  --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
  --threads=48 --time=300 --report-interval=10 run

Before: transactions: 4849.25 per sec
After:  transactions: 4849.40 per sec

Test 3:
Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
  echo never > /sys/kernel/mm/transparent_hugepage/enabled
  echo 100 > /sys/module/zswap/parameters/max_pool_percent
  echo 1 > /sys/module/zswap/parameters/enabled
  echo y > /sys/module/zswap/parameters/shrinker_enabled

  sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
  --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
  --threads=48 --time=600 --report-interval=10 run

Before: transactions: 1662.90 per sec
After:  transactions: 1726.52 per sec

Test 4:
Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
  echo always > /sys/kernel/mm/transparent_hugepage/enabled
  echo 100 > /sys/module/zswap/parameters/max_pool_percent
  echo 1 > /sys/module/zswap/parameters/enabled
  echo y > /sys/module/zswap/parameters/shrinker_enabled

  sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
  --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
  --threads=48 --time=600 --report-interval=10 run

Before: transactions: 2860.90 per sec.
After:  transactions: 2802.55 per sec.

Test 5:
Memtier / memcached (16G brd SWAP, 8G memcg, THP never):

  memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &

  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys --key-minimum=1 \
    --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
    --ratio 1:0 --pipeline 8 -d 1000

Before: 106730.31 Ops/sec
After:  106360.11 Ops/sec

Test 5:
Memtier / memcached (16G brd SWAP, 8G memcg, THP always):

  memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &

  memtier_benchmark -S /tmp/memcached.socket \
    -P memcache_binary -n allkeys --key-minimum=1 \
    --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
    --ratio 1:0 --pipeline 8 -d 1000

Before: 83193.11 Ops/sec
After:  82504.89 Ops/sec

These tests are tested under heavy memory stress, and the performance
seems basically same as before,very slightly better/worse for certain
cases, the benefits of multi-index are basically erased by
fragmentation and workingset nodes usage is slightly lower.

Some (maybe further) TODO items if we are OK with this approach:

- I see a slight performance regression for THP tests,
  could identify a clear hotspot with perf, my guess is the
  content on the xa_lock is an issue (we have a xa_lock for
  every 64M swap cache space), THP handling needs to take the lock
  longer than usual. splitting the xa_lock to be more
  fine-grained seems a good solution. We have
  SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
  Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
  just for 2 extra bits. 12 should be better to always make use of
  the whole XA chunk and having two layers at most. But duplicated
  address_space struct also wastes more memory and cacheline.
  I see an observable performance drop (~3%) after change
  SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
  decouple swap cache xarray from address_space (there are
  too many user for swapcache, shouldn't come too dirty).

- Actually after patch Patch 4/10, the performance is much better for
  tests limited with memory cgroup, until 10/10 applied the direct swap
  cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
  device is not near full, swapin doesn't clear up the swapcache, so
  repeated swapout doesn't need to re-alloc a swap entry, make things
  faster. This may indicate that lazy freeing of swap cache could benifit
  certain workloads and may worth looking into later.

- Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
  swap cache after swapin is done, which can be cleaned up and optimized
  further after this patch. Device type will only determine the
  readahead logic, and swap cache drop check can be based purely on swap
  count.

- Recent mTHP swapin/swapout series should have no fundamental
  conflict with this.

Kairui Song (10):
  mm/filemap: split filemap storing logic into a standalone helper
  mm/swap: move no readahead swapin code to a stand-alone helper
  mm/swap: convert swapin_readahead to return a folio
  mm/swap: remove cache bypass swapin
  mm/swap: clean shadow only in unmap path
  mm/swap: switch to use multi index entries
  mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
  mm/swap: use swap cache as a synchronization layer
  mm/swap: delay the swap cache look up for swapin
  mm/swap: optimize synchronous swapin

 include/linux/swapops.h |   5 +-
 mm/filemap.c            | 161 +++++++++-----
 mm/huge_memory.c        |  78 +++----
 mm/internal.h           |   2 +
 mm/memory.c             | 133 ++++-------
 mm/shmem.c              |  44 ++--
 mm/swap.h               |  71 ++++--
 mm/swap_state.c         | 478 +++++++++++++++++++++-------------------
 mm/swapfile.c           |  64 +++---
 mm/vmscan.c             |   8 +-
 mm/workingset.c         |   2 +-
 mm/zswap.c              |   4 +-
 12 files changed, 540 insertions(+), 510 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 01/10] mm/filemap: split filemap storing logic into a standalone helper
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 02/10] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Swapcache can reuse this part for multi index support, no change of
performance from page cache side except noise:

Test in 8G memory cgroup and 16G brd ramdisk.

  echo 3 > /proc/sys/vm/drop_caches

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
    --buffered=1 --ioengine=mmap --rw=randread --time_based \
    --ramp_time=30s --runtime=5m --group_reporting

Before:
bw (  MiB/s): min=  493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops        : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651

After:
bw (  MiB/s): min=  298, max= 3840, per=100.00%, avg=2614.34, stdev=23.77, samples=8689
iops        : min=76464, max=983045, avg=669270.35, stdev=6084.31, samples=8689

Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):

  echo 3 > /proc/sys/vm/drop_caches

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
      --buffered=1 --ioengine=mmap -thp=1 --readonly \
      --rw=randread --time_based --ramp_time=30s --runtime=10m \
      --group_reporting

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
      --buffered=1 --ioengine=mmap \
      --rw=randread --time_based --runtime=5s --group_reporting

Before:
bw (  KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops        : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146

READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec

After:
bw (  KiB/s): min= 4691, max=15666, per=100.00%, avg=8890.30, stdev=104.53, samples=19056
iops        : min= 1167, max= 3913, avg=2218.68, stdev=26.15, samples=19056

READ: bw=4590B/s (4590B/s), 4590B/s-4590B/s (4590B/s-4590B/s), io=64.0KiB (65.5kB), run=14275-14275msec

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/filemap.c | 124 +++++++++++++++++++++++++++------------------------
 1 file changed, 65 insertions(+), 59 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 90b86f22a9df..0ccdc9e92764 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -848,38 +848,23 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_folio);
 
-noinline int __filemap_add_folio(struct address_space *mapping,
-		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
+static int __filemap_lock_store(struct xa_state *xas, struct folio *folio,
+				  pgoff_t index, gfp_t gfp, void **shadowp)
 {
-	XA_STATE(xas, &mapping->i_pages, index);
-	void *alloced_shadow = NULL;
-	int alloced_order = 0;
-	bool huge;
-	long nr;
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
-	mapping_set_update(&xas, mapping);
-
-	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
-	xas_set_order(&xas, index, folio_order(folio));
-	huge = folio_test_hugetlb(folio);
-	nr = folio_nr_pages(folio);
-
+	void *entry, *old, *alloced_shadow = NULL;
+	int order, split_order, alloced_order = 0;
 	gfp &= GFP_RECLAIM_MASK;
-	folio_ref_add(folio, nr);
-	folio->mapping = mapping;
-	folio->index = xas.xa_index;
 
 	for (;;) {
-		int order = -1, split_order = 0;
-		void *entry, *old = NULL;
+		order = -1;
+		split_order = 0;
+		old = NULL;
 
-		xas_lock_irq(&xas);
-		xas_for_each_conflict(&xas, entry) {
+		xas_lock_irq(xas);
+		xas_for_each_conflict(xas, entry) {
 			old = entry;
 			if (!xa_is_value(entry)) {
-				xas_set_err(&xas, -EEXIST);
+				xas_set_err(xas, -EEXIST);
 				goto unlock;
 			}
 			/*
@@ -887,72 +872,93 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 			 * it will be the first and only entry iterated.
 			 */
 			if (order == -1)
-				order = xas_get_order(&xas);
+				order = xas_get_order(xas);
 		}
 
 		/* entry may have changed before we re-acquire the lock */
 		if (alloced_order && (old != alloced_shadow || order != alloced_order)) {
-			xas_destroy(&xas);
+			xas_destroy(xas);
 			alloced_order = 0;
 		}
 
 		if (old) {
 			if (order > 0 && order > folio_order(folio)) {
-				/* How to handle large swap entries? */
-				BUG_ON(shmem_mapping(mapping));
 				if (!alloced_order) {
 					split_order = order;
 					goto unlock;
 				}
-				xas_split(&xas, old, order);
-				xas_reset(&xas);
+				xas_split(xas, old, order);
+				xas_reset(xas);
 			}
 			if (shadowp)
 				*shadowp = old;
 		}
 
-		xas_store(&xas, folio);
-		if (xas_error(&xas))
-			goto unlock;
-
-		mapping->nrpages += nr;
-
-		/* hugetlb pages do not participate in page cache accounting */
-		if (!huge) {
-			__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
-			if (folio_test_pmd_mappable(folio))
-				__lruvec_stat_mod_folio(folio,
-						NR_FILE_THPS, nr);
-		}
-
+		xas_store(xas, folio);
+		if (!xas_error(xas))
+			return 0;
 unlock:
-		xas_unlock_irq(&xas);
+		xas_unlock_irq(xas);
 
 		/* split needed, alloc here and retry. */
 		if (split_order) {
-			xas_split_alloc(&xas, old, split_order, gfp);
-			if (xas_error(&xas))
+			xas_split_alloc(xas, old, split_order, gfp);
+			if (xas_error(xas))
 				goto error;
 			alloced_shadow = old;
 			alloced_order = split_order;
-			xas_reset(&xas);
+			xas_reset(xas);
 			continue;
 		}
 
-		if (!xas_nomem(&xas, gfp))
+		if (!xas_nomem(xas, gfp))
 			break;
 	}
 
-	if (xas_error(&xas))
-		goto error;
-
-	trace_mm_filemap_add_to_page_cache(folio);
-	return 0;
 error:
-	folio->mapping = NULL;
-	/* Leave page->index set: truncation relies upon it */
-	folio_put_refs(folio, nr);
-	return xas_error(&xas);
+	return xas_error(xas);
+}
+
+noinline int __filemap_add_folio(struct address_space *mapping,
+		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
+{
+	XA_STATE(xas, &mapping->i_pages, index);
+	bool huge;
+	long nr;
+	int ret;
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	mapping_set_update(&xas, mapping);
+
+	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
+	xas_set_order(&xas, index, folio_order(folio));
+	huge = folio_test_hugetlb(folio);
+	nr = folio_nr_pages(folio);
+
+	folio_ref_add(folio, nr);
+	folio->mapping = mapping;
+	folio->index = xas.xa_index;
+
+	ret = __filemap_lock_store(&xas, folio, index, gfp, shadowp);
+	if (!ret) {
+		mapping->nrpages += nr;
+		/* hugetlb pages do not participate in page cache accounting */
+		if (!huge) {
+			__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+			if (folio_test_pmd_mappable(folio))
+				__lruvec_stat_mod_folio(folio,
+						NR_FILE_THPS, nr);
+		}
+		xas_unlock_irq(&xas);
+		trace_mm_filemap_add_to_page_cache(folio);
+	} else {
+		folio->mapping = NULL;
+		/* Leave page->index set: truncation relies upon it */
+		folio_put_refs(folio, nr);
+	}
+
+	return ret;
 }
 ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 02/10] mm/swap: move no readahead swapin code to a stand-alone helper
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 01/10] mm/filemap: split filemap storing logic into a standalone helper Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 03/10] mm/swap: convert swapin_readahead to return a folio Kairui Song
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Simply move the routine to a standalone function, having a cleaner
split and avoid helpers being referenced corss multiple files.

Basically no feature change, but the error path is very slightly
different. Previously a mem_cgroup_swapin_charge_folio fail will cause
direct OOM, now we go through the error checking path in do_swap_pte, if
the page is already there, just return as the page fault was handled.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 42 +++-------------------------------
 mm/swap.h       |  8 +++++++
 mm/swap_state.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..e42fadc25268 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3937,7 +3937,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -4001,47 +4000,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
+			/* skip swapcache and readahead */
+			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
+			if (PTR_ERR(folio) == -EBUSY)
 				goto out;
-			}
 			need_clear_cache = true;
-
-			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
 			page = &folio->page;
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				if (mem_cgroup_swapin_charge_folio(folio,
-							vma->vm_mm, GFP_KERNEL,
-							entry)) {
-					ret = VM_FAULT_OOM;
-					goto out_page;
-				}
-				mem_cgroup_swapin_uncharge_swap(entry);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, true, NULL);
-				folio->private = NULL;
-			}
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..40e902812cc5 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -55,6 +55,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
+struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			    struct vm_fault *vmf);
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf);
 
@@ -87,6 +89,12 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 	return NULL;
 }
 
+static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			struct vm_fault *vmf)
+{
+	return NULL;
+}
+
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bfc7e8c58a6d..0a3fa48b3893 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -879,6 +879,66 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	return folio;
 }
 
+/**
+ * swapin_direct - swap in folios skipping swap cache and readahead
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vmf: fault information
+ *
+ * Returns the struct folio for entry and addr after the swap entry is read
+ * in.
+ */
+struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
+			    struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct folio *folio;
+	void *shadow = NULL;
+
+	/*
+	 * Prevent parallel swapin from proceeding with
+	 * the cache flag. Otherwise, another thread may
+	 * finish swapin first, free the entry, and swapout
+	 * reusing the same entry. It's undetectable as
+	 * pte_same() returns true due to entry reuse.
+	 */
+	if (swapcache_prepare(entry)) {
+		/* Relax a bit to prevent rapid repeated page faults */
+		schedule_timeout_uninterruptible(1);
+		return ERR_PTR(-EBUSY);
+	}
+
+	/* skip swapcache */
+	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
+				vma, vmf->address, false);
+	if (folio) {
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
+		if (mem_cgroup_swapin_charge_folio(folio,
+					vma->vm_mm, GFP_KERNEL,
+					entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return NULL;
+		}
+		mem_cgroup_swapin_uncharge_swap(entry);
+
+		shadow = get_shadow_from_swap_cache(entry);
+		if (shadow)
+			workingset_refault(folio, shadow);
+
+		folio_add_lru(folio);
+
+		/* To provide entry to swap_read_folio() */
+		folio->swap = entry;
+		swap_read_folio(folio, true, NULL);
+		folio->private = NULL;
+	}
+
+	return folio;
+}
+
 /**
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 03/10] mm/swap: convert swapin_readahead to return a folio
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 01/10] mm/filemap: split filemap storing logic into a standalone helper Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 02/10] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 20:03   ` Matthew Wilcox
  2024-03-26 18:50 ` [RFC PATCH 04/10] mm/swap: remove cache bypass swapin Kairui Song
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Simplify the caller code logic.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 8 +++-----
 mm/swap.h       | 4 ++--
 mm/swap_state.c | 6 ++----
 mm/swapfile.c   | 5 +----
 4 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e42fadc25268..dfdb620a9123 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4005,12 +4005,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			if (PTR_ERR(folio) == -EBUSY)
 				goto out;
 			need_clear_cache = true;
-			page = &folio->page;
 		} else {
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			if (page)
-				folio = page_folio(page);
+			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
 			swapcache = folio;
 		}
 
@@ -4027,6 +4023,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			goto unlock;
 		}
 
+		page = folio_file_page(folio, swp_offset(entry));
+
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/swap.h b/mm/swap.h
index 40e902812cc5..aee134907a70 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -57,7 +57,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
 			    struct vm_fault *vmf);
-struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
+struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
@@ -95,7 +95,7 @@ static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
 	return NULL;
 }
 
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0a3fa48b3893..2a9c6bdff5ea 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -951,7 +951,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * it will read ahead blocks by cluster-based(ie, physical disk based)
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
+struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 				struct vm_fault *vmf)
 {
 	struct mempolicy *mpol;
@@ -964,9 +964,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
 	mpol_cond_put(mpol);
 
-	if (!folio)
-		return NULL;
-	return folio_file_page(folio, swp_offset(entry));
+	return folio;
 }
 
 #ifdef CONFIG_SYSFS
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4919423cce76..4dd894395a0f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1883,7 +1883,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 		folio = swap_cache_get_folio(entry, vma, addr);
 		if (!folio) {
-			struct page *page;
 			struct vm_fault vmf = {
 				.vma = vma,
 				.address = addr,
@@ -1891,10 +1890,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				.pmd = pmd,
 			};
 
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						&vmf);
-			if (page)
-				folio = page_folio(page);
 		}
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 04/10] mm/swap: remove cache bypass swapin
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (2 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 03/10] mm/swap: convert swapin_readahead to return a folio Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-27  6:30   ` Huang, Ying
  2024-03-26 18:50 ` [RFC PATCH 05/10] mm/swap: clean shadow only in unmap path Kairui Song
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

We used to have the cache bypass swapin path for better performance,
but by removing it, more optimization can be applied and have
an even better overall performance and less hackish.

And these optimizations are not easily doable or not doable at all
without this.

This patch simply removes it, and the performance will drop heavily
for simple swapin, things won't get this worse for real workloads
but still observable. Following commits will fix this and archive a
better performance.

Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
of swap path itself, because zero pages are not compressed but simply
recorded in ZRAM, and performance drops more as SWAP device is getting
full):

Test result of sequential swapin/out:

               Before (us)        After (us)
Swapout:       33619409           33624641
Swapin:        32393771           41614858 (-28.4%)
Swapout (THP): 7817909            7795530
Swapin (THP) : 32452387           41708471 (-28.4%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 18 ++++-------------
 mm/swap.h       | 10 +++++-----
 mm/swap_state.c | 53 ++++++++++---------------------------------------
 mm/swapfile.c   | 13 ------------
 4 files changed, 19 insertions(+), 75 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dfdb620a9123..357d239ee2f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
-	bool need_clear_cache = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache and readahead */
 			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
-			if (PTR_ERR(folio) == -EBUSY)
-				goto out;
-			need_clear_cache = true;
 		} else {
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
-			swapcache = folio;
 		}
 
 		if (!folio) {
@@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			goto unlock;
 		}
 
+		swapcache = folio;
 		page = folio_file_page(folio, swp_offset(entry));
 
 		/* Had to read the page from swap area: Major fault */
@@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	vmf->orig_pte = pte;
 
 	/* ksm created a completely new copy */
-	if (unlikely(folio != swapcache && swapcache)) {
+	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, vmf->address);
 		folio_add_lru_vma(folio, vma);
 	} else {
@@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 
 	folio_unlock(folio);
-	if (folio != swapcache && swapcache) {
+	if (folio != swapcache) {
 		/*
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
@@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
-	/* Clear the swap cache pin for direct swapin after PTL unlock */
-	if (need_clear_cache)
-		swapcache_clear(si, entry);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	folio_unlock(folio);
 out_release:
 	folio_put(folio);
-	if (folio != swapcache && swapcache) {
+	if (folio != swapcache) {
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (need_clear_cache)
-		swapcache_clear(si, entry);
 	if (si)
 		put_swap_device(si);
 	return ret;
diff --git a/mm/swap.h b/mm/swap.h
index aee134907a70..ac9573b03432 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 {
 	return NULL;
 }
-
-static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
+static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
+			struct vm_fault *vmf);
 {
-	return 0;
+	return NULL;
 }
 
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
+static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
+	return 0;
 }
 
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2a9c6bdff5ea..49ef6250f676 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 }
 
 /**
- * swapin_direct - swap in folios skipping swap cache and readahead
+ * swapin_direct - swap in folios skipping readahead
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
  *
- * Returns the struct folio for entry and addr after the swap entry is read
- * in.
+ * Returns the folio for entry after it is read in.
  */
 struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 			    struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
+	struct mempolicy *mpol;
 	struct folio *folio;
-	void *shadow = NULL;
-
-	/*
-	 * Prevent parallel swapin from proceeding with
-	 * the cache flag. Otherwise, another thread may
-	 * finish swapin first, free the entry, and swapout
-	 * reusing the same entry. It's undetectable as
-	 * pte_same() returns true due to entry reuse.
-	 */
-	if (swapcache_prepare(entry)) {
-		/* Relax a bit to prevent rapid repeated page faults */
-		schedule_timeout_uninterruptible(1);
-		return ERR_PTR(-EBUSY);
-	}
-
-	/* skip swapcache */
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-				vma, vmf->address, false);
-	if (folio) {
-		__folio_set_locked(folio);
-		__folio_set_swapbacked(folio);
-
-		if (mem_cgroup_swapin_charge_folio(folio,
-					vma->vm_mm, GFP_KERNEL,
-					entry)) {
-			folio_unlock(folio);
-			folio_put(folio);
-			return NULL;
-		}
-		mem_cgroup_swapin_uncharge_swap(entry);
-
-		shadow = get_shadow_from_swap_cache(entry);
-		if (shadow)
-			workingset_refault(folio, shadow);
+	bool page_allocated;
+	pgoff_t ilx;
 
-		folio_add_lru(folio);
+	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
+	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+					&page_allocated, false);
+	mpol_cond_put(mpol);
 
-		/* To provide entry to swap_read_folio() */
-		folio->swap = entry;
+	if (page_allocated)
 		swap_read_folio(folio, true, NULL);
-		folio->private = NULL;
-	}
 
 	return folio;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4dd894395a0f..ae8d3aa05df7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3389,19 +3389,6 @@ int swapcache_prepare(swp_entry_t entry)
 	return __swap_duplicate(entry, SWAP_HAS_CACHE);
 }
 
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
-{
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-	unsigned char usage;
-
-	ci = lock_cluster_or_swap_info(si, offset);
-	usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
-	unlock_cluster_or_swap_info(si, ci);
-	if (!usage)
-		free_swap_slot(entry);
-}
-
 struct swap_info_struct *swp_swap_info(swp_entry_t entry)
 {
 	return swap_type_to_swap_info(swp_type(entry));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 05/10] mm/swap: clean shadow only in unmap path
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (3 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 04/10] mm/swap: remove cache bypass swapin Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 06/10] mm/swap: switch to use multi index entries Kairui Song
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

After removing the cache bypass swapin, the first thing could be
gone is all the clear_shadow_from_swap_cache calls. Currently
clear_shadow_from_swap_cache is being called in many paths.
It's currently being called by swap_range_free which has two
direct callers:

- swap_free_cluster, which is only called by put_swap_folio to free up
  the shadow of a slot cluster.
- swap_entry_free, which is only called by swapcache_free_entries to
  free up shadow of a slot.

And these two are very commonly used everywhere in SWAP codes.

Notice the shadow is only written by __delete_from_swap_cache after
after a successful SWAP out, so clearly we only want to clear shadow
after SWAP in (the shadow is used and no longer needed) or Unmap/MADV_FREE.

After all swapin is using cached swapin path, clear_shadow_from_swap_cache
is not needed for swapin anymore, because we have to insert the folio
first, and this already removed the shadow. So we only need to clear the
shadow for Unmap/MADV_FREE.

All direct/indirect caller of swap_free_cluster and swap_entry_free
are listed below:

- swap_free_cluster:
  -> put_swap_folio (Clean the cache flag and try delete shadow, after
                     removing the cache or error handling)
    -> delete_from_swap_cache
    -> __remove_mapping
    -> shmem_writepage
    -> folio_alloc_swap
    -> add_to_swap
    -> __read_swap_cache_async
- swap_entry_free
  -> swapcache_free_entries
    -> drain_slots_cache_cpu
    -> free_swap_slot
      -> put_swap_folio (Already covered above)
      -> __swap_entry_free / swap_free
        -> free_swap_and_cache (Called by Unmap/Zap/MADV_FREE)
          -> madvise_free_single_vma
          -> unmap_page_range
          -> shmem_undo_range
        -> swap_free (Called by swapin path)
          -> do_swap_page (Swapin path)
          -> alloc_swapdev_block/free_all_swap_pages ()
          -> try_to_unmap_one (Error handling, no shadow)
          -> shmem_set_folio_swapin_error (Shadow just gone)
          -> shmem_swapin_folio (Shmem's do_swap_page)
          -> unuse_pte (Swapoff, which always use swapcache)

So now we only need to call clear_shadow_from_swap_cache in
free_swap_and_cache because all swapin/out will went through swap cache
now. Previously all above functions could invoke
clear_shadow_from_swap_cache in case a cache bypass swapin left a entry
with uncleared shadow.

Also make clear_shadow_from_swap_cache only clear one entry for
simplicity.

Test result of sequential swapin/out:

               Before (us)        After (us)
Swapout:       33624641           33648529
Swapin:        41614858           40667696 (+2.3%)
Swapout (THP): 7795530            7658664
Swapin (THP) : 41708471           40602278 (+2.7%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  6 ++----
 mm/swap_state.c | 33 ++++++++-------------------------
 mm/swapfile.c   |  6 ++++--
 3 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index ac9573b03432..7721ddb3bdbc 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -39,8 +39,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 void __delete_from_swap_cache(struct folio *folio,
 			      swp_entry_t entry, void *shadow);
 void delete_from_swap_cache(struct folio *folio);
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				  unsigned long end);
+void clear_shadow_from_swap_cache(swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
@@ -148,8 +147,7 @@ static inline void delete_from_swap_cache(struct folio *folio)
 {
 }
 
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+static inline void clear_shadow_from_swap_cache(swp_entry_t entry)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 49ef6250f676..b84e7b0ea4a5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -245,34 +245,17 @@ void delete_from_swap_cache(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+void clear_shadow_from_swap_cache(swp_entry_t entry)
 {
-	unsigned long curr = begin;
-	void *old;
-
-	for (;;) {
-		swp_entry_t entry = swp_entry(type, curr);
-		struct address_space *address_space = swap_address_space(entry);
-		XA_STATE(xas, &address_space->i_pages, curr);
-
-		xas_set_update(&xas, workingset_update_node);
+	struct address_space *address_space = swap_address_space(entry);
+	XA_STATE(xas, &address_space->i_pages, swp_offset(entry));
 
-		xa_lock_irq(&address_space->i_pages);
-		xas_for_each(&xas, old, end) {
-			if (!xa_is_value(old))
-				continue;
-			xas_store(&xas, NULL);
-		}
-		xa_unlock_irq(&address_space->i_pages);
+	xas_set_update(&xas, workingset_update_node);
 
-		/* search the next swapcache until we meet end */
-		curr >>= SWAP_ADDRESS_SPACE_SHIFT;
-		curr++;
-		curr <<= SWAP_ADDRESS_SPACE_SHIFT;
-		if (curr > end)
-			break;
-	}
+	xa_lock_irq(&address_space->i_pages);
+	if (xa_is_value(xas_load(&xas)))
+		xas_store(&xas, NULL);
+	xa_unlock_irq(&address_space->i_pages);
 }
 
 /*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ae8d3aa05df7..bafae23c0f26 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -724,7 +724,6 @@ static void add_to_avail_list(struct swap_info_struct *p)
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			    unsigned int nr_entries)
 {
-	unsigned long begin = offset;
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 
@@ -748,7 +747,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	clear_shadow_from_swap_cache(si->type, begin, end);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1605,6 +1603,8 @@ bool folio_free_swap(struct folio *folio)
 /*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
+ * Useful when clearing the swap map and swap cache
+ * without reading swap content (eg. unmap, MADV_FREE)
  */
 int free_swap_and_cache(swp_entry_t entry)
 {
@@ -1626,6 +1626,8 @@ int free_swap_and_cache(swp_entry_t entry)
 		    !swap_page_trans_huge_swapped(p, entry))
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
+		if (!count)
+			clear_shadow_from_swap_cache(entry);
 		put_swap_device(p);
 	}
 	return p != NULL;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 06/10] mm/swap: switch to use multi index entries
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (4 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 05/10] mm/swap: clean shadow only in unmap path Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 07/10] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get Kairui Song
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

From: Kairui Song <ryncsn@gmail.com>

Since now all explicit shadow clearing is gone and all swapin / swapout
path is all using swap cache, switch swap cache to use multi index so
swapping out of THP will be faster, also using less memory.

Test result of sequential swapin/out of 30G zero page on ZRAM:

               Before (us)        After (us)
Swapout:       33648529           33713283
Swapin:        40667696           40954646
Swapout (THP): 7658664            6921176  (+9.7%)
Swapin (THP) : 40602278           40891953

And after swapping out 30G with THP, the radix node usage dropped by a
lot:

Before: radix_tree_node 73728K
After:  radix_tree_node  7056K (-94%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/filemap.c     | 27 +++++++++++++++++
 mm/huge_memory.c | 77 +++++++++++++++++++-----------------------------
 mm/internal.h    |  2 ++
 mm/swap_state.c  | 54 ++++++++++-----------------------
 4 files changed, 75 insertions(+), 85 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0ccdc9e92764..5e8e3fd26b8d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -919,6 +919,33 @@ static int __filemap_lock_store(struct xa_state *xas, struct folio *folio,
 	return xas_error(xas);
 }
 
+int __filemap_add_swapcache(struct address_space *mapping, struct folio *folio,
+			    pgoff_t index, gfp_t gfp, void **shadowp)
+{
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
+	long nr;
+	int ret;
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+	mapping_set_update(&xas, mapping);
+
+	nr = folio_nr_pages(folio);
+	folio_ref_add(folio, nr);
+
+	ret = __filemap_lock_store(&xas, folio, index, gfp, shadowp);
+	if (likely(!ret)) {
+		mapping->nrpages += nr;
+		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
+		xas_unlock_irq(&xas);
+	} else {
+		folio_put_refs(folio, nr);
+	}
+
+	return ret;
+}
+
 noinline int __filemap_add_folio(struct address_space *mapping,
 		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..4fd2f74b94a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2886,14 +2886,12 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 	lru_add_page_tail(head, page_tail, lruvec, list);
 }
 
-static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned int new_order)
+static void __split_huge_page(struct address_space *mapping, struct page *page,
+			      struct list_head *list, pgoff_t end, unsigned int new_order)
 {
 	struct folio *folio = page_folio(page);
 	struct page *head = &folio->page;
 	struct lruvec *lruvec;
-	struct address_space *swap_cache = NULL;
-	unsigned long offset = 0;
 	int i, nr_dropped = 0;
 	unsigned int new_nr = 1 << new_order;
 	int order = folio_order(folio);
@@ -2902,12 +2900,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
 
-	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
-		offset = swp_offset(folio->swap);
-		swap_cache = swap_address_space(folio->swap);
-		xa_lock(&swap_cache->i_pages);
-	}
-
 	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
 	lruvec = folio_lruvec_lock(folio);
 
@@ -2919,18 +2911,18 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		if (head[i].index >= end) {
 			struct folio *tail = page_folio(head + i);
 
-			if (shmem_mapping(folio->mapping))
+			if (shmem_mapping(mapping))
 				nr_dropped++;
 			else if (folio_test_clear_dirty(tail))
 				folio_account_cleaned(tail,
-					inode_to_wb(folio->mapping->host));
+					inode_to_wb(mapping->host));
 			__filemap_remove_folio(tail, NULL);
 			folio_put(tail);
 		} else if (!PageAnon(page)) {
-			__xa_store(&folio->mapping->i_pages, head[i].index,
+			__xa_store(&mapping->i_pages, head[i].index,
 					head + i, 0);
-		} else if (swap_cache) {
-			__xa_store(&swap_cache->i_pages, offset + i,
+		} else if (folio_test_swapcache(folio)) {
+			__xa_store(&mapping->i_pages, swp_offset(folio->swap) + i,
 					head + i, 0);
 		}
 	}
@@ -2948,23 +2940,17 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	split_page_owner(head, order, new_order);
 
 	/* See comment in __split_huge_page_tail() */
-	if (folio_test_anon(folio)) {
+	if (mapping) {
 		/* Additional pin to swap cache */
-		if (folio_test_swapcache(folio)) {
-			folio_ref_add(folio, 1 + new_nr);
-			xa_unlock(&swap_cache->i_pages);
-		} else {
-			folio_ref_inc(folio);
-		}
-	} else {
-		/* Additional pin to page cache */
 		folio_ref_add(folio, 1 + new_nr);
-		xa_unlock(&folio->mapping->i_pages);
+		xa_unlock(&mapping->i_pages);
+	} else {
+		folio_ref_inc(folio);
 	}
 	local_irq_enable();
 
 	if (nr_dropped)
-		shmem_uncharge(folio->mapping->host, nr_dropped);
+		shmem_uncharge(mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
 	if (folio_test_swapcache(folio))
@@ -3043,11 +3029,12 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	/* reset xarray order to new order after split */
 	XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
+	struct address_space *mapping = folio_mapping(folio);;
 	struct anon_vma *anon_vma = NULL;
-	struct address_space *mapping = NULL;
 	int extra_pins, ret;
 	pgoff_t end;
 	bool is_hzp;
+	gfp_t gfp;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -3079,7 +3066,6 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		}
 	}
 
-
 	is_hzp = is_huge_zero_page(&folio->page);
 	if (is_hzp) {
 		pr_warn_ratelimited("Called split_huge_page for huge zero page\n");
@@ -3089,6 +3075,17 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	if (folio_test_writeback(folio))
 		return -EBUSY;
 
+	if (mapping) {
+		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
+					  GFP_RECLAIM_MASK);
+
+		xas_split_alloc(&xas, folio, folio_order(folio), gfp);
+		if (xas_error(&xas)) {
+			ret = xas_error(&xas);
+			goto out;
+		}
+	}
+
 	if (folio_test_anon(folio)) {
 		/*
 		 * The caller does not necessarily hold an mmap_lock that would
@@ -3104,33 +3101,19 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 		end = -1;
-		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
-		gfp_t gfp;
-
-		mapping = folio->mapping;
-
 		/* Truncated ? */
 		if (!mapping) {
 			ret = -EBUSY;
 			goto out;
 		}
 
-		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
-							GFP_RECLAIM_MASK);
-
 		if (!filemap_release_folio(folio, gfp)) {
 			ret = -EBUSY;
 			goto out;
 		}
 
-		xas_split_alloc(&xas, folio, folio_order(folio), gfp);
-		if (xas_error(&xas)) {
-			ret = xas_error(&xas);
-			goto out;
-		}
-
 		anon_vma = NULL;
 		i_mmap_lock_read(mapping);
 
@@ -3189,7 +3172,9 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			int nr = folio_nr_pages(folio);
 
 			xas_split(&xas, folio, folio_order(folio));
-			if (folio_test_pmd_mappable(folio) &&
+
+			if (!folio_test_anon(folio) &&
+			    folio_test_pmd_mappable(folio) &&
 			    new_order < HPAGE_PMD_ORDER) {
 				if (folio_test_swapbacked(folio)) {
 					__lruvec_stat_mod_folio(folio,
@@ -3202,7 +3187,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			}
 		}
 
-		__split_huge_page(page, list, end, new_order);
+		__split_huge_page(mapping, page, list, end, new_order);
 		ret = 0;
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
@@ -3218,9 +3203,9 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	if (anon_vma) {
 		anon_vma_unlock_write(anon_vma);
 		put_anon_vma(anon_vma);
-	}
-	if (mapping)
+	} else {
 		i_mmap_unlock_read(mapping);
+	}
 out:
 	xas_destroy(&xas);
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
diff --git a/mm/internal.h b/mm/internal.h
index 7e486f2c502c..b2bbfd3c2b50 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1059,6 +1059,8 @@ struct migration_target_control {
  */
 size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
 			      struct folio *folio, loff_t fpos, size_t size);
+int __filemap_add_swapcache(struct address_space *mapping, struct folio *folio,
+			    pgoff_t index, gfp_t gfp, void **shadowp);
 
 /*
  * mm/vmalloc.c
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b84e7b0ea4a5..caf69696f47c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -90,48 +90,22 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
-	XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
-	unsigned long i, nr = folio_nr_pages(folio);
-	void *old;
-
-	xas_set_update(&xas, workingset_update_node);
+	int ret;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	folio_ref_add(folio, nr);
 	folio_set_swapcache(folio);
 	folio->swap = entry;
 
-	do {
-		xas_lock_irq(&xas);
-		xas_create_range(&xas);
-		if (xas_error(&xas))
-			goto unlock;
-		for (i = 0; i < nr; i++) {
-			VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
-			if (shadowp) {
-				old = xas_load(&xas);
-				if (xa_is_value(old))
-					*shadowp = old;
-			}
-			xas_store(&xas, folio);
-			xas_next(&xas);
-		}
-		address_space->nrpages += nr;
-		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
-		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
-unlock:
-		xas_unlock_irq(&xas);
-	} while (xas_nomem(&xas, gfp));
-
-	if (!xas_error(&xas))
-		return 0;
+	ret = __filemap_add_swapcache(address_space, folio, idx, gfp, shadowp);
+	if (ret) {
+		folio_clear_swapcache(folio);
+		folio->swap.val = 0;
+	}
 
-	folio_clear_swapcache(folio);
-	folio_ref_sub(folio, nr);
-	return xas_error(&xas);
+	return ret;
 }
 
 /*
@@ -142,7 +116,6 @@ void __delete_from_swap_cache(struct folio *folio,
 			swp_entry_t entry, void *shadow)
 {
 	struct address_space *address_space = swap_address_space(entry);
-	int i;
 	long nr = folio_nr_pages(folio);
 	pgoff_t idx = swp_offset(entry);
 	XA_STATE(xas, &address_space->i_pages, idx);
@@ -153,11 +126,9 @@ void __delete_from_swap_cache(struct folio *folio,
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
 
-	for (i = 0; i < nr; i++) {
-		void *entry = xas_store(&xas, shadow);
-		VM_BUG_ON_PAGE(entry != folio, entry);
-		xas_next(&xas);
-	}
+	xas_set_order(&xas, idx, folio_order(folio));
+	xas_store(&xas, shadow);
+
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
 	address_space->nrpages -= nr;
@@ -252,6 +223,11 @@ void clear_shadow_from_swap_cache(swp_entry_t entry)
 
 	xas_set_update(&xas, workingset_update_node);
 
+	/*
+	 * On unmap, it may delete a larger order shadow here. It's mostly
+	 * fine since not entirely mapped folios are spiltted on swap out
+	 * and leaves shadows with order 0.
+	 */
 	xa_lock_irq(&address_space->i_pages);
 	if (xa_is_value(xas_load(&xas)))
 		xas_store(&xas, NULL);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 07/10] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (5 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 06/10] mm/swap: switch to use multi index entries Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 08/10] mm/swap: use swap cache as a synchronization layer Kairui Song
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there. Rename it to
better present the usage.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  2 +-
 mm/swap_state.c | 22 +++++++++++-----------
 mm/swapfile.c   |  2 +-
 mm/zswap.c      |  2 +-
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 7721ddb3bdbc..5fbbc4a42787 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -48,7 +48,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
+struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index caf69696f47c..cd1a16afcd9f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -385,7 +385,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	return folio;
 }
 
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
+struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
@@ -443,12 +443,12 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			goto fail_put_swap;
 
 		/*
-		 * Protect against a recursive call to __read_swap_cache_async()
+		 * Protect against a recursive call to swap_cache_alloc_or_get()
 		 * on the same entry waiting forever here because SWAP_HAS_CACHE
 		 * is set but the folio is not the swap cache yet. This can
 		 * happen today if mem_cgroup_swapin_charge_folio() below
 		 * triggers reclaim through zswap, which may call
-		 * __read_swap_cache_async() in the writeback path.
+		 * swap_cache_alloc_or_get() in the writeback path.
 		 */
 		if (skip_if_exists)
 			goto fail_put_swap;
@@ -457,7 +457,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * We might race against __delete_from_swap_cache(), and
 		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
 		 * has not yet been cleared.  Or race against another
-		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
+		 * swap_cache_alloc_or_get(), which has set SWAP_HAS_CACHE
 		 * in swap_map, but not yet added its folio to swap cache.
 		 */
 		schedule_timeout_uninterruptible(1);
@@ -505,7 +505,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * the swap entry is no longer in use.
  *
  * get/put_swap_device() aren't needed to call this function, because
- * __read_swap_cache_async() call them and swap_read_folio() holds the
+ * swap_cache_alloc_or_get() call them and swap_read_folio() holds the
  * swap cache folio lock.
  */
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -518,7 +518,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	struct folio *folio;
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	mpol_cond_put(mpol);
 
@@ -634,7 +634,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
-		folio = __read_swap_cache_async(
+		folio = swap_cache_alloc_or_get(
 				swp_entry(swp_type(entry), offset),
 				gfp_mask, mpol, ilx, &page_allocated, false);
 		if (!folio)
@@ -653,7 +653,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
 	/* The page was likely read above, so no need for plugging here */
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated)) {
 		zswap_folio_swapin(folio);
@@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 			continue;
 		pte_unmap(pte);
 		pte = NULL;
-		folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+		folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
 						&page_allocated, false);
 		if (!folio)
 			continue;
@@ -829,7 +829,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	lru_add_drain();
 skip:
 	/* The folio was likely read above, so no need for plugging here */
-	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
+	folio = swap_cache_alloc_or_get(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated)) {
 		zswap_folio_swapin(folio);
@@ -855,7 +855,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 	pgoff_t ilx;
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	mpol_cond_put(mpol);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bafae23c0f26..332ce4e578e8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1249,7 +1249,7 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
  *   CPU1				CPU2
  *   do_swap_page()
  *     ...				swapoff+swapon
- *     __read_swap_cache_async()
+ *     swap_cache_alloc_or_get()
  *       swapcache_prepare()
  *         __swap_duplicate()
  *           // check swap_map
diff --git a/mm/zswap.c b/mm/zswap.c
index 9dec853647c8..e4d96816be70 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1126,7 +1126,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
-	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
+	folio = swap_cache_alloc_or_get(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
 	if (!folio)
 		return -ENOMEM;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 08/10] mm/swap: use swap cache as a synchronization layer
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (6 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 07/10] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 09/10] mm/swap: delay the swap cache lookup for swapin Kairui Song
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Rework how swapins are synchronized. Instead of spinning on the
swap map, simply use swap cache insert as the synchronization
point. Winner will insert a new locked folio, loser will just
get the locked folio and wait on unlock. This way we don't
need any extra mechanism and have a unified way to ensure all
swapin are race free.

Swap map is not removed, the HAS_CACHE bit is still set but updated with
a particular order and stay eventually consistent with xarray state
(it works fine with slot cache reservation), it will mostly be used
for fast cache state look up.

Two helpers now can be used to add a folio to swapcache:
- swap_cache_add_or_get for adding a folio with entry being used (swapin).
- swap_cache_add_wait for adding a folio with a freed entry (swapout).

swap_cache_add_or_get add a folio to swap cache, it return NULL if folio
is already swapped in or hitting OOM, it follows these steps:
1. Caller must ensure the folio is new allocated, this helper lock the
   folio.
2. Try to add the folio to Xarray (add_to_swap_cache).
3. If (2) success, try set SWAP_HAS_CACHE with swapcache_prepare. This
   step will now only fail if the entry is freed, which indicate
   the folio is swapped in by someone else, and if so, revert
   above steps and return NULL.
4. If (2) failed, try look up and return the locked folio. If a folio
   is returned, caller should try lock the folio and check if
   PG_swapcache is still set. If not, racer is hitting OOM or the
   folio is already swapped in, this can be tell easily (by checking
   page table for page table). Caller can bail out or retry conditionally.
5. If (4) failed to get a folio, the folio should have been swapped in
   by someone else, or racer is hitting OOM.

And swap_cache_add_wait is for adding a folio with a freed entry
to swap cache (for swapout path). Because swap_cache_add_or_get
will revert quickly if it accidentally added a folio with freed entry
to swapcache, so swap_cache_add_wait will simply wait on race.

To remove a folio from swap cache, one have to following these steps:
1. First start by acquiring folio lock.
2. Check if PG_swapcache is still set, if not, this folio is removed already.
3. Call put_swap_folio() to clear SWAP_HAS_CACHE flags in SWAP map first, do
   this before removing folio from Xarray to ensure insertions can
   successfully update SWAP map.
4. Remove folio from Xarray by __delete_from_swap_cache.
5. Clear folio flag PG_swapcache, unlock and put it.

Or just call delete_from_swap_cache after checking the folio is still
PG_swapcache set.

Note between step 3 and step 4, an entry may get loaded into swap slot
cache, but this is OK because swapout will uses swap_cache_add_wait which
wait for step 4.

By using swap cache as the synchronization for swapin/swapout, this help
removed a lot of hacks or fixes for the synchronization:

schedule_timeout_uninterruptible(1) introduced by (just wait on folio):
- commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
- commit 029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead")
skip_if_exist introduced by (now calls always return, it never waits inside):
- commit a65b0e7607cc ("zswap: make shrinking memcg-aware")
and the swapoff workaround by (swap map is now consistent with xarray,
and slot cache is disabled, so only need to check in swapoff now):
- commit ba81f8384254 ("mm/swap: skip readahead only when swap slot cache is enabled")

Test result of sequential swapin/out of 30G zero page on ZRAM:

               Before (us)        After (us)
Swapout:       33713283           33827215
Swapin:        40954646           39466754 (+3.7%)
Swapout (THP): 6921176            6917709
Swapin (THP) : 40891953           39566916 (+3.3%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c      |   5 +-
 mm/swap.h       |  18 ++--
 mm/swap_state.c | 217 +++++++++++++++++++++++++-----------------------
 mm/swapfile.c   |  13 ++-
 mm/vmscan.c     |   2 +-
 mm/zswap.c      |   2 +-
 6 files changed, 132 insertions(+), 125 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0aad0d9a621b..51e4593f9e2e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1512,9 +1512,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (list_empty(&info->swaplist))
 		list_add(&info->swaplist, &shmem_swaplist);
 
-	if (add_to_swap_cache(folio, swap,
-			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
-			NULL) == 0) {
+	if (!swap_cache_add_wait(folio, swap,
+				 __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
 		shmem_recalc_inode(inode, 0, 1);
 		swap_shmem_alloc(swap);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
diff --git a/mm/swap.h b/mm/swap.h
index 5fbbc4a42787..be2d1642b5d9 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -34,23 +34,20 @@ extern struct address_space *swapper_spaces[];
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-		      gfp_t gfp, void **shadowp);
 void __delete_from_swap_cache(struct folio *folio,
 			      swp_entry_t entry, void *shadow);
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(swp_entry_t entry);
+int swap_cache_add_wait(struct folio *folio, swp_entry_t entry, gfp_t gfp);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
-
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
 struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_flags,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists);
+		struct mempolicy *mpol, pgoff_t ilx, bool *folio_allocated);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
@@ -109,6 +106,11 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
+static inline int swap_cache_add_wait(struct folio *folio, swp_entry_t entry, gfp_t gfp)
+{
+	return -1;
+}
+
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr)
 {
@@ -132,12 +134,6 @@ static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
 	return NULL;
 }
 
-static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
-{
-	return -1;
-}
-
 static inline void __delete_from_swap_cache(struct folio *folio,
 					swp_entry_t entry, void *shadow)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index cd1a16afcd9f..b5ea13295e17 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -85,8 +85,8 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
  * add_to_swap_cache resembles filemap_add_folio on swapper_space,
  * but sets SwapCache flag and private instead of mapping and index.
  */
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-			gfp_t gfp, void **shadowp)
+static int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
+			     gfp_t gfp, void **shadowp)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
@@ -169,14 +169,16 @@ bool add_to_swap(struct folio *folio)
 	/*
 	 * Add it to the swap cache.
 	 */
-	err = add_to_swap_cache(folio, entry,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
-	if (err)
+	err = swap_cache_add_wait(folio, entry,
+				  __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+	if (err) {
 		/*
-		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
-		 * clear SWAP_HAS_CACHE flag.
+		 * swap_cache_add_wait() doesn't return -EEXIST, so we can
+		 * safely clear SWAP_HAS_CACHE flag.
 		 */
 		goto fail;
+	}
+
 	/*
 	 * Normally the folio will be dirtied in unmap because its
 	 * pte should be dirty. A special case is MADV_FREE page. The
@@ -208,11 +210,12 @@ void delete_from_swap_cache(struct folio *folio)
 	swp_entry_t entry = folio->swap;
 	struct address_space *address_space = swap_address_space(entry);
 
+	put_swap_folio(folio, entry);
+
 	xa_lock_irq(&address_space->i_pages);
 	__delete_from_swap_cache(folio, entry, NULL);
 	xa_unlock_irq(&address_space->i_pages);
 
-	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
@@ -385,119 +388,123 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	return folio;
 }
 
-struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists)
+/*
+ * Try add a new folio, return NULL if the entry is swapped in by someone
+ * else or hitting OOM.
+ */
+static struct folio *swap_cache_add_or_get(struct folio *folio,
+		swp_entry_t entry, gfp_t gfp_mask)
 {
-	struct swap_info_struct *si;
-	struct folio *folio;
+	int ret = 0;
 	void *shadow = NULL;
+	struct address_space *address_space = swap_address_space(entry);
 
-	*new_page_allocated = false;
-	si = get_swap_device(entry);
-	if (!si)
-		return NULL;
-
-	for (;;) {
-		int err;
-		/*
-		 * First check the swap cache.  Since this is normally
-		 * called after swap_cache_get_folio() failed, re-calling
-		 * that would confuse statistics.
-		 */
-		folio = filemap_get_folio(swap_address_space(entry),
-						swp_offset(entry));
-		if (!IS_ERR(folio))
-			goto got_folio;
-
-		/*
-		 * Just skip read ahead for unused swap slot.
-		 * During swap_off when swap_slot_cache is disabled,
-		 * we have to handle the race between putting
-		 * swap entry in swap cache and marking swap slot
-		 * as SWAP_HAS_CACHE.  That's done in later part of code or
-		 * else swap_off will be aborted if we return NULL.
-		 */
-		if (!swap_swapcount(si, entry) && swap_slot_cache_enabled)
-			goto fail_put_swap;
-
-		/*
-		 * Get a new folio to read into from swap.  Allocate it now,
-		 * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
-		 * cause any racers to loop around until we add it to cache.
-		 */
-		folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0,
-						mpol, ilx, numa_node_id());
-		if (!folio)
-                        goto fail_put_swap;
-
-		/*
-		 * Swap entry may have been freed since our caller observed it.
-		 */
-		err = swapcache_prepare(entry);
-		if (!err)
-			break;
-
-		folio_put(folio);
-		if (err != -EEXIST)
-			goto fail_put_swap;
-
-		/*
-		 * Protect against a recursive call to swap_cache_alloc_or_get()
-		 * on the same entry waiting forever here because SWAP_HAS_CACHE
-		 * is set but the folio is not the swap cache yet. This can
-		 * happen today if mem_cgroup_swapin_charge_folio() below
-		 * triggers reclaim through zswap, which may call
-		 * swap_cache_alloc_or_get() in the writeback path.
-		 */
-		if (skip_if_exists)
-			goto fail_put_swap;
+	/* If folio is NULL, simply go lookup the swapcache */
+	if (folio) {
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+		ret = add_to_swap_cache(folio, entry, gfp_mask, &shadow);
+		if (ret)
+			__folio_clear_locked(folio);
+	}
 
-		/*
-		 * We might race against __delete_from_swap_cache(), and
-		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
-		 * has not yet been cleared.  Or race against another
-		 * swap_cache_alloc_or_get(), which has set SWAP_HAS_CACHE
-		 * in swap_map, but not yet added its folio to swap cache.
-		 */
-		schedule_timeout_uninterruptible(1);
+	if (!folio || ret) {
+		/* If the folio is already added, return it untouched. */
+		folio = filemap_get_folio(address_space, swp_offset(entry));
+		/* If not, either the entry have been freed or we are OOM. */
+		if (IS_ERR(folio))
+			return NULL;
+		return folio;
 	}
 
 	/*
-	 * The swap entry is ours to swap in. Prepare the new folio.
+	 * The folio is now added to swap cache, try update the swap map
+	 * to ensure the entry is still valid. If we accidentally added
+	 * a stalled entry, undo the add.
 	 */
+	ret = swapcache_prepare(entry);
+	if (unlikely(ret))
+		goto fail_delete_cache;
 
-	__folio_set_locked(folio);
-	__folio_set_swapbacked(folio);
-
+	/* Charge and shadow check */
 	if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
-		goto fail_unlock;
-
-	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
-		goto fail_unlock;
-
+		goto fail_put_flag;
 	mem_cgroup_swapin_uncharge_swap(entry);
-
 	if (shadow)
 		workingset_refault(folio, shadow);
 
-	/* Caller will initiate read into locked folio */
+	/* Return new added folio locked */
 	folio_add_lru(folio);
-	*new_page_allocated = true;
-got_folio:
-	put_swap_device(si);
 	return folio;
 
-fail_unlock:
+fail_put_flag:
 	put_swap_folio(folio, entry);
+fail_delete_cache:
+	xa_lock_irq(&address_space->i_pages);
+	__delete_from_swap_cache(folio, entry, shadow);
+	xa_unlock_irq(&address_space->i_pages);
+	folio_ref_sub(folio, folio_nr_pages(folio));
 	folio_unlock(folio);
-	folio_put(folio);
-fail_put_swap:
-	put_swap_device(si);
+
 	return NULL;
 }
 
+/*
+ * Try to add a folio to swap cache, caller must ensure entry is freed.
+ * May block if swap_cache_alloc_or_get accidently loaded a freed entry
+ * and it will be removed very soon, so just wait and retry.
+ */
+int swap_cache_add_wait(struct folio *folio, swp_entry_t entry, gfp_t gfp)
+{
+	int ret;
+	struct folio *wait_folio;
+
+	for (;;) {
+		ret = add_to_swap_cache(folio, entry, gfp, NULL);
+		if (ret != -EEXIST)
+			break;
+		wait_folio = filemap_get_folio(swap_address_space(entry),
+				swp_offset(entry));
+		if (!IS_ERR(wait_folio)) {
+			folio_wait_locked(wait_folio);
+			folio_put(wait_folio);
+		}
+	}
+
+	return ret;
+}
+
+struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_mask,
+		struct mempolicy *mpol, pgoff_t ilx, bool *folio_allocated)
+{
+	struct folio *folio, *swapcache = NULL;
+	struct swap_info_struct *si;
+
+	/* Prevent swapoff from happening to us */
+	si = get_swap_device(entry);
+	if (!si)
+		goto out_no_device;
+
+	/* We are very likely the first user, alloc and try add to the swapcache. */
+	folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0, mpol, ilx,
+						 numa_node_id());
+	swapcache = swap_cache_add_or_get(folio, entry, gfp_mask);
+	if (swapcache != folio) {
+		folio_put(folio);
+		goto out_no_alloc;
+	}
+
+	put_swap_device(si);
+	*folio_allocated = true;
+	return swapcache;
+
+out_no_alloc:
+	put_swap_device(si);
+out_no_device:
+	*folio_allocated = false;
+	return swapcache;
+}
+
 /*
  * Locate a page of swap in physical memory, reserving swap cache space
  * and reading the disk if it is not already cached.
@@ -519,7 +526,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+					&page_allocated);
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
@@ -636,7 +643,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		/* Ok, do the async read-ahead now */
 		folio = swap_cache_alloc_or_get(
 				swp_entry(swp_type(entry), offset),
-				gfp_mask, mpol, ilx, &page_allocated, false);
+				gfp_mask, mpol, ilx, &page_allocated);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -654,7 +661,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+					&page_allocated);
 	if (unlikely(page_allocated)) {
 		zswap_folio_swapin(folio);
 		swap_read_folio(folio, false, NULL);
@@ -810,7 +817,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		pte_unmap(pte);
 		pte = NULL;
 		folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
-						&page_allocated, false);
+						&page_allocated);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -830,7 +837,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 skip:
 	/* The folio was likely read above, so no need for plugging here */
 	folio = swap_cache_alloc_or_get(targ_entry, gfp_mask, mpol, targ_ilx,
-					&page_allocated, false);
+					&page_allocated);
 	if (unlikely(page_allocated)) {
 		zswap_folio_swapin(folio);
 		swap_read_folio(folio, false, NULL);
@@ -856,7 +863,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+					&page_allocated);
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 332ce4e578e8..8225091d42b6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -149,9 +149,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * in usual operations.
 	 */
 	if (folio_trylock(folio)) {
-		if ((flags & TTRS_ANYWAY) ||
+		if (folio_test_swapcache(folio) &&
+		    ((flags & TTRS_ANYWAY) ||
 		    ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
-		    ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
+		    ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio))))
 			ret = folio_free_swap(folio);
 		folio_unlock(folio);
 	}
@@ -1344,7 +1345,8 @@ void swap_free(swp_entry_t entry)
 }
 
 /*
- * Called after dropping swapcache to decrease refcnt to swap entries.
+ * Called before dropping swapcache, free the entry and ensure
+ * new insertion will success.
  */
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
@@ -1897,13 +1899,15 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);
-			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
+			if (swap_count(swp_count) == 0 || swp_count == SWAP_MAP_BAD)
 				continue;
 			return -ENOMEM;
 		}
 
 		folio_lock(folio);
 		folio_wait_writeback(folio);
+		if (!folio_test_swapcache(folio))
+			goto free_folio;
 		ret = unuse_pte(vma, pmd, addr, entry, folio);
 		if (ret < 0) {
 			folio_unlock(folio);
@@ -1912,6 +1916,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 
 		folio_free_swap(folio);
+free_folio:
 		folio_unlock(folio);
 		folio_put(folio);
 	} while (addr += PAGE_SIZE, addr != end);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ef654addd44..c3db39393428 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -732,10 +732,10 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
+		put_swap_folio(folio, swap);
 		__delete_from_swap_cache(folio, swap, shadow);
 		mem_cgroup_swapout(folio, swap);
 		xa_unlock_irq(&mapping->i_pages);
-		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
 
diff --git a/mm/zswap.c b/mm/zswap.c
index e4d96816be70..c80e33c74235 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1127,7 +1127,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
 	folio = swap_cache_alloc_or_get(swpentry, GFP_KERNEL, mpol,
-				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+				NO_INTERLEAVE_INDEX, &folio_was_allocated);
 	if (!folio)
 		return -ENOMEM;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 09/10] mm/swap: delay the swap cache lookup for swapin
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (7 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 08/10] mm/swap: use swap cache as a synchronization layer Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-26 18:50 ` [RFC PATCH 10/10] mm/swap: optimize synchronous swapin Kairui Song
  2024-03-27  2:52 ` [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Huang, Ying
  10 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

From: Kairui Song <ryncsn@gmail.com>

Currently we do a swap cache lookup first, then call into the ordinary
swapin path. But all swapin path will call swap_cache_add_or_get,
which will do a swap cache lookup again on race, because the first
lookup is racy and could miss the swap cache.

If the race happened (could be frequent on busy device), caller have no
way of knowing that, not be able to distinguish minor / major fault,
and the first lookup is redundant.

So try to do swapcache lookup and readahead update late, defer it to
swap_cache_alloc_or_get, and make it faster by avoiding lookup if
HAS_CACHE flag is not set. This will be less accurate but the
later look up will always ensure we never miss a existing swap cache.
This provides 100% accuracy swap cache usage info for callers,
improve minor / major page fault info, and also improve performance.

Test result of sequential swapin/out of 30G zero page on ZRAM:

               Before (us)        After (us)
Swapout:       33827215           33853883
Swapin:        39466754           38336519 (+2.9%)
Swapout (THP): 6917709            6814619
Swapin (THP) : 39566916           38383367 (+3.0%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     |  45 ++++++++----------
 mm/shmem.c      |  39 +++++++---------
 mm/swap.h       |  16 +++++--
 mm/swap_state.c | 122 +++++++++++++++++++++++++++++-------------------
 mm/swapfile.c   |  32 +++++++------
 5 files changed, 141 insertions(+), 113 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 357d239ee2f6..774a912eb46d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3932,6 +3932,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
+	bool folio_allocated = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -3991,35 +3992,29 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
-	if (folio)
-		page = folio_file_page(folio, swp_offset(entry));
-	swapcache = folio;
+	if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) {
+		folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf, &folio_allocated);
+	} else {
+		folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf, &folio_allocated);
+	}
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
-		} else {
-			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
-		}
-
-		if (!folio) {
-			/*
-			 * Back out if somebody else faulted in this pte
-			 * while we released the pte lock.
-			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
-			if (likely(vmf->pte &&
-				   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
-				ret = VM_FAULT_OOM;
-			goto unlock;
-		}
+		/*
+		 * Back out if somebody else faulted in this pte
+		 * while we released the pte lock.
+		 */
+		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+				vmf->address, &vmf->ptl);
+		if (likely(vmf->pte &&
+			   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+			ret = VM_FAULT_OOM;
+		goto unlock;
+	}
 
-		swapcache = folio;
-		page = folio_file_page(folio, swp_offset(entry));
+	swapcache = folio;
+	page = folio_file_page(folio, swp_offset(entry));
 
+	if (folio_allocated) {
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
diff --git a/mm/shmem.c b/mm/shmem.c
index 51e4593f9e2e..7884bbe28731 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1570,20 +1570,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
 			pgoff_t index, unsigned int order, pgoff_t *ilx);
 
-static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	struct mempolicy *mpol;
-	pgoff_t ilx;
-	struct folio *folio;
-
-	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
-	mpol_cond_put(mpol);
-
-	return folio;
-}
-
 /*
  * Make sure huge_gfp is always more limited than limit_gfp.
  * Some of the flags set permissions, while others set limitations.
@@ -1857,9 +1843,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool folio_allocated = false;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
+	struct mempolicy *mpol;
 	swp_entry_t swap;
+	pgoff_t ilx;
 	int error;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1878,22 +1867,28 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_try_get(swap);
 	if (!folio) {
-		/* Or update major stats only when swapin succeeds?? */
-		if (fault_type) {
-			*fault_type |= VM_FAULT_MAJOR;
-			count_vm_event(PGMAJFAULT);
-			count_memcg_event_mm(fault_mm, PGMAJFAULT);
-		}
 		/* Here we actually start the io */
-		folio = shmem_swapin_cluster(swap, gfp, info, index);
+		mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+		folio = swap_cluster_readahead(swap, gfp, mpol, ilx, &folio_allocated);
+		mpol_cond_put(mpol);
 		if (!folio) {
 			error = -ENOMEM;
 			goto failed;
 		}
+
+		/* Update major stats only when swapin succeeds */
+		if (folio_allocated && fault_type) {
+			*fault_type |= VM_FAULT_MAJOR;
+			count_vm_event(PGMAJFAULT);
+			count_memcg_event_mm(fault_mm, PGMAJFAULT);
+		}
 	}
 
+	if (!folio_allocated)
+		swap_cache_update_ra(folio, NULL, 0);
+
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
 	if (!folio_test_swapcache(folio) ||
diff --git a/mm/swap.h b/mm/swap.h
index be2d1642b5d9..bd872b157950 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -39,7 +39,8 @@ void __delete_from_swap_cache(struct folio *folio,
 void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(swp_entry_t entry);
 int swap_cache_add_wait(struct folio *folio, swp_entry_t entry, gfp_t gfp);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
+struct folio *swap_cache_try_get(swp_entry_t entry);
+void swap_cache_update_ra(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr);
 struct folio *filemap_get_incore_folio(struct address_space *mapping,
 		pgoff_t index);
@@ -49,16 +50,18 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *folio_allocated);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-		struct mempolicy *mpol, pgoff_t ilx);
+		struct mempolicy *mpol, pgoff_t ilx, bool *folio_allocated);
 struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
-			    struct vm_fault *vmf);
+			    struct vm_fault *vmf, bool *folio_allocated);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf);
+			      struct vm_fault *vmf, bool *folio_allocated);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	return swp_swap_info(folio->swap)->flags;
 }
+
+bool __swap_has_cache(swp_entry_t entry);
 #else /* CONFIG_SWAP */
 struct swap_iocb;
 static inline void swap_read_folio(struct folio *folio, bool do_poll,
@@ -151,5 +154,10 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	return 0;
 }
+
+static inline bool __swap_has_cache(swp_entry_t entry);
+{
+	return false;
+}
 #endif /* CONFIG_SWAP */
 #endif /* _MM_SWAP_H */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b5ea13295e17..cf178dd1131a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -300,54 +300,54 @@ static inline bool swap_use_vma_readahead(void)
 }
 
 /*
- * Lookup a swap entry in the swap cache. A found folio will be returned
- * unlocked and with its refcount incremented - we rely on the kernel
- * lock getting page table operations atomic even if we drop the folio
- * lock before returning.
- *
- * Caller must lock the swap device or hold a reference to keep it valid.
+ * Try get the swap cache, bail out quickly if swapcache bit is not set.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+struct folio *swap_cache_try_get(swp_entry_t entry)
 {
 	struct folio *folio;
 
-	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
-	if (!IS_ERR(folio)) {
-		bool vma_ra = swap_use_vma_readahead();
-		bool readahead;
-
-		/*
-		 * At the moment, we don't support PG_readahead for anon THP
-		 * so let's bail out rather than confusing the readahead stat.
-		 */
-		if (unlikely(folio_test_large(folio)))
+	if (__swap_has_cache(entry)) {
+		folio = filemap_get_folio(swap_address_space(entry),
+				swp_offset(entry));
+		if (!IS_ERR(folio))
 			return folio;
+	}
 
-		readahead = folio_test_clear_readahead(folio);
-		if (vma && vma_ra) {
-			unsigned long ra_val;
-			int win, hits;
-
-			ra_val = GET_SWAP_RA_VAL(vma);
-			win = SWAP_RA_WIN(ra_val);
-			hits = SWAP_RA_HITS(ra_val);
-			if (readahead)
-				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vma->swap_readahead_info,
-					SWAP_RA_VAL(addr, win, hits));
-		}
+	return NULL;
+}
 
-		if (readahead) {
-			count_vm_event(SWAP_RA_HIT);
-			if (!vma || !vma_ra)
-				atomic_inc(&swapin_readahead_hits);
-		}
-	} else {
-		folio = NULL;
+void swap_cache_update_ra(struct folio *folio, struct vm_area_struct *vma,
+			  unsigned long addr)
+{
+	bool vma_ra = swap_use_vma_readahead();
+	bool readahead;
+
+	/*
+	 * At the moment, we don't support PG_readahead for anon THP
+	 * so let's bail out rather than confusing the readahead stat.
+	 */
+	if (unlikely(folio_test_large(folio)))
+		return;
+
+	readahead = folio_test_clear_readahead(folio);
+	if (vma && vma_ra) {
+		unsigned long ra_val;
+		int win, hits;
+
+		ra_val = GET_SWAP_RA_VAL(vma);
+		win = SWAP_RA_WIN(ra_val);
+		hits = SWAP_RA_HITS(ra_val);
+		if (readahead)
+			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+		atomic_long_set(&vma->swap_readahead_info,
+				SWAP_RA_VAL(addr, win, hits));
 	}
 
-	return folio;
+	if (readahead) {
+		count_vm_event(SWAP_RA_HIT);
+		if (!vma || !vma_ra)
+			atomic_inc(&swapin_readahead_hits);
+	}
 }
 
 /**
@@ -485,6 +485,11 @@ struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_mask,
 	if (!si)
 		goto out_no_device;
 
+	/* First do a racy check if cache is already loaded. */
+	swapcache = swap_cache_try_get(entry);
+	if (swapcache)
+		goto out_no_alloc;
+
 	/* We are very likely the first user, alloc and try add to the swapcache. */
 	folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0, mpol, ilx,
 						 numa_node_id());
@@ -614,7 +619,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool *folio_allocated)
 {
 	struct folio *folio;
 	unsigned long entry_offset = swp_offset(entry);
@@ -644,6 +650,10 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		folio = swap_cache_alloc_or_get(
 				swp_entry(swp_type(entry), offset),
 				gfp_mask, mpol, ilx, &page_allocated);
+		if (offset == entry_offset) {
+			*folio_allocated = page_allocated;
+			folio_allocated = NULL;
+		}
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -666,6 +676,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		zswap_folio_swapin(folio);
 		swap_read_folio(folio, false, NULL);
 	}
+	if (folio_allocated)
+		*folio_allocated = page_allocated;
 	return folio;
 }
 
@@ -779,7 +791,8 @@ static void swap_ra_info(struct vm_fault *vmf,
  *
  */
 static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+					struct mempolicy *mpol, pgoff_t targ_ilx,
+					struct vm_fault *vmf, bool *folio_allocated)
 {
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
@@ -818,6 +831,10 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		pte = NULL;
 		folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
 						&page_allocated);
+		if (i == ra_info.offset) {
+			*folio_allocated = page_allocated;
+			folio_allocated = NULL;
+		}
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -842,6 +859,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		zswap_folio_swapin(folio);
 		swap_read_folio(folio, false, NULL);
 	}
+	if (folio_allocated)
+		*folio_allocated = page_allocated;
 	return folio;
 }
 
@@ -854,20 +873,21 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * Returns the folio for entry after it is read in.
  */
 struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
-			    struct vm_fault *vmf)
+			    struct vm_fault *vmf, bool *folio_allocated)
 {
 	struct mempolicy *mpol;
 	struct folio *folio;
-	bool page_allocated;
 	pgoff_t ilx;
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
-					&page_allocated);
+					folio_allocated);
 	mpol_cond_put(mpol);
 
-	if (page_allocated)
+	if (*folio_allocated)
 		swap_read_folio(folio, true, NULL);
+	else if (folio)
+		swap_cache_update_ra(folio, vmf->vma, vmf->address);
 
 	return folio;
 }
@@ -885,18 +905,22 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+			       struct vm_fault *vmf, bool *folio_allocated)
 {
 	struct mempolicy *mpol;
-	pgoff_t ilx;
 	struct folio *folio;
+	bool allocated;
+	pgoff_t ilx;
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf, &allocated) :
+		swap_cluster_readahead(entry, gfp_mask, mpol, ilx, &allocated);
 	mpol_cond_put(mpol);
 
+	if (!*folio_allocated && folio)
+		swap_cache_update_ra(folio, vmf->vma, vmf->address);
+
 	return folio;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8225091d42b6..ddcf2ff91c39 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1455,6 +1455,15 @@ int __swap_count(swp_entry_t entry)
 	return swap_count(si->swap_map[offset]);
 }
 
+bool __swap_has_cache(swp_entry_t entry)
+{
+	pgoff_t offset = swp_offset(entry);
+	struct swap_info_struct *si = swp_swap_info(entry);
+	unsigned char count = READ_ONCE(si->swap_map[offset]);
+
+	return swap_count(count) && (count & SWAP_HAS_CACHE);
+}
+
 /*
  * How many references to @entry are currently swapped out?
  * This does not give an exact answer when swap count is continued,
@@ -1862,10 +1871,18 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		struct folio *folio;
 		unsigned long offset;
 		unsigned char swp_count;
+		bool folio_allocated;
 		swp_entry_t entry;
 		int ret;
 		pte_t ptent;
 
+		struct vm_fault vmf = {
+			.vma = vma,
+			.address = addr,
+			.real_address = addr,
+			.pmd = pmd,
+		};
+
 		if (!pte++) {
 			pte = pte_offset_map(pmd, addr);
 			if (!pte)
@@ -1884,19 +1901,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		offset = swp_offset(entry);
 		pte_unmap(pte);
 		pte = NULL;
-
-		folio = swap_cache_get_folio(entry, vma, addr);
-		if (!folio) {
-			struct vm_fault vmf = {
-				.vma = vma,
-				.address = addr,
-				.real_address = addr,
-				.pmd = pmd,
-			};
-
-			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
-		}
+		folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+				&vmf, &folio_allocated);
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);
 			if (swap_count(swp_count) == 0 || swp_count == SWAP_MAP_BAD)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (8 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 09/10] mm/swap: delay the swap cache lookup for swapin Kairui Song
@ 2024-03-26 18:50 ` Kairui Song
  2024-03-27  6:22   ` Huang, Ying
  2024-03-27  8:08   ` Barry Song
  2024-03-27  2:52 ` [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Huang, Ying
  10 siblings, 2 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-26 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Huang, Ying, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Interestingly the major performance overhead of synchronous is actually
from the workingset nodes update, that's because synchronous swap in
keeps adding single folios into a xa_node, making the node no longer
a shadow node and have to be removed from shadow_nodes, then remove
the folio very shortly and making the node a shadow node again,
so it has to add back to the shadow_nodes.

Mark synchronous swapin folio with a special bit in swap entry embedded
in folio->swap, as we still have some usable bits there. Skip workingset
node update on insertion of such folio because it will be removed very
quickly, and will trigger the update ensuring the workingset info is
eventual consensus.

Test result of sequential swapin/out of 30G zero page on ZRAM:

               Before (us)        After (us)
Swapout:       33853883           33886008
Swapin:        38336519           32465441 (+15.4%)
Swapout (THP): 6814619            6899938
Swapin (THP) : 38383367           33193479 (+13.6%)

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swapops.h |  5 +++-
 mm/filemap.c            | 16 +++++++++---
 mm/memory.c             | 34 ++++++++++++++----------
 mm/swap.h               | 15 +++++++++++
 mm/swap_state.c         | 57 ++++++++++++++++++++++++-----------------
 mm/vmscan.c             |  6 +++++
 mm/workingset.c         |  2 +-
 7 files changed, 92 insertions(+), 43 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 48b700ba1d18..ebc0c3e4668d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -25,7 +25,10 @@
  * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
  */
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
-#define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
+#define SWP_CACHE_FLAG_BITS	1
+#define SWP_CACHE_SYNCHRONOUS	BIT(SWP_TYPE_SHIFT - 1)
+#define SWP_OFFSET_BITS	(SWP_TYPE_SHIFT - SWP_CACHE_FLAG_BITS)
+#define SWP_OFFSET_MASK	(BIT(SWP_OFFSET_BITS) - 1)
 
 /*
  * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
diff --git a/mm/filemap.c b/mm/filemap.c
index 5e8e3fd26b8d..ac24cc65d1da 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -923,12 +923,20 @@ int __filemap_add_swapcache(struct address_space *mapping, struct folio *folio,
 			    pgoff_t index, gfp_t gfp, void **shadowp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
+	bool synchronous = swap_cache_test_synchronous(folio);
 	long nr;
 	int ret;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
-	mapping_set_update(&xas, mapping);
+
+	/*
+	 * Skip node update for synchronous folio insertion, it will be
+	 * updated on folio deletion very soon, avoid repeated LRU locking.
+	 */
+	if (!synchronous)
+		xas_set_update(&xas, workingset_update_node);
+	xas_set_lru(&xas, &shadow_nodes);
 
 	nr = folio_nr_pages(folio);
 	folio_ref_add(folio, nr);
@@ -936,8 +944,10 @@ int __filemap_add_swapcache(struct address_space *mapping, struct folio *folio,
 	ret = __filemap_lock_store(&xas, folio, index, gfp, shadowp);
 	if (likely(!ret)) {
 		mapping->nrpages += nr;
-		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
-		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
+		if (!synchronous) {
+			__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
+			__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
+		}
 		xas_unlock_irq(&xas);
 	} else {
 		folio_put_refs(folio, nr);
diff --git a/mm/memory.c b/mm/memory.c
index 774a912eb46d..bb40202b4f29 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3933,6 +3933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool folio_allocated = false;
+	bool synchronous_io = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4032,18 +4033,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
-	if (swapcache) {
-		/*
-		 * Make sure folio_free_swap() or swapoff did not release the
-		 * swapcache from under us.  The page pin, and pte_same test
-		 * below, are not enough to exclude that.  Even if it is still
-		 * swapcache, we need to check that the page's swap has not
-		 * changed.
-		 */
-		if (unlikely(!folio_test_swapcache(folio) ||
-			     page_swap_entry(page).val != entry.val))
-			goto out_page;
+	/*
+	 * Make sure folio_free_swap() or swapoff did not release the
+	 * swapcache from under us.  The page pin, and pte_same test
+	 * below, are not enough to exclude that.  Even if it is still
+	 * swapcache, we need to check that the page's swap has not
+	 * changed.
+	 */
+	if (unlikely(!folio_test_swapcache(folio) ||
+		     (page_swap_entry(page).val & ~SWP_CACHE_SYNCHRONOUS) != entry.val))
+		goto out_page;
 
+	synchronous_io = swap_cache_test_synchronous(folio);
+	if (!synchronous_io) {
 		/*
 		 * KSM sometimes has to copy on read faults, for example, if
 		 * page->index of !PageKSM() pages would be nonlinear inside the
@@ -4105,9 +4107,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	if (!folio_test_ksm(folio)) {
 		exclusive = pte_swp_exclusive(vmf->orig_pte);
-		if (folio != swapcache) {
+		if (synchronous_io || folio != swapcache) {
 			/*
-			 * We have a fresh page that is not exposed to the
+			 * We have a fresh page that is not sharable through the
 			 * swapcache -> certainly exclusive.
 			 */
 			exclusive = true;
@@ -4148,7 +4150,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free(entry);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (synchronous_io)
+		delete_from_swap_cache(folio);
+	else if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
@@ -4223,6 +4227,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 out_nomap:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
+	if (synchronous_io)
+		delete_from_swap_cache(folio);
 out_page:
 	folio_unlock(folio);
 out_release:
diff --git a/mm/swap.h b/mm/swap.h
index bd872b157950..9d106eebddbd 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -31,6 +31,21 @@ extern struct address_space *swapper_spaces[];
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
+static inline void swap_cache_mark_synchronous(struct folio *folio)
+{
+	folio->swap.val |= SWP_CACHE_SYNCHRONOUS;
+}
+
+static inline bool swap_cache_test_synchronous(struct folio *folio)
+{
+	return folio->swap.val & SWP_CACHE_SYNCHRONOUS;
+}
+
+static inline void swap_cache_clear_synchronous(struct folio *folio)
+{
+	folio->swap.val &= ~SWP_CACHE_SYNCHRONOUS;
+}
+
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index cf178dd1131a..b0b1b5391ac1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -86,7 +86,7 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
  * but sets SwapCache flag and private instead of mapping and index.
  */
 static int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-			     gfp_t gfp, void **shadowp)
+			     gfp_t gfp, bool synchronous, void **shadowp)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
@@ -98,11 +98,12 @@ static int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 
 	folio_set_swapcache(folio);
 	folio->swap = entry;
-
+	if (synchronous)
+		swap_cache_mark_synchronous(folio);
 	ret = __filemap_add_swapcache(address_space, folio, idx, gfp, shadowp);
 	if (ret) {
-		folio_clear_swapcache(folio);
 		folio->swap.val = 0;
+		folio_clear_swapcache(folio);
 	}
 
 	return ret;
@@ -129,11 +130,13 @@ void __delete_from_swap_cache(struct folio *folio,
 	xas_set_order(&xas, idx, folio_order(folio));
 	xas_store(&xas, shadow);
 
-	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
 	address_space->nrpages -= nr;
-	__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
-	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
+	if (!swap_cache_test_synchronous(folio)) {
+		__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
+		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
+	}
+	folio->swap.val = 0;
 }
 
 /**
@@ -393,7 +396,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
  * else or hitting OOM.
  */
 static struct folio *swap_cache_add_or_get(struct folio *folio,
-		swp_entry_t entry, gfp_t gfp_mask)
+		swp_entry_t entry, gfp_t gfp_mask, bool synchronous)
 {
 	int ret = 0;
 	void *shadow = NULL;
@@ -403,7 +406,7 @@ static struct folio *swap_cache_add_or_get(struct folio *folio,
 	if (folio) {
 		__folio_set_locked(folio);
 		__folio_set_swapbacked(folio);
-		ret = add_to_swap_cache(folio, entry, gfp_mask, &shadow);
+		ret = add_to_swap_cache(folio, entry, gfp_mask, synchronous, &shadow);
 		if (ret)
 			__folio_clear_locked(folio);
 	}
@@ -460,7 +463,7 @@ int swap_cache_add_wait(struct folio *folio, swp_entry_t entry, gfp_t gfp)
 	struct folio *wait_folio;
 
 	for (;;) {
-		ret = add_to_swap_cache(folio, entry, gfp, NULL);
+		ret = add_to_swap_cache(folio, entry, gfp, false, NULL);
 		if (ret != -EEXIST)
 			break;
 		wait_folio = filemap_get_folio(swap_address_space(entry),
@@ -493,7 +496,7 @@ struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_mask,
 	/* We are very likely the first user, alloc and try add to the swapcache. */
 	folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0, mpol, ilx,
 						 numa_node_id());
-	swapcache = swap_cache_add_or_get(folio, entry, gfp_mask);
+	swapcache = swap_cache_add_or_get(folio, entry, gfp_mask, false);
 	if (swapcache != folio) {
 		folio_put(folio);
 		goto out_no_alloc;
@@ -875,21 +878,27 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
 			    struct vm_fault *vmf, bool *folio_allocated)
 {
-	struct mempolicy *mpol;
-	struct folio *folio;
-	pgoff_t ilx;
-
-	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
-	folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
-					folio_allocated);
-	mpol_cond_put(mpol);
-
-	if (*folio_allocated)
+	struct folio *folio = NULL, *swapcache;
+	/* First do a racy check if cache is already loaded. */
+	swapcache = swap_cache_try_get(entry);
+	if (unlikely(swapcache))
+		goto out;
+	folio = vma_alloc_folio(gfp_mask, 0, vmf->vma, vmf->address, false);
+	swapcache = swap_cache_add_or_get(folio, entry, gfp_mask, true);
+	if (!swapcache)
+		goto out_nocache;
+	if (swapcache == folio) {
 		swap_read_folio(folio, true, NULL);
-	else if (folio)
-		swap_cache_update_ra(folio, vmf->vma, vmf->address);
-
-	return folio;
+		*folio_allocated = true;
+		return folio;
+	}
+out:
+	swap_cache_update_ra(swapcache, vmf->vma, vmf->address);
+out_nocache:
+	if (folio)
+		folio_put(folio);
+	*folio_allocated = false;
+	return swapcache;
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c3db39393428..e71b049fee01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1228,6 +1228,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!add_to_swap(folio))
 						goto activate_locked_split;
 				}
+			} else if (swap_cache_test_synchronous(folio)) {
+				/*
+				 * We see a folio being swapped in but not activated either
+				 * due to missing shadow or lived too short, active it.
+				 */
+				goto activate_locked;
 			}
 		} else if (folio_test_swapbacked(folio) &&
 			   folio_test_large(folio)) {
diff --git a/mm/workingset.c b/mm/workingset.c
index f2a0ecaf708d..83a0b409be0f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -753,7 +753,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	 */
 	if (WARN_ON_ONCE(!node->nr_values))
 		goto out_invalid;
-	if (WARN_ON_ONCE(node->count != node->nr_values))
+	if (WARN_ON_ONCE(node->count != node->nr_values && mapping->host != NULL))
 		goto out_invalid;
 	xa_delete_node(node, workingset_update_node);
 	__inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 03/10] mm/swap: convert swapin_readahead to return a folio
  2024-03-26 18:50 ` [RFC PATCH 03/10] mm/swap: convert swapin_readahead to return a folio Kairui Song
@ 2024-03-26 20:03   ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2024-03-26 20:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Huang, Ying, Chris Li, Minchan Kim, Barry Song,
	Ryan Roberts, Yu Zhao, SeongJae Park, David Hildenbrand,
	Yosry Ahmed, Johannes Weiner, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 02:50:25AM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Simplify the caller code logic.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I have this patch in my tree which is basically identical, but I think
has a slightly better commit message.  Feel free to merge the two.

From 0c17eefb4c1ca7733ca765ab2f28ee09c9110ec0 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 22 Mar 2024 13:28:33 -0400
Subject: [PATCH] mm: Return the folio from swapin_readahead

The unuse_pte_range() caller only wants the folio while do_swap_page()
wants both the page and the folio.  Since do_swap_page() already has
logic for handling both the folio and the page, move the folio-to-page
logic there.  This also lets us allocate larger folios in the
SWP_SYNCHRONOUS_IO path in future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c     | 6 ++----
 mm/swap.h       | 6 +++---
 mm/swap_state.c | 8 +++-----
 mm/swapfile.c   | 5 +----
 4 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 40070ef01867..aedf0ee554d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4020,7 +4020,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			/* skip swapcache */
 			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
 						vma, vmf->address, false);
-			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
@@ -4045,10 +4044,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				folio->private = NULL;
 			}
 		} else {
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						vmf);
-			if (page)
-				folio = page_folio(page);
 			swapcache = folio;
 		}
 
@@ -4069,6 +4066,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+		page = folio_file_page(folio, swp_offset(entry));
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
diff --git a/mm/swap.h b/mm/swap.h
index fc2f6ade7f80..6661b55b2c75 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -55,8 +55,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
-struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-			      struct vm_fault *vmf);
+struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
+		struct vm_fault *vmf);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -87,7 +87,7 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
 	return NULL;
 }
 
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 			struct vm_fault *vmf)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2deac23633cd..f3c379e93bc6 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -885,13 +885,13 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * @gfp_mask: memory allocation flags
  * @vmf: fault information
  *
- * Returns the struct page for entry and addr, after queueing swapin.
+ * Returns the struct folio for entry and addr, after queueing swapin.
  *
  * It's a main entry function for swap readahead. By the configuration,
  * it will read ahead blocks by cluster-based(ie, physical disk based)
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
+struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 				struct vm_fault *vmf)
 {
 	struct mempolicy *mpol;
@@ -904,9 +904,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
 	mpol_cond_put(mpol);
 
-	if (!folio)
-		return NULL;
-	return folio_file_page(folio, swp_offset(entry));
+	return folio;
 }
 
 #ifdef CONFIG_SYSFS
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5e6d2304a2a4..c9d041ad8df6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1883,7 +1883,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 		folio = swap_cache_get_folio(entry, vma, addr);
 		if (!folio) {
-			struct page *page;
 			struct vm_fault vmf = {
 				.vma = vma,
 				.address = addr,
@@ -1891,10 +1890,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				.pmd = pmd,
 			};
 
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
 						&vmf);
-			if (page)
-				folio = page_folio(page);
 		}
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);
-- 
2.43.0


>  mm/memory.c     | 8 +++-----
>  mm/swap.h       | 4 ++--
>  mm/swap_state.c | 6 ++----
>  mm/swapfile.c   | 5 +----
>  4 files changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e42fadc25268..dfdb620a9123 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4005,12 +4005,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			if (PTR_ERR(folio) == -EBUSY)
>  				goto out;
>  			need_clear_cache = true;
> -			page = &folio->page;
>  		} else {
> -			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						vmf);
> -			if (page)
> -				folio = page_folio(page);
> +			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
>  			swapcache = folio;
>  		}
>  
> @@ -4027,6 +4023,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			goto unlock;
>  		}
>  
> +		page = folio_file_page(folio, swp_offset(entry));
> +
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
>  		count_vm_event(PGMAJFAULT);
> diff --git a/mm/swap.h b/mm/swap.h
> index 40e902812cc5..aee134907a70 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -57,7 +57,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
>  struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
>  			    struct vm_fault *vmf);
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> +struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
>  			      struct vm_fault *vmf);
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
> @@ -95,7 +95,7 @@ static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
>  	return NULL;
>  }
>  
> -static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> +static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>  			struct vm_fault *vmf)
>  {
>  	return NULL;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0a3fa48b3893..2a9c6bdff5ea 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -951,7 +951,7 @@ struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>   * it will read ahead blocks by cluster-based(ie, physical disk based)
>   * or vma-based(ie, virtual address based on faulty address) readahead.
>   */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> +struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  				struct vm_fault *vmf)
>  {
>  	struct mempolicy *mpol;
> @@ -964,9 +964,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>  	mpol_cond_put(mpol);
>  
> -	if (!folio)
> -		return NULL;
> -	return folio_file_page(folio, swp_offset(entry));
> +	return folio;
>  }
>  
>  #ifdef CONFIG_SYSFS
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4919423cce76..4dd894395a0f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1883,7 +1883,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  		folio = swap_cache_get_folio(entry, vma, addr);
>  		if (!folio) {
> -			struct page *page;
>  			struct vm_fault vmf = {
>  				.vma = vma,
>  				.address = addr,
> @@ -1891,10 +1890,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				.pmd = pmd,
>  			};
>  
> -			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>  						&vmf);
> -			if (page)
> -				folio = page_folio(page);
>  		}
>  		if (!folio) {
>  			swp_count = READ_ONCE(si->swap_map[offset]);
> -- 
> 2.43.0
> 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
  2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
                   ` (9 preceding siblings ...)
  2024-03-26 18:50 ` [RFC PATCH 10/10] mm/swap: optimize synchronous swapin Kairui Song
@ 2024-03-27  2:52 ` Huang, Ying
  2024-03-27  3:01   ` Kairui Song
  10 siblings, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  2:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Chris Li, Minchan Kim, Barry Song,
	Ryan Roberts, Yu Zhao, SeongJae Park, David Hildenbrand,
	Yosry Ahmed, Johannes Weiner, Matthew Wilcox, Nhat Pham,
	Chengming Zhou, Andrew Morton, linux-kernel

Hi, Kairui,

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
> bypass swapin):
> https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@gmail.com/
>
> Because we have to spin on the swap map on race, and swap map is too small
> to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
> is added. It's not the first time a hackish workaround was added for cache
> bypass swapin and not the last time. I did many experiments locally to
> see if the swap cache bypass path can be dropped while keeping the
> performance still comparable. And it seems doable.
>

In general, I think that it's a good idea to unify cache bypass swapin
and normal swapin.  But I haven't dive into the implementation yet.

> This series does the following things:
> 1. Remove swap cache bypass completely.
> 2. Apply multiple optimizations after that, these optimizations are
>    either undoable or very difficult to do without dropping the cache
>    bypass swapin path.
> 3. Use swap cache as a synchronization layer, also unify some code
>    with page cache (filemap).
>
> As a result, we have:
> 1. A comparable performance, some tests are even faster.
> 2. Multi-index support for swap cache.
> 3. Removed many hackish workarounds including above long tailing
>    issue is gone.
>
> Sending this as RFC to collect some discussion, suggestion, or rejection
> early, this seems need to be split into multiple series, but the
> performance is not good until the last patch so I think start by
> seperating them may make this approach not very convincing. And there
> are still some (maybe further) TODO items and optimization space
> if we are OK with this approach.
>
> This is based on my another series, for reusing filemap code for swapcache:
> [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
> https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@gmail.com/
>
> Patch 1/10, introduce a helper from filemap side to be used later.
> Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
>   bypass swapin path.
> Patch 4/10, removed the swap cache bypass swapin path, and the
>   performance drop heavily (-28%).
> Patch 5/10, apply the first optimization after the removal, since all
>   folios goes through swap cache now, there is no need to explicit shadow
>   clearing any more.
> Patch 6/10, apply another optimization after clean up shadow clearing
>   routines. Now swapcache is very alike page cache, so just reuse page
>   cache code and we will have multi-index support. Shadow memory usage
>   dropped a lot.
> Patch 7/10, just rename __read_swap_cache_async, it will be refactored
>   and a key part of this series, and the naming is very confusing to me.
> Patch 8/10, make swap cache as a synchronization layer, introduce two
>   helpers for adding folios to swap cache, caller will either succeed or
>   get a folio to wait on.
> Patch 9/10, apply another optimization. With above two helpers, looking
>   up of swapcache can be optimized and avoid false looking up, which
>   helped improve the performance.
> Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
>   after this commit, performance for simple swapin/swapout is basically
>   same as before.
>
> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>
>                Before (us)        After (us)
> Swapout:       33619409           33886008
> Swapin:        32393771           32465441 (- 0.2%)
> Swapout (THP): 7817909            6899938  (+11.8%)
> Swapin (THP) : 32452387           33193479 (- 2.2%)

If my understanding were correct, we don't have swapin (THP) support,
yet.  Right?

> And after swapping out 30G with THP, the radix node usage dropped by a
> lot:
>
> Before: radix_tree_node 73728K
> After:  radix_tree_node  7056K (-94%)

Good!

> Test 2:
> Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
>   sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
>   --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
>   --threads=48 --time=300 --report-interval=10 run
>
> Before: transactions: 4849.25 per sec
> After:  transactions: 4849.40 per sec
>
> Test 3:
> Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
>   echo never > /sys/kernel/mm/transparent_hugepage/enabled
>   echo 100 > /sys/module/zswap/parameters/max_pool_percent
>   echo 1 > /sys/module/zswap/parameters/enabled
>   echo y > /sys/module/zswap/parameters/shrinker_enabled
>
>   sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
>   --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
>   --threads=48 --time=600 --report-interval=10 run
>
> Before: transactions: 1662.90 per sec
> After:  transactions: 1726.52 per sec

3.8% improvement.  Good!

> Test 4:
> Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
>   echo always > /sys/kernel/mm/transparent_hugepage/enabled
>   echo 100 > /sys/module/zswap/parameters/max_pool_percent
>   echo 1 > /sys/module/zswap/parameters/enabled
>   echo y > /sys/module/zswap/parameters/shrinker_enabled
>
>   sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
>   --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
>   --threads=48 --time=600 --report-interval=10 run
>
> Before: transactions: 2860.90 per sec.
> After:  transactions: 2802.55 per sec.
>
> Test 5:
> Memtier / memcached (16G brd SWAP, 8G memcg, THP never):
>
>   memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
>
>   memtier_benchmark -S /tmp/memcached.socket \
>     -P memcache_binary -n allkeys --key-minimum=1 \
>     --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
>     --ratio 1:0 --pipeline 8 -d 1000
>
> Before: 106730.31 Ops/sec
> After:  106360.11 Ops/sec
>
> Test 5:
> Memtier / memcached (16G brd SWAP, 8G memcg, THP always):
>
>   memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
>
>   memtier_benchmark -S /tmp/memcached.socket \
>     -P memcache_binary -n allkeys --key-minimum=1 \
>     --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
>     --ratio 1:0 --pipeline 8 -d 1000
>
> Before: 83193.11 Ops/sec
> After:  82504.89 Ops/sec
>
> These tests are tested under heavy memory stress, and the performance
> seems basically same as before,very slightly better/worse for certain
> cases, the benefits of multi-index are basically erased by
> fragmentation and workingset nodes usage is slightly lower.
>
> Some (maybe further) TODO items if we are OK with this approach:
>
> - I see a slight performance regression for THP tests,
>   could identify a clear hotspot with perf, my guess is the
>   content on the xa_lock is an issue (we have a xa_lock for
>   every 64M swap cache space), THP handling needs to take the lock
>   longer than usual. splitting the xa_lock to be more
>   fine-grained seems a good solution. We have
>   SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
>   Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
>   just for 2 extra bits. 12 should be better to always make use of
>   the whole XA chunk and having two layers at most. But duplicated
>   address_space struct also wastes more memory and cacheline.
>   I see an observable performance drop (~3%) after change
>   SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
>   decouple swap cache xarray from address_space (there are
>   too many user for swapcache, shouldn't come too dirty).
>
> - Actually after patch Patch 4/10, the performance is much better for
>   tests limited with memory cgroup, until 10/10 applied the direct swap
>   cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
>   device is not near full, swapin doesn't clear up the swapcache, so
>   repeated swapout doesn't need to re-alloc a swap entry, make things
>   faster. This may indicate that lazy freeing of swap cache could benifit
>   certain workloads and may worth looking into later.
>
> - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
>   swap cache after swapin is done, which can be cleaned up and optimized
>   further after this patch. Device type will only determine the
>   readahead logic, and swap cache drop check can be based purely on swap
>   count.
>
> - Recent mTHP swapin/swapout series should have no fundamental
>   conflict with this.
>
> Kairui Song (10):
>   mm/filemap: split filemap storing logic into a standalone helper
>   mm/swap: move no readahead swapin code to a stand-alone helper
>   mm/swap: convert swapin_readahead to return a folio
>   mm/swap: remove cache bypass swapin
>   mm/swap: clean shadow only in unmap path
>   mm/swap: switch to use multi index entries
>   mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
>   mm/swap: use swap cache as a synchronization layer
>   mm/swap: delay the swap cache look up for swapin
>   mm/swap: optimize synchronous swapin
>
>  include/linux/swapops.h |   5 +-
>  mm/filemap.c            | 161 +++++++++-----
>  mm/huge_memory.c        |  78 +++----
>  mm/internal.h           |   2 +
>  mm/memory.c             | 133 ++++-------
>  mm/shmem.c              |  44 ++--
>  mm/swap.h               |  71 ++++--
>  mm/swap_state.c         | 478 +++++++++++++++++++++-------------------
>  mm/swapfile.c           |  64 +++---
>  mm/vmscan.c             |   8 +-
>  mm/workingset.c         |   2 +-
>  mm/zswap.c              |   4 +-
>  12 files changed, 540 insertions(+), 510 deletions(-)

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
  2024-03-27  2:52 ` [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Huang, Ying
@ 2024-03-27  3:01   ` Kairui Song
  2024-03-27  8:27     ` Ryan Roberts
  0 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2024-03-27  3:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Kairui,
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > A month ago a bug was fixed for SWP_SYNCHRONOUS_IO swapin (swap cache
> > bypass swapin):
> > https://lore.kernel.org/linux-mm/20240219082040.7495-1-ryncsn@gmail.com/
> >
> > Because we have to spin on the swap map on race, and swap map is too small
> > to contain more usable info, an ugly schedule_timeout_uninterruptible(1)
> > is added. It's not the first time a hackish workaround was added for cache
> > bypass swapin and not the last time. I did many experiments locally to
> > see if the swap cache bypass path can be dropped while keeping the
> > performance still comparable. And it seems doable.
> >
>
> In general, I think that it's a good idea to unify cache bypass swapin
> and normal swapin.  But I haven't dive into the implementation yet.

Thanks!

This series might be a bit too large, I can try to split it for easier
reviewing later, just if we are OK with this idea.

>
> > This series does the following things:
> > 1. Remove swap cache bypass completely.
> > 2. Apply multiple optimizations after that, these optimizations are
> >    either undoable or very difficult to do without dropping the cache
> >    bypass swapin path.
> > 3. Use swap cache as a synchronization layer, also unify some code
> >    with page cache (filemap).
> >
> > As a result, we have:
> > 1. A comparable performance, some tests are even faster.
> > 2. Multi-index support for swap cache.
> > 3. Removed many hackish workarounds including above long tailing
> >    issue is gone.
> >
> > Sending this as RFC to collect some discussion, suggestion, or rejection
> > early, this seems need to be split into multiple series, but the
> > performance is not good until the last patch so I think start by
> > seperating them may make this approach not very convincing. And there
> > are still some (maybe further) TODO items and optimization space
> > if we are OK with this approach.
> >
> > This is based on my another series, for reusing filemap code for swapcache:
> > [PATCH v2 0/4] mm/filemap: optimize folio adding and splitting
> > https://lore.kernel.org/linux-mm/20240325171405.99971-1-ryncsn@gmail.com/
> >
> > Patch 1/10, introduce a helper from filemap side to be used later.
> > Patch 2/10, 3/10 are clean up and prepare for removing the swap cache
> >   bypass swapin path.
> > Patch 4/10, removed the swap cache bypass swapin path, and the
> >   performance drop heavily (-28%).
> > Patch 5/10, apply the first optimization after the removal, since all
> >   folios goes through swap cache now, there is no need to explicit shadow
> >   clearing any more.
> > Patch 6/10, apply another optimization after clean up shadow clearing
> >   routines. Now swapcache is very alike page cache, so just reuse page
> >   cache code and we will have multi-index support. Shadow memory usage
> >   dropped a lot.
> > Patch 7/10, just rename __read_swap_cache_async, it will be refactored
> >   and a key part of this series, and the naming is very confusing to me.
> > Patch 8/10, make swap cache as a synchronization layer, introduce two
> >   helpers for adding folios to swap cache, caller will either succeed or
> >   get a folio to wait on.
> > Patch 9/10, apply another optimization. With above two helpers, looking
> >   up of swapcache can be optimized and avoid false looking up, which
> >   helped improve the performance.
> > Patch 10/10, apply a major optimization for SWP_SYNCHRONOUS_IO devices,
> >   after this commit, performance for simple swapin/swapout is basically
> >   same as before.
> >
> > Test 1, sequential swapin/out of 30G zero page on ZRAM:
> >
> >                Before (us)        After (us)
> > Swapout:       33619409           33886008
> > Swapin:        32393771           32465441 (- 0.2%)
> > Swapout (THP): 7817909            6899938  (+11.8%)
> > Swapin (THP) : 32452387           33193479 (- 2.2%)
>
> If my understanding were correct, we don't have swapin (THP) support,
> yet.  Right?

Yes, this series doesn't change how swapin/swapout works with THP in
general, but now THP swapout will leave shadows with large order, so
it needs to be splitted upon swapin, that will slow down later swapin
by a little bit but I think that's worth it.

If we can do THP swapin in the future, this split on swapin can be
saved to make the performance even better.

>
> > And after swapping out 30G with THP, the radix node usage dropped by a
> > lot:
> >
> > Before: radix_tree_node 73728K
> > After:  radix_tree_node  7056K (-94%)
>
> Good!
>
> > Test 2:
> > Mysql (16g buffer pool, 32G ZRAM SWAP, 4G memcg, Zswap disabled, THP never)
> >   sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> >   --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> >   --threads=48 --time=300 --report-interval=10 run
> >
> > Before: transactions: 4849.25 per sec
> > After:  transactions: 4849.40 per sec
> >
> > Test 3:
> > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP never)
> >   echo never > /sys/kernel/mm/transparent_hugepage/enabled
> >   echo 100 > /sys/module/zswap/parameters/max_pool_percent
> >   echo 1 > /sys/module/zswap/parameters/enabled
> >   echo y > /sys/module/zswap/parameters/shrinker_enabled
> >
> >   sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> >   --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> >   --threads=48 --time=600 --report-interval=10 run
> >
> > Before: transactions: 1662.90 per sec
> > After:  transactions: 1726.52 per sec
>
> 3.8% improvement.  Good!
>
> > Test 4:
> > Mysql (16g buffer pool, NVME SWAP, 4G memcg, Zswap enabled, THP always)
> >   echo always > /sys/kernel/mm/transparent_hugepage/enabled
> >   echo 100 > /sys/module/zswap/parameters/max_pool_percent
> >   echo 1 > /sys/module/zswap/parameters/enabled
> >   echo y > /sys/module/zswap/parameters/shrinker_enabled
> >
> >   sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-user=root \
> >   --mysql-password=1234 --mysql-db=sb --tables=36 --table-size=2000000 \
> >   --threads=48 --time=600 --report-interval=10 run
> >
> > Before: transactions: 2860.90 per sec.
> > After:  transactions: 2802.55 per sec.
> >
> > Test 5:
> > Memtier / memcached (16G brd SWAP, 8G memcg, THP never):
> >
> >   memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
> >
> >   memtier_benchmark -S /tmp/memcached.socket \
> >     -P memcache_binary -n allkeys --key-minimum=1 \
> >     --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> >     --ratio 1:0 --pipeline 8 -d 1000
> >
> > Before: 106730.31 Ops/sec
> > After:  106360.11 Ops/sec
> >
> > Test 5:
> > Memtier / memcached (16G brd SWAP, 8G memcg, THP always):
> >
> >   memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 16 -B binary &
> >
> >   memtier_benchmark -S /tmp/memcached.socket \
> >     -P memcache_binary -n allkeys --key-minimum=1 \
> >     --key-maximum=24000000 --key-pattern=P:P -c 1 -t 16 \
> >     --ratio 1:0 --pipeline 8 -d 1000
> >
> > Before: 83193.11 Ops/sec
> > After:  82504.89 Ops/sec
> >
> > These tests are tested under heavy memory stress, and the performance
> > seems basically same as before,very slightly better/worse for certain
> > cases, the benefits of multi-index are basically erased by
> > fragmentation and workingset nodes usage is slightly lower.
> >
> > Some (maybe further) TODO items if we are OK with this approach:
> >
> > - I see a slight performance regression for THP tests,
> >   could identify a clear hotspot with perf, my guess is the
> >   content on the xa_lock is an issue (we have a xa_lock for
> >   every 64M swap cache space), THP handling needs to take the lock
> >   longer than usual. splitting the xa_lock to be more
> >   fine-grained seems a good solution. We have
> >   SWAP_ADDRESS_SPACE_SHIFT = 14 which is not an optimal value.
> >   Considering XA_CHUNK_SHIFT is 6, we will have three layer of Xarray
> >   just for 2 extra bits. 12 should be better to always make use of
> >   the whole XA chunk and having two layers at most. But duplicated
> >   address_space struct also wastes more memory and cacheline.
> >   I see an observable performance drop (~3%) after change
> >   SWAP_ADDRESS_SPACE_SHIFT to 12. Might be a good idea to
> >   decouple swap cache xarray from address_space (there are
> >   too many user for swapcache, shouldn't come too dirty).
> >
> > - Actually after patch Patch 4/10, the performance is much better for
> >   tests limited with memory cgroup, until 10/10 applied the direct swap
> >   cache freeing logic for SWP_SYNCHRONOUS_IO swapin. Because if the swap
> >   device is not near full, swapin doesn't clear up the swapcache, so
> >   repeated swapout doesn't need to re-alloc a swap entry, make things
> >   faster. This may indicate that lazy freeing of swap cache could benifit
> >   certain workloads and may worth looking into later.
> >
> > - Now SWP_SYNCHRONOUS_IO swapin will bypass readahead and force drop
> >   swap cache after swapin is done, which can be cleaned up and optimized
> >   further after this patch. Device type will only determine the
> >   readahead logic, and swap cache drop check can be based purely on swap
> >   count.
> >
> > - Recent mTHP swapin/swapout series should have no fundamental
> >   conflict with this.
> >
> > Kairui Song (10):
> >   mm/filemap: split filemap storing logic into a standalone helper
> >   mm/swap: move no readahead swapin code to a stand-alone helper
> >   mm/swap: convert swapin_readahead to return a folio
> >   mm/swap: remove cache bypass swapin
> >   mm/swap: clean shadow only in unmap path
> >   mm/swap: switch to use multi index entries
> >   mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get
> >   mm/swap: use swap cache as a synchronization layer
> >   mm/swap: delay the swap cache look up for swapin
> >   mm/swap: optimize synchronous swapin
> >
> >  include/linux/swapops.h |   5 +-
> >  mm/filemap.c            | 161 +++++++++-----
> >  mm/huge_memory.c        |  78 +++----
> >  mm/internal.h           |   2 +
> >  mm/memory.c             | 133 ++++-------
> >  mm/shmem.c              |  44 ++--
> >  mm/swap.h               |  71 ++++--
> >  mm/swap_state.c         | 478 +++++++++++++++++++++-------------------
> >  mm/swapfile.c           |  64 +++---
> >  mm/vmscan.c             |   8 +-
> >  mm/workingset.c         |   2 +-
> >  mm/zswap.c              |   4 +-
> >  12 files changed, 540 insertions(+), 510 deletions(-)
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-26 18:50 ` [RFC PATCH 10/10] mm/swap: optimize synchronous swapin Kairui Song
@ 2024-03-27  6:22   ` Huang, Ying
  2024-03-27  6:37     ` Kairui Song
  2024-03-27  8:08   ` Barry Song
  1 sibling, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  6:22 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Chris Li, Minchan Kim, Barry Song,
	Ryan Roberts, Yu Zhao, SeongJae Park, David Hildenbrand,
	Yosry Ahmed, Johannes Weiner, Matthew Wilcox, Nhat Pham,
	Chengming Zhou, Andrew Morton, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Interestingly the major performance overhead of synchronous is actually
> from the workingset nodes update, that's because synchronous swap in

If it's the major overhead, why not make it the first optimization?

> keeps adding single folios into a xa_node, making the node no longer
> a shadow node and have to be removed from shadow_nodes, then remove
> the folio very shortly and making the node a shadow node again,
> so it has to add back to the shadow_nodes.

The folio is removed only if should_try_to_free_swap() returns true?

> Mark synchronous swapin folio with a special bit in swap entry embedded
> in folio->swap, as we still have some usable bits there. Skip workingset
> node update on insertion of such folio because it will be removed very
> quickly, and will trigger the update ensuring the workingset info is
> eventual consensus.

Is this safe?  Is it possible for the shadow node to be reclaimed after
the folio are added into node and before being removed?

If so, we may consider some other methods.  Make shadow_nodes per-cpu?

> Test result of sequential swapin/out of 30G zero page on ZRAM:
>
>                Before (us)        After (us)
> Swapout:       33853883           33886008
> Swapin:        38336519           32465441 (+15.4%)
> Swapout (THP): 6814619            6899938
> Swapin (THP) : 38383367           33193479 (+13.6%)
>

[snip]

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin
  2024-03-26 18:50 ` [RFC PATCH 04/10] mm/swap: remove cache bypass swapin Kairui Song
@ 2024-03-27  6:30   ` Huang, Ying
  2024-03-27  6:55     ` Kairui Song
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  6:30 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kairui Song, Chris Li, Minchan Kim, Barry Song,
	Ryan Roberts, Yu Zhao, SeongJae Park, David Hildenbrand,
	Yosry Ahmed, Johannes Weiner, Matthew Wilcox, Nhat Pham,
	Chengming Zhou, Andrew Morton, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> We used to have the cache bypass swapin path for better performance,
> but by removing it, more optimization can be applied and have
> an even better overall performance and less hackish.
>
> And these optimizations are not easily doable or not doable at all
> without this.
>
> This patch simply removes it, and the performance will drop heavily
> for simple swapin, things won't get this worse for real workloads
> but still observable. Following commits will fix this and archive a
> better performance.
>
> Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
> of swap path itself, because zero pages are not compressed but simply
> recorded in ZRAM, and performance drops more as SWAP device is getting
> full):
>
> Test result of sequential swapin/out:
>
>                Before (us)        After (us)
> Swapout:       33619409           33624641
> Swapin:        32393771           41614858 (-28.4%)
> Swapout (THP): 7817909            7795530
> Swapin (THP) : 32452387           41708471 (-28.4%)
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 18 ++++-------------
>  mm/swap.h       | 10 +++++-----
>  mm/swap_state.c | 53 ++++++++++---------------------------------------
>  mm/swapfile.c   | 13 ------------
>  4 files changed, 19 insertions(+), 75 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index dfdb620a9123..357d239ee2f6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	struct page *page;
>  	struct swap_info_struct *si = NULL;
>  	rmap_t rmap_flags = RMAP_NONE;
> -	bool need_clear_cache = false;
>  	bool exclusive = false;
>  	swp_entry_t entry;
>  	pte_t pte;
> @@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!folio) {
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>  		    __swap_count(entry) == 1) {
> -			/* skip swapcache and readahead */
>  			folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> -			if (PTR_ERR(folio) == -EBUSY)
> -				goto out;
> -			need_clear_cache = true;
>  		} else {
>  			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> -			swapcache = folio;
>  		}
>  
>  		if (!folio) {
> @@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			goto unlock;
>  		}
>  
> +		swapcache = folio;
>  		page = folio_file_page(folio, swp_offset(entry));
>  
>  		/* Had to read the page from swap area: Major fault */
> @@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	vmf->orig_pte = pte;
>  
>  	/* ksm created a completely new copy */
> -	if (unlikely(folio != swapcache && swapcache)) {
> +	if (unlikely(folio != swapcache)) {
>  		folio_add_new_anon_rmap(folio, vma, vmf->address);
>  		folio_add_lru_vma(folio, vma);
>  	} else {
> @@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>  
>  	folio_unlock(folio);
> -	if (folio != swapcache && swapcache) {
> +	if (folio != swapcache) {
>  		/*
>  		 * Hold the lock to avoid the swap entry to be reused
>  		 * until we take the PT lock for the pte_same() check
> @@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> -	/* Clear the swap cache pin for direct swapin after PTL unlock */
> -	if (need_clear_cache)
> -		swapcache_clear(si, entry);
>  	if (si)
>  		put_swap_device(si);
>  	return ret;
> @@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	folio_unlock(folio);
>  out_release:
>  	folio_put(folio);
> -	if (folio != swapcache && swapcache) {
> +	if (folio != swapcache) {
>  		folio_unlock(swapcache);
>  		folio_put(swapcache);
>  	}
> -	if (need_clear_cache)
> -		swapcache_clear(si, entry);
>  	if (si)
>  		put_swap_device(si);
>  	return ret;
> diff --git a/mm/swap.h b/mm/swap.h
> index aee134907a70..ac9573b03432 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
>  void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>  				  unsigned long end);
> -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
>  struct folio *swap_cache_get_folio(swp_entry_t entry,
>  		struct vm_area_struct *vma, unsigned long addr);
>  struct folio *filemap_get_incore_folio(struct address_space *mapping,
> @@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>  {
>  	return NULL;
>  }
> -
> -static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> +static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> +			struct vm_fault *vmf);
>  {
> -	return 0;
> +	return NULL;
>  }
>  
> -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> +static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>  {
> +	return 0;
>  }
>  
>  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2a9c6bdff5ea..49ef6250f676 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>  }
>  
>  /**
> - * swapin_direct - swap in folios skipping swap cache and readahead
> + * swapin_direct - swap in folios skipping readahead
>   * @entry: swap entry of this memory
>   * @gfp_mask: memory allocation flags
>   * @vmf: fault information
>   *
> - * Returns the struct folio for entry and addr after the swap entry is read
> - * in.
> + * Returns the folio for entry after it is read in.
>   */
>  struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>  			    struct vm_fault *vmf)
>  {
> -	struct vm_area_struct *vma = vmf->vma;
> +	struct mempolicy *mpol;
>  	struct folio *folio;
> -	void *shadow = NULL;
> -
> -	/*
> -	 * Prevent parallel swapin from proceeding with
> -	 * the cache flag. Otherwise, another thread may
> -	 * finish swapin first, free the entry, and swapout
> -	 * reusing the same entry. It's undetectable as
> -	 * pte_same() returns true due to entry reuse.
> -	 */
> -	if (swapcache_prepare(entry)) {
> -		/* Relax a bit to prevent rapid repeated page faults */
> -		schedule_timeout_uninterruptible(1);
> -		return ERR_PTR(-EBUSY);
> -	}
> -
> -	/* skip swapcache */
> -	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -				vma, vmf->address, false);
> -	if (folio) {
> -		__folio_set_locked(folio);
> -		__folio_set_swapbacked(folio);
> -
> -		if (mem_cgroup_swapin_charge_folio(folio,
> -					vma->vm_mm, GFP_KERNEL,
> -					entry)) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			return NULL;
> -		}
> -		mem_cgroup_swapin_uncharge_swap(entry);
> -
> -		shadow = get_shadow_from_swap_cache(entry);
> -		if (shadow)
> -			workingset_refault(folio, shadow);
> +	bool page_allocated;
> +	pgoff_t ilx;
>  
> -		folio_add_lru(folio);
> +	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> +	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> +					&page_allocated, false);
> +	mpol_cond_put(mpol);
>  
> -		/* To provide entry to swap_read_folio() */
> -		folio->swap = entry;
> +	if (page_allocated)
>  		swap_read_folio(folio, true, NULL);
> -		folio->private = NULL;
> -	}
>  
>  	return folio;
>  }

This looks similar as read_swap_cache_async().  Can we merge them?

And, we should avoid to readahead in swapin_readahead() or
swap_vma_readahead() for SWP_SYNCHRONOUS_IO anyway.  So, it appears that
we can change and use swapin_readahead() directly?

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4dd894395a0f..ae8d3aa05df7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3389,19 +3389,6 @@ int swapcache_prepare(swp_entry_t entry)
>  	return __swap_duplicate(entry, SWAP_HAS_CACHE);
>  }
>  
> -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> -{
> -	struct swap_cluster_info *ci;
> -	unsigned long offset = swp_offset(entry);
> -	unsigned char usage;
> -
> -	ci = lock_cluster_or_swap_info(si, offset);
> -	usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE);
> -	unlock_cluster_or_swap_info(si, ci);
> -	if (!usage)
> -		free_swap_slot(entry);
> -}
> -
>  struct swap_info_struct *swp_swap_info(swp_entry_t entry)
>  {
>  	return swap_type_to_swap_info(swp_type(entry));

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-27  6:22   ` Huang, Ying
@ 2024-03-27  6:37     ` Kairui Song
  2024-03-27  6:47       ` Huang, Ying
  0 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2024-03-27  6:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 2:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Interestingly the major performance overhead of synchronous is actually
> > from the workingset nodes update, that's because synchronous swap in
>
> If it's the major overhead, why not make it the first optimization?

This performance issue became much more obvious after doing other
optimizations, and other optimizations are for general swapin not only
for synchronous swapin, that's also how I optimized things step by
step, so I kept my patch order...

And it is easier to do this after Patch 8/10 which introduces the new
interface for swap cache.

>
> > keeps adding single folios into a xa_node, making the node no longer
> > a shadow node and have to be removed from shadow_nodes, then remove
> > the folio very shortly and making the node a shadow node again,
> > so it has to add back to the shadow_nodes.
>
> The folio is removed only if should_try_to_free_swap() returns true?
>
> > Mark synchronous swapin folio with a special bit in swap entry embedded
> > in folio->swap, as we still have some usable bits there. Skip workingset
> > node update on insertion of such folio because it will be removed very
> > quickly, and will trigger the update ensuring the workingset info is
> > eventual consensus.
>
> Is this safe?  Is it possible for the shadow node to be reclaimed after
> the folio are added into node and before being removed?

If a xa node contains any non-shadow entry, it can't be reclaimed,
shadow_lru_isolate will check and skip such nodes in case of race.

>
> If so, we may consider some other methods.  Make shadow_nodes per-cpu?

That's also an alternative solution if there are other risks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-27  6:37     ` Kairui Song
@ 2024-03-27  6:47       ` Huang, Ying
  2024-03-27  7:14         ` Kairui Song
  0 siblings, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  6:47 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Mar 27, 2024 at 2:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > Interestingly the major performance overhead of synchronous is actually
>> > from the workingset nodes update, that's because synchronous swap in
>>
>> If it's the major overhead, why not make it the first optimization?
>
> This performance issue became much more obvious after doing other
> optimizations, and other optimizations are for general swapin not only
> for synchronous swapin, that's also how I optimized things step by
> step, so I kept my patch order...
>
> And it is easier to do this after Patch 8/10 which introduces the new
> interface for swap cache.
>
>>
>> > keeps adding single folios into a xa_node, making the node no longer
>> > a shadow node and have to be removed from shadow_nodes, then remove
>> > the folio very shortly and making the node a shadow node again,
>> > so it has to add back to the shadow_nodes.
>>
>> The folio is removed only if should_try_to_free_swap() returns true?
>>
>> > Mark synchronous swapin folio with a special bit in swap entry embedded
>> > in folio->swap, as we still have some usable bits there. Skip workingset
>> > node update on insertion of such folio because it will be removed very
>> > quickly, and will trigger the update ensuring the workingset info is
>> > eventual consensus.
>>
>> Is this safe?  Is it possible for the shadow node to be reclaimed after
>> the folio are added into node and before being removed?
>
> If a xa node contains any non-shadow entry, it can't be reclaimed,
> shadow_lru_isolate will check and skip such nodes in case of race.

In shadow_lru_isolate(),

	/*
	 * The nodes should only contain one or more shadow entries,
	 * no pages, so we expect to be able to remove them all and
	 * delete and free the empty node afterwards.
	 */
	if (WARN_ON_ONCE(!node->nr_values))
		goto out_invalid;
	if (WARN_ON_ONCE(node->count != node->nr_values))
		goto out_invalid;

So, this isn't considered normal and will cause warning now.

>>
>> If so, we may consider some other methods.  Make shadow_nodes per-cpu?
>
> That's also an alternative solution if there are other risks.

This appears a general optimization and more clean.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin
  2024-03-27  6:30   ` Huang, Ying
@ 2024-03-27  6:55     ` Kairui Song
  2024-03-27  7:29       ` Huang, Ying
  0 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2024-03-27  6:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 2:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > We used to have the cache bypass swapin path for better performance,
> > but by removing it, more optimization can be applied and have
> > an even better overall performance and less hackish.
> >
> > And these optimizations are not easily doable or not doable at all
> > without this.
> >
> > This patch simply removes it, and the performance will drop heavily
> > for simple swapin, things won't get this worse for real workloads
> > but still observable. Following commits will fix this and archive a
> > better performance.
> >
> > Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
> > of swap path itself, because zero pages are not compressed but simply
> > recorded in ZRAM, and performance drops more as SWAP device is getting
> > full):
> >
> > Test result of sequential swapin/out:
> >
> >                Before (us)        After (us)
> > Swapout:       33619409           33624641
> > Swapin:        32393771           41614858 (-28.4%)
> > Swapout (THP): 7817909            7795530
> > Swapin (THP) : 32452387           41708471 (-28.4%)
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 18 ++++-------------
> >  mm/swap.h       | 10 +++++-----
> >  mm/swap_state.c | 53 ++++++++++---------------------------------------
> >  mm/swapfile.c   | 13 ------------
> >  4 files changed, 19 insertions(+), 75 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index dfdb620a9123..357d239ee2f6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       struct page *page;
> >       struct swap_info_struct *si = NULL;
> >       rmap_t rmap_flags = RMAP_NONE;
> > -     bool need_clear_cache = false;
> >       bool exclusive = false;
> >       swp_entry_t entry;
> >       pte_t pte;
> > @@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> > -                     /* skip swapcache and readahead */
> >                       folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > -                     if (PTR_ERR(folio) == -EBUSY)
> > -                             goto out;
> > -                     need_clear_cache = true;
> >               } else {
> >                       folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> > -                     swapcache = folio;
> >               }
> >
> >               if (!folio) {
> > @@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                       goto unlock;
> >               }
> >
> > +             swapcache = folio;
> >               page = folio_file_page(folio, swp_offset(entry));
> >
> >               /* Had to read the page from swap area: Major fault */
> > @@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       vmf->orig_pte = pte;
> >
> >       /* ksm created a completely new copy */
> > -     if (unlikely(folio != swapcache && swapcache)) {
> > +     if (unlikely(folio != swapcache)) {
> >               folio_add_new_anon_rmap(folio, vma, vmf->address);
> >               folio_add_lru_vma(folio, vma);
> >       } else {
> > @@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >
> >       folio_unlock(folio);
> > -     if (folio != swapcache && swapcache) {
> > +     if (folio != swapcache) {
> >               /*
> >                * Hold the lock to avoid the swap entry to be reused
> >                * until we take the PT lock for the pte_same() check
> > @@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> > -     /* Clear the swap cache pin for direct swapin after PTL unlock */
> > -     if (need_clear_cache)
> > -             swapcache_clear(si, entry);
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
> > @@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       folio_unlock(folio);
> >  out_release:
> >       folio_put(folio);
> > -     if (folio != swapcache && swapcache) {
> > +     if (folio != swapcache) {
> >               folio_unlock(swapcache);
> >               folio_put(swapcache);
> >       }
> > -     if (need_clear_cache)
> > -             swapcache_clear(si, entry);
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index aee134907a70..ac9573b03432 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
> >  void delete_from_swap_cache(struct folio *folio);
> >  void clear_shadow_from_swap_cache(int type, unsigned long begin,
> >                                 unsigned long end);
> > -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
> >  struct folio *swap_cache_get_folio(swp_entry_t entry,
> >               struct vm_area_struct *vma, unsigned long addr);
> >  struct folio *filemap_get_incore_folio(struct address_space *mapping,
> > @@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> >  {
> >       return NULL;
> >  }
> > -
> > -static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> > +static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
> > +                     struct vm_fault *vmf);
> >  {
> > -     return 0;
> > +     return NULL;
> >  }
> >
> > -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
> > +static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> >  {
> > +     return 0;
> >  }
> >
> >  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 2a9c6bdff5ea..49ef6250f676 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> >  }
> >
> >  /**
> > - * swapin_direct - swap in folios skipping swap cache and readahead
> > + * swapin_direct - swap in folios skipping readahead
> >   * @entry: swap entry of this memory
> >   * @gfp_mask: memory allocation flags
> >   * @vmf: fault information
> >   *
> > - * Returns the struct folio for entry and addr after the swap entry is read
> > - * in.
> > + * Returns the folio for entry after it is read in.
> >   */
> >  struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
> >                           struct vm_fault *vmf)
> >  {
> > -     struct vm_area_struct *vma = vmf->vma;
> > +     struct mempolicy *mpol;
> >       struct folio *folio;
> > -     void *shadow = NULL;
> > -
> > -     /*
> > -      * Prevent parallel swapin from proceeding with
> > -      * the cache flag. Otherwise, another thread may
> > -      * finish swapin first, free the entry, and swapout
> > -      * reusing the same entry. It's undetectable as
> > -      * pte_same() returns true due to entry reuse.
> > -      */
> > -     if (swapcache_prepare(entry)) {
> > -             /* Relax a bit to prevent rapid repeated page faults */
> > -             schedule_timeout_uninterruptible(1);
> > -             return ERR_PTR(-EBUSY);
> > -     }
> > -
> > -     /* skip swapcache */
> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                             vma, vmf->address, false);
> > -     if (folio) {
> > -             __folio_set_locked(folio);
> > -             __folio_set_swapbacked(folio);
> > -
> > -             if (mem_cgroup_swapin_charge_folio(folio,
> > -                                     vma->vm_mm, GFP_KERNEL,
> > -                                     entry)) {
> > -                     folio_unlock(folio);
> > -                     folio_put(folio);
> > -                     return NULL;
> > -             }
> > -             mem_cgroup_swapin_uncharge_swap(entry);
> > -
> > -             shadow = get_shadow_from_swap_cache(entry);
> > -             if (shadow)
> > -                     workingset_refault(folio, shadow);
> > +     bool page_allocated;
> > +     pgoff_t ilx;
> >
> > -             folio_add_lru(folio);
> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > +     folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> > +                                     &page_allocated, false);
> > +     mpol_cond_put(mpol);
> >
> > -             /* To provide entry to swap_read_folio() */
> > -             folio->swap = entry;
> > +     if (page_allocated)
> >               swap_read_folio(folio, true, NULL);
> > -             folio->private = NULL;
> > -     }
> >
> >       return folio;
> >  }
>
> This looks similar as read_swap_cache_async().  Can we merge them?

Yes, that's doable. But I may have to split it out again for later
optimizations though.

>
> And, we should avoid to readahead in swapin_readahead() or
> swap_vma_readahead() for SWP_SYNCHRONOUS_IO anyway.  So, it appears that
> we can change and use swapin_readahead() directly?

Good point, SWP_SYNCHRONOUS_IO check can be extended more after this
series, but readahead optimization could be another series (like the
previous one which tried to unify readahead for shmem/anon), so I
thought it's better to keep it untouched for now.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-27  6:47       ` Huang, Ying
@ 2024-03-27  7:14         ` Kairui Song
  2024-03-27  8:16           ` Huang, Ying
  0 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2024-03-27  7:14 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 2:49 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Wed, Mar 27, 2024 at 2:24 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Kairui Song <ryncsn@gmail.com> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > Interestingly the major performance overhead of synchronous is actually
> >> > from the workingset nodes update, that's because synchronous swap in
> >>
> >> If it's the major overhead, why not make it the first optimization?
> >
> > This performance issue became much more obvious after doing other
> > optimizations, and other optimizations are for general swapin not only
> > for synchronous swapin, that's also how I optimized things step by
> > step, so I kept my patch order...
> >
> > And it is easier to do this after Patch 8/10 which introduces the new
> > interface for swap cache.
> >
> >>
> >> > keeps adding single folios into a xa_node, making the node no longer
> >> > a shadow node and have to be removed from shadow_nodes, then remove
> >> > the folio very shortly and making the node a shadow node again,
> >> > so it has to add back to the shadow_nodes.
> >>
> >> The folio is removed only if should_try_to_free_swap() returns true?
> >>
> >> > Mark synchronous swapin folio with a special bit in swap entry embedded
> >> > in folio->swap, as we still have some usable bits there. Skip workingset
> >> > node update on insertion of such folio because it will be removed very
> >> > quickly, and will trigger the update ensuring the workingset info is
> >> > eventual consensus.
> >>
> >> Is this safe?  Is it possible for the shadow node to be reclaimed after
> >> the folio are added into node and before being removed?
> >
> > If a xa node contains any non-shadow entry, it can't be reclaimed,
> > shadow_lru_isolate will check and skip such nodes in case of race.
>
> In shadow_lru_isolate(),
>
>         /*
>          * The nodes should only contain one or more shadow entries,
>          * no pages, so we expect to be able to remove them all and
>          * delete and free the empty node afterwards.
>          */
>         if (WARN_ON_ONCE(!node->nr_values))
>                 goto out_invalid;
>         if (WARN_ON_ONCE(node->count != node->nr_values))
>                 goto out_invalid;
>
> So, this isn't considered normal and will cause warning now.

Yes, I added an exception in this patch:
-       if (WARN_ON_ONCE(node->count != node->nr_values))
+       if (WARN_ON_ONCE(node->count != node->nr_values &&
mapping->host != NULL))

The code is not a good final solution, but the idea might not be that
bad, list_lru provides many operations like LRU_ROTATE, we can even
lazy remove all the nodes as a general optimization, or add a
threshold for adding/removing a node from LRU.

>
> >>
> >> If so, we may consider some other methods.  Make shadow_nodes per-cpu?
> >
> > That's also an alternative solution if there are other risks.
>
> This appears a general optimization and more clean.

I'm not sure if synchronization between CPUs will make more burden,
because shadow nodes are globally shared, one node can be referenced
by multiple CPUs, I can have a try to see if this is doable. Maybe a
per-cpu batch is better but synchronization might still be an issue.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 04/10] mm/swap: remove cache bypass swapin
  2024-03-27  6:55     ` Kairui Song
@ 2024-03-27  7:29       ` Huang, Ying
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  7:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Mar 27, 2024 at 2:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > We used to have the cache bypass swapin path for better performance,
>> > but by removing it, more optimization can be applied and have
>> > an even better overall performance and less hackish.
>> >
>> > And these optimizations are not easily doable or not doable at all
>> > without this.
>> >
>> > This patch simply removes it, and the performance will drop heavily
>> > for simple swapin, things won't get this worse for real workloads
>> > but still observable. Following commits will fix this and archive a
>> > better performance.
>> >
>> > Swapout/in 30G zero pages from ZRAM (This mostly measures overhead
>> > of swap path itself, because zero pages are not compressed but simply
>> > recorded in ZRAM, and performance drops more as SWAP device is getting
>> > full):
>> >
>> > Test result of sequential swapin/out:
>> >
>> >                Before (us)        After (us)
>> > Swapout:       33619409           33624641
>> > Swapin:        32393771           41614858 (-28.4%)
>> > Swapout (THP): 7817909            7795530
>> > Swapin (THP) : 32452387           41708471 (-28.4%)
>> >
>> > Signed-off-by: Kairui Song <kasong@tencent.com>
>> > ---
>> >  mm/memory.c     | 18 ++++-------------
>> >  mm/swap.h       | 10 +++++-----
>> >  mm/swap_state.c | 53 ++++++++++---------------------------------------
>> >  mm/swapfile.c   | 13 ------------
>> >  4 files changed, 19 insertions(+), 75 deletions(-)
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index dfdb620a9123..357d239ee2f6 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -3932,7 +3932,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       struct page *page;
>> >       struct swap_info_struct *si = NULL;
>> >       rmap_t rmap_flags = RMAP_NONE;
>> > -     bool need_clear_cache = false;
>> >       bool exclusive = false;
>> >       swp_entry_t entry;
>> >       pte_t pte;
>> > @@ -4000,14 +3999,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       if (!folio) {
>> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>> >                   __swap_count(entry) == 1) {
>> > -                     /* skip swapcache and readahead */
>> >                       folio = swapin_direct(entry, GFP_HIGHUSER_MOVABLE, vmf);
>> > -                     if (PTR_ERR(folio) == -EBUSY)
>> > -                             goto out;
>> > -                     need_clear_cache = true;
>> >               } else {
>> >                       folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
>> > -                     swapcache = folio;
>> >               }
>> >
>> >               if (!folio) {
>> > @@ -4023,6 +4017,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >                       goto unlock;
>> >               }
>> >
>> > +             swapcache = folio;
>> >               page = folio_file_page(folio, swp_offset(entry));
>> >
>> >               /* Had to read the page from swap area: Major fault */
>> > @@ -4187,7 +4182,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       vmf->orig_pte = pte;
>> >
>> >       /* ksm created a completely new copy */
>> > -     if (unlikely(folio != swapcache && swapcache)) {
>> > +     if (unlikely(folio != swapcache)) {
>> >               folio_add_new_anon_rmap(folio, vma, vmf->address);
>> >               folio_add_lru_vma(folio, vma);
>> >       } else {
>> > @@ -4201,7 +4196,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>> >
>> >       folio_unlock(folio);
>> > -     if (folio != swapcache && swapcache) {
>> > +     if (folio != swapcache) {
>> >               /*
>> >                * Hold the lock to avoid the swap entry to be reused
>> >                * until we take the PT lock for the pte_same() check
>> > @@ -4227,9 +4222,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       if (vmf->pte)
>> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
>> >  out:
>> > -     /* Clear the swap cache pin for direct swapin after PTL unlock */
>> > -     if (need_clear_cache)
>> > -             swapcache_clear(si, entry);
>> >       if (si)
>> >               put_swap_device(si);
>> >       return ret;
>> > @@ -4240,12 +4232,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >       folio_unlock(folio);
>> >  out_release:
>> >       folio_put(folio);
>> > -     if (folio != swapcache && swapcache) {
>> > +     if (folio != swapcache) {
>> >               folio_unlock(swapcache);
>> >               folio_put(swapcache);
>> >       }
>> > -     if (need_clear_cache)
>> > -             swapcache_clear(si, entry);
>> >       if (si)
>> >               put_swap_device(si);
>> >       return ret;
>> > diff --git a/mm/swap.h b/mm/swap.h
>> > index aee134907a70..ac9573b03432 100644
>> > --- a/mm/swap.h
>> > +++ b/mm/swap.h
>> > @@ -41,7 +41,6 @@ void __delete_from_swap_cache(struct folio *folio,
>> >  void delete_from_swap_cache(struct folio *folio);
>> >  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>> >                                 unsigned long end);
>> > -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry);
>> >  struct folio *swap_cache_get_folio(swp_entry_t entry,
>> >               struct vm_area_struct *vma, unsigned long addr);
>> >  struct folio *filemap_get_incore_folio(struct address_space *mapping,
>> > @@ -100,14 +99,15 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>> >  {
>> >       return NULL;
>> >  }
>> > -
>> > -static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> > +static inline struct folio *swapin_direct(swp_entry_t entry, gfp_t flag,
>> > +                     struct vm_fault *vmf);
>> >  {
>> > -     return 0;
>> > +     return NULL;
>> >  }
>> >
>> > -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry)
>> > +static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
>> >  {
>> > +     return 0;
>> >  }
>> >
>> >  static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index 2a9c6bdff5ea..49ef6250f676 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -880,61 +880,28 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>> >  }
>> >
>> >  /**
>> > - * swapin_direct - swap in folios skipping swap cache and readahead
>> > + * swapin_direct - swap in folios skipping readahead
>> >   * @entry: swap entry of this memory
>> >   * @gfp_mask: memory allocation flags
>> >   * @vmf: fault information
>> >   *
>> > - * Returns the struct folio for entry and addr after the swap entry is read
>> > - * in.
>> > + * Returns the folio for entry after it is read in.
>> >   */
>> >  struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>> >                           struct vm_fault *vmf)
>> >  {
>> > -     struct vm_area_struct *vma = vmf->vma;
>> > +     struct mempolicy *mpol;
>> >       struct folio *folio;
>> > -     void *shadow = NULL;
>> > -
>> > -     /*
>> > -      * Prevent parallel swapin from proceeding with
>> > -      * the cache flag. Otherwise, another thread may
>> > -      * finish swapin first, free the entry, and swapout
>> > -      * reusing the same entry. It's undetectable as
>> > -      * pte_same() returns true due to entry reuse.
>> > -      */
>> > -     if (swapcache_prepare(entry)) {
>> > -             /* Relax a bit to prevent rapid repeated page faults */
>> > -             schedule_timeout_uninterruptible(1);
>> > -             return ERR_PTR(-EBUSY);
>> > -     }
>> > -
>> > -     /* skip swapcache */
>> > -     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>> > -                             vma, vmf->address, false);
>> > -     if (folio) {
>> > -             __folio_set_locked(folio);
>> > -             __folio_set_swapbacked(folio);
>> > -
>> > -             if (mem_cgroup_swapin_charge_folio(folio,
>> > -                                     vma->vm_mm, GFP_KERNEL,
>> > -                                     entry)) {
>> > -                     folio_unlock(folio);
>> > -                     folio_put(folio);
>> > -                     return NULL;
>> > -             }
>> > -             mem_cgroup_swapin_uncharge_swap(entry);
>> > -
>> > -             shadow = get_shadow_from_swap_cache(entry);
>> > -             if (shadow)
>> > -                     workingset_refault(folio, shadow);
>> > +     bool page_allocated;
>> > +     pgoff_t ilx;
>> >
>> > -             folio_add_lru(folio);
>> > +     mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> > +     folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>> > +                                     &page_allocated, false);
>> > +     mpol_cond_put(mpol);
>> >
>> > -             /* To provide entry to swap_read_folio() */
>> > -             folio->swap = entry;
>> > +     if (page_allocated)
>> >               swap_read_folio(folio, true, NULL);
>> > -             folio->private = NULL;
>> > -     }
>> >
>> >       return folio;
>> >  }
>>
>> This looks similar as read_swap_cache_async().  Can we merge them?
>
> Yes, that's doable. But I may have to split it out again for later
> optimizations though.
>
>>
>> And, we should avoid to readahead in swapin_readahead() or
>> swap_vma_readahead() for SWP_SYNCHRONOUS_IO anyway.  So, it appears that
>> we can change and use swapin_readahead() directly?
>
> Good point, SWP_SYNCHRONOUS_IO check can be extended more after this
> series, but readahead optimization could be another series (like the
> previous one which tried to unify readahead for shmem/anon), so I
> thought it's better to keep it untouched for now.

Just want to check whether we can reduce the special processing for
SWP_SYNCHRONOUS_IO as much as possible.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-26 18:50 ` [RFC PATCH 10/10] mm/swap: optimize synchronous swapin Kairui Song
  2024-03-27  6:22   ` Huang, Ying
@ 2024-03-27  8:08   ` Barry Song
  2024-03-27  8:44     ` Kairui Song
  1 sibling, 1 reply; 28+ messages in thread
From: Barry Song @ 2024-03-27  8:08 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Huang, Ying, Chris Li, Minchan Kim, Barry Song,
	Ryan Roberts, Yu Zhao, SeongJae Park, David Hildenbrand,
	Yosry Ahmed, Johannes Weiner, Matthew Wilcox, Nhat Pham,
	Chengming Zhou, Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 8:06 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Interestingly the major performance overhead of synchronous is actually
> from the workingset nodes update, that's because synchronous swap in
> keeps adding single folios into a xa_node, making the node no longer
> a shadow node and have to be removed from shadow_nodes, then remove
> the folio very shortly and making the node a shadow node again,
> so it has to add back to the shadow_nodes.

Hi Kairui,

Thank you for clarifying this. I'm unsure how it relates to SWP_SYNCHRONOUS_IO.
Does this observation apply universally to all instances where
__swap_count(entry)
== 1, even on devices not using SYNCHRONOUS_IO?


>
> Mark synchronous swapin folio with a special bit in swap entry embedded
> in folio->swap, as we still have some usable bits there. Skip workingset
> node update on insertion of such folio because it will be removed very
> quickly, and will trigger the update ensuring the workingset info is
> eventual consensus.
>
> Test result of sequential swapin/out of 30G zero page on ZRAM:
>
>                Before (us)        After (us)
> Swapout:       33853883           33886008
> Swapin:        38336519           32465441 (+15.4%)
> Swapout (THP): 6814619            6899938
> Swapin (THP) : 38383367           33193479 (+13.6%)
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swapops.h |  5 +++-
>  mm/filemap.c            | 16 +++++++++---
>  mm/memory.c             | 34 ++++++++++++++----------
>  mm/swap.h               | 15 +++++++++++
>  mm/swap_state.c         | 57 ++++++++++++++++++++++++-----------------
>  mm/vmscan.c             |  6 +++++
>  mm/workingset.c         |  2 +-
>  7 files changed, 92 insertions(+), 43 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 48b700ba1d18..ebc0c3e4668d 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -25,7 +25,10 @@
>   * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
>   */
>  #define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
> -#define SWP_OFFSET_MASK        ((1UL << SWP_TYPE_SHIFT) - 1)
> +#define SWP_CACHE_FLAG_BITS    1
> +#define SWP_CACHE_SYNCHRONOUS  BIT(SWP_TYPE_SHIFT - 1)
> +#define SWP_OFFSET_BITS        (SWP_TYPE_SHIFT - SWP_CACHE_FLAG_BITS)
> +#define SWP_OFFSET_MASK        (BIT(SWP_OFFSET_BITS) - 1)
>
>  /*
>   * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 5e8e3fd26b8d..ac24cc65d1da 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -923,12 +923,20 @@ int __filemap_add_swapcache(struct address_space *mapping, struct folio *folio,
>                             pgoff_t index, gfp_t gfp, void **shadowp)
>  {
>         XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
> +       bool synchronous = swap_cache_test_synchronous(folio);
>         long nr;
>         int ret;
>
>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>         VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
> -       mapping_set_update(&xas, mapping);
> +
> +       /*
> +        * Skip node update for synchronous folio insertion, it will be
> +        * updated on folio deletion very soon, avoid repeated LRU locking.
> +        */
> +       if (!synchronous)
> +               xas_set_update(&xas, workingset_update_node);
> +       xas_set_lru(&xas, &shadow_nodes);
>
>         nr = folio_nr_pages(folio);
>         folio_ref_add(folio, nr);
> @@ -936,8 +944,10 @@ int __filemap_add_swapcache(struct address_space *mapping, struct folio *folio,
>         ret = __filemap_lock_store(&xas, folio, index, gfp, shadowp);
>         if (likely(!ret)) {
>                 mapping->nrpages += nr;
> -               __node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
> -               __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
> +               if (!synchronous) {
> +                       __node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
> +                       __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
> +               }
>                 xas_unlock_irq(&xas);
>         } else {
>                 folio_put_refs(folio, nr);
> diff --git a/mm/memory.c b/mm/memory.c
> index 774a912eb46d..bb40202b4f29 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3933,6 +3933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
>         bool folio_allocated = false;
> +       bool synchronous_io = false;
>         bool exclusive = false;
>         swp_entry_t entry;
>         pte_t pte;
> @@ -4032,18 +4033,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (ret & VM_FAULT_RETRY)
>                 goto out_release;
>
> -       if (swapcache) {
> -               /*
> -                * Make sure folio_free_swap() or swapoff did not release the
> -                * swapcache from under us.  The page pin, and pte_same test
> -                * below, are not enough to exclude that.  Even if it is still
> -                * swapcache, we need to check that the page's swap has not
> -                * changed.
> -                */
> -               if (unlikely(!folio_test_swapcache(folio) ||
> -                            page_swap_entry(page).val != entry.val))
> -                       goto out_page;
> +       /*
> +        * Make sure folio_free_swap() or swapoff did not release the
> +        * swapcache from under us.  The page pin, and pte_same test
> +        * below, are not enough to exclude that.  Even if it is still
> +        * swapcache, we need to check that the page's swap has not
> +        * changed.
> +        */
> +       if (unlikely(!folio_test_swapcache(folio) ||
> +                    (page_swap_entry(page).val & ~SWP_CACHE_SYNCHRONOUS) != entry.val))
> +               goto out_page;
>
> +       synchronous_io = swap_cache_test_synchronous(folio);
> +       if (!synchronous_io) {
>                 /*
>                  * KSM sometimes has to copy on read faults, for example, if
>                  * page->index of !PageKSM() pages would be nonlinear inside the
> @@ -4105,9 +4107,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          */
>         if (!folio_test_ksm(folio)) {
>                 exclusive = pte_swp_exclusive(vmf->orig_pte);
> -               if (folio != swapcache) {
> +               if (synchronous_io || folio != swapcache) {
>                         /*
> -                        * We have a fresh page that is not exposed to the
> +                        * We have a fresh page that is not sharable through the
>                          * swapcache -> certainly exclusive.
>                          */
>                         exclusive = true;
> @@ -4148,7 +4150,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * yet.
>          */
>         swap_free(entry);
> -       if (should_try_to_free_swap(folio, vma, vmf->flags))
> +       if (synchronous_io)
> +               delete_from_swap_cache(folio);
> +       else if (should_try_to_free_swap(folio, vma, vmf->flags))
>                 folio_free_swap(folio);
>
>         inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> @@ -4223,6 +4227,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  out_nomap:
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> +       if (synchronous_io)
> +               delete_from_swap_cache(folio);
>  out_page:
>         folio_unlock(folio);
>  out_release:
> diff --git a/mm/swap.h b/mm/swap.h
> index bd872b157950..9d106eebddbd 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -31,6 +31,21 @@ extern struct address_space *swapper_spaces[];
>         (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>                 >> SWAP_ADDRESS_SPACE_SHIFT])
>
> +static inline void swap_cache_mark_synchronous(struct folio *folio)
> +{
> +       folio->swap.val |= SWP_CACHE_SYNCHRONOUS;
> +}
> +
> +static inline bool swap_cache_test_synchronous(struct folio *folio)
> +{
> +       return folio->swap.val & SWP_CACHE_SYNCHRONOUS;
> +}
> +
> +static inline void swap_cache_clear_synchronous(struct folio *folio)
> +{
> +       folio->swap.val &= ~SWP_CACHE_SYNCHRONOUS;
> +}
> +
>  void show_swap_cache_info(void);
>  bool add_to_swap(struct folio *folio);
>  void *get_shadow_from_swap_cache(swp_entry_t entry);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index cf178dd1131a..b0b1b5391ac1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -86,7 +86,7 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
>   * but sets SwapCache flag and private instead of mapping and index.
>   */
>  static int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                            gfp_t gfp, void **shadowp)
> +                            gfp_t gfp, bool synchronous, void **shadowp)
>  {
>         struct address_space *address_space = swap_address_space(entry);
>         pgoff_t idx = swp_offset(entry);
> @@ -98,11 +98,12 @@ static int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
>
>         folio_set_swapcache(folio);
>         folio->swap = entry;
> -
> +       if (synchronous)
> +               swap_cache_mark_synchronous(folio);
>         ret = __filemap_add_swapcache(address_space, folio, idx, gfp, shadowp);
>         if (ret) {
> -               folio_clear_swapcache(folio);
>                 folio->swap.val = 0;
> +               folio_clear_swapcache(folio);
>         }
>
>         return ret;
> @@ -129,11 +130,13 @@ void __delete_from_swap_cache(struct folio *folio,
>         xas_set_order(&xas, idx, folio_order(folio));
>         xas_store(&xas, shadow);
>
> -       folio->swap.val = 0;
>         folio_clear_swapcache(folio);
>         address_space->nrpages -= nr;
> -       __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
> -       __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
> +       if (!swap_cache_test_synchronous(folio)) {
> +               __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
> +               __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
> +       }
> +       folio->swap.val = 0;
>  }
>
>  /**
> @@ -393,7 +396,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
>   * else or hitting OOM.
>   */
>  static struct folio *swap_cache_add_or_get(struct folio *folio,
> -               swp_entry_t entry, gfp_t gfp_mask)
> +               swp_entry_t entry, gfp_t gfp_mask, bool synchronous)
>  {
>         int ret = 0;
>         void *shadow = NULL;
> @@ -403,7 +406,7 @@ static struct folio *swap_cache_add_or_get(struct folio *folio,
>         if (folio) {
>                 __folio_set_locked(folio);
>                 __folio_set_swapbacked(folio);
> -               ret = add_to_swap_cache(folio, entry, gfp_mask, &shadow);
> +               ret = add_to_swap_cache(folio, entry, gfp_mask, synchronous, &shadow);
>                 if (ret)
>                         __folio_clear_locked(folio);
>         }
> @@ -460,7 +463,7 @@ int swap_cache_add_wait(struct folio *folio, swp_entry_t entry, gfp_t gfp)
>         struct folio *wait_folio;
>
>         for (;;) {
> -               ret = add_to_swap_cache(folio, entry, gfp, NULL);
> +               ret = add_to_swap_cache(folio, entry, gfp, false, NULL);
>                 if (ret != -EEXIST)
>                         break;
>                 wait_folio = filemap_get_folio(swap_address_space(entry),
> @@ -493,7 +496,7 @@ struct folio *swap_cache_alloc_or_get(swp_entry_t entry, gfp_t gfp_mask,
>         /* We are very likely the first user, alloc and try add to the swapcache. */
>         folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0, mpol, ilx,
>                                                  numa_node_id());
> -       swapcache = swap_cache_add_or_get(folio, entry, gfp_mask);
> +       swapcache = swap_cache_add_or_get(folio, entry, gfp_mask, false);
>         if (swapcache != folio) {
>                 folio_put(folio);
>                 goto out_no_alloc;
> @@ -875,21 +878,27 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>  struct folio *swapin_direct(swp_entry_t entry, gfp_t gfp_mask,
>                             struct vm_fault *vmf, bool *folio_allocated)
>  {
> -       struct mempolicy *mpol;
> -       struct folio *folio;
> -       pgoff_t ilx;
> -
> -       mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> -       folio = swap_cache_alloc_or_get(entry, gfp_mask, mpol, ilx,
> -                                       folio_allocated);
> -       mpol_cond_put(mpol);
> -
> -       if (*folio_allocated)
> +       struct folio *folio = NULL, *swapcache;
> +       /* First do a racy check if cache is already loaded. */
> +       swapcache = swap_cache_try_get(entry);
> +       if (unlikely(swapcache))
> +               goto out;
> +       folio = vma_alloc_folio(gfp_mask, 0, vmf->vma, vmf->address, false);
> +       swapcache = swap_cache_add_or_get(folio, entry, gfp_mask, true);
> +       if (!swapcache)
> +               goto out_nocache;
> +       if (swapcache == folio) {
>                 swap_read_folio(folio, true, NULL);
> -       else if (folio)
> -               swap_cache_update_ra(folio, vmf->vma, vmf->address);
> -
> -       return folio;
> +               *folio_allocated = true;
> +               return folio;
> +       }
> +out:
> +       swap_cache_update_ra(swapcache, vmf->vma, vmf->address);
> +out_nocache:
> +       if (folio)
> +               folio_put(folio);
> +       *folio_allocated = false;
> +       return swapcache;
>  }
>
>  /**
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c3db39393428..e71b049fee01 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1228,6 +1228,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         if (!add_to_swap(folio))
>                                                 goto activate_locked_split;
>                                 }
> +                       } else if (swap_cache_test_synchronous(folio)) {
> +                               /*
> +                                * We see a folio being swapped in but not activated either
> +                                * due to missing shadow or lived too short, active it.
> +                                */
> +                               goto activate_locked;
>                         }
>                 } else if (folio_test_swapbacked(folio) &&
>                            folio_test_large(folio)) {
> diff --git a/mm/workingset.c b/mm/workingset.c
> index f2a0ecaf708d..83a0b409be0f 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -753,7 +753,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>          */
>         if (WARN_ON_ONCE(!node->nr_values))
>                 goto out_invalid;
> -       if (WARN_ON_ONCE(node->count != node->nr_values))
> +       if (WARN_ON_ONCE(node->count != node->nr_values && mapping->host != NULL))
>                 goto out_invalid;
>         xa_delete_node(node, workingset_update_node);
>         __inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM);
> --
> 2.43.0
>
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-27  7:14         ` Kairui Song
@ 2024-03-27  8:16           ` Huang, Ying
  0 siblings, 0 replies; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  8:16 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Ryan Roberts,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

Kairui Song <ryncsn@gmail.com> writes:

> On Wed, Mar 27, 2024 at 2:49 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Wed, Mar 27, 2024 at 2:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Kairui Song <ryncsn@gmail.com> writes:
>> >>
>> >> > From: Kairui Song <kasong@tencent.com>
>> >> >
>> >> > Interestingly the major performance overhead of synchronous is actually
>> >> > from the workingset nodes update, that's because synchronous swap in
>> >>
>> >> If it's the major overhead, why not make it the first optimization?
>> >
>> > This performance issue became much more obvious after doing other
>> > optimizations, and other optimizations are for general swapin not only
>> > for synchronous swapin, that's also how I optimized things step by
>> > step, so I kept my patch order...
>> >
>> > And it is easier to do this after Patch 8/10 which introduces the new
>> > interface for swap cache.
>> >
>> >>
>> >> > keeps adding single folios into a xa_node, making the node no longer
>> >> > a shadow node and have to be removed from shadow_nodes, then remove
>> >> > the folio very shortly and making the node a shadow node again,
>> >> > so it has to add back to the shadow_nodes.
>> >>
>> >> The folio is removed only if should_try_to_free_swap() returns true?
>> >>
>> >> > Mark synchronous swapin folio with a special bit in swap entry embedded
>> >> > in folio->swap, as we still have some usable bits there. Skip workingset
>> >> > node update on insertion of such folio because it will be removed very
>> >> > quickly, and will trigger the update ensuring the workingset info is
>> >> > eventual consensus.
>> >>
>> >> Is this safe?  Is it possible for the shadow node to be reclaimed after
>> >> the folio are added into node and before being removed?
>> >
>> > If a xa node contains any non-shadow entry, it can't be reclaimed,
>> > shadow_lru_isolate will check and skip such nodes in case of race.
>>
>> In shadow_lru_isolate(),
>>
>>         /*
>>          * The nodes should only contain one or more shadow entries,
>>          * no pages, so we expect to be able to remove them all and
>>          * delete and free the empty node afterwards.
>>          */
>>         if (WARN_ON_ONCE(!node->nr_values))
>>                 goto out_invalid;
>>         if (WARN_ON_ONCE(node->count != node->nr_values))
>>                 goto out_invalid;
>>
>> So, this isn't considered normal and will cause warning now.
>
> Yes, I added an exception in this patch:
> -       if (WARN_ON_ONCE(node->count != node->nr_values))
> +       if (WARN_ON_ONCE(node->count != node->nr_values &&
> mapping->host != NULL))
>
> The code is not a good final solution, but the idea might not be that
> bad, list_lru provides many operations like LRU_ROTATE, we can even
> lazy remove all the nodes as a general optimization, or add a
> threshold for adding/removing a node from LRU.

We can compare different solutions.  For this one, we still need to deal
with the cases where the folio isn't removed from the swap cache, that
is, should_try_to_free_swap() returns false.

>>
>> >>
>> >> If so, we may consider some other methods.  Make shadow_nodes per-cpu?
>> >
>> > That's also an alternative solution if there are other risks.
>>
>> This appears a general optimization and more clean.
>
> I'm not sure if synchronization between CPUs will make more burden,
> because shadow nodes are globally shared, one node can be referenced
> by multiple CPUs, I can have a try to see if this is doable. Maybe a
> per-cpu batch is better but synchronization might still be an issue.

Yes.  Per-CPU shadow_nodes needs to find list from shadow node.  That
has some overhead.

If lock contention on list_lru lock is the root cause, we can use hashed
shadow node lists.  That can reduce lock contention effectively.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
  2024-03-27  3:01   ` Kairui Song
@ 2024-03-27  8:27     ` Ryan Roberts
  2024-03-27  8:32       ` Huang, Ying
  2024-03-27 11:04       ` Kairui Song
  0 siblings, 2 replies; 28+ messages in thread
From: Ryan Roberts @ 2024-03-27  8:27 UTC (permalink / raw)
  To: Kairui Song, Huang, Ying
  Cc: linux-mm, Chris Li, Minchan Kim, Barry Song, Yu Zhao,
	SeongJae Park, David Hildenbrand, Yosry Ahmed, Johannes Weiner,
	Matthew Wilcox, Nhat Pham, Chengming Zhou, Andrew Morton,
	linux-kernel

[...]

>>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>>>
>>>                Before (us)        After (us)
>>> Swapout:       33619409           33886008
>>> Swapin:        32393771           32465441 (- 0.2%)
>>> Swapout (THP): 7817909            6899938  (+11.8%)
>>> Swapin (THP) : 32452387           33193479 (- 2.2%)
>>
>> If my understanding were correct, we don't have swapin (THP) support,
>> yet.  Right?
> 
> Yes, this series doesn't change how swapin/swapout works with THP in
> general, but now THP swapout will leave shadows with large order, so
> it needs to be splitted upon swapin, that will slow down later swapin
> by a little bit but I think that's worth it.
> 
> If we can do THP swapin in the future, this split on swapin can be
> saved to make the performance even better.

I'm confused by this (clearly my understanding of how this works is incorrect).
Perhaps you can help me understand:

When you talk about "shadows" I assume you are referring to the swap cache? It
was my understanding that swapping out a THP would always leave the large folio
in the swap cache, so this is nothing new?

And on swap-in, if the target page is in the swap cache, even if part of a large
folio, why does it need to be split? I assumed the single page would just be
mapped? (and if all the other pages subsequently fault, then you end up with a
fully mapped large folio back in the process)?

Perhaps I'm misunderstanding what "shadows" are?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
  2024-03-27  8:27     ` Ryan Roberts
@ 2024-03-27  8:32       ` Huang, Ying
  2024-03-27  9:39         ` Ryan Roberts
  2024-03-27 11:04       ` Kairui Song
  1 sibling, 1 reply; 28+ messages in thread
From: Huang, Ying @ 2024-03-27  8:32 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kairui Song, linux-mm, Chris Li, Minchan Kim, Barry Song,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> writes:

> [...]
>
>>>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>>>>
>>>>                Before (us)        After (us)
>>>> Swapout:       33619409           33886008
>>>> Swapin:        32393771           32465441 (- 0.2%)
>>>> Swapout (THP): 7817909            6899938  (+11.8%)
>>>> Swapin (THP) : 32452387           33193479 (- 2.2%)
>>>
>>> If my understanding were correct, we don't have swapin (THP) support,
>>> yet.  Right?
>> 
>> Yes, this series doesn't change how swapin/swapout works with THP in
>> general, but now THP swapout will leave shadows with large order, so
>> it needs to be splitted upon swapin, that will slow down later swapin
>> by a little bit but I think that's worth it.
>> 
>> If we can do THP swapin in the future, this split on swapin can be
>> saved to make the performance even better.
>
> I'm confused by this (clearly my understanding of how this works is incorrect).
> Perhaps you can help me understand:
>
> When you talk about "shadows" I assume you are referring to the swap cache? It
> was my understanding that swapping out a THP would always leave the large folio
> in the swap cache, so this is nothing new?
>
> And on swap-in, if the target page is in the swap cache, even if part of a large
> folio, why does it need to be split? I assumed the single page would just be
> mapped? (and if all the other pages subsequently fault, then you end up with a
> fully mapped large folio back in the process)?
>
> Perhaps I'm misunderstanding what "shadows" are?

Perhaps, shadow is used to support workingset protection/detection on
the anonymous LRU list as in the following patchset (merged).

https://lore.kernel.org/all/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com/T/#m962395eb5968c74b0c4c8e41d4b0dcdd3f28b2e6

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin
  2024-03-27  8:08   ` Barry Song
@ 2024-03-27  8:44     ` Kairui Song
  0 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-27  8:44 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Huang, Ying, Chris Li, Minchan Kim, Barry Song,
	Ryan Roberts, Yu Zhao, SeongJae Park, David Hildenbrand,
	Yosry Ahmed, Johannes Weiner, Matthew Wilcox, Nhat Pham,
	Chengming Zhou, Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 4:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 27, 2024 at 8:06 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Interestingly the major performance overhead of synchronous is actually
> > from the workingset nodes update, that's because synchronous swap in
> > keeps adding single folios into a xa_node, making the node no longer
> > a shadow node and have to be removed from shadow_nodes, then remove
> > the folio very shortly and making the node a shadow node again,
> > so it has to add back to the shadow_nodes.
>
> Hi Kairui,
>
> Thank you for clarifying this. I'm unsure how it relates to SWP_SYNCHRONOUS_IO.
> Does this observation apply universally to all instances where
> __swap_count(entry)
> == 1, even on devices not using SYNCHRONOUS_IO?

Hi Barry

I was testing using zero pages on ZRAM so the performance issue is
much more obvious.

For non SYNCHRONOUS_IO devices, they don't drop swap cache immediately
unless swap is half full, so a shadow node will be removed from
shadow_nodes on first swapin, but usually won't be added/removed
repeatedly.

I think the logic that "never drop swapcache even if swap count is 1",
then suddenly switch to "always drop swap cache when swap count is 1"
when swap is half full is not a good solution... Maybe some generic
optimization can be applied for that part too.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
  2024-03-27  8:32       ` Huang, Ying
@ 2024-03-27  9:39         ` Ryan Roberts
  0 siblings, 0 replies; 28+ messages in thread
From: Ryan Roberts @ 2024-03-27  9:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song, linux-mm, Chris Li, Minchan Kim, Barry Song,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On 27/03/2024 08:32, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> [...]
>>
>>>>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
>>>>>
>>>>>                Before (us)        After (us)
>>>>> Swapout:       33619409           33886008
>>>>> Swapin:        32393771           32465441 (- 0.2%)
>>>>> Swapout (THP): 7817909            6899938  (+11.8%)
>>>>> Swapin (THP) : 32452387           33193479 (- 2.2%)
>>>>
>>>> If my understanding were correct, we don't have swapin (THP) support,
>>>> yet.  Right?
>>>
>>> Yes, this series doesn't change how swapin/swapout works with THP in
>>> general, but now THP swapout will leave shadows with large order, so
>>> it needs to be splitted upon swapin, that will slow down later swapin
>>> by a little bit but I think that's worth it.
>>>
>>> If we can do THP swapin in the future, this split on swapin can be
>>> saved to make the performance even better.
>>
>> I'm confused by this (clearly my understanding of how this works is incorrect).
>> Perhaps you can help me understand:
>>
>> When you talk about "shadows" I assume you are referring to the swap cache? It
>> was my understanding that swapping out a THP would always leave the large folio
>> in the swap cache, so this is nothing new?
>>
>> And on swap-in, if the target page is in the swap cache, even if part of a large
>> folio, why does it need to be split? I assumed the single page would just be
>> mapped? (and if all the other pages subsequently fault, then you end up with a
>> fully mapped large folio back in the process)?
>>
>> Perhaps I'm misunderstanding what "shadows" are?
> 
> Perhaps, shadow is used to support workingset protection/detection on
> the anonymous LRU list as in the following patchset (merged).
> 
> https://lore.kernel.org/all/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com/T/#m962395eb5968c74b0c4c8e41d4b0dcdd3f28b2e6

Thanks! Although after reading the cover letter I still don't really understand
the need for splitting. The LRU applies to whole folios.


> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization
  2024-03-27  8:27     ` Ryan Roberts
  2024-03-27  8:32       ` Huang, Ying
@ 2024-03-27 11:04       ` Kairui Song
  1 sibling, 0 replies; 28+ messages in thread
From: Kairui Song @ 2024-03-27 11:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Huang, Ying, linux-mm, Chris Li, Minchan Kim, Barry Song,
	Yu Zhao, SeongJae Park, David Hildenbrand, Yosry Ahmed,
	Johannes Weiner, Matthew Wilcox, Nhat Pham, Chengming Zhou,
	Andrew Morton, linux-kernel

On Wed, Mar 27, 2024 at 4:27 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> [...]
>
> >>> Test 1, sequential swapin/out of 30G zero page on ZRAM:
> >>>
> >>>                Before (us)        After (us)
> >>> Swapout:       33619409           33886008
> >>> Swapin:        32393771           32465441 (- 0.2%)
> >>> Swapout (THP): 7817909            6899938  (+11.8%)
> >>> Swapin (THP) : 32452387           33193479 (- 2.2%)
> >>
> >> If my understanding were correct, we don't have swapin (THP) support,
> >> yet.  Right?
> >
> > Yes, this series doesn't change how swapin/swapout works with THP in
> > general, but now THP swapout will leave shadows with large order, so
> > it needs to be splitted upon swapin, that will slow down later swapin
> > by a little bit but I think that's worth it.
> >
> > If we can do THP swapin in the future, this split on swapin can be
> > saved to make the performance even better.
>
> I'm confused by this (clearly my understanding of how this works is incorrect).
> Perhaps you can help me understand:
>
> When you talk about "shadows" I assume you are referring to the swap cache? It
> was my understanding that swapping out a THP would always leave the large folio
> in the swap cache, so this is nothing new?
>
> And on swap-in, if the target page is in the swap cache, even if part of a large
> folio, why does it need to be split? I assumed the single page would just be
> mapped? (and if all the other pages subsequently fault, then you end up with a
> fully mapped large folio back in the process)?
>
> Perhaps I'm misunderstanding what "shadows" are?

Hi Ryan

My bad I haven't made this clear.

Ying have posted the link to the commit that added "shadow" support
for anon pages, it has become a very important part for LRU activation
/ workingset tracking. Basically when folios are removed from the
cache xarray (eg. after swap writeback is done), instead of releasing
the xarray slot, an unsigned long / void * is stored to it, recording
some info that will be used when refault happens, to decide how to
handle the folio from LRU / workingset side.

And about large folio in swapcahce: if you look at the current version
of add_to_swap_cache in mainline (it adds a folio of any order into
swap cache), it calls xas_create_range(&xas) which fill all xarray
slots in entire range covered by the folio. But xarray supports
multi-index storing, making use of the nature of the radix tree to
save a lot of slots. eg. for a 2M THP page, previously 8 + 512 slots
(8 extra xa nodes) is needed to store it, after this series it only
needs 8 slots by using a multi-index store. (not sure if I did the
math right).

Same for shadow, when folio is being deleted, __delete_from_swap_cache
will currently walk the xarray with xas_next update all 8 + 512 slots
one by one, after this series only 8 stores are needed (ignoring
fragmentation).

And upon swapin, I was talking about swapin 1 sub page of a THP folio,
and the folio is gone, leaving a few multi-index shadow slots. The
multi-index slots need to be splitted (multi-index slot have to be
updated as a whole or split first, __filemap_add_folio handles such
split), I optimize and reused routine in __filemap_add_folio in this
series so without too much work it works perfectly for swapcache.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-03-27 11:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26 18:50 [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Kairui Song
2024-03-26 18:50 ` [RFC PATCH 01/10] mm/filemap: split filemap storing logic into a standalone helper Kairui Song
2024-03-26 18:50 ` [RFC PATCH 02/10] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
2024-03-26 18:50 ` [RFC PATCH 03/10] mm/swap: convert swapin_readahead to return a folio Kairui Song
2024-03-26 20:03   ` Matthew Wilcox
2024-03-26 18:50 ` [RFC PATCH 04/10] mm/swap: remove cache bypass swapin Kairui Song
2024-03-27  6:30   ` Huang, Ying
2024-03-27  6:55     ` Kairui Song
2024-03-27  7:29       ` Huang, Ying
2024-03-26 18:50 ` [RFC PATCH 05/10] mm/swap: clean shadow only in unmap path Kairui Song
2024-03-26 18:50 ` [RFC PATCH 06/10] mm/swap: switch to use multi index entries Kairui Song
2024-03-26 18:50 ` [RFC PATCH 07/10] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_or_get Kairui Song
2024-03-26 18:50 ` [RFC PATCH 08/10] mm/swap: use swap cache as a synchronization layer Kairui Song
2024-03-26 18:50 ` [RFC PATCH 09/10] mm/swap: delay the swap cache lookup for swapin Kairui Song
2024-03-26 18:50 ` [RFC PATCH 10/10] mm/swap: optimize synchronous swapin Kairui Song
2024-03-27  6:22   ` Huang, Ying
2024-03-27  6:37     ` Kairui Song
2024-03-27  6:47       ` Huang, Ying
2024-03-27  7:14         ` Kairui Song
2024-03-27  8:16           ` Huang, Ying
2024-03-27  8:08   ` Barry Song
2024-03-27  8:44     ` Kairui Song
2024-03-27  2:52 ` [RFC PATCH 00/10] mm/swap: always use swap cache for synchronization Huang, Ying
2024-03-27  3:01   ` Kairui Song
2024-03-27  8:27     ` Ryan Roberts
2024-03-27  8:32       ` Huang, Ying
2024-03-27  9:39         ` Ryan Roberts
2024-03-27 11:04       ` Kairui Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.