linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] mm/zswap: optimize zswap lru list
@ 2024-02-04  3:05 Chengming Zhou
  2024-02-04  3:05 ` [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:05 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

Changes in v2:
- Add comment above zswap_invalidate() to mention that large folio
  swap slot is not covered for now, per Yosry.
- Add comment about locking behaviour of LRU_STOP, per Yosry.
- Add the theory details and supportive testing results on why we
  choose the exclusive load as the default for zswap, per Johannes.
- Collect tags.
- Link to v1: https://lore.kernel.org/r/20240201-b4-zswap-invalidate-entry-v1-0-56ed496b6e55@bytedance.com

Hi all,

This series is motivated when observe the zswap lru list shrinking,
noted there are some unexpected cases in zswap_writeback_entry().

bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'

There are some -ENOMEM because when the swap entry is freed to
per-cpu swap pool, it doesn't invalidate/drop zswap entry. Then
the shrinker encounter these trashy zswap entries, it can't be
reclaimed and return -ENOMEM.

So moves the invalidation ahead to when swap entry freed to the
per-cpu swap pool, since there is no any benefit to leave trashy
zswap entries on the zswap tree and lru list.

Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio
will leave compressed copy on the tree and lru list. And it
can't be reclaimed until the folio is removed from swapcache.

Changing to zswap_exclusive_loads_enabled mode will invalidate
when folio swapin, which has its own drawback if that folio
is still clean in swapcache and swapout again, we need to
compress it again. Please see the commit for details on why
we choose exclusive load as the default for zswap.

Another optimization for -EEXIST is that we add LRU_STOP to
support terminating the shrinking process to avoid evicting
warmer region.

Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.

                mm-unstable   zswap-optimize
real               63.90s       63.25s
user             1064.05s     1063.40s
sys               292.32s      270.94s

The main optimization is in sys cpu, about 7% improvement.

Thanks for review and comments!

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (6):
      mm/zswap: add more comments in shrink_memcg_cb()
      mm/zswap: invalidate zswap entry when swap entry free
      mm/zswap: stop lru list shrinking when encounter warm region
      mm/zswap: remove duplicate_entry debug value
      mm/zswap: only support zswap_exclusive_loads_enabled
      mm/zswap: zswap entry doesn't need refcount anymore

 include/linux/list_lru.h |   2 +
 include/linux/zswap.h    |   4 +-
 mm/Kconfig               |  16 ------
 mm/list_lru.c            |   3 ++
 mm/swap_slots.c          |   3 ++
 mm/swapfile.c            |   1 -
 mm/zswap.c               | 136 ++++++++++++++++-------------------------------
 7 files changed, 56 insertions(+), 109 deletions(-)
---
base-commit: 3a92c45e4ba694381c46994f3fde0d8544a2088b
change-id: 20240201-b4-zswap-invalidate-entry-b77dea670325

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb()
  2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
@ 2024-02-04  3:05 ` Chengming Zhou
  2024-02-04  3:06 ` [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:05 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

Add more comments in shrink_memcg_cb() to describe the deref dance
which is implemented to fix race problem between lru writeback and
swapoff, and the reason why we rotate the entry at the beginning.

Also fix the stale comments in zswap_writeback_entry(), and add
more comments to state that we only deref the tree after we get
the swapcache reference.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4aea03285532..735f1a6ef336 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1207,10 +1207,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	/*
 	 * folio is locked, and the swapcache is now secured against
-	 * concurrent swapping to and from the slot. Verify that the
-	 * swap entry hasn't been invalidated and recycled behind our
-	 * backs (our zswap_entry reference doesn't prevent that), to
-	 * avoid overwriting a new swap folio with old compressed data.
+	 * concurrent swapping to and from the slot, and concurrent
+	 * swapoff so we can safely dereference the zswap tree here.
+	 * Verify that the swap entry hasn't been invalidated and recycled
+	 * behind our backs, to avoid overwriting a new swap folio with
+	 * old compressed data. Only when this is successful can the entry
+	 * be dereferenced.
 	 */
 	tree = swap_zswap_tree(swpentry);
 	spin_lock(&tree->lock);
@@ -1263,22 +1265,29 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	int writeback_result;
 
 	/*
-	 * Rotate the entry to the tail before unlocking the LRU,
-	 * so that in case of an invalidation race concurrent
-	 * reclaimers don't waste their time on it.
+	 * As soon as we drop the LRU lock, the entry can be freed by
+	 * a concurrent invalidation. This means the following:
 	 *
-	 * If writeback succeeds, or failure is due to the entry
-	 * being invalidated by the swap subsystem, the invalidation
-	 * will unlink and free it.
+	 * 1. We extract the swp_entry_t to the stack, allowing
+	 *    zswap_writeback_entry() to pin the swap entry and
+	 *    then validate the zwap entry against that swap entry's
+	 *    tree using pointer value comparison. Only when that
+	 *    is successful can the entry be dereferenced.
 	 *
-	 * Temporary failures, where the same entry should be tried
-	 * again immediately, almost never happen for this shrinker.
-	 * We don't do any trylocking; -ENOMEM comes closest,
-	 * but that's extremely rare and doesn't happen spuriously
-	 * either. Don't bother distinguishing this case.
+	 * 2. Usually, objects are taken off the LRU for reclaim. In
+	 *    this case this isn't possible, because if reclaim fails
+	 *    for whatever reason, we have no means of knowing if the
+	 *    entry is alive to put it back on the LRU.
 	 *
-	 * But since they do exist in theory, the entry cannot just
-	 * be unlinked, or we could leak it. Hence, rotate.
+	 *    So rotate it before dropping the lock. If the entry is
+	 *    written back or invalidated, the free path will unlink
+	 *    it. For failures, rotation is the right thing as well.
+	 *
+	 *    Temporary failures, where the same entry should be tried
+	 *    again immediately, almost never happen for this shrinker.
+	 *    We don't do any trylocking; -ENOMEM comes closest,
+	 *    but that's extremely rare and doesn't happen spuriously
+	 *    either. Don't bother distinguishing this case.
 	 */
 	list_move_tail(item, &l->list);
 

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
  2024-02-04  3:05 ` [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
@ 2024-02-04  3:06 ` Chengming Zhou
  2024-02-05 21:20   ` Yosry Ahmed
  2024-02-04  3:06 ` [PATCH v2 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:06 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

During testing I found there are some times the zswap_writeback_entry()
return -ENOMEM, which is not we expected:

bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
@[-12]: 1563
@[0]: 277221

The reason is that __read_swap_cache_async() return NULL because
swapcache_prepare() failed. The reason is that we won't invalidate
zswap entry when swap entry freed to the per-cpu pool, these zswap
entries are still on the zswap tree and lru list.

This patch moves the invalidation ahead to when swap entry freed
to the per-cpu pool, since there is no any benefit to leave trashy
zswap entry on the tree and lru list.

With this patch:
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
@[0]: 259744

Note: large folio can't have zswap entry for now, so don't bother
to add zswap entry invalidation in the large folio swap free path.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/zswap.h | 4 ++--
 mm/swap_slots.c       | 3 +++
 mm/swapfile.c         | 1 -
 mm/zswap.c            | 5 +++--
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 91895ce1fdbc..341aea490070 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -29,7 +29,7 @@ struct zswap_lruvec_state {
 
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
-void zswap_invalidate(int type, pgoff_t offset);
+void zswap_invalidate(swp_entry_t swp);
 int zswap_swapon(int type, unsigned long nr_pages);
 void zswap_swapoff(int type);
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
@@ -50,7 +50,7 @@ static inline bool zswap_load(struct folio *folio)
 	return false;
 }
 
-static inline void zswap_invalidate(int type, pgoff_t offset) {}
+static inline void zswap_invalidate(swp_entry_t swp) {}
 static inline int zswap_swapon(int type, unsigned long nr_pages)
 {
 	return 0;
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..90973ce7881d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -273,6 +273,9 @@ void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
 
+	/* Large folio swap slot is not covered. */
+	zswap_invalidate(entry);
+
 	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
 		spin_lock_irq(&cache->free_lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0580bb3e34d7..65b49db89b36 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -744,7 +744,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 		swap_slot_free_notify = NULL;
 	while (offset <= end) {
 		arch_swap_invalidate_page(si->type, offset);
-		zswap_invalidate(si->type, offset);
 		if (swap_slot_free_notify)
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
diff --git a/mm/zswap.c b/mm/zswap.c
index 735f1a6ef336..d8bb0e06e2b0 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1738,9 +1738,10 @@ bool zswap_load(struct folio *folio)
 	return true;
 }
 
-void zswap_invalidate(int type, pgoff_t offset)
+void zswap_invalidate(swp_entry_t swp)
 {
-	struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
+	pgoff_t offset = swp_offset(swp);
+	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
 	spin_lock(&tree->lock);

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/6] mm/zswap: stop lru list shrinking when encounter warm region
  2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
  2024-02-04  3:05 ` [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
  2024-02-04  3:06 ` [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
@ 2024-02-04  3:06 ` Chengming Zhou
  2024-02-04  3:06 ` [PATCH v2 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:06 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

When the shrinker encounter an existing folio in swap cache, it means
we are shrinking into the warmer region. We should terminate shrinking
if we're in the dynamic shrinker context.

This patch add LRU_STOP to support this, to avoid overshrinking.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/list_lru.h | 2 ++
 mm/list_lru.c            | 3 +++
 mm/zswap.c               | 4 +++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f2882a820690..792b67ceb631 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -24,6 +24,8 @@ enum lru_status {
 	LRU_SKIP,		/* item cannot be locked, skip */
 	LRU_RETRY,		/* item not freeable. May drop the lock
 				   internally, but has to return locked. */
+	LRU_STOP,		/* stop lru list walking. May drop the lock
+				   internally, but has to return locked. */
 };
 
 struct list_lru_one {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 61f3b6b1134f..3fd64736bc45 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -243,6 +243,9 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 			 */
 			assert_spin_locked(&nlru->lock);
 			goto restart;
+		case LRU_STOP:
+			assert_spin_locked(&nlru->lock);
+			goto out;
 		default:
 			BUG();
 		}
diff --git a/mm/zswap.c b/mm/zswap.c
index d8bb0e06e2b0..4381b7a2d4d6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1315,8 +1315,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 		 * into the warmer region. We should terminate shrinking (if we're in the dynamic
 		 * shrinker context).
 		 */
-		if (writeback_result == -EEXIST && encountered_page_in_swapcache)
+		if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
+			ret = LRU_STOP;
 			*encountered_page_in_swapcache = true;
+		}
 	} else {
 		zswap_written_back_pages++;
 	}

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
                   ` (2 preceding siblings ...)
  2024-02-04  3:06 ` [PATCH v2 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
@ 2024-02-04  3:06 ` Chengming Zhou
  2024-02-04  3:06 ` [PATCH v2 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
  2024-02-04  3:06 ` [PATCH v2 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
  5 siblings, 0 replies; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:06 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

cat /sys/kernel/debug/zswap/duplicate_entry
2086447

When testing, the duplicate_entry value is very high, but no warning
message in the kernel log. From the comment of duplicate_entry
"Duplicate store was encountered (rare)", it seems something goes wrong.

Actually it's incremented in the beginning of zswap_store(), which found
its zswap entry has already on the tree. And this is a normal case,
since the folio could leave zswap entry on the tree after swapin,
later it's dirtied and swapout/zswap_store again, found its original
zswap entry.

So duplicate_entry should be only incremented in the real bug case,
which already have "WARN_ON(1)", it looks redundant to count bug case,
so this patch just remove it.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4381b7a2d4d6..3fbb7e2c8b8d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -71,8 +71,6 @@ static u64 zswap_reject_compress_poor;
 static u64 zswap_reject_alloc_fail;
 /* Store failed because the entry metadata could not be allocated (rare) */
 static u64 zswap_reject_kmemcache_fail;
-/* Duplicate store was encountered (rare) */
-static u64 zswap_duplicate_entry;
 
 /* Shrinker work queue */
 static struct workqueue_struct *shrink_wq;
@@ -1571,10 +1569,8 @@ bool zswap_store(struct folio *folio)
 	 */
 	spin_lock(&tree->lock);
 	entry = zswap_rb_search(&tree->rbroot, offset);
-	if (entry) {
+	if (entry)
 		zswap_invalidate_entry(tree, entry);
-		zswap_duplicate_entry++;
-	}
 	spin_unlock(&tree->lock);
 	objcg = get_obj_cgroup_from_folio(folio);
 	if (objcg && !obj_cgroup_may_zswap(objcg)) {
@@ -1661,7 +1657,6 @@ bool zswap_store(struct folio *folio)
 	 */
 	while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
 		WARN_ON(1);
-		zswap_duplicate_entry++;
 		zswap_invalidate_entry(tree, dupentry);
 	}
 	if (entry->length) {
@@ -1822,8 +1817,6 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("written_back_pages", 0444,
 			   zswap_debugfs_root, &zswap_written_back_pages);
-	debugfs_create_u64("duplicate_entry", 0444,
-			   zswap_debugfs_root, &zswap_duplicate_entry);
 	debugfs_create_u64("pool_total_size", 0444,
 			   zswap_debugfs_root, &zswap_pool_total_size);
 	debugfs_create_atomic_t("stored_pages", 0444,

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
                   ` (3 preceding siblings ...)
  2024-02-04  3:06 ` [PATCH v2 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
@ 2024-02-04  3:06 ` Chengming Zhou
  2024-02-04  3:06 ` [PATCH v2 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
  5 siblings, 0 replies; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:06 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

The !zswap_exclusive_loads_enabled mode will leave compressed copy in
the zswap tree and lru list after the folio swapin.

There are some disadvantages in this mode:
1. It's a waste of memory since there are two copies of data, one is
   folio, the other one is compressed data in zswap. And it's unlikely
   the compressed data is useful in the near future.

2. If that folio is dirtied, the compressed data must be not useful,
   but we don't know and don't invalidate the trashy memory in zswap.

3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
   will always return -EEXIST and terminate the shrinking process.

On the other hand, the only downside of zswap_exclusive_loads_enabled
is a little more cpu usage/latency when compression, and the same if
the folio is removed from swapcache or dirtied.

More explanation by Johannes on why we should consider exclusive load
as the default for zswap:

  Caching "swapout work" is helpful when the system is thrashing. Then
  recently swapped in pages might get swapped out again very soon. It
  certainly makes sense with conventional swap, because keeping a clean
  copy on the disk saves IO work and doesn't cost any additional memory.

  But with zswap, it's different. It saves some compression work on a
  thrashing page. But the act of keeping compressed memory contributes
  to a higher rate of thrashing. And that can cause IO in other places
  like zswap writeback and file memory.

And the A/B test results of the kernel build in tmpfs with limited memory
can support this theory:

			!exclusive	exclusive
real                       63.80         63.01
user                       1063.83       1061.32
sys                        290.31        266.15

workingset_refault_anon    2383084.40    1976397.40
workingset_refault_file    44134.00      45689.40
workingset_activate_anon   837878.00     728441.20
workingset_activate_file   4710.00       4085.20
workingset_restore_anon    732622.60     639428.40
workingset_restore_file    1007.00       926.80
workingset_nodereclaim     0.00          0.00
pgscan                     14343003.40   12409570.20
pgscan_kswapd              0.00          0.00
pgscan_direct              14343003.40   12409570.20
pgscan_khugepaged          0.00          0.00

Acked-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/Kconfig | 16 ----------------
 mm/zswap.c | 14 +++-----------
 2 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ffc3a2ba3a8c..673b35629074 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -45,22 +45,6 @@ config ZSWAP_DEFAULT_ON
 	  The selection made here can be overridden by using the kernel
 	  command line 'zswap.enabled=' option.
 
-config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
-	bool "Invalidate zswap entries when pages are loaded"
-	depends on ZSWAP
-	help
-	  If selected, exclusive loads for zswap will be enabled at boot,
-	  otherwise it will be disabled.
-
-	  If exclusive loads are enabled, when a page is loaded from zswap,
-	  the zswap entry is invalidated at once, as opposed to leaving it
-	  in zswap until the swap entry is freed.
-
-	  This avoids having two copies of the same page in memory
-	  (compressed and uncompressed) after faulting in a page from zswap.
-	  The cost is that if the page was never dirtied and needs to be
-	  swapped out again, it will be re-compressed.
-
 config ZSWAP_SHRINKER_DEFAULT_ON
 	bool "Shrink the zswap pool on memory pressure"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 3fbb7e2c8b8d..cbf379abb6c7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -139,10 +139,6 @@ static bool zswap_non_same_filled_pages_enabled = true;
 module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
 		   bool, 0644);
 
-static bool zswap_exclusive_loads_enabled = IS_ENABLED(
-		CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
-module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
-
 /* Number of zpools in zswap_pool (empirically determined for scalability) */
 #define ZSWAP_NR_ZPOOLS 32
 
@@ -1722,16 +1718,12 @@ bool zswap_load(struct folio *folio)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
 	spin_lock(&tree->lock);
-	if (zswap_exclusive_loads_enabled) {
-		zswap_invalidate_entry(tree, entry);
-		folio_mark_dirty(folio);
-	} else if (entry->length) {
-		zswap_lru_del(&entry->pool->list_lru, entry);
-		zswap_lru_add(&entry->pool->list_lru, entry);
-	}
+	zswap_invalidate_entry(tree, entry);
 	zswap_entry_put(entry);
 	spin_unlock(&tree->lock);
 
+	folio_mark_dirty(folio);
+
 	return true;
 }
 

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
                   ` (4 preceding siblings ...)
  2024-02-04  3:06 ` [PATCH v2 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
@ 2024-02-04  3:06 ` Chengming Zhou
  5 siblings, 0 replies; 8+ messages in thread
From: Chengming Zhou @ 2024-02-04  3:06 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed, Andrew Morton, Johannes Weiner
  Cc: linux-mm, Nhat Pham, Chengming Zhou, linux-kernel, Yosry Ahmed,
	Johannes Weiner

Since we don't need to leave zswap entry on the zswap tree anymore,
we should remove it from tree once we find it from the tree.

Then after using it, we can directly free it, no concurrent path
can find it from tree. Only the shrinker can see it from lru list,
which will also double check under tree lock, so no race problem.

So we don't need refcount in zswap entry anymore and don't need to
take the spinlock for the second time to invalidate it.

The side effect is that zswap_entry_free() maybe not happen in tree
spinlock, but it's ok since nothing need to be protected by the lock.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 63 +++++++++++---------------------------------------------------
 1 file changed, 11 insertions(+), 52 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index cbf379abb6c7..cd67f7f6b302 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -193,12 +193,6 @@ struct zswap_pool {
  *
  * rbnode - links the entry into red-black tree for the appropriate swap type
  * swpentry - associated swap entry, the offset indexes into the red-black tree
- * refcount - the number of outstanding reference to the entry. This is needed
- *            to protect against premature freeing of the entry by code
- *            concurrent calls to load, invalidate, and writeback.  The lock
- *            for the zswap_tree structure that contains the entry must
- *            be held while changing the refcount.  Since the lock must
- *            be held, there is no reason to also make refcount atomic.
  * length - the length in bytes of the compressed page data.  Needed during
  *          decompression. For a same value filled page length is 0, and both
  *          pool and lru are invalid and must be ignored.
@@ -211,7 +205,6 @@ struct zswap_pool {
 struct zswap_entry {
 	struct rb_node rbnode;
 	swp_entry_t swpentry;
-	int refcount;
 	unsigned int length;
 	struct zswap_pool *pool;
 	union {
@@ -222,11 +215,6 @@ struct zswap_entry {
 	struct list_head lru;
 };
 
-/*
- * The tree lock in the zswap_tree struct protects a few things:
- * - the rbtree
- * - the refcount field of each entry in the tree
- */
 struct zswap_tree {
 	struct rb_root rbroot;
 	spinlock_t lock;
@@ -890,14 +878,10 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
 	return 0;
 }
 
-static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
+static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
 {
-	if (!RB_EMPTY_NODE(&entry->rbnode)) {
-		rb_erase(&entry->rbnode, root);
-		RB_CLEAR_NODE(&entry->rbnode);
-		return true;
-	}
-	return false;
+	rb_erase(&entry->rbnode, root);
+	RB_CLEAR_NODE(&entry->rbnode);
 }
 
 /*********************************
@@ -911,7 +895,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
 	entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
 	if (!entry)
 		return NULL;
-	entry->refcount = 1;
 	RB_CLEAR_NODE(&entry->rbnode);
 	return entry;
 }
@@ -954,33 +937,15 @@ static void zswap_entry_free(struct zswap_entry *entry)
 	zswap_update_total_size();
 }
 
-/* caller must hold the tree lock */
-static void zswap_entry_get(struct zswap_entry *entry)
-{
-	WARN_ON_ONCE(!entry->refcount);
-	entry->refcount++;
-}
-
-/* caller must hold the tree lock */
-static void zswap_entry_put(struct zswap_entry *entry)
-{
-	WARN_ON_ONCE(!entry->refcount);
-	if (--entry->refcount == 0) {
-		WARN_ON_ONCE(!RB_EMPTY_NODE(&entry->rbnode));
-		zswap_entry_free(entry);
-	}
-}
-
 /*
- * If the entry is still valid in the tree, drop the initial ref and remove it
- * from the tree. This function must be called with an additional ref held,
- * otherwise it may race with another invalidation freeing the entry.
+ * The caller hold the tree lock and search the entry from the tree,
+ * so it must be on the tree, remove it from the tree and free it.
  */
 static void zswap_invalidate_entry(struct zswap_tree *tree,
 				   struct zswap_entry *entry)
 {
-	if (zswap_rb_erase(&tree->rbroot, entry))
-		zswap_entry_put(entry);
+	zswap_rb_erase(&tree->rbroot, entry);
+	zswap_entry_free(entry);
 }
 
 /*********************************
@@ -1219,7 +1184,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	}
 
 	/* Safe to deref entry after the entry is verified above. */
-	zswap_entry_get(entry);
+	zswap_rb_erase(&tree->rbroot, entry);
 	spin_unlock(&tree->lock);
 
 	zswap_decompress(entry, &folio->page);
@@ -1228,10 +1193,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPWB);
 
-	spin_lock(&tree->lock);
-	zswap_invalidate_entry(tree, entry);
-	zswap_entry_put(entry);
-	spin_unlock(&tree->lock);
+	zswap_entry_free(entry);
 
 	/* folio is up to date */
 	folio_mark_uptodate(folio);
@@ -1702,7 +1664,7 @@ bool zswap_load(struct folio *folio)
 		spin_unlock(&tree->lock);
 		return false;
 	}
-	zswap_entry_get(entry);
+	zswap_rb_erase(&tree->rbroot, entry);
 	spin_unlock(&tree->lock);
 
 	if (entry->length)
@@ -1717,10 +1679,7 @@ bool zswap_load(struct folio *folio)
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
-	spin_lock(&tree->lock);
-	zswap_invalidate_entry(tree, entry);
-	zswap_entry_put(entry);
-	spin_unlock(&tree->lock);
+	zswap_entry_free(entry);
 
 	folio_mark_dirty(folio);
 

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-04  3:06 ` [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
@ 2024-02-05 21:20   ` Yosry Ahmed
  0 siblings, 0 replies; 8+ messages in thread
From: Yosry Ahmed @ 2024-02-05 21:20 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Johannes Weiner, linux-mm, linux-kernel

On Sun, Feb 04, 2024 at 03:06:00AM +0000, Chengming Zhou wrote:
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
> 
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
> 
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
> 
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
> 
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744
> 
> Note: large folio can't have zswap entry for now, so don't bother
> to add zswap entry invalidation in the large folio swap free path.
> 
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Yosry Ahmed <yosryahmed@google.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-02-05 21:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-04  3:05 [PATCH v2 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
2024-02-04  3:05 ` [PATCH v2 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
2024-02-04  3:06 ` [PATCH v2 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
2024-02-05 21:20   ` Yosry Ahmed
2024-02-04  3:06 ` [PATCH v2 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
2024-02-04  3:06 ` [PATCH v2 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
2024-02-04  3:06 ` [PATCH v2 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
2024-02-04  3:06 ` [PATCH v2 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).