linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] mm/zswap: optimize zswap lru list
@ 2024-02-01 15:49 Chengming Zhou
  2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
                   ` (5 more replies)
  0 siblings, 6 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

Hi all,

This series is motivated when observe the zswap lru list shrinking,
noted there are some unexpected cases in zswap_writeback_entry().

bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'

There are some -ENOMEM because when the swap entry is freed to
per-cpu swap pool, it doesn't invalidate/drop zswap entry. Then
the shrinker encounter these trashy zswap entries, it can't be
reclaimed and return -ENOMEM.

So moves the invalidation ahead to when swap entry freed to the
per-cpu swap pool, since there is no any benefit to leave trashy
zswap entries on the zswap tree and lru list.

Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio
will leave compressed copy on the tree and lru list. And it
can't be reclaimed until the folio is removed from swapcache.

Changing to zswap_exclusive_loads_enabled mode will invalidate
when folio swapin, which has its own drawback if that folio
is still clean in swapcache and swapout again, we need to
compress it again. But it seems an unlikely case? I just send
it out for discussion. Please see the commit for details.

Another optimization for -EEXIST is that we add LRU_STOP to
support terminating the shrinking process to avoid evicting
warmer region.

Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.

                mm-unstable   zswap-optimize
real               63.90s       63.25s
user             1064.05s     1063.40s
sys               292.32s      270.94s

The main optimization is in sys cpu, about 7% improvement.

Thanks for review and comments!

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (6):
      mm/zswap: add more comments in shrink_memcg_cb()
      mm/zswap: invalidate zswap entry when swap entry free
      mm/zswap: stop lru list shrinking when encounter warm region
      mm/zswap: remove duplicate_entry debug value
      mm/zswap: only support zswap_exclusive_loads_enabled
      mm/zswap: zswap entry doesn't need refcount anymore

 include/linux/list_lru.h |   1 +
 include/linux/zswap.h    |   4 +-
 mm/Kconfig               |  16 ------
 mm/list_lru.c            |   3 ++
 mm/swap_slots.c          |   2 +
 mm/swapfile.c            |   1 -
 mm/zswap.c               | 136 ++++++++++++++++-------------------------------
 7 files changed, 54 insertions(+), 109 deletions(-)
---
base-commit: 3a92c45e4ba694381c46994f3fde0d8544a2088b
change-id: 20240201-b4-zswap-invalidate-entry-b77dea670325

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb()
  2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
@ 2024-02-01 15:49 ` Chengming Zhou
  2024-02-01 17:45   ` Johannes Weiner
                     ` (2 more replies)
  2024-02-01 15:49 ` [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

Add more comments in shrink_memcg_cb() to describe the deref dance
which is implemented to fix race problem between lru writeback and
swapoff, and the reason why we rotate the entry at the beginning.

Also fix the stale comments in zswap_writeback_entry(), and add
more comments to state that we only deref the tree after we get
the swapcache reference.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4aea03285532..735f1a6ef336 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1207,10 +1207,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	/*
 	 * folio is locked, and the swapcache is now secured against
-	 * concurrent swapping to and from the slot. Verify that the
-	 * swap entry hasn't been invalidated and recycled behind our
-	 * backs (our zswap_entry reference doesn't prevent that), to
-	 * avoid overwriting a new swap folio with old compressed data.
+	 * concurrent swapping to and from the slot, and concurrent
+	 * swapoff so we can safely dereference the zswap tree here.
+	 * Verify that the swap entry hasn't been invalidated and recycled
+	 * behind our backs, to avoid overwriting a new swap folio with
+	 * old compressed data. Only when this is successful can the entry
+	 * be dereferenced.
 	 */
 	tree = swap_zswap_tree(swpentry);
 	spin_lock(&tree->lock);
@@ -1263,22 +1265,29 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 	int writeback_result;
 
 	/*
-	 * Rotate the entry to the tail before unlocking the LRU,
-	 * so that in case of an invalidation race concurrent
-	 * reclaimers don't waste their time on it.
+	 * As soon as we drop the LRU lock, the entry can be freed by
+	 * a concurrent invalidation. This means the following:
 	 *
-	 * If writeback succeeds, or failure is due to the entry
-	 * being invalidated by the swap subsystem, the invalidation
-	 * will unlink and free it.
+	 * 1. We extract the swp_entry_t to the stack, allowing
+	 *    zswap_writeback_entry() to pin the swap entry and
+	 *    then validate the zwap entry against that swap entry's
+	 *    tree using pointer value comparison. Only when that
+	 *    is successful can the entry be dereferenced.
 	 *
-	 * Temporary failures, where the same entry should be tried
-	 * again immediately, almost never happen for this shrinker.
-	 * We don't do any trylocking; -ENOMEM comes closest,
-	 * but that's extremely rare and doesn't happen spuriously
-	 * either. Don't bother distinguishing this case.
+	 * 2. Usually, objects are taken off the LRU for reclaim. In
+	 *    this case this isn't possible, because if reclaim fails
+	 *    for whatever reason, we have no means of knowing if the
+	 *    entry is alive to put it back on the LRU.
 	 *
-	 * But since they do exist in theory, the entry cannot just
-	 * be unlinked, or we could leak it. Hence, rotate.
+	 *    So rotate it before dropping the lock. If the entry is
+	 *    written back or invalidated, the free path will unlink
+	 *    it. For failures, rotation is the right thing as well.
+	 *
+	 *    Temporary failures, where the same entry should be tried
+	 *    again immediately, almost never happen for this shrinker.
+	 *    We don't do any trylocking; -ENOMEM comes closest,
+	 *    but that's extremely rare and doesn't happen spuriously
+	 *    either. Don't bother distinguishing this case.
 	 */
 	list_move_tail(item, &l->list);
 

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
  2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
@ 2024-02-01 15:49 ` Chengming Zhou
  2024-02-01 17:49   ` Johannes Weiner
                     ` (2 more replies)
  2024-02-01 15:49 ` [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

During testing I found there are some times the zswap_writeback_entry()
return -ENOMEM, which is not we expected:

bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
@[-12]: 1563
@[0]: 277221

The reason is that __read_swap_cache_async() return NULL because
swapcache_prepare() failed. The reason is that we won't invalidate
zswap entry when swap entry freed to the per-cpu pool, these zswap
entries are still on the zswap tree and lru list.

This patch moves the invalidation ahead to when swap entry freed
to the per-cpu pool, since there is no any benefit to leave trashy
zswap entry on the tree and lru list.

With this patch:
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
@[0]: 259744

Note: large folio can't have zswap entry for now, so don't bother
to add zswap entry invalidation in the large folio swap free path.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/zswap.h | 4 ++--
 mm/swap_slots.c       | 2 ++
 mm/swapfile.c         | 1 -
 mm/zswap.c            | 5 +++--
 4 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 91895ce1fdbc..341aea490070 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -29,7 +29,7 @@ struct zswap_lruvec_state {
 
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
-void zswap_invalidate(int type, pgoff_t offset);
+void zswap_invalidate(swp_entry_t swp);
 int zswap_swapon(int type, unsigned long nr_pages);
 void zswap_swapoff(int type);
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
@@ -50,7 +50,7 @@ static inline bool zswap_load(struct folio *folio)
 	return false;
 }
 
-static inline void zswap_invalidate(int type, pgoff_t offset) {}
+static inline void zswap_invalidate(swp_entry_t swp) {}
 static inline int zswap_swapon(int type, unsigned long nr_pages)
 {
 	return 0;
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..d24cdea26daa 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -273,6 +273,8 @@ void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
 
+	zswap_invalidate(entry);
+
 	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
 		spin_lock_irq(&cache->free_lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0580bb3e34d7..65b49db89b36 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -744,7 +744,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 		swap_slot_free_notify = NULL;
 	while (offset <= end) {
 		arch_swap_invalidate_page(si->type, offset);
-		zswap_invalidate(si->type, offset);
 		if (swap_slot_free_notify)
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
diff --git a/mm/zswap.c b/mm/zswap.c
index 735f1a6ef336..d8bb0e06e2b0 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1738,9 +1738,10 @@ bool zswap_load(struct folio *folio)
 	return true;
 }
 
-void zswap_invalidate(int type, pgoff_t offset)
+void zswap_invalidate(swp_entry_t swp)
 {
-	struct zswap_tree *tree = swap_zswap_tree(swp_entry(type, offset));
+	pgoff_t offset = swp_offset(swp);
+	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
 	spin_lock(&tree->lock);

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region
  2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
  2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
  2024-02-01 15:49 ` [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
@ 2024-02-01 15:49 ` Chengming Zhou
  2024-02-01 17:51   ` Johannes Weiner
                     ` (2 more replies)
  2024-02-01 15:49 ` [PATCH 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

When the shrinker encounter an existing folio in swap cache, it means
we are shrinking into the warmer region. We should terminate shrinking
if we're in the dynamic shrinker context.

This patch add LRU_STOP to support this, to avoid overshrinking.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 include/linux/list_lru.h | 1 +
 mm/list_lru.c            | 3 +++
 mm/zswap.c               | 4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f2882a820690..5633e970144b 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -24,6 +24,7 @@ enum lru_status {
 	LRU_SKIP,		/* item cannot be locked, skip */
 	LRU_RETRY,		/* item not freeable. May drop the lock
 				   internally, but has to return locked. */
+	LRU_STOP,		/* stop lru list walking */
 };
 
 struct list_lru_one {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 61f3b6b1134f..3fd64736bc45 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -243,6 +243,9 @@ __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 			 */
 			assert_spin_locked(&nlru->lock);
 			goto restart;
+		case LRU_STOP:
+			assert_spin_locked(&nlru->lock);
+			goto out;
 		default:
 			BUG();
 		}
diff --git a/mm/zswap.c b/mm/zswap.c
index d8bb0e06e2b0..4381b7a2d4d6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1315,8 +1315,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
 		 * into the warmer region. We should terminate shrinking (if we're in the dynamic
 		 * shrinker context).
 		 */
-		if (writeback_result == -EEXIST && encountered_page_in_swapcache)
+		if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
+			ret = LRU_STOP;
 			*encountered_page_in_swapcache = true;
+		}
 	} else {
 		zswap_written_back_pages++;
 	}

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
                   ` (2 preceding siblings ...)
  2024-02-01 15:49 ` [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
@ 2024-02-01 15:49 ` Chengming Zhou
  2024-02-01 17:55   ` Johannes Weiner
                     ` (2 more replies)
  2024-02-01 15:49 ` [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
  2024-02-01 15:49 ` [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
  5 siblings, 3 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

cat /sys/kernel/debug/zswap/duplicate_entry
2086447

When testing, the duplicate_entry value is very high, but no warning
message in the kernel log. From the comment of duplicate_entry
"Duplicate store was encountered (rare)", it seems something goes wrong.

Actually it's incremented in the beginning of zswap_store(), which found
its zswap entry has already on the tree. And this is a normal case,
since the folio could leave zswap entry on the tree after swapin,
later it's dirtied and swapout/zswap_store again, found its original
zswap entry. (Maybe we can reuse it instead of invalidating it?)

So duplicate_entry should be only incremented in the real bug case,
which already have "WARN_ON(1)", it looks redundant to count bug case,
so this patch just remove it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4381b7a2d4d6..3fbb7e2c8b8d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -71,8 +71,6 @@ static u64 zswap_reject_compress_poor;
 static u64 zswap_reject_alloc_fail;
 /* Store failed because the entry metadata could not be allocated (rare) */
 static u64 zswap_reject_kmemcache_fail;
-/* Duplicate store was encountered (rare) */
-static u64 zswap_duplicate_entry;
 
 /* Shrinker work queue */
 static struct workqueue_struct *shrink_wq;
@@ -1571,10 +1569,8 @@ bool zswap_store(struct folio *folio)
 	 */
 	spin_lock(&tree->lock);
 	entry = zswap_rb_search(&tree->rbroot, offset);
-	if (entry) {
+	if (entry)
 		zswap_invalidate_entry(tree, entry);
-		zswap_duplicate_entry++;
-	}
 	spin_unlock(&tree->lock);
 	objcg = get_obj_cgroup_from_folio(folio);
 	if (objcg && !obj_cgroup_may_zswap(objcg)) {
@@ -1661,7 +1657,6 @@ bool zswap_store(struct folio *folio)
 	 */
 	while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
 		WARN_ON(1);
-		zswap_duplicate_entry++;
 		zswap_invalidate_entry(tree, dupentry);
 	}
 	if (entry->length) {
@@ -1822,8 +1817,6 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_reject_compress_poor);
 	debugfs_create_u64("written_back_pages", 0444,
 			   zswap_debugfs_root, &zswap_written_back_pages);
-	debugfs_create_u64("duplicate_entry", 0444,
-			   zswap_debugfs_root, &zswap_duplicate_entry);
 	debugfs_create_u64("pool_total_size", 0444,
 			   zswap_debugfs_root, &zswap_pool_total_size);
 	debugfs_create_atomic_t("stored_pages", 0444,

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
                   ` (3 preceding siblings ...)
  2024-02-01 15:49 ` [PATCH 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
@ 2024-02-01 15:49 ` Chengming Zhou
  2024-02-01 18:12   ` Johannes Weiner
  2024-02-01 15:49 ` [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
  5 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

The !zswap_exclusive_loads_enabled mode will leave compressed copy in
the zswap tree and lru list after the folio swapin.

There are some disadvantages in this mode:
1. It's a waste of memory since there are two copies of data, one is
   folio, the other one is compressed data in zswap. And it's unlikely
   the compressed data is useful in the near future.

2. If that folio is dirtied, the compressed data must be not useful,
   but we don't know and don't invalidate the trashy memory in zswap.

3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
   will always return -EEXIST and terminate the shrinking process.

On the other hand, the only downside of zswap_exclusive_loads_enabled
is a little more cpu usage/latency when compression, and the same if
the folio is removed from swapcache or dirtied.

Not sure if we should accept the above disadvantages in the case of
!zswap_exclusive_loads_enabled, so send this out for disscusion.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/Kconfig | 16 ----------------
 mm/zswap.c | 14 +++-----------
 2 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ffc3a2ba3a8c..673b35629074 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -45,22 +45,6 @@ config ZSWAP_DEFAULT_ON
 	  The selection made here can be overridden by using the kernel
 	  command line 'zswap.enabled=' option.
 
-config ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON
-	bool "Invalidate zswap entries when pages are loaded"
-	depends on ZSWAP
-	help
-	  If selected, exclusive loads for zswap will be enabled at boot,
-	  otherwise it will be disabled.
-
-	  If exclusive loads are enabled, when a page is loaded from zswap,
-	  the zswap entry is invalidated at once, as opposed to leaving it
-	  in zswap until the swap entry is freed.
-
-	  This avoids having two copies of the same page in memory
-	  (compressed and uncompressed) after faulting in a page from zswap.
-	  The cost is that if the page was never dirtied and needs to be
-	  swapped out again, it will be re-compressed.
-
 config ZSWAP_SHRINKER_DEFAULT_ON
 	bool "Shrink the zswap pool on memory pressure"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 3fbb7e2c8b8d..cbf379abb6c7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -139,10 +139,6 @@ static bool zswap_non_same_filled_pages_enabled = true;
 module_param_named(non_same_filled_pages_enabled, zswap_non_same_filled_pages_enabled,
 		   bool, 0644);
 
-static bool zswap_exclusive_loads_enabled = IS_ENABLED(
-		CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON);
-module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644);
-
 /* Number of zpools in zswap_pool (empirically determined for scalability) */
 #define ZSWAP_NR_ZPOOLS 32
 
@@ -1722,16 +1718,12 @@ bool zswap_load(struct folio *folio)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
 	spin_lock(&tree->lock);
-	if (zswap_exclusive_loads_enabled) {
-		zswap_invalidate_entry(tree, entry);
-		folio_mark_dirty(folio);
-	} else if (entry->length) {
-		zswap_lru_del(&entry->pool->list_lru, entry);
-		zswap_lru_add(&entry->pool->list_lru, entry);
-	}
+	zswap_invalidate_entry(tree, entry);
 	zswap_entry_put(entry);
 	spin_unlock(&tree->lock);
 
+	folio_mark_dirty(folio);
+
 	return true;
 }
 

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
                   ` (4 preceding siblings ...)
  2024-02-01 15:49 ` [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
@ 2024-02-01 15:49 ` Chengming Zhou
  2024-02-02  1:11   ` Yosry Ahmed
                     ` (2 more replies)
  5 siblings, 3 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-01 15:49 UTC (permalink / raw)
  To: Nhat Pham, Johannes Weiner, Andrew Morton, Yosry Ahmed
  Cc: linux-kernel, Yosry Ahmed, Chengming Zhou, Johannes Weiner, linux-mm

Since we don't need to leave zswap entry on the zswap tree anymore,
we should remove it from tree once we find it from the tree.

Then after using it, we can directly free it, no concurrent path
can find it from tree. Only the shrinker can see it from lru list,
which will also double check under tree lock, so no race problem.

So we don't need refcount in zswap entry anymore and don't need to
take the spinlock for the second time to invalidate it.

The side effect is that zswap_entry_free() maybe not happen in tree
spinlock, but it's ok since nothing need to be protected by the lock.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 63 +++++++++++---------------------------------------------------
 1 file changed, 11 insertions(+), 52 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index cbf379abb6c7..cd67f7f6b302 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -193,12 +193,6 @@ struct zswap_pool {
  *
  * rbnode - links the entry into red-black tree for the appropriate swap type
  * swpentry - associated swap entry, the offset indexes into the red-black tree
- * refcount - the number of outstanding reference to the entry. This is needed
- *            to protect against premature freeing of the entry by code
- *            concurrent calls to load, invalidate, and writeback.  The lock
- *            for the zswap_tree structure that contains the entry must
- *            be held while changing the refcount.  Since the lock must
- *            be held, there is no reason to also make refcount atomic.
  * length - the length in bytes of the compressed page data.  Needed during
  *          decompression. For a same value filled page length is 0, and both
  *          pool and lru are invalid and must be ignored.
@@ -211,7 +205,6 @@ struct zswap_pool {
 struct zswap_entry {
 	struct rb_node rbnode;
 	swp_entry_t swpentry;
-	int refcount;
 	unsigned int length;
 	struct zswap_pool *pool;
 	union {
@@ -222,11 +215,6 @@ struct zswap_entry {
 	struct list_head lru;
 };
 
-/*
- * The tree lock in the zswap_tree struct protects a few things:
- * - the rbtree
- * - the refcount field of each entry in the tree
- */
 struct zswap_tree {
 	struct rb_root rbroot;
 	spinlock_t lock;
@@ -890,14 +878,10 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
 	return 0;
 }
 
-static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
+static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
 {
-	if (!RB_EMPTY_NODE(&entry->rbnode)) {
-		rb_erase(&entry->rbnode, root);
-		RB_CLEAR_NODE(&entry->rbnode);
-		return true;
-	}
-	return false;
+	rb_erase(&entry->rbnode, root);
+	RB_CLEAR_NODE(&entry->rbnode);
 }
 
 /*********************************
@@ -911,7 +895,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
 	entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
 	if (!entry)
 		return NULL;
-	entry->refcount = 1;
 	RB_CLEAR_NODE(&entry->rbnode);
 	return entry;
 }
@@ -954,33 +937,15 @@ static void zswap_entry_free(struct zswap_entry *entry)
 	zswap_update_total_size();
 }
 
-/* caller must hold the tree lock */
-static void zswap_entry_get(struct zswap_entry *entry)
-{
-	WARN_ON_ONCE(!entry->refcount);
-	entry->refcount++;
-}
-
-/* caller must hold the tree lock */
-static void zswap_entry_put(struct zswap_entry *entry)
-{
-	WARN_ON_ONCE(!entry->refcount);
-	if (--entry->refcount == 0) {
-		WARN_ON_ONCE(!RB_EMPTY_NODE(&entry->rbnode));
-		zswap_entry_free(entry);
-	}
-}
-
 /*
- * If the entry is still valid in the tree, drop the initial ref and remove it
- * from the tree. This function must be called with an additional ref held,
- * otherwise it may race with another invalidation freeing the entry.
+ * The caller hold the tree lock and search the entry from the tree,
+ * so it must be on the tree, remove it from the tree and free it.
  */
 static void zswap_invalidate_entry(struct zswap_tree *tree,
 				   struct zswap_entry *entry)
 {
-	if (zswap_rb_erase(&tree->rbroot, entry))
-		zswap_entry_put(entry);
+	zswap_rb_erase(&tree->rbroot, entry);
+	zswap_entry_free(entry);
 }
 
 /*********************************
@@ -1219,7 +1184,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	}
 
 	/* Safe to deref entry after the entry is verified above. */
-	zswap_entry_get(entry);
+	zswap_rb_erase(&tree->rbroot, entry);
 	spin_unlock(&tree->lock);
 
 	zswap_decompress(entry, &folio->page);
@@ -1228,10 +1193,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPWB);
 
-	spin_lock(&tree->lock);
-	zswap_invalidate_entry(tree, entry);
-	zswap_entry_put(entry);
-	spin_unlock(&tree->lock);
+	zswap_entry_free(entry);
 
 	/* folio is up to date */
 	folio_mark_uptodate(folio);
@@ -1702,7 +1664,7 @@ bool zswap_load(struct folio *folio)
 		spin_unlock(&tree->lock);
 		return false;
 	}
-	zswap_entry_get(entry);
+	zswap_rb_erase(&tree->rbroot, entry);
 	spin_unlock(&tree->lock);
 
 	if (entry->length)
@@ -1717,10 +1679,7 @@ bool zswap_load(struct folio *folio)
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
-	spin_lock(&tree->lock);
-	zswap_invalidate_entry(tree, entry);
-	zswap_entry_put(entry);
-	spin_unlock(&tree->lock);
+	zswap_entry_free(entry);
 
 	folio_mark_dirty(folio);
 

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb()
  2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
@ 2024-02-01 17:45   ` Johannes Weiner
  2024-02-01 23:55   ` Yosry Ahmed
  2024-02-02 22:25   ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Johannes Weiner @ 2024-02-01 17:45 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:01PM +0000, Chengming Zhou wrote:
> Add more comments in shrink_memcg_cb() to describe the deref dance
> which is implemented to fix race problem between lru writeback and
> swapoff, and the reason why we rotate the entry at the beginning.
> 
> Also fix the stale comments in zswap_writeback_entry(), and add
> more comments to state that we only deref the tree after we get
> the swapcache reference.
> 
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-01 15:49 ` [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
@ 2024-02-01 17:49   ` Johannes Weiner
  2024-02-01 20:56   ` Nhat Pham
  2024-02-02  0:11   ` Yosry Ahmed
  2 siblings, 0 replies; 37+ messages in thread
From: Johannes Weiner @ 2024-02-01 17:49 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:02PM +0000, Chengming Zhou wrote:
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
> 
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
> 
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
> 
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
> 
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744
> 
> Note: large folio can't have zswap entry for now, so don't bother
> to add zswap entry invalidation in the large folio swap free path.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Great catch.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region
  2024-02-01 15:49 ` [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
@ 2024-02-01 17:51   ` Johannes Weiner
  2024-02-01 18:10   ` Nhat Pham
  2024-02-02  0:15   ` Yosry Ahmed
  2 siblings, 0 replies; 37+ messages in thread
From: Johannes Weiner @ 2024-02-01 17:51 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:03PM +0000, Chengming Zhou wrote:
> When the shrinker encounter an existing folio in swap cache, it means
> we are shrinking into the warmer region. We should terminate shrinking
> if we're in the dynamic shrinker context.
> 
> This patch add LRU_STOP to support this, to avoid overshrinking.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

LGTM.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-01 15:49 ` [PATCH 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
@ 2024-02-01 17:55   ` Johannes Weiner
  2024-02-02  8:18     ` Chengming Zhou
  2024-02-02 22:17   ` Yosry Ahmed
  2024-02-02 22:28   ` Nhat Pham
  2 siblings, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2024-02-01 17:55 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:04PM +0000, Chengming Zhou wrote:
> cat /sys/kernel/debug/zswap/duplicate_entry
> 2086447
> 
> When testing, the duplicate_entry value is very high, but no warning
> message in the kernel log. From the comment of duplicate_entry
> "Duplicate store was encountered (rare)", it seems something goes wrong.
> 
> Actually it's incremented in the beginning of zswap_store(), which found
> its zswap entry has already on the tree. And this is a normal case,
> since the folio could leave zswap entry on the tree after swapin,
> later it's dirtied and swapout/zswap_store again, found its original
> zswap entry. (Maybe we can reuse it instead of invalidating it?)

Probably not worth it, especially after the next patch.

> So duplicate_entry should be only incremented in the real bug case,
> which already have "WARN_ON(1)", it looks redundant to count bug case,
> so this patch just remove it.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Either way, I agree that the WARN_ON() is more useful to point out a
bug than a counter.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region
  2024-02-01 15:49 ` [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
  2024-02-01 17:51   ` Johannes Weiner
@ 2024-02-01 18:10   ` Nhat Pham
  2024-02-02  0:15   ` Yosry Ahmed
  2 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-02-01 18:10 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> When the shrinker encounter an existing folio in swap cache, it means
> we are shrinking into the warmer region. We should terminate shrinking
> if we're in the dynamic shrinker context.
>
> This patch add LRU_STOP to support this, to avoid overshrinking.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Nice!
Acked-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-01 15:49 ` [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
@ 2024-02-01 18:12   ` Johannes Weiner
  2024-02-02  1:04     ` Yosry Ahmed
  2024-02-02 12:57     ` Chengming Zhou
  0 siblings, 2 replies; 37+ messages in thread
From: Johannes Weiner @ 2024-02-01 18:12 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
> The !zswap_exclusive_loads_enabled mode will leave compressed copy in
> the zswap tree and lru list after the folio swapin.
> 
> There are some disadvantages in this mode:
> 1. It's a waste of memory since there are two copies of data, one is
>    folio, the other one is compressed data in zswap. And it's unlikely
>    the compressed data is useful in the near future.
> 
> 2. If that folio is dirtied, the compressed data must be not useful,
>    but we don't know and don't invalidate the trashy memory in zswap.
> 
> 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
>    will always return -EEXIST and terminate the shrinking process.
> 
> On the other hand, the only downside of zswap_exclusive_loads_enabled
> is a little more cpu usage/latency when compression, and the same if
> the folio is removed from swapcache or dirtied.
> 
> Not sure if we should accept the above disadvantages in the case of
> !zswap_exclusive_loads_enabled, so send this out for disscusion.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

This is interesting.

First, I will say that I never liked this config option, because it's
nearly impossible for a user to answer this question. Much better to
just pick a reasonable default.

What should the default be?

Caching "swapout work" is helpful when the system is thrashing. Then
recently swapped in pages might get swapped out again very soon. It
certainly makes sense with conventional swap, because keeping a clean
copy on the disk saves IO work and doesn't cost any additional memory.

But with zswap, it's different. It saves some compression work on a
thrashing page. But the act of keeping compressed memory contributes
to a higher rate of thrashing. And that can cause IO in other places
like zswap writeback and file memory.

It would be useful to have an A/B test to confirm that not caching is
better. Can you run your test with and without keeping the cache, and
in addition to the timings also compare the deltas for pgscan_anon,
pgscan_file, workingset_refault_anon, workingset_refault_file?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-01 15:49 ` [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
  2024-02-01 17:49   ` Johannes Weiner
@ 2024-02-01 20:56   ` Nhat Pham
  2024-02-02  0:11   ` Yosry Ahmed
  2 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-02-01 20:56 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
>
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
>
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
>
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
>
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744

Nice!
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb()
  2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
  2024-02-01 17:45   ` Johannes Weiner
@ 2024-02-01 23:55   ` Yosry Ahmed
  2024-02-02 22:25   ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-01 23:55 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:01PM +0000, Chengming Zhou wrote:
> Add more comments in shrink_memcg_cb() to describe the deref dance
> which is implemented to fix race problem between lru writeback and
> swapoff, and the reason why we rotate the entry at the beginning.
> 
> Also fix the stale comments in zswap_writeback_entry(), and add
> more comments to state that we only deref the tree after we get
> the swapcache reference.
> 
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Yosry Ahmed <yosryahmed@google.com>

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-01 15:49 ` [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
  2024-02-01 17:49   ` Johannes Weiner
  2024-02-01 20:56   ` Nhat Pham
@ 2024-02-02  0:11   ` Yosry Ahmed
  2024-02-02  8:10     ` Chengming Zhou
  2 siblings, 1 reply; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02  0:11 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:02PM +0000, Chengming Zhou wrote:
> During testing I found there are some times the zswap_writeback_entry()
> return -ENOMEM, which is not we expected:
> 
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[-12]: 1563
> @[0]: 277221
> 
> The reason is that __read_swap_cache_async() return NULL because
> swapcache_prepare() failed. The reason is that we won't invalidate
> zswap entry when swap entry freed to the per-cpu pool, these zswap
> entries are still on the zswap tree and lru list.
> 
> This patch moves the invalidation ahead to when swap entry freed
> to the per-cpu pool, since there is no any benefit to leave trashy
> zswap entry on the tree and lru list.
> 
> With this patch:
> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
> @[0]: 259744
> 
> Note: large folio can't have zswap entry for now, so don't bother
> to add zswap entry invalidation in the large folio swap free path.

This makes me slightly nervous. Should we add a comment somewhere just
in case this is missed if someone adds large folio support?

Otherwise the patch itself LGTM.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region
  2024-02-01 15:49 ` [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
  2024-02-01 17:51   ` Johannes Weiner
  2024-02-01 18:10   ` Nhat Pham
@ 2024-02-02  0:15   ` Yosry Ahmed
  2024-02-02  8:12     ` Chengming Zhou
  2 siblings, 1 reply; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02  0:15 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:03PM +0000, Chengming Zhou wrote:
> When the shrinker encounter an existing folio in swap cache, it means
> we are shrinking into the warmer region. We should terminate shrinking
> if we're in the dynamic shrinker context.
> 
> This patch add LRU_STOP to support this, to avoid overshrinking.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

LGTM with one comment below.

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  include/linux/list_lru.h | 1 +
>  mm/list_lru.c            | 3 +++
>  mm/zswap.c               | 4 +++-
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index f2882a820690..5633e970144b 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -24,6 +24,7 @@ enum lru_status {
>  	LRU_SKIP,		/* item cannot be locked, skip */
>  	LRU_RETRY,		/* item not freeable. May drop the lock
>  				   internally, but has to return locked. */
> +	LRU_STOP,		/* stop lru list walking */

nit: Should we add "May drop the lock internally, but has to return
locked" like LRU_RETRY and LRU_REMOVED_RETRY?

>  };
>  
>  struct list_lru_one {
[..]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-01 18:12   ` Johannes Weiner
@ 2024-02-02  1:04     ` Yosry Ahmed
  2024-02-02 12:57     ` Chengming Zhou
  1 sibling, 0 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02  1:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chengming Zhou, Nhat Pham, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 01:12:40PM -0500, Johannes Weiner wrote:
> On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
> > The !zswap_exclusive_loads_enabled mode will leave compressed copy in
> > the zswap tree and lru list after the folio swapin.
> > 
> > There are some disadvantages in this mode:
> > 1. It's a waste of memory since there are two copies of data, one is
> >    folio, the other one is compressed data in zswap. And it's unlikely
> >    the compressed data is useful in the near future.
> > 
> > 2. If that folio is dirtied, the compressed data must be not useful,
> >    but we don't know and don't invalidate the trashy memory in zswap.
> > 
> > 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
> >    will always return -EEXIST and terminate the shrinking process.
> > 
> > On the other hand, the only downside of zswap_exclusive_loads_enabled
> > is a little more cpu usage/latency when compression, and the same if
> > the folio is removed from swapcache or dirtied.
> > 
> > Not sure if we should accept the above disadvantages in the case of
> > !zswap_exclusive_loads_enabled, so send this out for disscusion.
> > 
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> This is interesting.
> 
> First, I will say that I never liked this config option, because it's
> nearly impossible for a user to answer this question. Much better to
> just pick a reasonable default.
> 
> What should the default be?
> 
> Caching "swapout work" is helpful when the system is thrashing. Then
> recently swapped in pages might get swapped out again very soon. It
> certainly makes sense with conventional swap, because keeping a clean
> copy on the disk saves IO work and doesn't cost any additional memory.
> 
> But with zswap, it's different. It saves some compression work on a
> thrashing page. But the act of keeping compressed memory contributes
> to a higher rate of thrashing. And that can cause IO in other places
> like zswap writeback and file memory.

Agreed.

At Google, we have been using exclusive loads for a very long time in
production, so I have no objections to this. The user interface is also
relatively new, so I don't think it will have accumulated users.

> 
> It would be useful to have an A/B test to confirm that not caching is
> better. Can you run your test with and without keeping the cache, and
> in addition to the timings also compare the deltas for pgscan_anon,
> pgscan_file, workingset_refault_anon, workingset_refault_file?

That would be interesting :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-01 15:49 ` [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
@ 2024-02-02  1:11   ` Yosry Ahmed
  2024-02-02 13:00     ` Chengming Zhou
  2024-02-02 16:28   ` Johannes Weiner
  2024-02-02 22:33   ` Nhat Pham
  2 siblings, 1 reply; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02  1:11 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:06PM +0000, Chengming Zhou wrote:
> Since we don't need to leave zswap entry on the zswap tree anymore,
> we should remove it from tree once we find it from the tree.
> 
> Then after using it, we can directly free it, no concurrent path
> can find it from tree. Only the shrinker can see it from lru list,
> which will also double check under tree lock, so no race problem.
> 
> So we don't need refcount in zswap entry anymore and don't need to
> take the spinlock for the second time to invalidate it.
> 
> The side effect is that zswap_entry_free() maybe not happen in tree
> spinlock, but it's ok since nothing need to be protected by the lock.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

This looks like a great simplification, and a good motivation to only
support exclusive loads. Everything is more straightforward because
every tree lookup implies a removal and exclusive ownership.

Let's see if removing support for non-exclusive loads is agreeable first
though :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free
  2024-02-02  0:11   ` Yosry Ahmed
@ 2024-02-02  8:10     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-02  8:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On 2024/2/2 08:11, Yosry Ahmed wrote:
> On Thu, Feb 01, 2024 at 03:49:02PM +0000, Chengming Zhou wrote:
>> During testing I found there are some times the zswap_writeback_entry()
>> return -ENOMEM, which is not we expected:
>>
>> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
>> @[-12]: 1563
>> @[0]: 277221
>>
>> The reason is that __read_swap_cache_async() return NULL because
>> swapcache_prepare() failed. The reason is that we won't invalidate
>> zswap entry when swap entry freed to the per-cpu pool, these zswap
>> entries are still on the zswap tree and lru list.
>>
>> This patch moves the invalidation ahead to when swap entry freed
>> to the per-cpu pool, since there is no any benefit to leave trashy
>> zswap entry on the tree and lru list.
>>
>> With this patch:
>> bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
>> @[0]: 259744
>>
>> Note: large folio can't have zswap entry for now, so don't bother
>> to add zswap entry invalidation in the large folio swap free path.
> 
> This makes me slightly nervous. Should we add a comment somewhere just
> in case this is missed if someone adds large folio support?

Ok, will add this comment:

+       /* Large folio swap slot is not covered. */
	zswap_invalidate(entry);

> 
> Otherwise the patch itself LGTM.

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region
  2024-02-02  0:15   ` Yosry Ahmed
@ 2024-02-02  8:12     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-02  8:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On 2024/2/2 08:15, Yosry Ahmed wrote:
> On Thu, Feb 01, 2024 at 03:49:03PM +0000, Chengming Zhou wrote:
>> When the shrinker encounter an existing folio in swap cache, it means
>> we are shrinking into the warmer region. We should terminate shrinking
>> if we're in the dynamic shrinker context.
>>
>> This patch add LRU_STOP to support this, to avoid overshrinking.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> LGTM with one comment below.
> 
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> 
>> ---
>>  include/linux/list_lru.h | 1 +
>>  mm/list_lru.c            | 3 +++
>>  mm/zswap.c               | 4 +++-
>>  3 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index f2882a820690..5633e970144b 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -24,6 +24,7 @@ enum lru_status {
>>  	LRU_SKIP,		/* item cannot be locked, skip */
>>  	LRU_RETRY,		/* item not freeable. May drop the lock
>>  				   internally, but has to return locked. */
>> +	LRU_STOP,		/* stop lru list walking */
> 
> nit: Should we add "May drop the lock internally, but has to return
> locked" like LRU_RETRY and LRU_REMOVED_RETRY?

Right, will add.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-01 17:55   ` Johannes Weiner
@ 2024-02-02  8:18     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-02  8:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On 2024/2/2 01:55, Johannes Weiner wrote:
> On Thu, Feb 01, 2024 at 03:49:04PM +0000, Chengming Zhou wrote:
>> cat /sys/kernel/debug/zswap/duplicate_entry
>> 2086447
>>
>> When testing, the duplicate_entry value is very high, but no warning
>> message in the kernel log. From the comment of duplicate_entry
>> "Duplicate store was encountered (rare)", it seems something goes wrong.
>>
>> Actually it's incremented in the beginning of zswap_store(), which found
>> its zswap entry has already on the tree. And this is a normal case,
>> since the folio could leave zswap entry on the tree after swapin,
>> later it's dirtied and swapout/zswap_store again, found its original
>> zswap entry. (Maybe we can reuse it instead of invalidating it?)
> 
> Probably not worth it, especially after the next patch.

You are right, not worth it.

> 
>> So duplicate_entry should be only incremented in the real bug case,
>> which already have "WARN_ON(1)", it looks redundant to count bug case,
>> so this patch just remove it.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Either way, I agree that the WARN_ON() is more useful to point out a
> bug than a counter.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-01 18:12   ` Johannes Weiner
  2024-02-02  1:04     ` Yosry Ahmed
@ 2024-02-02 12:57     ` Chengming Zhou
  2024-02-02 16:26       ` Johannes Weiner
                         ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-02 12:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On 2024/2/2 02:12, Johannes Weiner wrote:
> On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
>> The !zswap_exclusive_loads_enabled mode will leave compressed copy in
>> the zswap tree and lru list after the folio swapin.
>>
>> There are some disadvantages in this mode:
>> 1. It's a waste of memory since there are two copies of data, one is
>>    folio, the other one is compressed data in zswap. And it's unlikely
>>    the compressed data is useful in the near future.
>>
>> 2. If that folio is dirtied, the compressed data must be not useful,
>>    but we don't know and don't invalidate the trashy memory in zswap.
>>
>> 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
>>    will always return -EEXIST and terminate the shrinking process.
>>
>> On the other hand, the only downside of zswap_exclusive_loads_enabled
>> is a little more cpu usage/latency when compression, and the same if
>> the folio is removed from swapcache or dirtied.
>>
>> Not sure if we should accept the above disadvantages in the case of
>> !zswap_exclusive_loads_enabled, so send this out for disscusion.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> This is interesting.
> 
> First, I will say that I never liked this config option, because it's
> nearly impossible for a user to answer this question. Much better to
> just pick a reasonable default.

Agree.

> 
> What should the default be?
> 
> Caching "swapout work" is helpful when the system is thrashing. Then
> recently swapped in pages might get swapped out again very soon. It
> certainly makes sense with conventional swap, because keeping a clean
> copy on the disk saves IO work and doesn't cost any additional memory.
> 
> But with zswap, it's different. It saves some compression work on a
> thrashing page. But the act of keeping compressed memory contributes
> to a higher rate of thrashing. And that can cause IO in other places
> like zswap writeback and file memory.
> 
> It would be useful to have an A/B test to confirm that not caching is
> better. Can you run your test with and without keeping the cache, and
> in addition to the timings also compare the deltas for pgscan_anon,
> pgscan_file, workingset_refault_anon, workingset_refault_file?

I just A/B test kernel building in tmpfs directory, memory.max=2GB.
(zswap writeback enabled and shrinker_enabled, one 50GB swapfile)

From the below results, exclusive mode has fewer scan and refault.

                              zswap-invalidate-entry        zswap-invalidate-entry-exclusive
real                          63.80                         63.01                         
user                          1063.83                       1061.32                       
sys                           290.31                        266.15                        
                              zswap-invalidate-entry        zswap-invalidate-entry-exclusive
workingset_refault_anon       2383084.40                    1976397.40                    
workingset_refault_file       44134.00                      45689.40                      
workingset_activate_anon      837878.00                     728441.20                     
workingset_activate_file      4710.00                       4085.20                       
workingset_restore_anon       732622.60                     639428.40                     
workingset_restore_file       1007.00                       926.80                        
workingset_nodereclaim        0.00                          0.00                          
pgscan                        14343003.40                   12409570.20                   
pgscan_kswapd                 0.00                          0.00                          
pgscan_direct                 14343003.40                   12409570.20                   
pgscan_khugepaged             0.00                          0.00                         

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-02  1:11   ` Yosry Ahmed
@ 2024-02-02 13:00     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-02 13:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On 2024/2/2 09:11, Yosry Ahmed wrote:
> On Thu, Feb 01, 2024 at 03:49:06PM +0000, Chengming Zhou wrote:
>> Since we don't need to leave zswap entry on the zswap tree anymore,
>> we should remove it from tree once we find it from the tree.
>>
>> Then after using it, we can directly free it, no concurrent path
>> can find it from tree. Only the shrinker can see it from lru list,
>> which will also double check under tree lock, so no race problem.
>>
>> So we don't need refcount in zswap entry anymore and don't need to
>> take the spinlock for the second time to invalidate it.
>>
>> The side effect is that zswap_entry_free() maybe not happen in tree
>> spinlock, but it's ok since nothing need to be protected by the lock.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> This looks like a great simplification, and a good motivation to only
> support exclusive loads. Everything is more straightforward because
> every tree lookup implies a removal and exclusive ownership.

Right, much simpler!

> 
> Let's see if removing support for non-exclusive loads is agreeable first
> though :)

Ok, I have just posted some testing data for discussion.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-02 12:57     ` Chengming Zhou
@ 2024-02-02 16:26       ` Johannes Weiner
  2024-02-03  4:33         ` Chengming Zhou
  2024-02-02 22:15       ` Yosry Ahmed
  2024-02-02 22:31       ` Nhat Pham
  2 siblings, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2024-02-02 16:26 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Fri, Feb 02, 2024 at 08:57:38PM +0800, Chengming Zhou wrote:
> On 2024/2/2 02:12, Johannes Weiner wrote:
> > Caching "swapout work" is helpful when the system is thrashing. Then
> > recently swapped in pages might get swapped out again very soon. It
> > certainly makes sense with conventional swap, because keeping a clean
> > copy on the disk saves IO work and doesn't cost any additional memory.
> > 
> > But with zswap, it's different. It saves some compression work on a
> > thrashing page. But the act of keeping compressed memory contributes
> > to a higher rate of thrashing. And that can cause IO in other places
> > like zswap writeback and file memory.
> 
> [...] A/B test kernel building in tmpfs directory, memory.max=2GB.
> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
> 
> From the below results, exclusive mode has fewer scan and refault.
> 
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> real                          63.80                         63.01                         
> user                          1063.83                       1061.32                       
> sys                           290.31                        266.15                        
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> workingset_refault_anon       2383084.40                    1976397.40                    
> workingset_refault_file       44134.00                      45689.40                      
> workingset_activate_anon      837878.00                     728441.20                     
> workingset_activate_file      4710.00                       4085.20                       
> workingset_restore_anon       732622.60                     639428.40                     
> workingset_restore_file       1007.00                       926.80                        
> workingset_nodereclaim        0.00                          0.00                          
> pgscan                        14343003.40                   12409570.20                   
> pgscan_kswapd                 0.00                          0.00                          
> pgscan_direct                 14343003.40                   12409570.20                   
> pgscan_khugepaged             0.00                          0.00                         

That's perfect. Thanks!

Would you mind adding all of the above into the changelog?

With that,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-01 15:49 ` [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
  2024-02-02  1:11   ` Yosry Ahmed
@ 2024-02-02 16:28   ` Johannes Weiner
  2024-02-02 22:33   ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Johannes Weiner @ 2024-02-02 16:28 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 01, 2024 at 03:49:06PM +0000, Chengming Zhou wrote:
> Since we don't need to leave zswap entry on the zswap tree anymore,
> we should remove it from tree once we find it from the tree.
> 
> Then after using it, we can directly free it, no concurrent path
> can find it from tree. Only the shrinker can see it from lru list,
> which will also double check under tree lock, so no race problem.
> 
> So we don't need refcount in zswap entry anymore and don't need to
> take the spinlock for the second time to invalidate it.
> 
> The side effect is that zswap_entry_free() maybe not happen in tree
> spinlock, but it's ok since nothing need to be protected by the lock.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-02 12:57     ` Chengming Zhou
  2024-02-02 16:26       ` Johannes Weiner
@ 2024-02-02 22:15       ` Yosry Ahmed
  2024-02-02 22:31       ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02 22:15 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Nhat Pham, Andrew Morton, linux-kernel, linux-mm

> I just A/B test kernel building in tmpfs directory, memory.max=2GB.
> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
>
> From the below results, exclusive mode has fewer scan and refault.
>
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> real                          63.80                         63.01
> user                          1063.83                       1061.32
> sys                           290.31                        266.15
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> workingset_refault_anon       2383084.40                    1976397.40
> workingset_refault_file       44134.00                      45689.40
> workingset_activate_anon      837878.00                     728441.20
> workingset_activate_file      4710.00                       4085.20
> workingset_restore_anon       732622.60                     639428.40
> workingset_restore_file       1007.00                       926.80
> workingset_nodereclaim        0.00                          0.00
> pgscan                        14343003.40                   12409570.20
> pgscan_kswapd                 0.00                          0.00
> pgscan_direct                 14343003.40                   12409570.20
> pgscan_khugepaged             0.00                          0.00

I think the numbers look really good, and as I mentioned, we have been
doing this in production for many years now, so:

Acked-by: Yosry Ahmed <yosryahmed@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-01 15:49 ` [PATCH 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
  2024-02-01 17:55   ` Johannes Weiner
@ 2024-02-02 22:17   ` Yosry Ahmed
  2024-02-02 22:28   ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02 22:17 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> cat /sys/kernel/debug/zswap/duplicate_entry
> 2086447
>
> When testing, the duplicate_entry value is very high, but no warning
> message in the kernel log. From the comment of duplicate_entry
> "Duplicate store was encountered (rare)", it seems something goes wrong.
>
> Actually it's incremented in the beginning of zswap_store(), which found
> its zswap entry has already on the tree. And this is a normal case,
> since the folio could leave zswap entry on the tree after swapin,
> later it's dirtied and swapout/zswap_store again, found its original
> zswap entry. (Maybe we can reuse it instead of invalidating it?)
>
> So duplicate_entry should be only incremented in the real bug case,
> which already have "WARN_ON(1)", it looks redundant to count bug case,
> so this patch just remove it.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Acked-by: Yosry Ahmed <yosryahmed@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb()
  2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
  2024-02-01 17:45   ` Johannes Weiner
  2024-02-01 23:55   ` Yosry Ahmed
@ 2024-02-02 22:25   ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-02-02 22:25 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Add more comments in shrink_memcg_cb() to describe the deref dance
> which is implemented to fix race problem between lru writeback and
> swapoff, and the reason why we rotate the entry at the beginning.
>
> Also fix the stale comments in zswap_writeback_entry(), and add
> more comments to state that we only deref the tree after we get
> the swapcache reference.
>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-01 15:49 ` [PATCH 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
  2024-02-01 17:55   ` Johannes Weiner
  2024-02-02 22:17   ` Yosry Ahmed
@ 2024-02-02 22:28   ` Nhat Pham
  2024-02-03  4:29     ` Chengming Zhou
  2 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-02-02 22:28 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> cat /sys/kernel/debug/zswap/duplicate_entry
> 2086447
>
> When testing, the duplicate_entry value is very high, but no warning
> message in the kernel log. From the comment of duplicate_entry
> "Duplicate store was encountered (rare)", it seems something goes wrong.
>
> Actually it's incremented in the beginning of zswap_store(), which found
> its zswap entry has already on the tree. And this is a normal case,
> since the folio could leave zswap entry on the tree after swapin,
> later it's dirtied and swapout/zswap_store again, found its original
> zswap entry. (Maybe we can reuse it instead of invalidating it?)

Interesting. So if we make invalidate load the only mode, this oddity
is gone as well?

>
> So duplicate_entry should be only incremented in the real bug case,
> which already have "WARN_ON(1)", it looks redundant to count bug case,
> so this patch just remove it.

But yeah, I have literally never checked this value (maybe I should
ha). I'm fine with removing it, unless someone has a strong case for
this counter?

For now:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 4381b7a2d4d6..3fbb7e2c8b8d 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -71,8 +71,6 @@ static u64 zswap_reject_compress_poor;
>  static u64 zswap_reject_alloc_fail;
>  /* Store failed because the entry metadata could not be allocated (rare) */
>  static u64 zswap_reject_kmemcache_fail;
> -/* Duplicate store was encountered (rare) */
> -static u64 zswap_duplicate_entry;
>
>  /* Shrinker work queue */
>  static struct workqueue_struct *shrink_wq;
> @@ -1571,10 +1569,8 @@ bool zswap_store(struct folio *folio)
>          */
>         spin_lock(&tree->lock);
>         entry = zswap_rb_search(&tree->rbroot, offset);
> -       if (entry) {
> +       if (entry)
>                 zswap_invalidate_entry(tree, entry);
> -               zswap_duplicate_entry++;
> -       }
>         spin_unlock(&tree->lock);
>         objcg = get_obj_cgroup_from_folio(folio);
>         if (objcg && !obj_cgroup_may_zswap(objcg)) {
> @@ -1661,7 +1657,6 @@ bool zswap_store(struct folio *folio)
>          */
>         while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
>                 WARN_ON(1);
> -               zswap_duplicate_entry++;
>                 zswap_invalidate_entry(tree, dupentry);
>         }
>         if (entry->length) {
> @@ -1822,8 +1817,6 @@ static int zswap_debugfs_init(void)
>                            zswap_debugfs_root, &zswap_reject_compress_poor);
>         debugfs_create_u64("written_back_pages", 0444,
>                            zswap_debugfs_root, &zswap_written_back_pages);
> -       debugfs_create_u64("duplicate_entry", 0444,
> -                          zswap_debugfs_root, &zswap_duplicate_entry);
>         debugfs_create_u64("pool_total_size", 0444,
>                            zswap_debugfs_root, &zswap_pool_total_size);
>         debugfs_create_atomic_t("stored_pages", 0444,
>
> --
> b4 0.10.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-02 12:57     ` Chengming Zhou
  2024-02-02 16:26       ` Johannes Weiner
  2024-02-02 22:15       ` Yosry Ahmed
@ 2024-02-02 22:31       ` Nhat Pham
  2 siblings, 0 replies; 37+ messages in thread
From: Nhat Pham @ 2024-02-02 22:31 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Fri, Feb 2, 2024 at 4:57 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2024/2/2 02:12, Johannes Weiner wrote:
> > On Thu, Feb 01, 2024 at 03:49:05PM +0000, Chengming Zhou wrote:
> >> The !zswap_exclusive_loads_enabled mode will leave compressed copy in
> >> the zswap tree and lru list after the folio swapin.
> >>
> >> There are some disadvantages in this mode:
> >> 1. It's a waste of memory since there are two copies of data, one is
> >>    folio, the other one is compressed data in zswap. And it's unlikely
> >>    the compressed data is useful in the near future.
> >>
> >> 2. If that folio is dirtied, the compressed data must be not useful,
> >>    but we don't know and don't invalidate the trashy memory in zswap.
> >>
> >> 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry()
> >>    will always return -EEXIST and terminate the shrinking process.
> >>
> >> On the other hand, the only downside of zswap_exclusive_loads_enabled
> >> is a little more cpu usage/latency when compression, and the same if
> >> the folio is removed from swapcache or dirtied.
> >>
> >> Not sure if we should accept the above disadvantages in the case of
> >> !zswap_exclusive_loads_enabled, so send this out for disscusion.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >
> > This is interesting.
> >
> > First, I will say that I never liked this config option, because it's
> > nearly impossible for a user to answer this question. Much better to
> > just pick a reasonable default.
>
> Agree.
>
> >
> > What should the default be?
> >
> > Caching "swapout work" is helpful when the system is thrashing. Then
> > recently swapped in pages might get swapped out again very soon. It
> > certainly makes sense with conventional swap, because keeping a clean
> > copy on the disk saves IO work and doesn't cost any additional memory.
> >
> > But with zswap, it's different. It saves some compression work on a
> > thrashing page. But the act of keeping compressed memory contributes
> > to a higher rate of thrashing. And that can cause IO in other places
> > like zswap writeback and file memory.
> >
> > It would be useful to have an A/B test to confirm that not caching is
> > better. Can you run your test with and without keeping the cache, and
> > in addition to the timings also compare the deltas for pgscan_anon,
> > pgscan_file, workingset_refault_anon, workingset_refault_file?
>
> I just A/B test kernel building in tmpfs directory, memory.max=2GB.
> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
>
> From the below results, exclusive mode has fewer scan and refault.
>
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
> real                          63.80                         63.01
> user                          1063.83                       1061.32
> sys                           290.31                        266.15
>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive

This is one of those cases where something might make sense
conceptually, but does not pan out in practice. Removing
non-invalidate seems to simplify the code a bit, and that's one less
thing to worry about for users, so I like this :)

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

> workingset_refault_anon       2383084.40                    1976397.40
> workingset_refault_file       44134.00                      45689.40
> workingset_activate_anon      837878.00                     728441.20
> workingset_activate_file      4710.00                       4085.20
> workingset_restore_anon       732622.60                     639428.40
> workingset_restore_file       1007.00                       926.80
> workingset_nodereclaim        0.00                          0.00
> pgscan                        14343003.40                   12409570.20
> pgscan_kswapd                 0.00                          0.00
> pgscan_direct                 14343003.40                   12409570.20
> pgscan_khugepaged             0.00                          0.00

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-01 15:49 ` [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
  2024-02-02  1:11   ` Yosry Ahmed
  2024-02-02 16:28   ` Johannes Weiner
@ 2024-02-02 22:33   ` Nhat Pham
  2024-02-02 22:36     ` Yosry Ahmed
  2 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-02-02 22:33 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Since we don't need to leave zswap entry on the zswap tree anymore,
> we should remove it from tree once we find it from the tree.
>
> Then after using it, we can directly free it, no concurrent path
> can find it from tree. Only the shrinker can see it from lru list,
> which will also double check under tree lock, so no race problem.
>
> So we don't need refcount in zswap entry anymore and don't need to
> take the spinlock for the second time to invalidate it.
>
> The side effect is that zswap_entry_free() maybe not happen in tree
> spinlock, but it's ok since nothing need to be protected by the lock.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Oh this is sweet! Fewer things to keep in mind.
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

> ---
>  mm/zswap.c | 63 +++++++++++---------------------------------------------------
>  1 file changed, 11 insertions(+), 52 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cbf379abb6c7..cd67f7f6b302 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -193,12 +193,6 @@ struct zswap_pool {
>   *
>   * rbnode - links the entry into red-black tree for the appropriate swap type
>   * swpentry - associated swap entry, the offset indexes into the red-black tree
> - * refcount - the number of outstanding reference to the entry. This is needed
> - *            to protect against premature freeing of the entry by code
> - *            concurrent calls to load, invalidate, and writeback.  The lock
> - *            for the zswap_tree structure that contains the entry must
> - *            be held while changing the refcount.  Since the lock must
> - *            be held, there is no reason to also make refcount atomic.
>   * length - the length in bytes of the compressed page data.  Needed during
>   *          decompression. For a same value filled page length is 0, and both
>   *          pool and lru are invalid and must be ignored.
> @@ -211,7 +205,6 @@ struct zswap_pool {
>  struct zswap_entry {
>         struct rb_node rbnode;
>         swp_entry_t swpentry;
> -       int refcount;

Hah this should even make zswap a bit more space-efficient. IIRC Yosry
has some analysis regarding how much less efficient zswap will be
every time we add a new field to zswap entry - this should go in the
opposite direction :)

>         unsigned int length;
>         struct zswap_pool *pool;
>         union {
> @@ -222,11 +215,6 @@ struct zswap_entry {
>         struct list_head lru;
>  };
>
> -/*
> - * The tree lock in the zswap_tree struct protects a few things:
> - * - the rbtree
> - * - the refcount field of each entry in the tree
> - */
>  struct zswap_tree {
>         struct rb_root rbroot;
>         spinlock_t lock;
> @@ -890,14 +878,10 @@ static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
>         return 0;
>  }
>
> -static bool zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> +static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
>  {
> -       if (!RB_EMPTY_NODE(&entry->rbnode)) {
> -               rb_erase(&entry->rbnode, root);
> -               RB_CLEAR_NODE(&entry->rbnode);
> -               return true;
> -       }
> -       return false;
> +       rb_erase(&entry->rbnode, root);
> +       RB_CLEAR_NODE(&entry->rbnode);
>  }
>
>  /*********************************
> @@ -911,7 +895,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
>         entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
>         if (!entry)
>                 return NULL;
> -       entry->refcount = 1;
>         RB_CLEAR_NODE(&entry->rbnode);
>         return entry;
>  }
> @@ -954,33 +937,15 @@ static void zswap_entry_free(struct zswap_entry *entry)
>         zswap_update_total_size();
>  }
>
> -/* caller must hold the tree lock */
> -static void zswap_entry_get(struct zswap_entry *entry)
> -{
> -       WARN_ON_ONCE(!entry->refcount);
> -       entry->refcount++;
> -}
> -
> -/* caller must hold the tree lock */
> -static void zswap_entry_put(struct zswap_entry *entry)
> -{
> -       WARN_ON_ONCE(!entry->refcount);
> -       if (--entry->refcount == 0) {
> -               WARN_ON_ONCE(!RB_EMPTY_NODE(&entry->rbnode));
> -               zswap_entry_free(entry);
> -       }
> -}
> -
>  /*
> - * If the entry is still valid in the tree, drop the initial ref and remove it
> - * from the tree. This function must be called with an additional ref held,
> - * otherwise it may race with another invalidation freeing the entry.
> + * The caller hold the tree lock and search the entry from the tree,
> + * so it must be on the tree, remove it from the tree and free it.
>   */
>  static void zswap_invalidate_entry(struct zswap_tree *tree,
>                                    struct zswap_entry *entry)
>  {
> -       if (zswap_rb_erase(&tree->rbroot, entry))
> -               zswap_entry_put(entry);
> +       zswap_rb_erase(&tree->rbroot, entry);
> +       zswap_entry_free(entry);
>  }
>
>  /*********************************
> @@ -1219,7 +1184,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         }
>
>         /* Safe to deref entry after the entry is verified above. */
> -       zswap_entry_get(entry);
> +       zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         zswap_decompress(entry, &folio->page);
> @@ -1228,10 +1193,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPWB);
>
> -       spin_lock(&tree->lock);
> -       zswap_invalidate_entry(tree, entry);
> -       zswap_entry_put(entry);
> -       spin_unlock(&tree->lock);
> +       zswap_entry_free(entry);
>
>         /* folio is up to date */
>         folio_mark_uptodate(folio);
> @@ -1702,7 +1664,7 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_entry_get(entry);
> +       zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1717,10 +1679,7 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       spin_lock(&tree->lock);
> -       zswap_invalidate_entry(tree, entry);
> -       zswap_entry_put(entry);
> -       spin_unlock(&tree->lock);
> +       zswap_entry_free(entry);
>
>         folio_mark_dirty(folio);
>
>
> --
> b4 0.10.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-02 22:33   ` Nhat Pham
@ 2024-02-02 22:36     ` Yosry Ahmed
  2024-02-02 22:44       ` Nhat Pham
  0 siblings, 1 reply; 37+ messages in thread
From: Yosry Ahmed @ 2024-02-02 22:36 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Fri, Feb 2, 2024 at 2:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Since we don't need to leave zswap entry on the zswap tree anymore,
> > we should remove it from tree once we find it from the tree.
> >
> > Then after using it, we can directly free it, no concurrent path
> > can find it from tree. Only the shrinker can see it from lru list,
> > which will also double check under tree lock, so no race problem.
> >
> > So we don't need refcount in zswap entry anymore and don't need to
> > take the spinlock for the second time to invalidate it.
> >
> > The side effect is that zswap_entry_free() maybe not happen in tree
> > spinlock, but it's ok since nothing need to be protected by the lock.
> >
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>
> Oh this is sweet! Fewer things to keep in mind.
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>
> > ---
> >  mm/zswap.c | 63 +++++++++++---------------------------------------------------
> >  1 file changed, 11 insertions(+), 52 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index cbf379abb6c7..cd67f7f6b302 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -193,12 +193,6 @@ struct zswap_pool {
> >   *
> >   * rbnode - links the entry into red-black tree for the appropriate swap type
> >   * swpentry - associated swap entry, the offset indexes into the red-black tree
> > - * refcount - the number of outstanding reference to the entry. This is needed
> > - *            to protect against premature freeing of the entry by code
> > - *            concurrent calls to load, invalidate, and writeback.  The lock
> > - *            for the zswap_tree structure that contains the entry must
> > - *            be held while changing the refcount.  Since the lock must
> > - *            be held, there is no reason to also make refcount atomic.
> >   * length - the length in bytes of the compressed page data.  Needed during
> >   *          decompression. For a same value filled page length is 0, and both
> >   *          pool and lru are invalid and must be ignored.
> > @@ -211,7 +205,6 @@ struct zswap_pool {
> >  struct zswap_entry {
> >         struct rb_node rbnode;
> >         swp_entry_t swpentry;
> > -       int refcount;
>
> Hah this should even make zswap a bit more space-efficient. IIRC Yosry
> has some analysis regarding how much less efficient zswap will be
> every time we add a new field to zswap entry - this should go in the
> opposite direction :)

Unfortunately in this specific case I think it won't change the size
of the allocation for struct zswap_entry anyway, but it is a step
nonetheless :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-02 22:36     ` Yosry Ahmed
@ 2024-02-02 22:44       ` Nhat Pham
  2024-02-03  5:09         ` Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Nhat Pham @ 2024-02-02 22:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Fri, Feb 2, 2024 at 2:37 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Feb 2, 2024 at 2:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> > >
> > > Since we don't need to leave zswap entry on the zswap tree anymore,
> > > we should remove it from tree once we find it from the tree.
> > >
> > > Then after using it, we can directly free it, no concurrent path
> > > can find it from tree. Only the shrinker can see it from lru list,
> > > which will also double check under tree lock, so no race problem.
> > >
> > > So we don't need refcount in zswap entry anymore and don't need to
> > > take the spinlock for the second time to invalidate it.
> > >
> > > The side effect is that zswap_entry_free() maybe not happen in tree
> > > spinlock, but it's ok since nothing need to be protected by the lock.
> > >
> > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >
> > Oh this is sweet! Fewer things to keep in mind.
> > Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >
> > > ---
> > >  mm/zswap.c | 63 +++++++++++---------------------------------------------------
> > >  1 file changed, 11 insertions(+), 52 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index cbf379abb6c7..cd67f7f6b302 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -193,12 +193,6 @@ struct zswap_pool {
> > >   *
> > >   * rbnode - links the entry into red-black tree for the appropriate swap type
> > >   * swpentry - associated swap entry, the offset indexes into the red-black tree
> > > - * refcount - the number of outstanding reference to the entry. This is needed
> > > - *            to protect against premature freeing of the entry by code
> > > - *            concurrent calls to load, invalidate, and writeback.  The lock
> > > - *            for the zswap_tree structure that contains the entry must
> > > - *            be held while changing the refcount.  Since the lock must
> > > - *            be held, there is no reason to also make refcount atomic.
> > >   * length - the length in bytes of the compressed page data.  Needed during
> > >   *          decompression. For a same value filled page length is 0, and both
> > >   *          pool and lru are invalid and must be ignored.
> > > @@ -211,7 +205,6 @@ struct zswap_pool {
> > >  struct zswap_entry {
> > >         struct rb_node rbnode;
> > >         swp_entry_t swpentry;
> > > -       int refcount;
> >
> > Hah this should even make zswap a bit more space-efficient. IIRC Yosry
> > has some analysis regarding how much less efficient zswap will be
> > every time we add a new field to zswap entry - this should go in the
> > opposite direction :)
>
> Unfortunately in this specific case I think it won't change the size
> of the allocation for struct zswap_entry anyway, but it is a step
> nonetheless :)

Ah, is it because of the field alignment requirement? But yeah, one
day we will remove enough of them :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] mm/zswap: remove duplicate_entry debug value
  2024-02-02 22:28   ` Nhat Pham
@ 2024-02-03  4:29     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-03  4:29 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On 2024/2/3 06:28, Nhat Pham wrote:
> On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> cat /sys/kernel/debug/zswap/duplicate_entry
>> 2086447
>>
>> When testing, the duplicate_entry value is very high, but no warning
>> message in the kernel log. From the comment of duplicate_entry
>> "Duplicate store was encountered (rare)", it seems something goes wrong.
>>
>> Actually it's incremented in the beginning of zswap_store(), which found
>> its zswap entry has already on the tree. And this is a normal case,
>> since the folio could leave zswap entry on the tree after swapin,
>> later it's dirtied and swapout/zswap_store again, found its original
>> zswap entry. (Maybe we can reuse it instead of invalidating it?)
> 
> Interesting. So if we make invalidate load the only mode, this oddity
> is gone as well?

Good point!
This oddity is why we need to invalidate it first at the beginning.

But there is another oddity that a stored folio maybe dirtied again,
so that folio needs to be writeback/stored for the second time, in
which case, we still need to invalidate it first to avoid WARN_ON later.

Thanks.

>>
>> So duplicate_entry should be only incremented in the real bug case,
>> which already have "WARN_ON(1)", it looks redundant to count bug case,
>> so this patch just remove it.
> 
> But yeah, I have literally never checked this value (maybe I should
> ha). I'm fine with removing it, unless someone has a strong case for
> this counter?
> 
> For now:
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> 
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 9 +--------
>>  1 file changed, 1 insertion(+), 8 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 4381b7a2d4d6..3fbb7e2c8b8d 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -71,8 +71,6 @@ static u64 zswap_reject_compress_poor;
>>  static u64 zswap_reject_alloc_fail;
>>  /* Store failed because the entry metadata could not be allocated (rare) */
>>  static u64 zswap_reject_kmemcache_fail;
>> -/* Duplicate store was encountered (rare) */
>> -static u64 zswap_duplicate_entry;
>>
>>  /* Shrinker work queue */
>>  static struct workqueue_struct *shrink_wq;
>> @@ -1571,10 +1569,8 @@ bool zswap_store(struct folio *folio)
>>          */
>>         spin_lock(&tree->lock);
>>         entry = zswap_rb_search(&tree->rbroot, offset);
>> -       if (entry) {
>> +       if (entry)
>>                 zswap_invalidate_entry(tree, entry);
>> -               zswap_duplicate_entry++;
>> -       }
>>         spin_unlock(&tree->lock);
>>         objcg = get_obj_cgroup_from_folio(folio);
>>         if (objcg && !obj_cgroup_may_zswap(objcg)) {
>> @@ -1661,7 +1657,6 @@ bool zswap_store(struct folio *folio)
>>          */
>>         while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
>>                 WARN_ON(1);
>> -               zswap_duplicate_entry++;
>>                 zswap_invalidate_entry(tree, dupentry);
>>         }
>>         if (entry->length) {
>> @@ -1822,8 +1817,6 @@ static int zswap_debugfs_init(void)
>>                            zswap_debugfs_root, &zswap_reject_compress_poor);
>>         debugfs_create_u64("written_back_pages", 0444,
>>                            zswap_debugfs_root, &zswap_written_back_pages);
>> -       debugfs_create_u64("duplicate_entry", 0444,
>> -                          zswap_debugfs_root, &zswap_duplicate_entry);
>>         debugfs_create_u64("pool_total_size", 0444,
>>                            zswap_debugfs_root, &zswap_pool_total_size);
>>         debugfs_create_atomic_t("stored_pages", 0444,
>>
>> --
>> b4 0.10.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled
  2024-02-02 16:26       ` Johannes Weiner
@ 2024-02-03  4:33         ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-03  4:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, Andrew Morton, Yosry Ahmed, linux-kernel, linux-mm

On 2024/2/3 00:26, Johannes Weiner wrote:
> On Fri, Feb 02, 2024 at 08:57:38PM +0800, Chengming Zhou wrote:
>> On 2024/2/2 02:12, Johannes Weiner wrote:
>>> Caching "swapout work" is helpful when the system is thrashing. Then
>>> recently swapped in pages might get swapped out again very soon. It
>>> certainly makes sense with conventional swap, because keeping a clean
>>> copy on the disk saves IO work and doesn't cost any additional memory.
>>>
>>> But with zswap, it's different. It saves some compression work on a
>>> thrashing page. But the act of keeping compressed memory contributes
>>> to a higher rate of thrashing. And that can cause IO in other places
>>> like zswap writeback and file memory.
>>
>> [...] A/B test kernel building in tmpfs directory, memory.max=2GB.
>> (zswap writeback enabled and shrinker_enabled, one 50GB swapfile)
>>
>> From the below results, exclusive mode has fewer scan and refault.
>>
>>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
>> real                          63.80                         63.01                         
>> user                          1063.83                       1061.32                       
>> sys                           290.31                        266.15                        
>>                               zswap-invalidate-entry        zswap-invalidate-entry-exclusive
>> workingset_refault_anon       2383084.40                    1976397.40                    
>> workingset_refault_file       44134.00                      45689.40                      
>> workingset_activate_anon      837878.00                     728441.20                     
>> workingset_activate_file      4710.00                       4085.20                       
>> workingset_restore_anon       732622.60                     639428.40                     
>> workingset_restore_file       1007.00                       926.80                        
>> workingset_nodereclaim        0.00                          0.00                          
>> pgscan                        14343003.40                   12409570.20                   
>> pgscan_kswapd                 0.00                          0.00                          
>> pgscan_direct                 14343003.40                   12409570.20                   
>> pgscan_khugepaged             0.00                          0.00                         
> 
> That's perfect. Thanks!
> 
> Would you mind adding all of the above into the changelog?
Yeah, will do. Thanks!

> 
> With that,
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore
  2024-02-02 22:44       ` Nhat Pham
@ 2024-02-03  5:09         ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2024-02-03  5:09 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On 2024/2/3 06:44, Nhat Pham wrote:
> On Fri, Feb 2, 2024 at 2:37 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Fri, Feb 2, 2024 at 2:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>>
>>> On Thu, Feb 1, 2024 at 7:50 AM Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>>
>>>> Since we don't need to leave zswap entry on the zswap tree anymore,
>>>> we should remove it from tree once we find it from the tree.
>>>>
>>>> Then after using it, we can directly free it, no concurrent path
>>>> can find it from tree. Only the shrinker can see it from lru list,
>>>> which will also double check under tree lock, so no race problem.
>>>>
>>>> So we don't need refcount in zswap entry anymore and don't need to
>>>> take the spinlock for the second time to invalidate it.
>>>>
>>>> The side effect is that zswap_entry_free() maybe not happen in tree
>>>> spinlock, but it's ok since nothing need to be protected by the lock.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>
>>> Oh this is sweet! Fewer things to keep in mind.
>>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>>>
>>>> ---
>>>>  mm/zswap.c | 63 +++++++++++---------------------------------------------------
>>>>  1 file changed, 11 insertions(+), 52 deletions(-)
>>>>
>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>> index cbf379abb6c7..cd67f7f6b302 100644
>>>> --- a/mm/zswap.c
>>>> +++ b/mm/zswap.c
>>>> @@ -193,12 +193,6 @@ struct zswap_pool {
>>>>   *
>>>>   * rbnode - links the entry into red-black tree for the appropriate swap type
>>>>   * swpentry - associated swap entry, the offset indexes into the red-black tree
>>>> - * refcount - the number of outstanding reference to the entry. This is needed
>>>> - *            to protect against premature freeing of the entry by code
>>>> - *            concurrent calls to load, invalidate, and writeback.  The lock
>>>> - *            for the zswap_tree structure that contains the entry must
>>>> - *            be held while changing the refcount.  Since the lock must
>>>> - *            be held, there is no reason to also make refcount atomic.
>>>>   * length - the length in bytes of the compressed page data.  Needed during
>>>>   *          decompression. For a same value filled page length is 0, and both
>>>>   *          pool and lru are invalid and must be ignored.
>>>> @@ -211,7 +205,6 @@ struct zswap_pool {
>>>>  struct zswap_entry {
>>>>         struct rb_node rbnode;
>>>>         swp_entry_t swpentry;
>>>> -       int refcount;
>>>
>>> Hah this should even make zswap a bit more space-efficient. IIRC Yosry
>>> has some analysis regarding how much less efficient zswap will be
>>> every time we add a new field to zswap entry - this should go in the
>>> opposite direction :)
>>
>> Unfortunately in this specific case I think it won't change the size
>> of the allocation for struct zswap_entry anyway, but it is a step
>> nonetheless :)
> 
> Ah, is it because of the field alignment requirement? But yeah, one
> day we will remove enough of them :)

Yeah, the zswap_entry size is not changed :)

If later xarray conversion land in, the rb_node would be gone, so
one cacheline will be enough.

struct zswap_entry {
	struct rb_node             rbnode __attribute__((__aligned__(8))); /*     0    24 */
	swp_entry_t                swpentry;             /*    24     8 */
	unsigned int               length;               /*    32     4 */

	/* XXX 4 bytes hole, try to pack */

	struct zswap_pool *        pool;                 /*    40     8 */
	union {
		long unsigned int  handle;               /*    48     8 */
		long unsigned int  value;                /*    48     8 */
	};                                               /*    48     8 */
	struct obj_cgroup *        objcg;                /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct list_head           lru;                  /*    64    16 */

	/* size: 80, cachelines: 2, members: 7 */
	/* sum members: 76, holes: 1, sum holes: 4 */
	/* forced alignments: 1 */
	/* last cacheline: 16 bytes */
} __attribute__((__aligned__(8)));

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-02-03  5:09 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-01 15:49 [PATCH 0/6] mm/zswap: optimize zswap lru list Chengming Zhou
2024-02-01 15:49 ` [PATCH 1/6] mm/zswap: add more comments in shrink_memcg_cb() Chengming Zhou
2024-02-01 17:45   ` Johannes Weiner
2024-02-01 23:55   ` Yosry Ahmed
2024-02-02 22:25   ` Nhat Pham
2024-02-01 15:49 ` [PATCH 2/6] mm/zswap: invalidate zswap entry when swap entry free Chengming Zhou
2024-02-01 17:49   ` Johannes Weiner
2024-02-01 20:56   ` Nhat Pham
2024-02-02  0:11   ` Yosry Ahmed
2024-02-02  8:10     ` Chengming Zhou
2024-02-01 15:49 ` [PATCH 3/6] mm/zswap: stop lru list shrinking when encounter warm region Chengming Zhou
2024-02-01 17:51   ` Johannes Weiner
2024-02-01 18:10   ` Nhat Pham
2024-02-02  0:15   ` Yosry Ahmed
2024-02-02  8:12     ` Chengming Zhou
2024-02-01 15:49 ` [PATCH 4/6] mm/zswap: remove duplicate_entry debug value Chengming Zhou
2024-02-01 17:55   ` Johannes Weiner
2024-02-02  8:18     ` Chengming Zhou
2024-02-02 22:17   ` Yosry Ahmed
2024-02-02 22:28   ` Nhat Pham
2024-02-03  4:29     ` Chengming Zhou
2024-02-01 15:49 ` [PATCH 5/6] mm/zswap: only support zswap_exclusive_loads_enabled Chengming Zhou
2024-02-01 18:12   ` Johannes Weiner
2024-02-02  1:04     ` Yosry Ahmed
2024-02-02 12:57     ` Chengming Zhou
2024-02-02 16:26       ` Johannes Weiner
2024-02-03  4:33         ` Chengming Zhou
2024-02-02 22:15       ` Yosry Ahmed
2024-02-02 22:31       ` Nhat Pham
2024-02-01 15:49 ` [PATCH 6/6] mm/zswap: zswap entry doesn't need refcount anymore Chengming Zhou
2024-02-02  1:11   ` Yosry Ahmed
2024-02-02 13:00     ` Chengming Zhou
2024-02-02 16:28   ` Johannes Weiner
2024-02-02 22:33   ` Nhat Pham
2024-02-02 22:36     ` Yosry Ahmed
2024-02-02 22:44       ` Nhat Pham
2024-02-03  5:09         ` Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).