[PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups
@ 2023-12-13  4:17 Chengming Zhou
  2023-12-13  4:17 ` [PATCH 1/5] mm/zswap: reuse dstmem when decompress Chengming Zhou
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-13  4:17 UTC (permalink / raw)
  To: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yosry Ahmed
  Cc: Nhat Pham, linux-kernel, linux-mm, Chengming Zhou

Hi everyone,

This series is split from [1] to only include zswap dstmem reuse
optimizations and cleanups, the other part of rbtree breakdown will
be deferred to retest after the rbtree converted to xarray.

And the problem this series tries to optimize is that zswap_load()
and zswap_writeback_entry() have to malloc a temporary memory to
support !zpool_can_sleep_mapped(). We can avoid it by reusing the
percpu crypto_acomp_ctx->dstmem, which is also used by zswap_store()
and protected by the same percpu crypto_acomp_ctx->mutex.

[1] https://lore.kernel.org/all/20231206-zswap-lock-optimize-v1-0-e25b059f9c3a@bytedance.com/

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (5):
      mm/zswap: reuse dstmem when decompress
      mm/zswap: change dstmem size to one page
      mm/zswap: refactor out __zswap_load()
      mm/zswap: cleanup zswap_load()
      mm/zswap: cleanup zswap_reclaim_entry()

 mm/zswap.c | 158 +++++++++++++++++++------------------------------------------
 1 file changed, 49 insertions(+), 109 deletions(-)
---
base-commit: 1f242c1964cf9b8d663a2fd72159b296205a8126
change-id: 20231213-zswap-dstmem-d828f563303d

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-13  4:17 [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
@ 2023-12-13  4:17 ` Chengming Zhou
  2023-12-13 23:24   ` Yosry Ahmed
  2023-12-14 17:59   ` Chris Li
  2023-12-13  4:17 ` [PATCH 2/5] mm/zswap: change dstmem size to one page Chengming Zhou
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-13  4:17 UTC (permalink / raw)
  To: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yosry Ahmed
  Cc: Nhat Pham, linux-kernel, linux-mm, Chengming Zhou

In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
copy the entry->handle memory to a temporary memory, which is allocated
using kmalloc.

Obviously we can reuse the per-compressor dstmem to avoid allocating
every time, since it's percpu-compressor and protected in mutex.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 7ee54a3d8281..edb8b45ed5a1 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
 	struct zswap_entry *entry;
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
-	u8 *src, *dst, *tmp;
+	unsigned int dlen = PAGE_SIZE;
+	u8 *src, *dst;
 	struct zpool *zpool;
-	unsigned int dlen;
 	bool ret;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
@@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
 		goto stats;
 	}
 
-	zpool = zswap_find_zpool(entry);
-	if (!zpool_can_sleep_mapped(zpool)) {
-		tmp = kmalloc(entry->length, GFP_KERNEL);
-		if (!tmp) {
-			ret = false;
-			goto freeentry;
-		}
-	}
-
 	/* decompress */
-	dlen = PAGE_SIZE;
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(acomp_ctx->mutex);
 
+	zpool = zswap_find_zpool(entry);
+	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
 	if (!zpool_can_sleep_mapped(zpool)) {
-		memcpy(tmp, src, entry->length);
-		src = tmp;
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_page(&output, page, PAGE_SIZE, 0);
@@ -1827,15 +1818,13 @@ bool zswap_load(struct folio *folio)
 
 	if (zpool_can_sleep_mapped(zpool))
 		zpool_unmap_handle(zpool, entry->handle);
-	else
-		kfree(tmp);
 
 	ret = true;
 stats:
 	count_vm_event(ZSWPIN);
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
-freeentry:
+
 	spin_lock(&tree->lock);
 	if (ret && zswap_exclusive_loads_enabled) {
 		zswap_invalidate_entry(tree, entry);

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-13  4:17 [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
  2023-12-13  4:17 ` [PATCH 1/5] mm/zswap: reuse dstmem when decompress Chengming Zhou
@ 2023-12-13  4:17 ` Chengming Zhou
  2023-12-13 23:34   ` Yosry Ahmed
  2023-12-13  4:18 ` [PATCH 3/5] mm/zswap: refactor out __zswap_load() Chengming Zhou
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-13  4:17 UTC (permalink / raw)
  To: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yosry Ahmed
  Cc: Nhat Pham, linux-kernel, linux-mm, Chengming Zhou

Change the dstmem size from 2 * PAGE_SIZE to only one page since
we only need at most one page when compress, and the "dlen" is also
PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
we don't wanna store the output in zswap anyway.

So change it to one page, and delete the stale comment.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index edb8b45ed5a1..fa186945010d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -707,7 +707,7 @@ static int zswap_dstmem_prepare(unsigned int cpu)
 	struct mutex *mutex;
 	u8 *dst;
 
-	dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	dst = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
 	if (!dst)
 		return -ENOMEM;
 
@@ -1662,8 +1662,7 @@ bool zswap_store(struct folio *folio)
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
-	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
+	sg_init_one(&output, dst, PAGE_SIZE);
 	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/5] mm/zswap: refactor out __zswap_load()
  2023-12-13  4:17 [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
  2023-12-13  4:17 ` [PATCH 1/5] mm/zswap: reuse dstmem when decompress Chengming Zhou
  2023-12-13  4:17 ` [PATCH 2/5] mm/zswap: change dstmem size to one page Chengming Zhou
@ 2023-12-13  4:18 ` Chengming Zhou
  2023-12-13 23:37   ` Yosry Ahmed
  2023-12-14  0:52   ` Yosry Ahmed
  2023-12-13  4:18 ` [PATCH 4/5] mm/zswap: cleanup zswap_load() Chengming Zhou
  2023-12-13  4:18 ` [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry() Chengming Zhou
  4 siblings, 2 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-13  4:18 UTC (permalink / raw)
  To: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yosry Ahmed
  Cc: Nhat Pham, linux-kernel, linux-mm, Chengming Zhou

The zswap_load() and zswap_writeback_entry() have the same part that
decompress the data from zswap_entry to page, so refactor out the
common part as __zswap_load(entry, page).

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 107 ++++++++++++++++++++++---------------------------------------
 1 file changed, 38 insertions(+), 69 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index fa186945010d..2f095c919a5c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1392,6 +1392,41 @@ static int zswap_enabled_param_set(const char *val,
 	return ret;
 }
 
+static void __zswap_load(struct zswap_entry *entry, struct page *page)
+{
+	struct scatterlist input, output;
+	unsigned int dlen = PAGE_SIZE;
+	struct crypto_acomp_ctx *acomp_ctx;
+	struct zpool *zpool;
+	u8 *src;
+	int ret;
+
+	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(acomp_ctx->mutex);
+
+	zpool = zswap_find_zpool(entry);
+	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+	if (!zpool_can_sleep_mapped(zpool)) {
+		memcpy(acomp_ctx->dstmem, src, entry->length);
+		src = acomp_ctx->dstmem;
+		zpool_unmap_handle(zpool, entry->handle);
+	}
+
+	sg_init_one(&input, src, entry->length);
+	sg_init_table(&output, 1);
+	sg_set_page(&output, page, PAGE_SIZE, 0);
+	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
+	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
+	dlen = acomp_ctx->req->dlen;
+	mutex_unlock(acomp_ctx->mutex);
+
+	if (zpool_can_sleep_mapped(zpool))
+		zpool_unmap_handle(zpool, entry->handle);
+
+	BUG_ON(ret);
+	BUG_ON(dlen != PAGE_SIZE);
+}
+
 /*********************************
 * writeback code
 **********************************/
@@ -1413,23 +1448,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	swp_entry_t swpentry = entry->swpentry;
 	struct page *page;
 	struct mempolicy *mpol;
-	struct scatterlist input, output;
-	struct crypto_acomp_ctx *acomp_ctx;
-	struct zpool *pool = zswap_find_zpool(entry);
 	bool page_was_allocated;
-	u8 *src, *tmp = NULL;
-	unsigned int dlen;
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
 
-	if (!zpool_can_sleep_mapped(pool)) {
-		tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
-		if (!tmp)
-			return -ENOMEM;
-	}
-
 	/* try to allocate swap cache page */
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
@@ -1462,33 +1486,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	}
 	spin_unlock(&tree->lock);
 
-	/* decompress */
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	dlen = PAGE_SIZE;
-
-	src = zpool_map_handle(pool, entry->handle, ZPOOL_MM_RO);
-	if (!zpool_can_sleep_mapped(pool)) {
-		memcpy(tmp, src, entry->length);
-		src = tmp;
-		zpool_unmap_handle(pool, entry->handle);
-	}
-
-	mutex_lock(acomp_ctx->mutex);
-	sg_init_one(&input, src, entry->length);
-	sg_init_table(&output, 1);
-	sg_set_page(&output, page, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
-	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
-	mutex_unlock(acomp_ctx->mutex);
-
-	if (!zpool_can_sleep_mapped(pool))
-		kfree(tmp);
-	else
-		zpool_unmap_handle(pool, entry->handle);
-
-	BUG_ON(ret);
-	BUG_ON(dlen != PAGE_SIZE);
+	__zswap_load(entry, page);
 
 	/* page is up to date */
 	SetPageUptodate(page);
@@ -1508,9 +1506,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	return ret;
 
 fail:
-	if (!zpool_can_sleep_mapped(pool))
-		kfree(tmp);
-
 	/*
 	 * If we get here because the page is already in swapcache, a
 	 * load may be happening concurrently. It is safe and okay to
@@ -1769,11 +1764,7 @@ bool zswap_load(struct folio *folio)
 	struct page *page = &folio->page;
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry;
-	struct scatterlist input, output;
-	struct crypto_acomp_ctx *acomp_ctx;
-	unsigned int dlen = PAGE_SIZE;
-	u8 *src, *dst;
-	struct zpool *zpool;
+	u8 *dst;
 	bool ret;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
@@ -1795,29 +1786,7 @@ bool zswap_load(struct folio *folio)
 		goto stats;
 	}
 
-	/* decompress */
-	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-	mutex_lock(acomp_ctx->mutex);
-
-	zpool = zswap_find_zpool(entry);
-	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
-	if (!zpool_can_sleep_mapped(zpool)) {
-		memcpy(acomp_ctx->dstmem, src, entry->length);
-		src = acomp_ctx->dstmem;
-		zpool_unmap_handle(zpool, entry->handle);
-	}
-
-	sg_init_one(&input, src, entry->length);
-	sg_init_table(&output, 1);
-	sg_set_page(&output, page, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
-	if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait))
-		WARN_ON(1);
-	mutex_unlock(acomp_ctx->mutex);
-
-	if (zpool_can_sleep_mapped(zpool))
-		zpool_unmap_handle(zpool, entry->handle);
-
+	__zswap_load(entry, page);
 	ret = true;
 stats:
 	count_vm_event(ZSWPIN);

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/5] mm/zswap: cleanup zswap_load()
  2023-12-13  4:17 [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
                   ` (2 preceding siblings ...)
  2023-12-13  4:18 ` [PATCH 3/5] mm/zswap: refactor out __zswap_load() Chengming Zhou
@ 2023-12-13  4:18 ` Chengming Zhou
  2023-12-14  0:56   ` Yosry Ahmed
  2023-12-13  4:18 ` [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry() Chengming Zhou
  4 siblings, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-13  4:18 UTC (permalink / raw)
  To: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yosry Ahmed
  Cc: Nhat Pham, linux-kernel, linux-mm, Chengming Zhou

After the common decompress part goes to __zswap_load(), we can cleanup
the zswap_load() a little.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 2f095c919a5c..0476e1c553c2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1765,7 +1765,6 @@ bool zswap_load(struct folio *folio)
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry;
 	u8 *dst;
-	bool ret;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 
@@ -1782,19 +1781,16 @@ bool zswap_load(struct folio *folio)
 		dst = kmap_local_page(page);
 		zswap_fill_page(dst, entry->value);
 		kunmap_local(dst);
-		ret = true;
-		goto stats;
+	} else {
+		__zswap_load(entry, page);
 	}
 
-	__zswap_load(entry, page);
-	ret = true;
-stats:
 	count_vm_event(ZSWPIN);
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
 	spin_lock(&tree->lock);
-	if (ret && zswap_exclusive_loads_enabled) {
+	if (zswap_exclusive_loads_enabled) {
 		zswap_invalidate_entry(tree, entry);
 		folio_mark_dirty(folio);
 	} else if (entry->length) {
@@ -1804,7 +1800,7 @@ bool zswap_load(struct folio *folio)
 	zswap_entry_put(tree, entry);
 	spin_unlock(&tree->lock);
 
-	return ret;
+	return true;
 }
 
 void zswap_invalidate(int type, pgoff_t offset)

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-13  4:17 [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
                   ` (3 preceding siblings ...)
  2023-12-13  4:18 ` [PATCH 4/5] mm/zswap: cleanup zswap_load() Chengming Zhou
@ 2023-12-13  4:18 ` Chengming Zhou
  2023-12-13 23:27   ` Nhat Pham
  2023-12-14  1:02   ` Yosry Ahmed
  4 siblings, 2 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-13  4:18 UTC (permalink / raw)
  To: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yosry Ahmed
  Cc: Nhat Pham, linux-kernel, linux-mm, Chengming Zhou

Also after the common decompress part goes to __zswap_load(), we can
cleanup the zswap_reclaim_entry() a little.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 0476e1c553c2..9c709368a0e6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1449,7 +1449,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct page *page;
 	struct mempolicy *mpol;
 	bool page_was_allocated;
-	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
@@ -1458,16 +1457,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &page_was_allocated, true);
-	if (!page) {
-		ret = -ENOMEM;
-		goto fail;
-	}
+	if (!page)
+		return -ENOMEM;
 
 	/* Found an existing page, we raced with load/swapin */
 	if (!page_was_allocated) {
 		put_page(page);
-		ret = -EEXIST;
-		goto fail;
+		return -EEXIST;
 	}
 
 	/*
@@ -1481,8 +1477,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
 		spin_unlock(&tree->lock);
 		delete_from_swap_cache(page_folio(page));
-		ret = -ENOMEM;
-		goto fail;
+		return -ENOMEM;
 	}
 	spin_unlock(&tree->lock);
 
@@ -1503,15 +1498,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	__swap_writepage(page, &wbc);
 	put_page(page);
 
-	return ret;
-
-fail:
-	/*
-	 * If we get here because the page is already in swapcache, a
-	 * load may be happening concurrently. It is safe and okay to
-	 * not free the entry. It is also okay to return !0.
-	 */
-	return ret;
+	return 0;
 }
 
 static int zswap_is_page_same_filled(void *ptr, unsigned long *value)

-- 
b4 0.10.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-13  4:17 ` [PATCH 1/5] mm/zswap: reuse dstmem when decompress Chengming Zhou
@ 2023-12-13 23:24   ` Yosry Ahmed
  2023-12-14 13:29     ` Chengming Zhou
  2023-12-14 17:59   ` Chris Li
  1 sibling, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-13 23:24 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> copy the entry->handle memory to a temporary memory, which is allocated
> using kmalloc.
>
> Obviously we can reuse the per-compressor dstmem to avoid allocating
> every time, since it's percpu-compressor and protected in mutex.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/zswap.c | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7ee54a3d8281..edb8b45ed5a1 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
>         struct zswap_entry *entry;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> -       u8 *src, *dst, *tmp;
> +       unsigned int dlen = PAGE_SIZE;
> +       u8 *src, *dst;
>         struct zpool *zpool;
> -       unsigned int dlen;
>         bool ret;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> @@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
>                 goto stats;
>         }
>
> -       zpool = zswap_find_zpool(entry);
> -       if (!zpool_can_sleep_mapped(zpool)) {
> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> -               if (!tmp) {
> -                       ret = false;
> -                       goto freeentry;
> -               }
> -       }
> -
>         /* decompress */
> -       dlen = PAGE_SIZE;
> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +       mutex_lock(acomp_ctx->mutex);
>
> +       zpool = zswap_find_zpool(entry);
> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(zpool)) {
> -               memcpy(tmp, src, entry->length);
> -               src = tmp;
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;

I don't like that we are now using acomp_ctx->dstmem and
acomp_ctx->mutex now for purposes other than what the naming suggests.

How about removing these two fields from acomp_ctx, and directly using
zswap_dstmem and zswap_mutex in both the load and store paths, rename
them, and add proper comments above their definitions that they are
for generic percpu buffering on the load and store paths?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-13  4:18 ` [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry() Chengming Zhou
@ 2023-12-13 23:27   ` Nhat Pham
  2023-12-14  1:02   ` Yosry Ahmed
  1 sibling, 0 replies; 41+ messages in thread
From: Nhat Pham @ 2023-12-13 23:27 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Chris Li, Johannes Weiner, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yosry Ahmed, linux-kernel, linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Also after the common decompress part goes to __zswap_load(), we can
> cleanup the zswap_reclaim_entry() a little.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0476e1c553c2..9c709368a0e6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1449,7 +1449,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct page *page;
>         struct mempolicy *mpol;
>         bool page_was_allocated;
> -       int ret;
>         struct writeback_control wbc = {
>                 .sync_mode = WB_SYNC_NONE,
>         };
> @@ -1458,16 +1457,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> -       if (!page) {
> -               ret = -ENOMEM;
> -               goto fail;
> -       }
> +       if (!page)
> +               return -ENOMEM;
>
>         /* Found an existing page, we raced with load/swapin */
>         if (!page_was_allocated) {
>                 put_page(page);
> -               ret = -EEXIST;
> -               goto fail;
> +               return -EEXIST;
>         }
>
>         /*
> @@ -1481,8 +1477,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
>                 spin_unlock(&tree->lock);
>                 delete_from_swap_cache(page_folio(page));
> -               ret = -ENOMEM;
> -               goto fail;
> +               return -ENOMEM;
>         }
>         spin_unlock(&tree->lock);
>
> @@ -1503,15 +1498,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         __swap_writepage(page, &wbc);
>         put_page(page);
>
> -       return ret;
> -
> -fail:
> -       /*
> -        * If we get here because the page is already in swapcache, a
> -        * load may be happening concurrently. It is safe and okay to
> -        * not free the entry. It is also okay to return !0.
> -        */
> -       return ret;
> +       return 0;
>  }
>
>  static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
>
> --
> b4 0.10.1

LGTM. The fail label was primarily to free the temporary memory. Since
we're re-using dstmem, this isn't really needed anymore. Nice cleanup.
So, FWIW:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-13  4:17 ` [PATCH 2/5] mm/zswap: change dstmem size to one page Chengming Zhou
@ 2023-12-13 23:34   ` Yosry Ahmed
  2023-12-14  0:18     ` Nhat Pham
  0 siblings, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-13 23:34 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> we only need at most one page when compress, and the "dlen" is also
> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> we don't wanna store the output in zswap anyway.
>
> So change it to one page, and delete the stale comment.

I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
be nice if someone has the context, perhaps one of the maintainers.

One potential reason is that we used to store a zswap header
containing the swap entry in the compressed page for writeback
purposes, but we don't do that anymore. Maybe we wanted to be able to
handle the case where an incompressible page would exceed PAGE_SIZE
because of that?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] mm/zswap: refactor out __zswap_load()
  2023-12-13  4:18 ` [PATCH 3/5] mm/zswap: refactor out __zswap_load() Chengming Zhou
@ 2023-12-13 23:37   ` Yosry Ahmed
  2023-12-14  0:52   ` Yosry Ahmed
  1 sibling, 0 replies; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-13 23:37 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> The zswap_load() and zswap_writeback_entry() have the same part that
> decompress the data from zswap_entry to page, so refactor out the
> common part as __zswap_load(entry, page).
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Great cleanup,
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-13 23:34   ` Yosry Ahmed
@ 2023-12-14  0:18     ` Nhat Pham
  2023-12-14 13:33       ` Chengming Zhou
  0 siblings, 1 reply; 41+ messages in thread
From: Nhat Pham @ 2023-12-14  0:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Change the dstmem size from 2 * PAGE_SIZE to only one page since
> > we only need at most one page when compress, and the "dlen" is also
> > PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> > we don't wanna store the output in zswap anyway.
> >
> > So change it to one page, and delete the stale comment.
>
> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> be nice if someone has the context, perhaps one of the maintainers.

It'd be very nice indeed.

>
> One potential reason is that we used to store a zswap header
> containing the swap entry in the compressed page for writeback
> purposes, but we don't do that anymore. Maybe we wanted to be able to
> handle the case where an incompressible page would exceed PAGE_SIZE
> because of that?

It could be hmm. I didn't study the old zswap architecture too much,
but it has been 2 * PAGE_SIZE since the time zswap was first merged
last I checked.
I'm not 100% comfortable ACK-ing the undoing of something that looks
so intentional, but FTR, AFAICT, this looks correct to me.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] mm/zswap: refactor out __zswap_load()
  2023-12-13  4:18 ` [PATCH 3/5] mm/zswap: refactor out __zswap_load() Chengming Zhou
  2023-12-13 23:37   ` Yosry Ahmed
@ 2023-12-14  0:52   ` Yosry Ahmed
  2023-12-14 14:45     ` Chengming Zhou
  2023-12-18  8:15     ` Chengming Zhou
  1 sibling, 2 replies; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14  0:52 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> The zswap_load() and zswap_writeback_entry() have the same part that
> decompress the data from zswap_entry to page, so refactor out the
> common part as __zswap_load(entry, page).
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

On a second look, there a few nits here.

First I think it makes more sense to move this refactoring ahead of
reusing destmem. Right now, we add the destmem reuse to zswap_load()
only, then we do the refactor and zswap_writeback_entry() gets it
automatically, so there is a slight change coming to
zswap_writeback_entry() hidden in the refactoring patch.

Let's refactor out __zswap_load() first, then reuse destmem in it.

> ---
>  mm/zswap.c | 107 ++++++++++++++++++++++---------------------------------------
>  1 file changed, 38 insertions(+), 69 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fa186945010d..2f095c919a5c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1392,6 +1392,41 @@ static int zswap_enabled_param_set(const char *val,
>         return ret;
>  }
>
> +static void __zswap_load(struct zswap_entry *entry, struct page *page)
> +{
> +       struct scatterlist input, output;
> +       unsigned int dlen = PAGE_SIZE;
> +       struct crypto_acomp_ctx *acomp_ctx;
> +       struct zpool *zpool;
> +       u8 *src;
> +       int ret;
> +
> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +       mutex_lock(acomp_ctx->mutex);
> +
> +       zpool = zswap_find_zpool(entry);
> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +       if (!zpool_can_sleep_mapped(zpool)) {
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;
> +               zpool_unmap_handle(zpool, entry->handle);
> +       }
> +
> +       sg_init_one(&input, src, entry->length);
> +       sg_init_table(&output, 1);
> +       sg_set_page(&output, page, PAGE_SIZE, 0);
> +       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);

We should pass PAGE_SIZE here directly, BUG_ON(acomp_ctx->req->dlen)
below, and remove the dlen variable.

> +       ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);

We should just BUG_ON() here directly an remove the ret variable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/5] mm/zswap: cleanup zswap_load()
  2023-12-13  4:18 ` [PATCH 4/5] mm/zswap: cleanup zswap_load() Chengming Zhou
@ 2023-12-14  0:56   ` Yosry Ahmed
  0 siblings, 0 replies; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14  0:56 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> After the common decompress part goes to __zswap_load(), we can cleanup
> the zswap_load() a little.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

LGTM, I think it can be squashed into the patch creating
__zswap_load(), but it doesn't matter much.

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  mm/zswap.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2f095c919a5c..0476e1c553c2 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1765,7 +1765,6 @@ bool zswap_load(struct folio *folio)
>         struct zswap_tree *tree = zswap_trees[type];
>         struct zswap_entry *entry;
>         u8 *dst;
> -       bool ret;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> @@ -1782,19 +1781,16 @@ bool zswap_load(struct folio *folio)
>                 dst = kmap_local_page(page);
>                 zswap_fill_page(dst, entry->value);
>                 kunmap_local(dst);
> -               ret = true;
> -               goto stats;
> +       } else {
> +               __zswap_load(entry, page);
>         }
>
> -       __zswap_load(entry, page);
> -       ret = true;
> -stats:
>         count_vm_event(ZSWPIN);
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
>         spin_lock(&tree->lock);
> -       if (ret && zswap_exclusive_loads_enabled) {
> +       if (zswap_exclusive_loads_enabled) {
>                 zswap_invalidate_entry(tree, entry);
>                 folio_mark_dirty(folio);
>         } else if (entry->length) {
> @@ -1804,7 +1800,7 @@ bool zswap_load(struct folio *folio)
>         zswap_entry_put(tree, entry);
>         spin_unlock(&tree->lock);
>
> -       return ret;
> +       return true;
>  }
>
>  void zswap_invalidate(int type, pgoff_t offset)
>
> --
> b4 0.10.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-13  4:18 ` [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry() Chengming Zhou
  2023-12-13 23:27   ` Nhat Pham
@ 2023-12-14  1:02   ` Yosry Ahmed
  2023-12-14 22:23     ` Andrew Morton
  1 sibling, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14  1:02 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Also after the common decompress part goes to __zswap_load(), we can
> cleanup the zswap_reclaim_entry() a little.

I think you mean zswap_writeback_entry(), same for the commit title.

>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0476e1c553c2..9c709368a0e6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1449,7 +1449,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct page *page;
>         struct mempolicy *mpol;
>         bool page_was_allocated;
> -       int ret;
>         struct writeback_control wbc = {
>                 .sync_mode = WB_SYNC_NONE,
>         };
> @@ -1458,16 +1457,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> -       if (!page) {
> -               ret = -ENOMEM;
> -               goto fail;
> -       }
> +       if (!page)
> +               return -ENOMEM;
>
>         /* Found an existing page, we raced with load/swapin */
>         if (!page_was_allocated) {
>                 put_page(page);
> -               ret = -EEXIST;
> -               goto fail;
> +               return -EEXIST;
>         }
>
>         /*
> @@ -1481,8 +1477,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
>                 spin_unlock(&tree->lock);
>                 delete_from_swap_cache(page_folio(page));
> -               ret = -ENOMEM;
> -               goto fail;
> +               return -ENOMEM;
>         }
>         spin_unlock(&tree->lock);
>
> @@ -1503,15 +1498,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         __swap_writepage(page, &wbc);
>         put_page(page);
>
> -       return ret;
> -
> -fail:
> -       /*
> -        * If we get here because the page is already in swapcache, a
> -        * load may be happening concurrently. It is safe and okay to
> -        * not free the entry. It is also okay to return !0.
> -        */

This comment should be moved above the failure check of
__read_swap_cache_async() above, not completely removed. With that:

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

Feel free to squash this patch into the one creating __zswap_load() or
leaving it as-is.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-13 23:24   ` Yosry Ahmed
@ 2023-12-14 13:29     ` Chengming Zhou
  2023-12-14 13:32       ` Yosry Ahmed
  0 siblings, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-14 13:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On 2023/12/14 07:24, Yosry Ahmed wrote:
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
>> copy the entry->handle memory to a temporary memory, which is allocated
>> using kmalloc.
>>
>> Obviously we can reuse the per-compressor dstmem to avoid allocating
>> every time, since it's percpu-compressor and protected in mutex.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>> ---
>>  mm/zswap.c | 29 +++++++++--------------------
>>  1 file changed, 9 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 7ee54a3d8281..edb8b45ed5a1 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
>>         struct zswap_entry *entry;
>>         struct scatterlist input, output;
>>         struct crypto_acomp_ctx *acomp_ctx;
>> -       u8 *src, *dst, *tmp;
>> +       unsigned int dlen = PAGE_SIZE;
>> +       u8 *src, *dst;
>>         struct zpool *zpool;
>> -       unsigned int dlen;
>>         bool ret;
>>
>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>> @@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
>>                 goto stats;
>>         }
>>
>> -       zpool = zswap_find_zpool(entry);
>> -       if (!zpool_can_sleep_mapped(zpool)) {
>> -               tmp = kmalloc(entry->length, GFP_KERNEL);
>> -               if (!tmp) {
>> -                       ret = false;
>> -                       goto freeentry;
>> -               }
>> -       }
>> -
>>         /* decompress */
>> -       dlen = PAGE_SIZE;
>> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>> +       mutex_lock(acomp_ctx->mutex);
>>
>> +       zpool = zswap_find_zpool(entry);
>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>         if (!zpool_can_sleep_mapped(zpool)) {
>> -               memcpy(tmp, src, entry->length);
>> -               src = tmp;
>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>> +               src = acomp_ctx->dstmem;
> 
> I don't like that we are now using acomp_ctx->dstmem and
> acomp_ctx->mutex now for purposes other than what the naming suggests.

The "mutex" name is coherent, "dstmem" depends on how we use IMHO.
Change to just "mem"? Or do you have a better name to replace?

> 
> How about removing these two fields from acomp_ctx, and directly using
> zswap_dstmem and zswap_mutex in both the load and store paths, rename
> them, and add proper comments above their definitions that they are
> for generic percpu buffering on the load and store paths?

Yes, they are percpu memory and lock, but they are used by per acomp_ctx,
and the cpu maybe changing in the middle, so maybe better to keep them.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 13:29     ` Chengming Zhou
@ 2023-12-14 13:32       ` Yosry Ahmed
  2023-12-14 14:42         ` Chengming Zhou
  0 siblings, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 13:32 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 5:29 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/14 07:24, Yosry Ahmed wrote:
> > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> >> copy the entry->handle memory to a temporary memory, which is allocated
> >> using kmalloc.
> >>
> >> Obviously we can reuse the per-compressor dstmem to avoid allocating
> >> every time, since it's percpu-compressor and protected in mutex.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >> ---
> >>  mm/zswap.c | 29 +++++++++--------------------
> >>  1 file changed, 9 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/mm/zswap.c b/mm/zswap.c
> >> index 7ee54a3d8281..edb8b45ed5a1 100644
> >> --- a/mm/zswap.c
> >> +++ b/mm/zswap.c
> >> @@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
> >>         struct zswap_entry *entry;
> >>         struct scatterlist input, output;
> >>         struct crypto_acomp_ctx *acomp_ctx;
> >> -       u8 *src, *dst, *tmp;
> >> +       unsigned int dlen = PAGE_SIZE;
> >> +       u8 *src, *dst;
> >>         struct zpool *zpool;
> >> -       unsigned int dlen;
> >>         bool ret;
> >>
> >>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >> @@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
> >>                 goto stats;
> >>         }
> >>
> >> -       zpool = zswap_find_zpool(entry);
> >> -       if (!zpool_can_sleep_mapped(zpool)) {
> >> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> >> -               if (!tmp) {
> >> -                       ret = false;
> >> -                       goto freeentry;
> >> -               }
> >> -       }
> >> -
> >>         /* decompress */
> >> -       dlen = PAGE_SIZE;
> >> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >> +       mutex_lock(acomp_ctx->mutex);
> >>
> >> +       zpool = zswap_find_zpool(entry);
> >> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >>         if (!zpool_can_sleep_mapped(zpool)) {
> >> -               memcpy(tmp, src, entry->length);
> >> -               src = tmp;
> >> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> >> +               src = acomp_ctx->dstmem;
> >
> > I don't like that we are now using acomp_ctx->dstmem and
> > acomp_ctx->mutex now for purposes other than what the naming suggests.
>
> The "mutex" name is coherent, "dstmem" depends on how we use IMHO.
> Change to just "mem"? Or do you have a better name to replace?
>
> >
> > How about removing these two fields from acomp_ctx, and directly using
> > zswap_dstmem and zswap_mutex in both the load and store paths, rename
> > them, and add proper comments above their definitions that they are
> > for generic percpu buffering on the load and store paths?
>
> Yes, they are percpu memory and lock, but they are used by per acomp_ctx,
> and the cpu maybe changing in the middle, so maybe better to keep them.

I don't mean to remove completely. Keep them as (for example)
zswap_mem and zswap_mutex global percpu variables, and not have
pointers in acomp_ctx to them. Instead of using acomp_ctx->dstmem
today, we directly use the global zswap_mem (same for the mutex).

This makes it clear that the buffers are not owned or exclusively used
by the acomp_ctx. WDYT?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14  0:18     ` Nhat Pham
@ 2023-12-14 13:33       ` Chengming Zhou
  2023-12-14 13:37         ` Yosry Ahmed
  0 siblings, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-14 13:33 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: Andrew Morton, Chris Li, Johannes Weiner, Seth Jennings,
	Dan Streetman, Vitaly Wool, linux-kernel, linux-mm

On 2023/12/14 08:18, Nhat Pham wrote:
> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>> <zhouchengming@bytedance.com> wrote:
>>>
>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>>> we only need at most one page when compress, and the "dlen" is also
>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>>> we don't wanna store the output in zswap anyway.
>>>
>>> So change it to one page, and delete the stale comment.
>>
>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
>> be nice if someone has the context, perhaps one of the maintainers.
> 
> It'd be very nice indeed.
> 
>>
>> One potential reason is that we used to store a zswap header
>> containing the swap entry in the compressed page for writeback
>> purposes, but we don't do that anymore. Maybe we wanted to be able to
>> handle the case where an incompressible page would exceed PAGE_SIZE
>> because of that?
> 
> It could be hmm. I didn't study the old zswap architecture too much,
> but it has been 2 * PAGE_SIZE since the time zswap was first merged
> last I checked.
> I'm not 100% comfortable ACK-ing the undoing of something that looks
> so intentional, but FTR, AFAICT, this looks correct to me.

Right, there is no any history about the reason why we needed 2 pages.
But obviously only one page is needed from the current code and no any
problem found in the kernel build stress testing.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14 13:33       ` Chengming Zhou
@ 2023-12-14 13:37         ` Yosry Ahmed
  2023-12-14 13:57           ` Chengming Zhou
  0 siblings, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 13:37 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/14 08:18, Nhat Pham wrote:
> > On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> >> <zhouchengming@bytedance.com> wrote:
> >>>
> >>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> >>> we only need at most one page when compress, and the "dlen" is also
> >>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> >>> we don't wanna store the output in zswap anyway.
> >>>
> >>> So change it to one page, and delete the stale comment.
> >>
> >> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> >> be nice if someone has the context, perhaps one of the maintainers.
> >
> > It'd be very nice indeed.
> >
> >>
> >> One potential reason is that we used to store a zswap header
> >> containing the swap entry in the compressed page for writeback
> >> purposes, but we don't do that anymore. Maybe we wanted to be able to
> >> handle the case where an incompressible page would exceed PAGE_SIZE
> >> because of that?
> >
> > It could be hmm. I didn't study the old zswap architecture too much,
> > but it has been 2 * PAGE_SIZE since the time zswap was first merged
> > last I checked.
> > I'm not 100% comfortable ACK-ing the undoing of something that looks
> > so intentional, but FTR, AFAICT, this looks correct to me.
>
> Right, there is no any history about the reason why we needed 2 pages.
> But obviously only one page is needed from the current code and no any
> problem found in the kernel build stress testing.

Could you try manually stressing the compression with data that
doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
that this case is specifically handled. I think using data from
/dev/random will do that but please double check that dlen ==
PAGE_SIZE.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14 13:37         ` Yosry Ahmed
@ 2023-12-14 13:57           ` Chengming Zhou
  2023-12-14 15:03             ` Chengming Zhou
  2023-12-14 18:30             ` Yosry Ahmed
  0 siblings, 2 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-14 13:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Andrew Morton, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On 2023/12/14 21:37, Yosry Ahmed wrote:
> On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2023/12/14 08:18, Nhat Pham wrote:
>>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>
>>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>>>> <zhouchengming@bytedance.com> wrote:
>>>>>
>>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>>>>> we only need at most one page when compress, and the "dlen" is also
>>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>>>>> we don't wanna store the output in zswap anyway.
>>>>>
>>>>> So change it to one page, and delete the stale comment.
>>>>
>>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
>>>> be nice if someone has the context, perhaps one of the maintainers.
>>>
>>> It'd be very nice indeed.
>>>
>>>>
>>>> One potential reason is that we used to store a zswap header
>>>> containing the swap entry in the compressed page for writeback
>>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
>>>> handle the case where an incompressible page would exceed PAGE_SIZE
>>>> because of that?
>>>
>>> It could be hmm. I didn't study the old zswap architecture too much,
>>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
>>> last I checked.
>>> I'm not 100% comfortable ACK-ing the undoing of something that looks
>>> so intentional, but FTR, AFAICT, this looks correct to me.
>>
>> Right, there is no any history about the reason why we needed 2 pages.
>> But obviously only one page is needed from the current code and no any
>> problem found in the kernel build stress testing.
> 
> Could you try manually stressing the compression with data that
> doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
> that this case is specifically handled. I think using data from
> /dev/random will do that but please double check that dlen ==
> PAGE_SIZE.

I just did the same kernel build testing, indeed there are a few cases
that output dlen == PAGE_SIZE.

bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'

@[1]: 2
@[0]: 12011430

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 13:32       ` Yosry Ahmed
@ 2023-12-14 14:42         ` Chengming Zhou
  2023-12-14 18:24           ` Yosry Ahmed
  0 siblings, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-14 14:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On 2023/12/14 21:32, Yosry Ahmed wrote:
> On Thu, Dec 14, 2023 at 5:29 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2023/12/14 07:24, Yosry Ahmed wrote:
>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>>
>>>> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
>>>> copy the entry->handle memory to a temporary memory, which is allocated
>>>> using kmalloc.
>>>>
>>>> Obviously we can reuse the per-compressor dstmem to avoid allocating
>>>> every time, since it's percpu-compressor and protected in mutex.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
>>>> ---
>>>>  mm/zswap.c | 29 +++++++++--------------------
>>>>  1 file changed, 9 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>> index 7ee54a3d8281..edb8b45ed5a1 100644
>>>> --- a/mm/zswap.c
>>>> +++ b/mm/zswap.c
>>>> @@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
>>>>         struct zswap_entry *entry;
>>>>         struct scatterlist input, output;
>>>>         struct crypto_acomp_ctx *acomp_ctx;
>>>> -       u8 *src, *dst, *tmp;
>>>> +       unsigned int dlen = PAGE_SIZE;
>>>> +       u8 *src, *dst;
>>>>         struct zpool *zpool;
>>>> -       unsigned int dlen;
>>>>         bool ret;
>>>>
>>>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>> @@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
>>>>                 goto stats;
>>>>         }
>>>>
>>>> -       zpool = zswap_find_zpool(entry);
>>>> -       if (!zpool_can_sleep_mapped(zpool)) {
>>>> -               tmp = kmalloc(entry->length, GFP_KERNEL);
>>>> -               if (!tmp) {
>>>> -                       ret = false;
>>>> -                       goto freeentry;
>>>> -               }
>>>> -       }
>>>> -
>>>>         /* decompress */
>>>> -       dlen = PAGE_SIZE;
>>>> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>>>> +       mutex_lock(acomp_ctx->mutex);
>>>>
>>>> +       zpool = zswap_find_zpool(entry);
>>>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>>>         if (!zpool_can_sleep_mapped(zpool)) {
>>>> -               memcpy(tmp, src, entry->length);
>>>> -               src = tmp;
>>>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>>>> +               src = acomp_ctx->dstmem;
>>>
>>> I don't like that we are now using acomp_ctx->dstmem and
>>> acomp_ctx->mutex now for purposes other than what the naming suggests.
>>
>> The "mutex" name is coherent, "dstmem" depends on how we use IMHO.
>> Change to just "mem"? Or do you have a better name to replace?
>>
>>>
>>> How about removing these two fields from acomp_ctx, and directly using
>>> zswap_dstmem and zswap_mutex in both the load and store paths, rename
>>> them, and add proper comments above their definitions that they are
>>> for generic percpu buffering on the load and store paths?
>>
>> Yes, they are percpu memory and lock, but they are used by per acomp_ctx,
>> and the cpu maybe changing in the middle, so maybe better to keep them.
> 
> I don't mean to remove completely. Keep them as (for example)
> zswap_mem and zswap_mutex global percpu variables, and not have
> pointers in acomp_ctx to them. Instead of using acomp_ctx->dstmem
> today, we directly use the global zswap_mem (same for the mutex).
> 
> This makes it clear that the buffers are not owned or exclusively used
> by the acomp_ctx. WDYT?

Does this look good to you?

```
int cpu = raw_smp_processor_id();

mutex = per_cpu(zswap_mutex, cpu);
mutex_lock(mutex);

dstmem = per_cpu(zswap_dstmem, cpu);
acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);

/* compress or decompress */
```

Another way I just think of is to make acomp_ctx own its lock and buffer,
and we could delete these percpu zswap_mutex and zswap_dstmem instead.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] mm/zswap: refactor out __zswap_load()
  2023-12-14  0:52   ` Yosry Ahmed
@ 2023-12-14 14:45     ` Chengming Zhou
  2023-12-18  8:15     ` Chengming Zhou
  1 sibling, 0 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-14 14:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On 2023/12/14 08:52, Yosry Ahmed wrote:
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> The zswap_load() and zswap_writeback_entry() have the same part that
>> decompress the data from zswap_entry to page, so refactor out the
>> common part as __zswap_load(entry, page).
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> 
> On a second look, there a few nits here.
> 
> First I think it makes more sense to move this refactoring ahead of
> reusing destmem. Right now, we add the destmem reuse to zswap_load()
> only, then we do the refactor and zswap_writeback_entry() gets it
> automatically, so there is a slight change coming to
> zswap_writeback_entry() hidden in the refactoring patch.
> 
> Let's refactor out __zswap_load() first, then reuse destmem in it.
> 

Ok, will put it first.

>> ---
>>  mm/zswap.c | 107 ++++++++++++++++++++++---------------------------------------
>>  1 file changed, 38 insertions(+), 69 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index fa186945010d..2f095c919a5c 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1392,6 +1392,41 @@ static int zswap_enabled_param_set(const char *val,
>>         return ret;
>>  }
>>
>> +static void __zswap_load(struct zswap_entry *entry, struct page *page)
>> +{
>> +       struct scatterlist input, output;
>> +       unsigned int dlen = PAGE_SIZE;
>> +       struct crypto_acomp_ctx *acomp_ctx;
>> +       struct zpool *zpool;
>> +       u8 *src;
>> +       int ret;
>> +
>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>> +       mutex_lock(acomp_ctx->mutex);
>> +
>> +       zpool = zswap_find_zpool(entry);
>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>> +       if (!zpool_can_sleep_mapped(zpool)) {
>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>> +               src = acomp_ctx->dstmem;
>> +               zpool_unmap_handle(zpool, entry->handle);
>> +       }
>> +
>> +       sg_init_one(&input, src, entry->length);
>> +       sg_init_table(&output, 1);
>> +       sg_set_page(&output, page, PAGE_SIZE, 0);
>> +       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
> 
> We should pass PAGE_SIZE here directly, BUG_ON(acomp_ctx->req->dlen)
> below, and remove the dlen variable.
> 
>> +       ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
> 
> We should just BUG_ON() here directly an remove the ret variable.

Ok, thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14 13:57           ` Chengming Zhou
@ 2023-12-14 15:03             ` Chengming Zhou
  2023-12-14 18:34               ` Yosry Ahmed
  2023-12-14 18:30             ` Yosry Ahmed
  1 sibling, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-14 15:03 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: Andrew Morton, Chris Li, Johannes Weiner, Seth Jennings,
	Dan Streetman, Vitaly Wool, linux-kernel, linux-mm

On 2023/12/14 21:57, Chengming Zhou wrote:
> On 2023/12/14 21:37, Yosry Ahmed wrote:
>> On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
>> <zhouchengming@bytedance.com> wrote:
>>>
>>> On 2023/12/14 08:18, Nhat Pham wrote:
>>>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>>>>> <zhouchengming@bytedance.com> wrote:
>>>>>>
>>>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
>>>>>> we only need at most one page when compress, and the "dlen" is also
>>>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
>>>>>> we don't wanna store the output in zswap anyway.
>>>>>>
>>>>>> So change it to one page, and delete the stale comment.
>>>>>
>>>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
>>>>> be nice if someone has the context, perhaps one of the maintainers.
>>>>
>>>> It'd be very nice indeed.
>>>>
>>>>>
>>>>> One potential reason is that we used to store a zswap header
>>>>> containing the swap entry in the compressed page for writeback
>>>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
>>>>> handle the case where an incompressible page would exceed PAGE_SIZE
>>>>> because of that?
>>>>
>>>> It could be hmm. I didn't study the old zswap architecture too much,
>>>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
>>>> last I checked.
>>>> I'm not 100% comfortable ACK-ing the undoing of something that looks
>>>> so intentional, but FTR, AFAICT, this looks correct to me.
>>>
>>> Right, there is no any history about the reason why we needed 2 pages.
>>> But obviously only one page is needed from the current code and no any
>>> problem found in the kernel build stress testing.
>>
>> Could you try manually stressing the compression with data that
>> doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
>> that this case is specifically handled. I think using data from
>> /dev/random will do that but please double check that dlen ==
>> PAGE_SIZE.
> 
> I just did the same kernel build testing, indeed there are a few cases
> that output dlen == PAGE_SIZE.
> 
> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> 
> @[1]: 2
> @[0]: 12011430

I think we shouldn't put these poorly compressed output into zswap,
maybe it's better to early return in these cases when compress ratio
< threshold ratio, which can be tune by the user?

e.g. in the same kernel build testing:

bpftrace -e 'k:zpool_malloc {@[(uint32)arg1>2048]=count()}'

@[1]: 1597706
@[0]: 10886138


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-13  4:17 ` [PATCH 1/5] mm/zswap: reuse dstmem when decompress Chengming Zhou
  2023-12-13 23:24   ` Yosry Ahmed
@ 2023-12-14 17:59   ` Chris Li
  2023-12-14 18:26     ` Yosry Ahmed
  2023-12-14 20:33     ` Nhat Pham
  1 sibling, 2 replies; 41+ messages in thread
From: Chris Li @ 2023-12-14 17:59 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Johannes Weiner, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yosry Ahmed, linux-kernel, linux-mm

On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> copy the entry->handle memory to a temporary memory, which is allocated
> using kmalloc.
>
> Obviously we can reuse the per-compressor dstmem to avoid allocating
> every time, since it's percpu-compressor and protected in mutex.

You are trading more memory for faster speed.
Per-cpu data structure does not come free. It is expensive in terms of
memory on a big server with a lot of CPU. Think more than a few
hundred CPU. On the big servers, we might want to disable this
optimization to save a few MB RAM, depending on the gain of this
optimization.
Do we have any benchmark suggesting how much CPU overhead or latency
this per-CPU page buys us, compared to using kmalloc?

Chris

>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/zswap.c | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7ee54a3d8281..edb8b45ed5a1 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
>         struct zswap_entry *entry;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> -       u8 *src, *dst, *tmp;
> +       unsigned int dlen = PAGE_SIZE;
> +       u8 *src, *dst;
>         struct zpool *zpool;
> -       unsigned int dlen;
>         bool ret;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> @@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
>                 goto stats;
>         }
>
> -       zpool = zswap_find_zpool(entry);
> -       if (!zpool_can_sleep_mapped(zpool)) {
> -               tmp = kmalloc(entry->length, GFP_KERNEL);
> -               if (!tmp) {
> -                       ret = false;
> -                       goto freeentry;
> -               }
> -       }
> -
>         /* decompress */
> -       dlen = PAGE_SIZE;
> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> +       mutex_lock(acomp_ctx->mutex);
>
> +       zpool = zswap_find_zpool(entry);
> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>         if (!zpool_can_sleep_mapped(zpool)) {
> -               memcpy(tmp, src, entry->length);
> -               src = tmp;
> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> +               src = acomp_ctx->dstmem;
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
> -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -       mutex_lock(acomp_ctx->mutex);
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_page(&output, page, PAGE_SIZE, 0);
> @@ -1827,15 +1818,13 @@ bool zswap_load(struct folio *folio)
>
>         if (zpool_can_sleep_mapped(zpool))
>                 zpool_unmap_handle(zpool, entry->handle);
> -       else
> -               kfree(tmp);
>
>         ret = true;
>  stats:
>         count_vm_event(ZSWPIN);
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
> -freeentry:
> +
>         spin_lock(&tree->lock);
>         if (ret && zswap_exclusive_loads_enabled) {
>                 zswap_invalidate_entry(tree, entry);
>
> --
> b4 0.10.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 14:42         ` Chengming Zhou
@ 2023-12-14 18:24           ` Yosry Ahmed
  2023-12-18  8:06             ` Chengming Zhou
  0 siblings, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 18:24 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 6:42 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
[..]
> >>>> -
> >>>>         /* decompress */
> >>>> -       dlen = PAGE_SIZE;
> >>>> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >>>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >>>> +       mutex_lock(acomp_ctx->mutex);
> >>>>
> >>>> +       zpool = zswap_find_zpool(entry);
> >>>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >>>>         if (!zpool_can_sleep_mapped(zpool)) {
> >>>> -               memcpy(tmp, src, entry->length);
> >>>> -               src = tmp;
> >>>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
> >>>> +               src = acomp_ctx->dstmem;
> >>>
> >>> I don't like that we are now using acomp_ctx->dstmem and
> >>> acomp_ctx->mutex now for purposes other than what the naming suggests.
> >>
> >> The "mutex" name is coherent, "dstmem" depends on how we use IMHO.
> >> Change to just "mem"? Or do you have a better name to replace?
> >>
> >>>
> >>> How about removing these two fields from acomp_ctx, and directly using
> >>> zswap_dstmem and zswap_mutex in both the load and store paths, rename
> >>> them, and add proper comments above their definitions that they are
> >>> for generic percpu buffering on the load and store paths?
> >>
> >> Yes, they are percpu memory and lock, but they are used by per acomp_ctx,
> >> and the cpu maybe changing in the middle, so maybe better to keep them.
> >
> > I don't mean to remove completely. Keep them as (for example)
> > zswap_mem and zswap_mutex global percpu variables, and not have
> > pointers in acomp_ctx to them. Instead of using acomp_ctx->dstmem
> > today, we directly use the global zswap_mem (same for the mutex).
> >
> > This makes it clear that the buffers are not owned or exclusively used
> > by the acomp_ctx. WDYT?
>
> Does this look good to you?
>
> ```
> int cpu = raw_smp_processor_id();
>
> mutex = per_cpu(zswap_mutex, cpu);
> mutex_lock(mutex);
>
> dstmem = per_cpu(zswap_dstmem, cpu);

Renaming to zswap_buffer or zswap_mem would be better I think, but
yeah what I had in mind is having zswap_mutex and
zswap_[dstmem/mem/buffer] be generic percpu buffers that are used by
store and load paths for different purposes, not directly linked to
acomp_ctx.

> acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>
> /* compress or decompress */
> ```
>
> Another way I just think of is to make acomp_ctx own its lock and buffer,
> and we could delete these percpu zswap_mutex and zswap_dstmem instead.

You mean have two separate set of percpu buffers for zswap load &
stores paths? This is probably unnecessary.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 17:59   ` Chris Li
@ 2023-12-14 18:26     ` Yosry Ahmed
  2023-12-14 22:02       ` Chris Li
  2023-12-14 20:33     ` Nhat Pham
  1 sibling, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 18:26 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Andrew Morton, Nhat Pham, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 9:59 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> > copy the entry->handle memory to a temporary memory, which is allocated
> > using kmalloc.
> >
> > Obviously we can reuse the per-compressor dstmem to avoid allocating
> > every time, since it's percpu-compressor and protected in mutex.
>
> You are trading more memory for faster speed.
> Per-cpu data structure does not come free. It is expensive in terms of
> memory on a big server with a lot of CPU. Think more than a few
> hundred CPU. On the big servers, we might want to disable this
> optimization to save a few MB RAM, depending on the gain of this
> optimization.
> Do we have any benchmark suggesting how much CPU overhead or latency
> this per-CPU page buys us, compared to using kmalloc?

IIUC we are not creating any new percpu data structures here. We are
reusing existing percpu buffers used in the store path to compress
into. Now we also use them in the load path if we need a temporary
buffer to decompress into if the zpool backend does not support
sleeping while the memory is mapped.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14 13:57           ` Chengming Zhou
  2023-12-14 15:03             ` Chengming Zhou
@ 2023-12-14 18:30             ` Yosry Ahmed
  2023-12-14 20:29               ` Nhat Pham
  1 sibling, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 18:30 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 5:57 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/14 21:37, Yosry Ahmed wrote:
> > On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> On 2023/12/14 08:18, Nhat Pham wrote:
> >>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>
> >>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> >>>> <zhouchengming@bytedance.com> wrote:
> >>>>>
> >>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> >>>>> we only need at most one page when compress, and the "dlen" is also
> >>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> >>>>> we don't wanna store the output in zswap anyway.
> >>>>>
> >>>>> So change it to one page, and delete the stale comment.
> >>>>
> >>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> >>>> be nice if someone has the context, perhaps one of the maintainers.
> >>>
> >>> It'd be very nice indeed.
> >>>
> >>>>
> >>>> One potential reason is that we used to store a zswap header
> >>>> containing the swap entry in the compressed page for writeback
> >>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
> >>>> handle the case where an incompressible page would exceed PAGE_SIZE
> >>>> because of that?
> >>>
> >>> It could be hmm. I didn't study the old zswap architecture too much,
> >>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
> >>> last I checked.
> >>> I'm not 100% comfortable ACK-ing the undoing of something that looks
> >>> so intentional, but FTR, AFAICT, this looks correct to me.
> >>
> >> Right, there is no any history about the reason why we needed 2 pages.
> >> But obviously only one page is needed from the current code and no any
> >> problem found in the kernel build stress testing.
> >
> > Could you try manually stressing the compression with data that
> > doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
> > that this case is specifically handled. I think using data from
> > /dev/random will do that but please double check that dlen ==
> > PAGE_SIZE.
>
> I just did the same kernel build testing, indeed there are a few cases
> that output dlen == PAGE_SIZE.
>
> bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
>
> @[1]: 2
> @[0]: 12011430

That's very useful information, thanks for testing that. Please
include this in the commit log. Please also include the fact that we
used to store a zswap header with the compressed page but don't do
that anymore, which *may* be the reason why this was needed back then.

I still want someone who knows the history to Ack this, but FWIW it
looks correct to me, so low-key:
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14 15:03             ` Chengming Zhou
@ 2023-12-14 18:34               ` Yosry Ahmed
  0 siblings, 0 replies; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 18:34 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Andrew Morton, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

[..]
>
> I think we shouldn't put these poorly compressed output into zswap,
> maybe it's better to early return in these cases when compress ratio
> < threshold ratio, which can be tune by the user?

We have something similar at Google, but because we use zswap without
a backing swapfile, we make those pages unevictable. For the upstream
code, the pages will go to a backing swapfile, which arguably violates
the LRU ordering, but may be the correct thing to do. There was a
recent upstream attempt to solidify storing those incompressible pages
in zswap in their uncompressed form to retain the LRU ordering.

If you want, feel free to start a discussion about this separately,
it's out of context for this patch series.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/5] mm/zswap: change dstmem size to one page
  2023-12-14 18:30             ` Yosry Ahmed
@ 2023-12-14 20:29               ` Nhat Pham
  0 siblings, 0 replies; 41+ messages in thread
From: Nhat Pham @ 2023-12-14 20:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 10:30 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 5:57 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > On 2023/12/14 21:37, Yosry Ahmed wrote:
> > > On Thu, Dec 14, 2023 at 5:33 AM Chengming Zhou
> > > <zhouchengming@bytedance.com> wrote:
> > >>
> > >> On 2023/12/14 08:18, Nhat Pham wrote:
> > >>> On Wed, Dec 13, 2023 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >>>>
> > >>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > >>>> <zhouchengming@bytedance.com> wrote:
> > >>>>>
> > >>>>> Change the dstmem size from 2 * PAGE_SIZE to only one page since
> > >>>>> we only need at most one page when compress, and the "dlen" is also
> > >>>>> PAGE_SIZE in acomp_request_set_params(). If the output size > PAGE_SIZE
> > >>>>> we don't wanna store the output in zswap anyway.
> > >>>>>
> > >>>>> So change it to one page, and delete the stale comment.
> > >>>>
> > >>>> I couldn't find the history of why we needed 2 * PAGE_SIZE, it would
> > >>>> be nice if someone has the context, perhaps one of the maintainers.
> > >>>
> > >>> It'd be very nice indeed.
> > >>>
> > >>>>
> > >>>> One potential reason is that we used to store a zswap header
> > >>>> containing the swap entry in the compressed page for writeback
> > >>>> purposes, but we don't do that anymore. Maybe we wanted to be able to
> > >>>> handle the case where an incompressible page would exceed PAGE_SIZE
> > >>>> because of that?
> > >>>
> > >>> It could be hmm. I didn't study the old zswap architecture too much,
> > >>> but it has been 2 * PAGE_SIZE since the time zswap was first merged
> > >>> last I checked.
> > >>> I'm not 100% comfortable ACK-ing the undoing of something that looks
> > >>> so intentional, but FTR, AFAICT, this looks correct to me.
> > >>
> > >> Right, there is no any history about the reason why we needed 2 pages.
> > >> But obviously only one page is needed from the current code and no any
> > >> problem found in the kernel build stress testing.
> > >
> > > Could you try manually stressing the compression with data that
> > > doesn't compress at all (i.e. dlen == PAGE_SIZE)? I want to make sure
> > > that this case is specifically handled. I think using data from
> > > /dev/random will do that but please double check that dlen ==
> > > PAGE_SIZE.

FWIW, zsmalloc supports the storing of pages that are PAGE_SIZE in
length, so a use case is probably there (although it could be for
ZRAM). We tested it during the storing-uncompressed-pages patch.
Architecturally, it seems that zswap just lets the backend allocator
handle the rejection of compressed objects that are too large, and the
compressor to reject pages that are too poorly compressed.

> >
> > I just did the same kernel build testing, indeed there are a few cases
> > that output dlen == PAGE_SIZE.
> >
> > bpftrace -e 'k:zpool_malloc {@[(uint32)arg1==4096]=count()}'
> >
> > @[1]: 2
> > @[0]: 12011430
>
> That's very useful information, thanks for testing that. Please
> include this in the commit log. Please also include the fact that we
> used to store a zswap header with the compressed page but don't do
> that anymore, which *may* be the reason why this was needed back then.
>
> I still want someone who knows the history to Ack this, but FWIW it
> looks correct to me, so low-key:
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

Anyway:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 17:59   ` Chris Li
  2023-12-14 18:26     ` Yosry Ahmed
@ 2023-12-14 20:33     ` Nhat Pham
  1 sibling, 0 replies; 41+ messages in thread
From: Nhat Pham @ 2023-12-14 20:33 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Andrew Morton, Johannes Weiner, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yosry Ahmed, linux-kernel, linux-mm

On Thu, Dec 14, 2023 at 9:59 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> > copy the entry->handle memory to a temporary memory, which is allocated
> > using kmalloc.
> >
> > Obviously we can reuse the per-compressor dstmem to avoid allocating
> > every time, since it's percpu-compressor and protected in mutex.
>
> You are trading more memory for faster speed.
> Per-cpu data structure does not come free. It is expensive in terms of
> memory on a big server with a lot of CPU. Think more than a few
> hundred CPU. On the big servers, we might want to disable this
> optimization to save a few MB RAM, depending on the gain of this
> optimization.
> Do we have any benchmark suggesting how much CPU overhead or latency
> this per-CPU page buys us, compared to using kmalloc?

I think Chengming is re-using an existing per-CPU buffer for this
purpose. IIUC, it was previously only used for compression
(zswap_store) - Chengming is leveraging it for decompression (load and
writeback) too with this patch. This sounds fine to me tbh, because
both directions have to hold the mutex anyway, so that buffer is
locked out - might as well use it.

We're doing a bit more work in the mutex section (memcpy and handle
(un)mapping) - but seems fine to me tbh.

>
> Chris
>
> >
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> > Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  mm/zswap.c | 29 +++++++++--------------------
> >  1 file changed, 9 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 7ee54a3d8281..edb8b45ed5a1 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1772,9 +1772,9 @@ bool zswap_load(struct folio *folio)
> >         struct zswap_entry *entry;
> >         struct scatterlist input, output;
> >         struct crypto_acomp_ctx *acomp_ctx;
> > -       u8 *src, *dst, *tmp;
> > +       unsigned int dlen = PAGE_SIZE;
> > +       u8 *src, *dst;
> >         struct zpool *zpool;
> > -       unsigned int dlen;
> >         bool ret;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1796,27 +1796,18 @@ bool zswap_load(struct folio *folio)
> >                 goto stats;
> >         }
> >
> > -       zpool = zswap_find_zpool(entry);
> > -       if (!zpool_can_sleep_mapped(zpool)) {
> > -               tmp = kmalloc(entry->length, GFP_KERNEL);
> > -               if (!tmp) {
> > -                       ret = false;
> > -                       goto freeentry;
> > -               }
> > -       }
> > -
> >         /* decompress */
> > -       dlen = PAGE_SIZE;
> > -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> > +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > +       mutex_lock(acomp_ctx->mutex);
> >
> > +       zpool = zswap_find_zpool(entry);
> > +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
> >         if (!zpool_can_sleep_mapped(zpool)) {
> > -               memcpy(tmp, src, entry->length);
> > -               src = tmp;
> > +               memcpy(acomp_ctx->dstmem, src, entry->length);
> > +               src = acomp_ctx->dstmem;
> >                 zpool_unmap_handle(zpool, entry->handle);
> >         }
> >
> > -       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > -       mutex_lock(acomp_ctx->mutex);
> >         sg_init_one(&input, src, entry->length);
> >         sg_init_table(&output, 1);
> >         sg_set_page(&output, page, PAGE_SIZE, 0);
> > @@ -1827,15 +1818,13 @@ bool zswap_load(struct folio *folio)
> >
> >         if (zpool_can_sleep_mapped(zpool))
> >                 zpool_unmap_handle(zpool, entry->handle);
> > -       else
> > -               kfree(tmp);
> >
> >         ret = true;
> >  stats:
> >         count_vm_event(ZSWPIN);
> >         if (entry->objcg)
> >                 count_objcg_event(entry->objcg, ZSWPIN);
> > -freeentry:
> > +
> >         spin_lock(&tree->lock);
> >         if (ret && zswap_exclusive_loads_enabled) {
> >                 zswap_invalidate_entry(tree, entry);
> >
> > --
> > b4 0.10.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 18:26     ` Yosry Ahmed
@ 2023-12-14 22:02       ` Chris Li
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Li @ 2023-12-14 22:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Andrew Morton, Nhat Pham, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

Hi Yosry,

On Thu, Dec 14, 2023 at 10:27 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 9:59 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> > >
> > > In the !zpool_can_sleep_mapped() case such as zsmalloc, we need to first
> > > copy the entry->handle memory to a temporary memory, which is allocated
> > > using kmalloc.
> > >
> > > Obviously we can reuse the per-compressor dstmem to avoid allocating
> > > every time, since it's percpu-compressor and protected in mutex.
> >
> > You are trading more memory for faster speed.
> > Per-cpu data structure does not come free. It is expensive in terms of
> > memory on a big server with a lot of CPU. Think more than a few
> > hundred CPU. On the big servers, we might want to disable this
> > optimization to save a few MB RAM, depending on the gain of this
> > optimization.
> > Do we have any benchmark suggesting how much CPU overhead or latency
> > this per-CPU page buys us, compared to using kmalloc?
>
> IIUC we are not creating any new percpu data structures here. We are
> reusing existing percpu buffers used in the store path to compress
> into. Now we also use them in the load path if we need a temporary
> buffer to decompress into if the zpool backend does not support
> sleeping while the memory is mapped.

That sounds like pure win then. Thanks for explaining it.

Hi Nahn,

> I think Chengming is re-using an existing per-CPU buffer for this
> purpose. IIUC, it was previously only used for compression
> (zswap_store) - Chengming is leveraging it for decompression (load and
> writeback) too with this patch. This sounds fine to me tbh, because
> both directions have to hold the mutex anyway, so that buffer is
> locked out - might as well use it.

Agree.

Acked-by: Chris Li <chrisl@kernel.org>

>
> We're doing a bit more work in the mutex section (memcpy and handle
> (un)mapping) - but seems fine to me tbh.

Thanks for the heads up.

Chris

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-14  1:02   ` Yosry Ahmed
@ 2023-12-14 22:23     ` Andrew Morton
  2023-12-14 22:41       ` Yosry Ahmed
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2023-12-14 22:23 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:

> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Also after the common decompress part goes to __zswap_load(), we can
> > cleanup the zswap_reclaim_entry() a little.
> 
> I think you mean zswap_writeback_entry(), same for the commit title.

I updated my copy of the changelog, thanks.

> > -       /*
> > -        * If we get here because the page is already in swapcache, a
> > -        * load may be happening concurrently. It is safe and okay to
> > -        * not free the entry. It is also okay to return !0.
> > -        */
> 
> This comment should be moved above the failure check of
> __read_swap_cache_async() above, not completely removed.

This?

--- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
+++ a/mm/zswap.c
@@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &page_was_allocated, true);
-	if (!page)
+	if (!page) {
+		/*
+		 * If we get here because the page is already in swapcache, a
+		 * load may be happening concurrently. It is safe and okay to
+		 * not free the entry. It is also okay to return !0.
+		 */
 		return -ENOMEM;
+	}
 
 	/* Found an existing page, we raced with load/swapin */
 	if (!page_was_allocated) {


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-14 22:23     ` Andrew Morton
@ 2023-12-14 22:41       ` Yosry Ahmed
  2023-12-18 14:03         ` Johannes Weiner
  0 siblings, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-14 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chengming Zhou, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 2:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> > >
> > > Also after the common decompress part goes to __zswap_load(), we can
> > > cleanup the zswap_reclaim_entry() a little.
> >
> > I think you mean zswap_writeback_entry(), same for the commit title.
>
> I updated my copy of the changelog, thanks.
>
> > > -       /*
> > > -        * If we get here because the page is already in swapcache, a
> > > -        * load may be happening concurrently. It is safe and okay to
> > > -        * not free the entry. It is also okay to return !0.
> > > -        */
> >
> > This comment should be moved above the failure check of
> > __read_swap_cache_async() above, not completely removed.
>
> This?

Yes, thanks a lot. Although I think a new version is needed anyway to
address other comments.

>
> --- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
> +++ a/mm/zswap.c
> @@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> -       if (!page)
> +       if (!page) {
> +               /*
> +                * If we get here because the page is already in swapcache, a
> +                * load may be happening concurrently. It is safe and okay to
> +                * not free the entry. It is also okay to return !0.
> +                */
>                 return -ENOMEM;
> +       }
>
>         /* Found an existing page, we raced with load/swapin */
>         if (!page_was_allocated) {
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/5] mm/zswap: reuse dstmem when decompress
  2023-12-14 18:24           ` Yosry Ahmed
@ 2023-12-18  8:06             ` Chengming Zhou
  0 siblings, 0 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-18  8:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On 2023/12/15 02:24, Yosry Ahmed wrote:
> On Thu, Dec 14, 2023 at 6:42 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> [..]
>>>>>> -
>>>>>>         /* decompress */
>>>>>> -       dlen = PAGE_SIZE;
>>>>>> -       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>>>>> +       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>>>>>> +       mutex_lock(acomp_ctx->mutex);
>>>>>>
>>>>>> +       zpool = zswap_find_zpool(entry);
>>>>>> +       src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>>>>>>         if (!zpool_can_sleep_mapped(zpool)) {
>>>>>> -               memcpy(tmp, src, entry->length);
>>>>>> -               src = tmp;
>>>>>> +               memcpy(acomp_ctx->dstmem, src, entry->length);
>>>>>> +               src = acomp_ctx->dstmem;
>>>>>
>>>>> I don't like that we are now using acomp_ctx->dstmem and
>>>>> acomp_ctx->mutex now for purposes other than what the naming suggests.
>>>>
>>>> The "mutex" name is coherent, "dstmem" depends on how we use IMHO.
>>>> Change to just "mem"? Or do you have a better name to replace?
>>>>
>>>>>
>>>>> How about removing these two fields from acomp_ctx, and directly using
>>>>> zswap_dstmem and zswap_mutex in both the load and store paths, rename
>>>>> them, and add proper comments above their definitions that they are
>>>>> for generic percpu buffering on the load and store paths?
>>>>
>>>> Yes, they are percpu memory and lock, but they are used by per acomp_ctx,
>>>> and the cpu maybe changing in the middle, so maybe better to keep them.
>>>
>>> I don't mean to remove completely. Keep them as (for example)
>>> zswap_mem and zswap_mutex global percpu variables, and not have
>>> pointers in acomp_ctx to them. Instead of using acomp_ctx->dstmem
>>> today, we directly use the global zswap_mem (same for the mutex).
>>>
>>> This makes it clear that the buffers are not owned or exclusively used
>>> by the acomp_ctx. WDYT?
>>
>> Does this look good to you?
>>
>> ```
>> int cpu = raw_smp_processor_id();
>>
>> mutex = per_cpu(zswap_mutex, cpu);
>> mutex_lock(mutex);
>>
>> dstmem = per_cpu(zswap_dstmem, cpu);
> 
> Renaming to zswap_buffer or zswap_mem would be better I think, but
> yeah what I had in mind is having zswap_mutex and
> zswap_[dstmem/mem/buffer] be generic percpu buffers that are used by
> store and load paths for different purposes, not directly linked to
> acomp_ctx.
> 

Ok, I'll append a patch to do the refactor & cleanup: remove mutex
and dstmem from acomp_ctx, and rename to zswap_buffer, then directly
use them on the load/store paths.

>> acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>>
>> /* compress or decompress */
>> ```
>>
>> Another way I just think of is to make acomp_ctx own its lock and buffer,
>> and we could delete these percpu zswap_mutex and zswap_dstmem instead.
> 
> You mean have two separate set of percpu buffers for zswap load &
> stores paths? This is probably unnecessary.

Alright. Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] mm/zswap: refactor out __zswap_load()
  2023-12-14  0:52   ` Yosry Ahmed
  2023-12-14 14:45     ` Chengming Zhou
@ 2023-12-18  8:15     ` Chengming Zhou
  2023-12-18  9:38       ` Yosry Ahmed
  1 sibling, 1 reply; 41+ messages in thread
From: Chengming Zhou @ 2023-12-18  8:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On 2023/12/14 08:52, Yosry Ahmed wrote:
> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> The zswap_load() and zswap_writeback_entry() have the same part that
>> decompress the data from zswap_entry to page, so refactor out the
>> common part as __zswap_load(entry, page).
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> 
> On a second look, there a few nits here.
> 
> First I think it makes more sense to move this refactoring ahead of
> reusing destmem. Right now, we add the destmem reuse to zswap_load()
> only, then we do the refactor and zswap_writeback_entry() gets it
> automatically, so there is a slight change coming to
> zswap_writeback_entry() hidden in the refactoring patch.
> 
> Let's refactor out __zswap_load() first, then reuse destmem in it.

I tried but found that putting the __zswap_load() first would introduce
another failure case in zswap_writeback_entry(), since the temporary
memory allocation may fail.

So instead, I also move the dstmem reusing in zswap_writeback_entry() to
the dstmem reusing patch. Then this patch becomes having only refactoring.

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] mm/zswap: refactor out __zswap_load()
  2023-12-18  8:15     ` Chengming Zhou
@ 2023-12-18  9:38       ` Yosry Ahmed
  0 siblings, 0 replies; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-18  9:38 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Andrew Morton, Nhat Pham, Chris Li, Johannes Weiner,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Mon, Dec 18, 2023 at 12:15 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/14 08:52, Yosry Ahmed wrote:
> > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> The zswap_load() and zswap_writeback_entry() have the same part that
> >> decompress the data from zswap_entry to page, so refactor out the
> >> common part as __zswap_load(entry, page).
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> >
> > On a second look, there a few nits here.
> >
> > First I think it makes more sense to move this refactoring ahead of
> > reusing destmem. Right now, we add the destmem reuse to zswap_load()
> > only, then we do the refactor and zswap_writeback_entry() gets it
> > automatically, so there is a slight change coming to
> > zswap_writeback_entry() hidden in the refactoring patch.
> >
> > Let's refactor out __zswap_load() first, then reuse destmem in it.
>
> I tried but found that putting the __zswap_load() first would introduce
> another failure case in zswap_writeback_entry(), since the temporary
> memory allocation may fail.
>
> So instead, I also move the dstmem reusing in zswap_writeback_entry() to
> the dstmem reusing patch. Then this patch becomes having only refactoring.

We could have still refactored __zswap_load() first by making it
return an int initially when split, then void later. Anyway, it's not
a big deal. The new series looks fine.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-14 22:41       ` Yosry Ahmed
@ 2023-12-18 14:03         ` Johannes Weiner
  2023-12-18 14:39           ` Yosry Ahmed
  0 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2023-12-18 14:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Chengming Zhou, Nhat Pham, Chris Li,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Thu, Dec 14, 2023 at 02:41:26PM -0800, Yosry Ahmed wrote:
> On Thu, Dec 14, 2023 at 2:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > > <zhouchengming@bytedance.com> wrote:
> > > >
> > > > Also after the common decompress part goes to __zswap_load(), we can
> > > > cleanup the zswap_reclaim_entry() a little.
> > >
> > > I think you mean zswap_writeback_entry(), same for the commit title.
> >
> > I updated my copy of the changelog, thanks.
> >
> > > > -       /*
> > > > -        * If we get here because the page is already in swapcache, a
> > > > -        * load may be happening concurrently. It is safe and okay to
> > > > -        * not free the entry. It is also okay to return !0.
> > > > -        */
> > >
> > > This comment should be moved above the failure check of
> > > __read_swap_cache_async() above, not completely removed.
> >
> > This?
> 
> Yes, thanks a lot. Although I think a new version is needed anyway to
> address other comments.
> 
> >
> > --- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
> > +++ a/mm/zswap.c
> > @@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
> >         mpol = get_task_policy(current);
> >         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> >                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> > -       if (!page)
> > +       if (!page) {
> > +               /*
> > +                * If we get here because the page is already in swapcache, a
> > +                * load may be happening concurrently. It is safe and okay to
> > +                * not free the entry. It is also okay to return !0.
> > +                */
> >                 return -ENOMEM;
> > +       }
> >
> >         /* Found an existing page, we raced with load/swapin */
> >         if (!page_was_allocated) {

That's the wrong branch, no?

!page -> -ENOMEM

page && !page_was_allocated -> already in swapcache

Personally, I don't really get the comment. What does it mean that
it's "okay" not to free the entry? There is a put, which may or may
not free the entry if somebody else is using it. Is it explaining how
lifetime works for refcounted objects? I'm similarly confused by the
"it's okay" to return non-zero. What is that trying to convey?

Deletion seemed like the right choice here, IMO ;)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-18 14:03         ` Johannes Weiner
@ 2023-12-18 14:39           ` Yosry Ahmed
  2023-12-18 14:58             ` Johannes Weiner
  0 siblings, 1 reply; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-18 14:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Chengming Zhou, Nhat Pham, Chris Li,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Mon, Dec 18, 2023 at 6:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 14, 2023 at 02:41:26PM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 14, 2023 at 2:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > > > <zhouchengming@bytedance.com> wrote:
> > > > >
> > > > > Also after the common decompress part goes to __zswap_load(), we can
> > > > > cleanup the zswap_reclaim_entry() a little.
> > > >
> > > > I think you mean zswap_writeback_entry(), same for the commit title.
> > >
> > > I updated my copy of the changelog, thanks.
> > >
> > > > > -       /*
> > > > > -        * If we get here because the page is already in swapcache, a
> > > > > -        * load may be happening concurrently. It is safe and okay to
> > > > > -        * not free the entry. It is also okay to return !0.
> > > > > -        */
> > > >
> > > > This comment should be moved above the failure check of
> > > > __read_swap_cache_async() above, not completely removed.
> > >
> > > This?
> >
> > Yes, thanks a lot. Although I think a new version is needed anyway to
> > address other comments.
> >
> > >
> > > --- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
> > > +++ a/mm/zswap.c
> > > @@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
> > >         mpol = get_task_policy(current);
> > >         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> > >                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> > > -       if (!page)
> > > +       if (!page) {
> > > +               /*
> > > +                * If we get here because the page is already in swapcache, a
> > > +                * load may be happening concurrently. It is safe and okay to
> > > +                * not free the entry. It is also okay to return !0.
> > > +                */
> > >                 return -ENOMEM;
> > > +       }
> > >
> > >         /* Found an existing page, we raced with load/swapin */
> > >         if (!page_was_allocated) {
>
> That's the wrong branch, no?
>
> !page -> -ENOMEM
>
> page && !page_was_allocated -> already in swapcache

Ah yes, my bad.

>
> Personally, I don't really get the comment. What does it mean that
> it's "okay" not to free the entry? There is a put, which may or may
> not free the entry if somebody else is using it. Is it explaining how
> lifetime works for refcounted objects? I'm similarly confused by the
> "it's okay" to return non-zero. What is that trying to convey?
>
> Deletion seemed like the right choice here, IMO ;)

It's not the clearest of comments for sure. I think it is just trying
to say that it is okay not to write back the entry from zswap and to
fail, because the caller will just try another page. I did not like
silently deleting the comment during the refactoring. How about
rewriting it to something like:

/*
 * If we get here because the page is already in the swapcache, a
 * load may be happening concurrently. Skip this page, the caller
 * will move on to a different page.
 */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-18 14:39           ` Yosry Ahmed
@ 2023-12-18 14:58             ` Johannes Weiner
  2023-12-18 20:52               ` Yosry Ahmed
  0 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2023-12-18 14:58 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Chengming Zhou, Nhat Pham, Chris Li,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Mon, Dec 18, 2023 at 06:39:13AM -0800, Yosry Ahmed wrote:
> On Mon, Dec 18, 2023 at 6:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Dec 14, 2023 at 02:41:26PM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 14, 2023 at 2:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > > > > <zhouchengming@bytedance.com> wrote:
> > > > > >
> > > > > > Also after the common decompress part goes to __zswap_load(), we can
> > > > > > cleanup the zswap_reclaim_entry() a little.
> > > > >
> > > > > I think you mean zswap_writeback_entry(), same for the commit title.
> > > >
> > > > I updated my copy of the changelog, thanks.
> > > >
> > > > > > -       /*
> > > > > > -        * If we get here because the page is already in swapcache, a
> > > > > > -        * load may be happening concurrently. It is safe and okay to
> > > > > > -        * not free the entry. It is also okay to return !0.
> > > > > > -        */
> > > > >
> > > > > This comment should be moved above the failure check of
> > > > > __read_swap_cache_async() above, not completely removed.
> > > >
> > > > This?
> > >
> > > Yes, thanks a lot. Although I think a new version is needed anyway to
> > > address other comments.
> > >
> > > >
> > > > --- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
> > > > +++ a/mm/zswap.c
> > > > @@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
> > > >         mpol = get_task_policy(current);
> > > >         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> > > >                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> > > > -       if (!page)
> > > > +       if (!page) {
> > > > +               /*
> > > > +                * If we get here because the page is already in swapcache, a
> > > > +                * load may be happening concurrently. It is safe and okay to
> > > > +                * not free the entry. It is also okay to return !0.
> > > > +                */
> > > >                 return -ENOMEM;
> > > > +       }
> > > >
> > > >         /* Found an existing page, we raced with load/swapin */
> > > >         if (!page_was_allocated) {
> >
> > That's the wrong branch, no?
> >
> > !page -> -ENOMEM
> >
> > page && !page_was_allocated -> already in swapcache
> 
> Ah yes, my bad.
> 
> >
> > Personally, I don't really get the comment. What does it mean that
> > it's "okay" not to free the entry? There is a put, which may or may
> > not free the entry if somebody else is using it. Is it explaining how
> > lifetime works for refcounted objects? I'm similarly confused by the
> > "it's okay" to return non-zero. What is that trying to convey?
> >
> > Deletion seemed like the right choice here, IMO ;)
> 
> It's not the clearest of comments for sure. I think it is just trying
> to say that it is okay not to write back the entry from zswap and to
> fail, because the caller will just try another page. I did not like
> silently deleting the comment during the refactoring. How about
> rewriting it to something like:
> 
> /*
>  * If we get here because the page is already in the swapcache, a
>  * load may be happening concurrently. Skip this page, the caller
>  * will move on to a different page.
>  */

Well there is this one already on the branch:

/* Found an existing page, we raced with load/swapin */

which covers the first half. The unspoken assumption there is that
writeback is an operation for an aged out page, while swapin means the
age just got reset to 0. Maybe it makes sense to elaborate on that?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-18 14:58             ` Johannes Weiner
@ 2023-12-18 20:52               ` Yosry Ahmed
  2023-12-19 12:16                 ` Chengming Zhou
  2023-12-20  4:30                 ` Johannes Weiner
  0 siblings, 2 replies; 41+ messages in thread
From: Yosry Ahmed @ 2023-12-18 20:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Chengming Zhou, Nhat Pham, Chris Li,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Mon, Dec 18, 2023 at 6:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Dec 18, 2023 at 06:39:13AM -0800, Yosry Ahmed wrote:
> > On Mon, Dec 18, 2023 at 6:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, Dec 14, 2023 at 02:41:26PM -0800, Yosry Ahmed wrote:
> > > > On Thu, Dec 14, 2023 at 2:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > > On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
> > > > > > <zhouchengming@bytedance.com> wrote:
> > > > > > >
> > > > > > > Also after the common decompress part goes to __zswap_load(), we can
> > > > > > > cleanup the zswap_reclaim_entry() a little.
> > > > > >
> > > > > > I think you mean zswap_writeback_entry(), same for the commit title.
> > > > >
> > > > > I updated my copy of the changelog, thanks.
> > > > >
> > > > > > > -       /*
> > > > > > > -        * If we get here because the page is already in swapcache, a
> > > > > > > -        * load may be happening concurrently. It is safe and okay to
> > > > > > > -        * not free the entry. It is also okay to return !0.
> > > > > > > -        */
> > > > > >
> > > > > > This comment should be moved above the failure check of
> > > > > > __read_swap_cache_async() above, not completely removed.
> > > > >
> > > > > This?
> > > >
> > > > Yes, thanks a lot. Although I think a new version is needed anyway to
> > > > address other comments.
> > > >
> > > > >
> > > > > --- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
> > > > > +++ a/mm/zswap.c
> > > > > @@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
> > > > >         mpol = get_task_policy(current);
> > > > >         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> > > > >                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
> > > > > -       if (!page)
> > > > > +       if (!page) {
> > > > > +               /*
> > > > > +                * If we get here because the page is already in swapcache, a
> > > > > +                * load may be happening concurrently. It is safe and okay to
> > > > > +                * not free the entry. It is also okay to return !0.
> > > > > +                */
> > > > >                 return -ENOMEM;
> > > > > +       }
> > > > >
> > > > >         /* Found an existing page, we raced with load/swapin */
> > > > >         if (!page_was_allocated) {
> > >
> > > That's the wrong branch, no?
> > >
> > > !page -> -ENOMEM
> > >
> > > page && !page_was_allocated -> already in swapcache
> >
> > Ah yes, my bad.
> >
> > >
> > > Personally, I don't really get the comment. What does it mean that
> > > it's "okay" not to free the entry? There is a put, which may or may
> > > not free the entry if somebody else is using it. Is it explaining how
> > > lifetime works for refcounted objects? I'm similarly confused by the
> > > "it's okay" to return non-zero. What is that trying to convey?
> > >
> > > Deletion seemed like the right choice here, IMO ;)
> >
> > It's not the clearest of comments for sure. I think it is just trying
> > to say that it is okay not to write back the entry from zswap and to
> > fail, because the caller will just try another page. I did not like
> > silently deleting the comment during the refactoring. How about
> > rewriting it to something like:
> >
> > /*
> >  * If we get here because the page is already in the swapcache, a
> >  * load may be happening concurrently. Skip this page, the caller
> >  * will move on to a different page.
> >  */
>
> Well there is this one already on the branch:
>
> /* Found an existing page, we raced with load/swapin */
>
> which covers the first half. The unspoken assumption there is that
> writeback is an operation for an aged out page, while swapin means the
> age just got reset to 0. Maybe it makes sense to elaborate on that?

How about the following diff? This applies on top of Andrew's fix:

diff --git a/mm/zswap.c b/mm/zswap.c
index e8f8f47596dae..8228a0b370979 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1458,15 +1458,14 @@ static int zswap_writeback_entry(struct
zswap_entry *entry,
        page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
                                NO_INTERLEAVE_INDEX, &page_was_allocated, true);
        if (!page) {
-               /*
-                * If we get here because the page is already in swapcache, a
-                * load may be happening concurrently. It is safe and okay to
-                * not free the entry. It is also okay to return !0.
-                */
                return -ENOMEM;
        }

-       /* Found an existing page, we raced with load/swapin */
+       /*
+        * Found an existing page, we raced with load/swapin. We generally
+        * writeback cold pages from zswap, and swapin means the page just
+        * became hot. Skip this page and let the caller find another one.
+        */
        if (!page_was_allocated) {
                put_page(page);
                return -EEXIST;

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-18 20:52               ` Yosry Ahmed
@ 2023-12-19 12:16                 ` Chengming Zhou
  2023-12-20  4:30                 ` Johannes Weiner
  1 sibling, 0 replies; 41+ messages in thread
From: Chengming Zhou @ 2023-12-19 12:16 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: Andrew Morton, Nhat Pham, Chris Li, Seth Jennings, Dan Streetman,
	Vitaly Wool, linux-kernel, linux-mm

On 2023/12/19 04:52, Yosry Ahmed wrote:
> On Mon, Dec 18, 2023 at 6:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Mon, Dec 18, 2023 at 06:39:13AM -0800, Yosry Ahmed wrote:
>>> On Mon, Dec 18, 2023 at 6:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>
>>>> On Thu, Dec 14, 2023 at 02:41:26PM -0800, Yosry Ahmed wrote:
>>>>> On Thu, Dec 14, 2023 at 2:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>>>
>>>>>> On Wed, 13 Dec 2023 17:02:25 -0800 Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>>
>>>>>>> On Tue, Dec 12, 2023 at 8:18 PM Chengming Zhou
>>>>>>> <zhouchengming@bytedance.com> wrote:
>>>>>>>>
>>>>>>>> Also after the common decompress part goes to __zswap_load(), we can
>>>>>>>> cleanup the zswap_reclaim_entry() a little.
>>>>>>>
>>>>>>> I think you mean zswap_writeback_entry(), same for the commit title.
>>>>>>
>>>>>> I updated my copy of the changelog, thanks.
>>>>>>
>>>>>>>> -       /*
>>>>>>>> -        * If we get here because the page is already in swapcache, a
>>>>>>>> -        * load may be happening concurrently. It is safe and okay to
>>>>>>>> -        * not free the entry. It is also okay to return !0.
>>>>>>>> -        */
>>>>>>>
>>>>>>> This comment should be moved above the failure check of
>>>>>>> __read_swap_cache_async() above, not completely removed.
>>>>>>
>>>>>> This?
>>>>>
>>>>> Yes, thanks a lot. Although I think a new version is needed anyway to
>>>>> address other comments.
>>>>>
>>>>>>
>>>>>> --- a/mm/zswap.c~mm-zswap-cleanup-zswap_reclaim_entry-fix
>>>>>> +++ a/mm/zswap.c
>>>>>> @@ -1457,8 +1457,14 @@ static int zswap_writeback_entry(struct
>>>>>>         mpol = get_task_policy(current);
>>>>>>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>>>>>>                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
>>>>>> -       if (!page)
>>>>>> +       if (!page) {
>>>>>> +               /*
>>>>>> +                * If we get here because the page is already in swapcache, a
>>>>>> +                * load may be happening concurrently. It is safe and okay to
>>>>>> +                * not free the entry. It is also okay to return !0.
>>>>>> +                */
>>>>>>                 return -ENOMEM;
>>>>>> +       }
>>>>>>
>>>>>>         /* Found an existing page, we raced with load/swapin */
>>>>>>         if (!page_was_allocated) {
>>>>
>>>> That's the wrong branch, no?
>>>>
>>>> !page -> -ENOMEM
>>>>
>>>> page && !page_was_allocated -> already in swapcache
>>>
>>> Ah yes, my bad.
>>>
>>>>
>>>> Personally, I don't really get the comment. What does it mean that
>>>> it's "okay" not to free the entry? There is a put, which may or may
>>>> not free the entry if somebody else is using it. Is it explaining how
>>>> lifetime works for refcounted objects? I'm similarly confused by the
>>>> "it's okay" to return non-zero. What is that trying to convey?
>>>>
>>>> Deletion seemed like the right choice here, IMO ;)
>>>
>>> It's not the clearest of comments for sure. I think it is just trying
>>> to say that it is okay not to write back the entry from zswap and to
>>> fail, because the caller will just try another page. I did not like
>>> silently deleting the comment during the refactoring. How about
>>> rewriting it to something like:
>>>
>>> /*
>>>  * If we get here because the page is already in the swapcache, a
>>>  * load may be happening concurrently. Skip this page, the caller
>>>  * will move on to a different page.
>>>  */
>>
>> Well there is this one already on the branch:
>>
>> /* Found an existing page, we raced with load/swapin */
>>
>> which covers the first half. The unspoken assumption there is that
>> writeback is an operation for an aged out page, while swapin means the
>> age just got reset to 0. Maybe it makes sense to elaborate on that?
> 
> How about the following diff? This applies on top of Andrew's fix:
> 

Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>

The latest v3 also put the comments on the wrong branch, and this diff
could be folded to fix it.

v3: https://lore.kernel.org/all/20231213-zswap-dstmem-v3-5-4eac09b94ece@bytedance.com/

> diff --git a/mm/zswap.c b/mm/zswap.c
> index e8f8f47596dae..8228a0b370979 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1458,15 +1458,14 @@ static int zswap_writeback_entry(struct
> zswap_entry *entry,
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
>         if (!page) {
> -               /*
> -                * If we get here because the page is already in swapcache, a
> -                * load may be happening concurrently. It is safe and okay to
> -                * not free the entry. It is also okay to return !0.
> -                */
>                 return -ENOMEM;
>         }
> 
> -       /* Found an existing page, we raced with load/swapin */
> +       /*
> +        * Found an existing page, we raced with load/swapin. We generally
> +        * writeback cold pages from zswap, and swapin means the page just
> +        * became hot. Skip this page and let the caller find another one.
> +        */
>         if (!page_was_allocated) {
>                 put_page(page);
>                 return -EEXIST;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry()
  2023-12-18 20:52               ` Yosry Ahmed
  2023-12-19 12:16                 ` Chengming Zhou
@ 2023-12-20  4:30                 ` Johannes Weiner
  1 sibling, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2023-12-20  4:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Chengming Zhou, Nhat Pham, Chris Li,
	Seth Jennings, Dan Streetman, Vitaly Wool, linux-kernel,
	linux-mm

On Mon, Dec 18, 2023 at 12:52:40PM -0800, Yosry Ahmed wrote:
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1458,15 +1458,14 @@ static int zswap_writeback_entry(struct
> zswap_entry *entry,
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated, true);
>         if (!page) {
> -               /*
> -                * If we get here because the page is already in swapcache, a
> -                * load may be happening concurrently. It is safe and okay to
> -                * not free the entry. It is also okay to return !0.
> -                */
>                 return -ENOMEM;
>         }
> 
> -       /* Found an existing page, we raced with load/swapin */
> +       /*
> +        * Found an existing page, we raced with load/swapin. We generally
> +        * writeback cold pages from zswap, and swapin means the page just
> +        * became hot. Skip this page and let the caller find another one.
> +        */
>         if (!page_was_allocated) {
>                 put_page(page);
>                 return -EEXIST;

ACK, that looks good to me!

Thanks

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-12-20  4:30 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-13  4:17 [PATCH 0/5] mm/zswap: dstmem reuse optimizations and cleanups Chengming Zhou
2023-12-13  4:17 ` [PATCH 1/5] mm/zswap: reuse dstmem when decompress Chengming Zhou
2023-12-13 23:24   ` Yosry Ahmed
2023-12-14 13:29     ` Chengming Zhou
2023-12-14 13:32       ` Yosry Ahmed
2023-12-14 14:42         ` Chengming Zhou
2023-12-14 18:24           ` Yosry Ahmed
2023-12-18  8:06             ` Chengming Zhou
2023-12-14 17:59   ` Chris Li
2023-12-14 18:26     ` Yosry Ahmed
2023-12-14 22:02       ` Chris Li
2023-12-14 20:33     ` Nhat Pham
2023-12-13  4:17 ` [PATCH 2/5] mm/zswap: change dstmem size to one page Chengming Zhou
2023-12-13 23:34   ` Yosry Ahmed
2023-12-14  0:18     ` Nhat Pham
2023-12-14 13:33       ` Chengming Zhou
2023-12-14 13:37         ` Yosry Ahmed
2023-12-14 13:57           ` Chengming Zhou
2023-12-14 15:03             ` Chengming Zhou
2023-12-14 18:34               ` Yosry Ahmed
2023-12-14 18:30             ` Yosry Ahmed
2023-12-14 20:29               ` Nhat Pham
2023-12-13  4:18 ` [PATCH 3/5] mm/zswap: refactor out __zswap_load() Chengming Zhou
2023-12-13 23:37   ` Yosry Ahmed
2023-12-14  0:52   ` Yosry Ahmed
2023-12-14 14:45     ` Chengming Zhou
2023-12-18  8:15     ` Chengming Zhou
2023-12-18  9:38       ` Yosry Ahmed
2023-12-13  4:18 ` [PATCH 4/5] mm/zswap: cleanup zswap_load() Chengming Zhou
2023-12-14  0:56   ` Yosry Ahmed
2023-12-13  4:18 ` [PATCH 5/5] mm/zswap: cleanup zswap_reclaim_entry() Chengming Zhou
2023-12-13 23:27   ` Nhat Pham
2023-12-14  1:02   ` Yosry Ahmed
2023-12-14 22:23     ` Andrew Morton
2023-12-14 22:41       ` Yosry Ahmed
2023-12-18 14:03         ` Johannes Weiner
2023-12-18 14:39           ` Yosry Ahmed
2023-12-18 14:58             ` Johannes Weiner
2023-12-18 20:52               ` Yosry Ahmed
2023-12-19 12:16                 ` Chengming Zhou
2023-12-20  4:30                 ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).