linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
@ 2020-07-07 12:52 Barry Song
  2020-07-08 14:59 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2020-07-07 12:52 UTC (permalink / raw)
  To: akpm, herbert, davem
  Cc: linux-crypto, linux-mm, linux-kernel, linuxarm, Barry Song,
	Luis Claudio R . Goncalves, Sebastian Andrzej Siewior,
	Mahipal Challa, Seth Jennings, Dan Streetman, Vitaly Wool,
	Zhou Wang, Colin Ian King

right now, all new ZIP drivers are using crypto_acomp APIs rather than
legacy crypto_comp APIs. But zswap.c is still using the old APIs. That
means zswap won't be able to use any new zip drivers in kernel.

This patch moves to use cryto_acomp APIs to fix the problem. On the
other hand, tradiontal compressors like lz4,lzo etc have been wrapped
into acomp via scomp backend. So platforms without async compressors
can fallback to use acomp via scomp backend.

It is probably the first real user to use acomp but perhaps not a good
example to demonstrate how multiple acomp requests can be executed in
parallel in one acomp instance. frontswap is doing page load and store
page by page. It doesn't have a queuing or buffering mechinism to permit
multiple pages to do frontswap simultaneously in one thread.
However this patch creates multiple acomp instances, so multiple threads
running on multiple different cpus can actually do (de)compression
parallelly, leveraging the power of multiple ZIP hardware queues. This
is also consistent with frontswap's page management model.

On the other hand, the current zswap implementation has some per-cpu
global resource like zswap_dstmem. So we create acomp instances in
number of CPUs just like before, zswap created comp instances in number
of CPUs.

Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mahipal Challa <mahipalreddy2006@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zhou Wang <wangzhou1@hisilicon.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 -v4: refine changelog and comment

 mm/zswap.c | 177 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 134 insertions(+), 43 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index fbb782924ccc..ab73957716c3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -24,8 +24,10 @@
 #include <linux/rbtree.h>
 #include <linux/swap.h>
 #include <linux/crypto.h>
+#include <linux/scatterlist.h>
 #include <linux/mempool.h>
 #include <linux/zpool.h>
+#include <crypto/acompress.h>
 
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
@@ -127,9 +129,17 @@ module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
 * data structures
 **********************************/
 
+struct crypto_acomp_ctx {
+	struct crypto_acomp *acomp;
+	struct acomp_req *req;
+	struct crypto_wait wait;
+	u8 *dstmem;
+	struct mutex mutex;
+};
+
 struct zswap_pool {
 	struct zpool *zpool;
-	struct crypto_comp * __percpu *tfm;
+	struct crypto_acomp_ctx * __percpu *acomp_ctx;
 	struct kref kref;
 	struct list_head list;
 	struct work_struct release_work;
@@ -415,30 +425,73 @@ static int zswap_dstmem_dead(unsigned int cpu)
 static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_comp *tfm;
+	struct crypto_acomp *acomp;
+	struct acomp_req *req;
+	struct crypto_acomp_ctx *acomp_ctx;
+	int ret;
 
-	if (WARN_ON(*per_cpu_ptr(pool->tfm, cpu)))
+	if (WARN_ON(*per_cpu_ptr(pool->acomp_ctx, cpu)))
 		return 0;
 
-	tfm = crypto_alloc_comp(pool->tfm_name, 0, 0);
-	if (IS_ERR_OR_NULL(tfm)) {
-		pr_err("could not alloc crypto comp %s : %ld\n",
-		       pool->tfm_name, PTR_ERR(tfm));
+	acomp_ctx = kzalloc(sizeof(*acomp_ctx), GFP_KERNEL);
+	if (!acomp_ctx)
 		return -ENOMEM;
+
+	acomp = crypto_alloc_acomp(pool->tfm_name, 0, 0);
+	if (IS_ERR(acomp)) {
+		pr_err("could not alloc crypto acomp %s : %ld\n",
+				pool->tfm_name, PTR_ERR(acomp));
+		ret = PTR_ERR(acomp);
+		goto free_ctx;
 	}
-	*per_cpu_ptr(pool->tfm, cpu) = tfm;
+	acomp_ctx->acomp = acomp;
+
+	req = acomp_request_alloc(acomp_ctx->acomp);
+	if (!req) {
+		pr_err("could not alloc crypto acomp_request %s\n",
+		       pool->tfm_name);
+		ret = -ENOMEM;
+		goto free_acomp;
+	}
+	acomp_ctx->req = req;
+
+	mutex_init(&acomp_ctx->mutex);
+	crypto_init_wait(&acomp_ctx->wait);
+	/*
+	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
+	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
+	 * won't be called, crypto_wait_req() will return without blocking.
+	 */
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &acomp_ctx->wait);
+
+	acomp_ctx->dstmem = per_cpu(zswap_dstmem, cpu);
+	*per_cpu_ptr(pool->acomp_ctx, cpu) = acomp_ctx;
+
 	return 0;
+
+free_acomp:
+	crypto_free_acomp(acomp_ctx->acomp);
+free_ctx:
+	kfree(acomp_ctx);
+	return ret;
 }
 
 static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_comp *tfm;
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx = *per_cpu_ptr(pool->acomp_ctx, cpu);
+	if (!IS_ERR_OR_NULL(acomp_ctx)) {
+		if (!IS_ERR_OR_NULL(acomp_ctx->req))
+			acomp_request_free(acomp_ctx->req);
+		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+			crypto_free_acomp(acomp_ctx->acomp);
+		kfree(acomp_ctx);
+	}
+	*per_cpu_ptr(pool->acomp_ctx, cpu) = NULL;
 
-	tfm = *per_cpu_ptr(pool->tfm, cpu);
-	if (!IS_ERR_OR_NULL(tfm))
-		crypto_free_comp(tfm);
-	*per_cpu_ptr(pool->tfm, cpu) = NULL;
 	return 0;
 }
 
@@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
 
 	strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
-	pool->tfm = alloc_percpu(struct crypto_comp *);
-	if (!pool->tfm) {
+
+	pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);
+	if (!pool->acomp_ctx) {
 		pr_err("percpu alloc failed\n");
 		goto error;
 	}
@@ -585,7 +639,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	return pool;
 
 error:
-	free_percpu(pool->tfm);
+	free_percpu(pool->acomp_ctx);
 	if (pool->zpool)
 		zpool_destroy_pool(pool->zpool);
 	kfree(pool);
@@ -596,14 +650,14 @@ static __init struct zswap_pool *__zswap_pool_create_fallback(void)
 {
 	bool has_comp, has_zpool;
 
-	has_comp = crypto_has_comp(zswap_compressor, 0, 0);
+	has_comp = crypto_has_acomp(zswap_compressor, 0, 0);
 	if (!has_comp && strcmp(zswap_compressor,
 				CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) {
 		pr_err("compressor %s not available, using default %s\n",
 		       zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT);
 		param_free_charp(&zswap_compressor);
 		zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
-		has_comp = crypto_has_comp(zswap_compressor, 0, 0);
+		has_comp = crypto_has_acomp(zswap_compressor, 0, 0);
 	}
 	if (!has_comp) {
 		pr_err("default compressor %s not available\n",
@@ -639,7 +693,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
 	zswap_pool_debug("destroying", pool);
 
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
-	free_percpu(pool->tfm);
+	free_percpu(pool->acomp_ctx);
 	zpool_destroy_pool(pool->zpool);
 	kfree(pool);
 }
@@ -723,7 +777,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		}
 		type = s;
 	} else if (!compressor) {
-		if (!crypto_has_comp(s, 0, 0)) {
+		if (!crypto_has_acomp(s, 0, 0)) {
 			pr_err("compressor %s not available\n", s);
 			return -ENOENT;
 		}
@@ -774,7 +828,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		 * failed, maybe both compressor and zpool params were bad.
 		 * Allow changing this param, so pool creation will succeed
 		 * when the other param is changed. We already verified this
-		 * param is ok in the zpool_has_pool() or crypto_has_comp()
+		 * param is ok in the zpool_has_pool() or crypto_has_acomp()
 		 * checks above.
 		 */
 		ret = param_set_charp(s, kp);
@@ -876,7 +930,9 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
 	pgoff_t offset;
 	struct zswap_entry *entry;
 	struct page *page;
-	struct crypto_comp *tfm;
+	struct scatterlist input, output;
+	struct crypto_acomp_ctx *acomp_ctx;
+
 	u8 *src, *dst;
 	unsigned int dlen;
 	int ret;
@@ -916,14 +972,21 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
 
 	case ZSWAP_SWAPCACHE_NEW: /* page is locked */
 		/* decompress */
+		acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
+
 		dlen = PAGE_SIZE;
 		src = (u8 *)zhdr + sizeof(struct zswap_header);
-		dst = kmap_atomic(page);
-		tfm = *get_cpu_ptr(entry->pool->tfm);
-		ret = crypto_comp_decompress(tfm, src, entry->length,
-					     dst, &dlen);
-		put_cpu_ptr(entry->pool->tfm);
-		kunmap_atomic(dst);
+		dst = kmap(page);
+
+		mutex_lock(&acomp_ctx->mutex);
+		sg_init_one(&input, src, entry->length);
+		sg_init_one(&output, dst, dlen);
+		acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
+		ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
+		dlen = acomp_ctx->req->dlen;
+		mutex_unlock(&acomp_ctx->mutex);
+
+		kunmap(page);
 		BUG_ON(ret);
 		BUG_ON(dlen != PAGE_SIZE);
 
@@ -1004,7 +1067,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 {
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry, *dupentry;
-	struct crypto_comp *tfm;
+	struct scatterlist input, output;
+	struct crypto_acomp_ctx *acomp_ctx;
 	int ret;
 	unsigned int hlen, dlen = PAGE_SIZE;
 	unsigned long handle, value;
@@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	}
 
 	/* compress */
-	dst = get_cpu_var(zswap_dstmem);
-	tfm = *get_cpu_ptr(entry->pool->tfm);
-	src = kmap_atomic(page);
-	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
-	kunmap_atomic(src);
-	put_cpu_ptr(entry->pool->tfm);
+	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
+
+	mutex_lock(&acomp_ctx->mutex);
+
+	src = kmap(page);
+	dst = acomp_ctx->dstmem;
+	sg_init_one(&input, src, PAGE_SIZE);
+	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
+	sg_init_one(&output, dst, PAGE_SIZE * 2);
+	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	/*
+	 * it maybe looks a little bit silly that we send an asynchronous request,
+	 * then wait for its completion synchronously. This makes the process look
+	 * synchronous in fact.
+	 * Theoretically, acomp supports users send multiple acomp requests in one
+	 * acomp instance, then get those requests done simultaneously. but in this
+	 * case, frontswap actually does store and load page by page, there is no
+	 * existing method to send the second page before the first page is done
+	 * in one thread doing frontswap.
+	 * but in different threads running on different cpu, we have different
+	 * acomp instance, so multiple threads can do (de)compression in parallel.
+	 */
+	ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
+	dlen = acomp_ctx->req->dlen;
+	kunmap(page);
+
 	if (ret) {
 		ret = -EINVAL;
 		goto put_dstmem;
@@ -1103,7 +1187,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	memcpy(buf, &zhdr, hlen);
 	memcpy(buf + hlen, dst, dlen);
 	zpool_unmap_handle(entry->pool->zpool, handle);
-	put_cpu_var(zswap_dstmem);
+	mutex_unlock(&acomp_ctx->mutex);
 
 	/* populate entry */
 	entry->offset = offset;
@@ -1131,7 +1215,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	return 0;
 
 put_dstmem:
-	put_cpu_var(zswap_dstmem);
+	mutex_unlock(&acomp_ctx->mutex);
 	zswap_pool_put(entry->pool);
 freepage:
 	zswap_entry_cache_free(entry);
@@ -1148,7 +1232,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 {
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry;
-	struct crypto_comp *tfm;
+	struct scatterlist input, output;
+	struct crypto_acomp_ctx *acomp_ctx;
 	u8 *src, *dst;
 	unsigned int dlen;
 	int ret;
@@ -1175,11 +1260,17 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 	src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
 	if (zpool_evictable(entry->pool->zpool))
 		src += sizeof(struct zswap_header);
-	dst = kmap_atomic(page);
-	tfm = *get_cpu_ptr(entry->pool->tfm);
-	ret = crypto_comp_decompress(tfm, src, entry->length, dst, &dlen);
-	put_cpu_ptr(entry->pool->tfm);
-	kunmap_atomic(dst);
+	dst = kmap(page);
+
+	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+	sg_init_one(&input, src, entry->length);
+	sg_init_one(&output, dst, dlen);
+	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
+	ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
+	mutex_unlock(&acomp_ctx->mutex);
+
+	kunmap(page);
 	zpool_unmap_handle(entry->pool->zpool, entry->handle);
 	BUG_ON(ret);
 
-- 
2.27.0




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-07 12:52 [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration Barry Song
@ 2020-07-08 14:59 ` Sebastian Andrzej Siewior
  2020-07-08 21:45   ` Song Bao Hua (Barry Song)
  2020-07-09  1:32   ` Song Bao Hua (Barry Song)
  0 siblings, 2 replies; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-07-08 14:59 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Zhou Wang,
	Colin Ian King

On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
…
> @@ -127,9 +129,17 @@ module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
>  * data structures
>  **********************************/
>  
> +struct crypto_acomp_ctx {
> +	struct crypto_acomp *acomp;
> +	struct acomp_req *req;
> +	struct crypto_wait wait;
> +	u8 *dstmem;
> +	struct mutex mutex;
> +};
> @@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  	pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
>  
>  	strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> -	pool->tfm = alloc_percpu(struct crypto_comp *);
> -	if (!pool->tfm) {
> +
> +	pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);

Can't you allocate the whole structure instead just a pointer to it? The
structure looks just like bunch of pointers anyway. Less time for
pointer chasing means more time for fun.

> @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>  	}
>  
>  	/* compress */
> -	dst = get_cpu_var(zswap_dstmem);
> -	tfm = *get_cpu_ptr(entry->pool->tfm);
> -	src = kmap_atomic(page);
> -	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> -	kunmap_atomic(src);
> -	put_cpu_ptr(entry->pool->tfm);
> +	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> +
> +	mutex_lock(&acomp_ctx->mutex);
> +
> +	src = kmap(page);
> +	dst = acomp_ctx->dstmem;

that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if
I read this right, you can get preempted after crypto_wait_req() and
another context in this CPU writes its data to the same dstmem and then…

> +	sg_init_one(&input, src, PAGE_SIZE);
> +	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> +	sg_init_one(&output, dst, PAGE_SIZE * 2);
> +	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +	/*
> +	 * it maybe looks a little bit silly that we send an asynchronous request,
> +	 * then wait for its completion synchronously. This makes the process look
> +	 * synchronous in fact.
> +	 * Theoretically, acomp supports users send multiple acomp requests in one
> +	 * acomp instance, then get those requests done simultaneously. but in this
> +	 * case, frontswap actually does store and load page by page, there is no
> +	 * existing method to send the second page before the first page is done
> +	 * in one thread doing frontswap.
> +	 * but in different threads running on different cpu, we have different
> +	 * acomp instance, so multiple threads can do (de)compression in parallel.
> +	 */
> +	ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> +	dlen = acomp_ctx->req->dlen;
> +	kunmap(page);
> +
>  	if (ret) {
>  		ret = -EINVAL;
>  		goto put_dstmem;

This looks using the same synchronous mechanism around an asynchronous
interface. It works as a PoC.

As far as I remember the crypto async interface, the incoming skbs were
fed to the async interface and returned to the caller so the NIC could
continue allocate new RX skbs and move on. Only if the queue of requests
was getting to long the code started to throttle. Eventually the async
crypto code completed the decryption operation in a different context
and fed the decrypted packet(s) into the stack.

From a quick view, you would have to return -EINPROGRESS here and have
at the caller side something like that:

iff --git a/mm/page_io.c b/mm/page_io.c
index e8726f3e3820b..9d1baa46ec3ed 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
                unlock_page(page);
                goto out;
        }
-       if (frontswap_store(page) == 0) {
+       ret = frontswap_store(page);
+       if (ret == 0) {
                set_page_writeback(page);
                unlock_page(page);
                end_page_writeback(page);
                goto out;
        }
+       if (ret = -EINPROGRESS)
+               goto out;
        ret = __swap_writepage(page, wbc, end_swap_bio_write);
 out:
        return ret;

so that eventually callers like write_cache_pages() could feed all pages
into *writepage and then wait for that bulk to finish.

Having it this way would also reshape the memory allocation you have.
You have now per-context a per-CPU crypto request and everything. With
a 64 or 128 core I'm not sure you will use all that resources.
With a truly async interface you would be force to have a resource pool
or so which you would use and then only allow a certain amount of
parallel requests.

Sebastian


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-08 14:59 ` Sebastian Andrzej Siewior
@ 2020-07-08 21:45   ` Song Bao Hua (Barry Song)
  2020-07-09  7:17     ` Sebastian Andrzej Siewior
  2020-07-09  1:32   ` Song Bao Hua (Barry Song)
  1 sibling, 1 reply; 10+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-07-08 21:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King



> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org
> [mailto:linux-crypto-owner@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 3:00 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: akpm@linux-foundation.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; linux-crypto@vger.kernel.org; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; Luis Claudio
> R . Goncalves <lgoncalv@redhat.com>; Mahipal Challa
> <mahipalreddy2006@gmail.com>; Seth Jennings <sjenning@redhat.com>;
> Dan Streetman <ddstreet@ieee.org>; Vitaly Wool
> <vitaly.wool@konsulko.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> Colin Ian King <colin.king@canonical.com>
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> …
> > @@ -127,9 +129,17 @@
> module_param_named(same_filled_pages_enabled,
> zswap_same_filled_pages_enabled,
> >  * data structures
> >  **********************************/
> >
> > +struct crypto_acomp_ctx {
> > +	struct crypto_acomp *acomp;
> > +	struct acomp_req *req;
> > +	struct crypto_wait wait;
> > +	u8 *dstmem;
> > +	struct mutex mutex;
> > +};
> …
> > @@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> >  	pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
> >
> >  	strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> > -	pool->tfm = alloc_percpu(struct crypto_comp *);
> > -	if (!pool->tfm) {
> > +
> > +	pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);
> 
> Can't you allocate the whole structure instead just a pointer to it? The
> structure looks just like bunch of pointers anyway. Less time for
> pointer chasing means more time for fun.
> 

Should be possible.

> > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> type, pgoff_t offset,
> >  	}
> >
> >  	/* compress */
> > -	dst = get_cpu_var(zswap_dstmem);
> > -	tfm = *get_cpu_ptr(entry->pool->tfm);
> > -	src = kmap_atomic(page);
> > -	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> > -	kunmap_atomic(src);
> > -	put_cpu_ptr(entry->pool->tfm);
> > +	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > +
> > +	mutex_lock(&acomp_ctx->mutex);
> > +
> > +	src = kmap(page);
> > +	dst = acomp_ctx->dstmem;
> 
> that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if
> I read this right, you can get preempted after crypto_wait_req() and
> another context in this CPU writes its data to the same dstmem and then…
> 

This isn't true. Another thread in this cpu will be blocked by the mutex.
It is impossible for two threads to write the same dstmem.
If thread1 ran on cpu1, it held cpu1's mutex; if another thread wants to run on cpu1, it is blocked.
If thread1 ran on cpu1 first, it held cpu1's mutex, then it migrated to cpu2 (with very rare chance)
	a. if another thread wants to run on cpu1, it is blocked;
	b. if another thread wants to run on cpu2, it is not blocked but it will write cpu2's dstmem not cpu1's

> > +	sg_init_one(&input, src, PAGE_SIZE);
> > +	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > +	sg_init_one(&output, dst, PAGE_SIZE * 2);
> > +	acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +	/*
> > +	 * it maybe looks a little bit silly that we send an asynchronous request,
> > +	 * then wait for its completion synchronously. This makes the process
> look
> > +	 * synchronous in fact.
> > +	 * Theoretically, acomp supports users send multiple acomp requests in
> one
> > +	 * acomp instance, then get those requests done simultaneously. but in
> this
> > +	 * case, frontswap actually does store and load page by page, there is no
> > +	 * existing method to send the second page before the first page is done
> > +	 * in one thread doing frontswap.
> > +	 * but in different threads running on different cpu, we have different
> > +	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +	 */
> > +	ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> &acomp_ctx->wait);
> > +	dlen = acomp_ctx->req->dlen;
> > +	kunmap(page);
> > +
> >  	if (ret) {
> >  		ret = -EINVAL;
> >  		goto put_dstmem;
> 
> This looks using the same synchronous mechanism around an asynchronous
> interface. It works as a PoC.
> 
> As far as I remember the crypto async interface, the incoming skbs were
> fed to the async interface and returned to the caller so the NIC could
> continue allocate new RX skbs and move on. Only if the queue of requests
> was getting to long the code started to throttle. Eventually the async
> crypto code completed the decryption operation in a different context
> and fed the decrypted packet(s) into the stack.
> 
> From a quick view, you would have to return -EINPROGRESS here and have
> at the caller side something like that:
> 
> iff --git a/mm/page_io.c b/mm/page_io.c
> index e8726f3e3820b..9d1baa46ec3ed 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
>                 unlock_page(page);
>                 goto out;
>         }
> -       if (frontswap_store(page) == 0) {
> +       ret = frontswap_store(page);
> +       if (ret == 0) {
>                 set_page_writeback(page);
>                 unlock_page(page);
>                 end_page_writeback(page);
>                 goto out;
>         }
> +       if (ret = -EINPROGRESS)
> +               goto out;
>         ret = __swap_writepage(page, wbc, end_swap_bio_write);
>  out:
>         return ret;
> 
> so that eventually callers like write_cache_pages() could feed all pages
> into *writepage and then wait for that bulk to finish.
> 
> Having it this way would also reshape the memory allocation you have.
> You have now per-context a per-CPU crypto request and everything. With
> a 64 or 128 core I'm not sure you will use all that resources.
> With a truly async interface you would be force to have a resource pool
> or so which you would use and then only allow a certain amount of
> parallel requests.

I agree we can optimize swap, frontswap, and zswap to make frontswap async to
improve performance. But this needs very careful thinking and benchmark. We
need benchmark to prove the performance improvement for making those changes.
I am very interested in figuring out a patchset for that. But for the first step,
we need to build a base so that everything else can move ahead. Right now, zswap
really can't work on new compression drivers. After we have a base, we can run
many things interesting, including making frontswap more efficient.

> 
> Sebastian

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-08 14:59 ` Sebastian Andrzej Siewior
  2020-07-08 21:45   ` Song Bao Hua (Barry Song)
@ 2020-07-09  1:32   ` Song Bao Hua (Barry Song)
  2020-07-09  7:39     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 10+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-07-09  1:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King



> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org
> [mailto:linux-crypto-owner@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 3:00 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: akpm@linux-foundation.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; linux-crypto@vger.kernel.org; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; Luis Claudio
> R . Goncalves <lgoncalv@redhat.com>; Mahipal Challa
> <mahipalreddy2006@gmail.com>; Seth Jennings <sjenning@redhat.com>;
> Dan Streetman <ddstreet@ieee.org>; Vitaly Wool
> <vitaly.wool@konsulko.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> Colin Ian King <colin.king@canonical.com>
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> …
> > @@ -127,9 +129,17 @@
> module_param_named(same_filled_pages_enabled,
> > zswap_same_filled_pages_enabled,
> >  * data structures
> >  **********************************/
> >
> > +struct crypto_acomp_ctx {
> > +	struct crypto_acomp *acomp;
> > +	struct acomp_req *req;
> > +	struct crypto_wait wait;
> > +	u8 *dstmem;
> > +	struct mutex mutex;
> > +};
> …
> > @@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> >  	pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
> >
> >  	strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> > -	pool->tfm = alloc_percpu(struct crypto_comp *);
> > -	if (!pool->tfm) {
> > +
> > +	pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);
> 
> Can't you allocate the whole structure instead just a pointer to it? The
> structure looks just like bunch of pointers anyway. Less time for pointer
> chasing means more time for fun.
> 
> > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> type, pgoff_t offset,
> >  	}
> >
> >  	/* compress */
> > -	dst = get_cpu_var(zswap_dstmem);
> > -	tfm = *get_cpu_ptr(entry->pool->tfm);
> > -	src = kmap_atomic(page);
> > -	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> > -	kunmap_atomic(src);
> > -	put_cpu_ptr(entry->pool->tfm);
> > +	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > +
> > +	mutex_lock(&acomp_ctx->mutex);
> > +
> > +	src = kmap(page);
> > +	dst = acomp_ctx->dstmem;
> 
> that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if I read
> this right, you can get preempted after crypto_wait_req() and another context
> in this CPU writes its data to the same dstmem and then…
> 
> > +	sg_init_one(&input, src, PAGE_SIZE);
> > +	/* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > +	sg_init_one(&output, dst, PAGE_SIZE * 2);
> > +	acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +	/*
> > +	 * it maybe looks a little bit silly that we send an asynchronous request,
> > +	 * then wait for its completion synchronously. This makes the process
> look
> > +	 * synchronous in fact.
> > +	 * Theoretically, acomp supports users send multiple acomp requests in
> one
> > +	 * acomp instance, then get those requests done simultaneously. but in
> this
> > +	 * case, frontswap actually does store and load page by page, there is no
> > +	 * existing method to send the second page before the first page is done
> > +	 * in one thread doing frontswap.
> > +	 * but in different threads running on different cpu, we have different
> > +	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +	 */
> > +	ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> &acomp_ctx->wait);
> > +	dlen = acomp_ctx->req->dlen;
> > +	kunmap(page);
> > +
> >  	if (ret) {
> >  		ret = -EINVAL;
> >  		goto put_dstmem;
> 
> This looks using the same synchronous mechanism around an asynchronous
> interface. It works as a PoC.
> 
> As far as I remember the crypto async interface, the incoming skbs were fed to
> the async interface and returned to the caller so the NIC could continue
> allocate new RX skbs and move on. Only if the queue of requests was getting
> to long the code started to throttle. Eventually the async crypto code
> completed the decryption operation in a different context and fed the
> decrypted packet(s) into the stack.
> 
> From a quick view, you would have to return -EINPROGRESS here and have at
> the caller side something like that:
> 
> iff --git a/mm/page_io.c b/mm/page_io.c
> index e8726f3e3820b..9d1baa46ec3ed 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
>                 unlock_page(page);
>                 goto out;
>         }
> -       if (frontswap_store(page) == 0) {
> +       ret = frontswap_store(page);
> +       if (ret == 0) {
>                 set_page_writeback(page);
>                 unlock_page(page);
>                 end_page_writeback(page);
>                 goto out;
>         }
> +       if (ret = -EINPROGRESS)
> +               goto out;
>         ret = __swap_writepage(page, wbc, end_swap_bio_write);
>  out:
>         return ret;
> 
Unfortunately, this is not true and things are not that simple.

We can't simply depend on -EINPROGRESS and go out.
We have to wait for the result of compression to decide if we should
do __swap_writepage(). As one page might be compressed into two
pages, in this case, we will still need to do _swap_writepage().
As I replied in the latest email, all of the async improvement to frontswap
needs very careful thinking and benchmark. It can only happen after
we build the base in this patch, fixing the broken connection between
zswap and those new zip drivers.

Thanks
Barry

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-08 21:45   ` Song Bao Hua (Barry Song)
@ 2020-07-09  7:17     ` Sebastian Andrzej Siewior
  2020-07-09 12:14       ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-07-09  7:17 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King

On 2020-07-08 21:45:47 [+0000], Song Bao Hua (Barry Song) wrote:
> > On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> > > @@ -127,9 +129,17 @@
> > > +struct crypto_acomp_ctx {
> > > +	struct crypto_acomp *acomp;
> > > +	struct acomp_req *req;
> > > +	struct crypto_wait wait;
> > > +	u8 *dstmem;
> > > +	struct mutex mutex;
> > > +};
> > …
> > > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> > type, pgoff_t offset,
> > >  	}
> > >
> > >  	/* compress */
> > > -	dst = get_cpu_var(zswap_dstmem);
> > > -	tfm = *get_cpu_ptr(entry->pool->tfm);
> > > -	src = kmap_atomic(page);
> > > -	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> > > -	kunmap_atomic(src);
> > > -	put_cpu_ptr(entry->pool->tfm);
> > > +	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > > +
> > > +	mutex_lock(&acomp_ctx->mutex);
> > > +
> > > +	src = kmap(page);
> > > +	dst = acomp_ctx->dstmem;
> > 
> > that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if
> > I read this right, you can get preempted after crypto_wait_req() and
> > another context in this CPU writes its data to the same dstmem and then…
> > 
> 
> This isn't true. Another thread in this cpu will be blocked by the mutex.
> It is impossible for two threads to write the same dstmem.
> If thread1 ran on cpu1, it held cpu1's mutex; if another thread wants to run on cpu1, it is blocked.
> If thread1 ran on cpu1 first, it held cpu1's mutex, then it migrated to cpu2 (with very rare chance)
> 	a. if another thread wants to run on cpu1, it is blocked;

How it is blocked? That "struct crypto_acomp_ctx" is
"this_cpu_ptr(entry->pool->acomp_ctx)" - which is per-CPU of a pool
which you can have multiple of. But `dstmem' you have only one per-CPU
no matter have many pools you have.
So pool1 on CPU1 uses the same `dstmem' as pool2 on CPU1. But pool1 and
pool2 on CPU1 use a different mutex for protection of this `dstmem'.

> Thanks
> Barry

Sebastian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-09  1:32   ` Song Bao Hua (Barry Song)
@ 2020-07-09  7:39     ` Sebastian Andrzej Siewior
  2020-07-09  7:55       ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-07-09  7:39 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King

On 2020-07-09 01:32:38 [+0000], Song Bao Hua (Barry Song) wrote:
> > This looks using the same synchronous mechanism around an asynchronous
> > interface. It works as a PoC.
> > 
> > As far as I remember the crypto async interface, the incoming skbs were fed to
> > the async interface and returned to the caller so the NIC could continue
> > allocate new RX skbs and move on. Only if the queue of requests was getting
> > to long the code started to throttle. Eventually the async crypto code
> > completed the decryption operation in a different context and fed the
> > decrypted packet(s) into the stack.
> > 
> > From a quick view, you would have to return -EINPROGRESS here and have at
> > the caller side something like that:
> > 
> > iff --git a/mm/page_io.c b/mm/page_io.c
> > index e8726f3e3820b..9d1baa46ec3ed 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct
> > writeback_control *wbc)
> >                 unlock_page(page);
> >                 goto out;
> >         }
> > -       if (frontswap_store(page) == 0) {
> > +       ret = frontswap_store(page);
> > +       if (ret == 0) {
> >                 set_page_writeback(page);
> >                 unlock_page(page);
> >                 end_page_writeback(page);
> >                 goto out;
> >         }
> > +       if (ret = -EINPROGRESS)
> > +               goto out;
> >         ret = __swap_writepage(page, wbc, end_swap_bio_write);
> >  out:
> >         return ret;
> > 
> Unfortunately, this is not true and things are not that simple.
> 
> We can't simply depend on -EINPROGRESS and go out.
> We have to wait for the result of compression to decide if we should
> do __swap_writepage(). As one page might be compressed into two
> pages, in this case, we will still need to do _swap_writepage().
> As I replied in the latest email, all of the async improvement to frontswap
> needs very careful thinking and benchmark. It can only happen after
> we build the base in this patch, fixing the broken connection between
> zswap and those new zip drivers.

At the time the compression finishes you see what happens and based on
the design you can either complete it immediately (the 0/error case from
above) or forward the result to the caller and let him decide.

> Thanks
> Barry

Sebastian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-09  7:39     ` Sebastian Andrzej Siewior
@ 2020-07-09  7:55       ` Song Bao Hua (Barry Song)
  2020-07-09  8:40         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 10+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-07-09  7:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King



> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org
> [mailto:linux-crypto-owner@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 7:39 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: akpm@linux-foundation.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; linux-crypto@vger.kernel.org; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; Luis Claudio
> R . Goncalves <lgoncalv@redhat.com>; Mahipal Challa
> <mahipalreddy2006@gmail.com>; Seth Jennings <sjenning@redhat.com>;
> Dan Streetman <ddstreet@ieee.org>; Vitaly Wool
> <vitaly.wool@konsulko.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> Colin Ian King <colin.king@canonical.com>
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-09 01:32:38 [+0000], Song Bao Hua (Barry Song) wrote:
> > > This looks using the same synchronous mechanism around an
> asynchronous
> > > interface. It works as a PoC.
> > >
> > > As far as I remember the crypto async interface, the incoming skbs were fed
> to
> > > the async interface and returned to the caller so the NIC could continue
> > > allocate new RX skbs and move on. Only if the queue of requests was
> getting
> > > to long the code started to throttle. Eventually the async crypto code
> > > completed the decryption operation in a different context and fed the
> > > decrypted packet(s) into the stack.
> > >
> > > From a quick view, you would have to return -EINPROGRESS here and have
> at
> > > the caller side something like that:
> > >
> > > iff --git a/mm/page_io.c b/mm/page_io.c
> > > index e8726f3e3820b..9d1baa46ec3ed 100644
> > > --- a/mm/page_io.c
> > > +++ b/mm/page_io.c
> > > @@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct
> > > writeback_control *wbc)
> > >                 unlock_page(page);
> > >                 goto out;
> > >         }
> > > -       if (frontswap_store(page) == 0) {
> > > +       ret = frontswap_store(page);
> > > +       if (ret == 0) {
> > >                 set_page_writeback(page);
> > >                 unlock_page(page);
> > >                 end_page_writeback(page);
> > >                 goto out;
> > >         }
> > > +       if (ret = -EINPROGRESS)
> > > +               goto out;
> > >         ret = __swap_writepage(page, wbc, end_swap_bio_write);
> > >  out:
> > >         return ret;
> > >
> > Unfortunately, this is not true and things are not that simple.
> >
> > We can't simply depend on -EINPROGRESS and go out.
> > We have to wait for the result of compression to decide if we should
> > do __swap_writepage(). As one page might be compressed into two
> > pages, in this case, we will still need to do _swap_writepage().
> > As I replied in the latest email, all of the async improvement to frontswap
> > needs very careful thinking and benchmark. It can only happen after
> > we build the base in this patch, fixing the broken connection between
> > zswap and those new zip drivers.
> 
> At the time the compression finishes you see what happens and based on
> the design you can either complete it immediately (the 0/error case from
> above) or forward the result to the caller and let him decide.

Hello Sebastian, thanks for your reply and careful review.

Right now, frontswap is pretty much one thing which happens before __swap_writepage().
The whole design is full of the assumption that frontswap is sync. So if frontswap
consumes a page without any error, this page won't go to __swap_writepage()
which is async. On the other hand, if frontswap's store has any error, that means
this page needs to swap to disk.

int swap_writepage(struct page *page, struct writeback_control *wbc)
{
	int ret = 0;

	if (try_to_free_swap(page)) {
		unlock_page(page);
		goto out;
	}
	if (frontswap_store(page) == 0) {
		set_page_writeback(page);
		unlock_page(page);
		end_page_writeback(page);
		goto out;
	}
	ret = __swap_writepage(page, wbc, end_swap_bio_write);
out:
	return ret;
}

I don't think we can simply "forward the result to the caller and let him decide".
Would you like to present some pseudo code?

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-09  7:55       ` Song Bao Hua (Barry Song)
@ 2020-07-09  8:40         ` Sebastian Andrzej Siewior
  2020-07-09  9:09           ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-07-09  8:40 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King

On 2020-07-09 07:55:22 [+0000], Song Bao Hua (Barry Song) wrote:
> Hello Sebastian, thanks for your reply and careful review.
Hi,

> I don't think we can simply "forward the result to the caller and let him decide".
> Would you like to present some pseudo code?

I provided just some pseudo code to illustrate an example how the async
interface should look like (more or less). The essential part is where
you allow to feed multiple requests without blocking.
I went up the call-chain and found one potential user which seem to have
a list of pages which are processed. This looked like a nice example. I
haven't looked at the details.

I have no opinion whether or not it makes sense to switch to the async
interface in a sync way.

> Thanks
> Barry

Sebastian


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-09  8:40         ` Sebastian Andrzej Siewior
@ 2020-07-09  9:09           ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 10+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-07-09  9:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King



> -----Original Message-----
> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> Behalf Of Sebastian Andrzej Siewior
> Sent: Thursday, July 9, 2020 8:41 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: akpm@linux-foundation.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; linux-crypto@vger.kernel.org; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; Luis Claudio
> R . Goncalves <lgoncalv@redhat.com>; Mahipal Challa
> <mahipalreddy2006@gmail.com>; Seth Jennings <sjenning@redhat.com>;
> Dan Streetman <ddstreet@ieee.org>; Vitaly Wool
> <vitaly.wool@konsulko.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> Colin Ian King <colin.king@canonical.com>
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-09 07:55:22 [+0000], Song Bao Hua (Barry Song) wrote:
> > Hello Sebastian, thanks for your reply and careful review.
> Hi,
> 
> > I don't think we can simply "forward the result to the caller and let him
> decide".
> > Would you like to present some pseudo code?
> 
> I provided just some pseudo code to illustrate an example how the async
> interface should look like (more or less). The essential part is where
> you allow to feed multiple requests without blocking.

Sebastian, Do you mean the below code?

@@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
                unlock_page(page);
                goto out;
        }
-       if (frontswap_store(page) == 0) {
+       ret = frontswap_store(page);
+       if (ret == 0) {
                set_page_writeback(page);
                unlock_page(page);
                end_page_writeback(page);
                goto out;
        }
+       if (ret = -EINPROGRESS)
+               goto out;
        ret = __swap_writepage(page, wbc, end_swap_bio_write);
 out:
        return ret;

I think this won' work. -EINPROGRESS won't be able to decide if we should goto out. We can only goto out if the compression
has done without any error. The error might be because of HW like EIO or because the data is not suitable to compress. We can
only know the result after the compression is really done and the completion callback is called by ZIP drivers.

If the compression is still INPROGRESS, we don't know what will happen.

> I went up the call-chain and found one potential user which seem to have
> a list of pages which are processed. This looked like a nice example. I
> haven't looked at the details.
> 
> I have no opinion whether or not it makes sense to switch to the async
> interface in a sync way.

I always appreciate your comment and your opinion.

The real problem here is that all of those new zip drivers are adapted to async interface. There is no old interface support
for those new drivers mainlined these years. zswap doesn’t work on those new drivers as they totally don't support
crypto_comp_compress()
crypto_comp_decompress()
...

So the initial goal of this patch is fixing the disconnected bridge between new zip drivers and zswap.

Making frontswap async can probably happen if we see performance improvement. But it seems it is a big project, not
that simple. On the other hand, it seems hisi_zip in drivers/crypto is the only async driver till now. Sorry if I am missing
any one. other drivers are adapted to acomp APIs by scomp APIs. For example:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b3

So even we make frontswap totally async, most zip drivers are still sync and we don't get the benefit. From my prospective,
I am glad to try the possibility of making frontswap async to leverage the power of ZIP hardware. This would probably and
only happen after we have a base to support acomp APIs.

Thanks
Barry

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration
  2020-07-09  7:17     ` Sebastian Andrzej Siewior
@ 2020-07-09 12:14       ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 10+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-07-09 12:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: akpm, herbert, davem, linux-crypto, linux-mm, linux-kernel,
	Linuxarm, Luis Claudio R . Goncalves, Mahipal Challa,
	Seth Jennings, Dan Streetman, Vitaly Wool, Wangzhou (B),
	Colin Ian King



> -----Original Message-----
> From: linux-crypto-owner@vger.kernel.org
> [mailto:linux-crypto-owner@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 7:17 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: akpm@linux-foundation.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; linux-crypto@vger.kernel.org; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; Luis Claudio
> R . Goncalves <lgoncalv@redhat.com>; Mahipal Challa
> <mahipalreddy2006@gmail.com>; Seth Jennings <sjenning@redhat.com>;
> Dan Streetman <ddstreet@ieee.org>; Vitaly Wool
> <vitaly.wool@konsulko.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> Colin Ian King <colin.king@canonical.com>
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 21:45:47 [+0000], Song Bao Hua (Barry Song) wrote:
> > > On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> > > > @@ -127,9 +129,17 @@
> > > > +struct crypto_acomp_ctx {
> > > > +	struct crypto_acomp *acomp;
> > > > +	struct acomp_req *req;
> > > > +	struct crypto_wait wait;
> > > > +	u8 *dstmem;
> > > > +	struct mutex mutex;
> > > > +};
> > > …
> > > > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> > > type, pgoff_t offset,
> > > >  	}
> > > >
> > > >  	/* compress */
> > > > -	dst = get_cpu_var(zswap_dstmem);
> > > > -	tfm = *get_cpu_ptr(entry->pool->tfm);
> > > > -	src = kmap_atomic(page);
> > > > -	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> > > > -	kunmap_atomic(src);
> > > > -	put_cpu_ptr(entry->pool->tfm);
> > > > +	acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > > > +
> > > > +	mutex_lock(&acomp_ctx->mutex);
> > > > +
> > > > +	src = kmap(page);
> > > > +	dst = acomp_ctx->dstmem;
> > >
> > > that mutex is per-CPU, per-context. The dstmem pointer is per-CPU.
> > > So if I read this right, you can get preempted after
> > > crypto_wait_req() and another context in this CPU writes its data to
> > > the same dstmem and then…
> > >
> >
> > This isn't true. Another thread in this cpu will be blocked by the mutex.
> > It is impossible for two threads to write the same dstmem.
> > If thread1 ran on cpu1, it held cpu1's mutex; if another thread wants to run
> on cpu1, it is blocked.
> > If thread1 ran on cpu1 first, it held cpu1's mutex, then it migrated to cpu2
> (with very rare chance)
> > 	a. if another thread wants to run on cpu1, it is blocked;
> 
> How it is blocked? That "struct crypto_acomp_ctx" is
> "this_cpu_ptr(entry->pool->acomp_ctx)" - which is per-CPU of a pool which
> you can have multiple of. But `dstmem' you have only one per-CPU no matter
> have many pools you have.
> So pool1 on CPU1 uses the same `dstmem' as pool2 on CPU1. But pool1 and
> pool2 on CPU1 use a different mutex for protection of this `dstmem'.

Good catch, Sebastian, thanks!
this is a corner case testing has not encountered yet. There is a race if we change the pool type at runtime.
Typically, a group of initial parameters were set, then software wrote/read lots of anon pages to generate
swapping as busy as possible. But never tried to change the compressor/pool type at runtime.

will address this problem in v5 with the cleanup of acomp_ctx pointer in zswap_pool. I mean to
create acomp instants for per-cpu, not for (pools * per-cpu).

Thanks
Barry

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-07-09 12:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-07 12:52 [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration Barry Song
2020-07-08 14:59 ` Sebastian Andrzej Siewior
2020-07-08 21:45   ` Song Bao Hua (Barry Song)
2020-07-09  7:17     ` Sebastian Andrzej Siewior
2020-07-09 12:14       ` Song Bao Hua (Barry Song)
2020-07-09  1:32   ` Song Bao Hua (Barry Song)
2020-07-09  7:39     ` Sebastian Andrzej Siewior
2020-07-09  7:55       ` Song Bao Hua (Barry Song)
2020-07-09  8:40         ` Sebastian Andrzej Siewior
2020-07-09  9:09           ` Song Bao Hua (Barry Song)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).