linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] dm-crypt excessive overhead
@ 2020-06-19 16:41 Ignat Korchagin
  2020-06-19 16:41 ` [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target Ignat Korchagin
  2020-06-19 16:55 ` [RFC PATCH 0/1] dm-crypt excessive overhead Mike Snitzer
  0 siblings, 2 replies; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-19 16:41 UTC (permalink / raw)
  To: agk, snitzer, dm-devel, dm-crypt, linux-kernel
  Cc: Ignat Korchagin, kernel-team

This is a follow up from the long-forgotten [1], but with some more convincing
evidence. Consider the following script:

#!/bin/bash -e

# create 4G ramdisk
sudo modprobe brd rd_nr=1 rd_size=4194304

# create a dm-crypt device with NULL cipher on top of /dev/ram0
echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0

# create a dm-crypt device with NULL cipher and custom force_inline flag
echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0

# read all data from /dev/ram0
sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum

# read the same data from /dev/mapper/eram0
sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum

# read the same data from /dev/mapper/inline-eram0
sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum

This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
for "encyption"). The first instance is the current dm-crypt implementation from
5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
better readability):

# plain ram0
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

# eram0 (current dm-crypt)
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

# inline-eram0 (patched dm-crypt)
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -

As we can see, current dm-crypt implementation creates a significant IO
performance overhead (at least on small IO block sizes) for both latency and
throughput. We suspect offloading IO request processing into workqueues and
async threads is more harmful these days with the modern fast storage. I also
did some digging into the dm-crypt git history and much of this async processing
is not needed anymore, because the reasons it was added are mostly gone from the
kernel. More details can be found in [2] (see "Git archeology" section).

We have been running the attached patch on different hardware generations in
more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
happy with the performance benefits.

[1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
[2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Ignat Korchagin (1):
  Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target

 drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-19 16:41 [RFC PATCH 0/1] dm-crypt excessive overhead Ignat Korchagin
@ 2020-06-19 16:41 ` Ignat Korchagin
  2020-06-24  5:04   ` [dm-crypt] " Eric Biggers
  2020-06-24  5:12   ` [dm-devel] " Damien Le Moal
  2020-06-19 16:55 ` [RFC PATCH 0/1] dm-crypt excessive overhead Mike Snitzer
  1 sibling, 2 replies; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-19 16:41 UTC (permalink / raw)
  To: agk, snitzer, dm-devel, dm-crypt, linux-kernel
  Cc: Ignat Korchagin, kernel-team

Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
especially visible on busy systems with many processes/threads. Moreover, most
Crypto API implementaions are async, that is they offload crypto operations on
their own, so this dm-crypt offloading is excessive.

This adds a new flag, which directs dm-crypt not to offload crypto operations
and process everything inline. For cases, where crypto operations cannot happen
inline (hard interrupt context, for example the read path of the NVME driver),
we offload the work to a tasklet rather than a workqueue.

Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
---
 drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 000ddfab5ba0..5a9bac4fdffb 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -69,6 +69,7 @@ struct dm_crypt_io {
 	u8 *integrity_metadata;
 	bool integrity_metadata_from_pool;
 	struct work_struct work;
+	struct tasklet_struct tasklet;
 
 	struct convert_context ctx;
 
@@ -127,7 +128,7 @@ struct iv_elephant_private {
  * and encrypts / decrypts at the same time.
  */
 enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
+	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) };
 
 enum cipher_flags {
 	CRYPT_MODE_INTEGRITY_AEAD,	/* Use authenticated mode for cihper */
@@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
 
 	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
 
-	/*
-	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
-	 * requests if driver request queue is full.
-	 */
-	skcipher_request_set_callback(ctx->r.req,
-	    CRYPTO_TFM_REQ_MAY_BACKLOG,
-	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
+	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
+		/* make sure we zero important fields of the request */
+		skcipher_request_set_callback(ctx->r.req,
+	        0, NULL, NULL);
+	else
+		/*
+		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
+		 * requests if driver request queue is full.
+		 */
+		skcipher_request_set_callback(ctx->r.req,
+	        CRYPTO_TFM_REQ_MAY_BACKLOG,
+	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
 }
 
 static void crypt_alloc_req_aead(struct crypt_config *cc,
@@ -1566,7 +1572,8 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
 			atomic_dec(&ctx->cc_pending);
 			ctx->cc_sector += sector_step;
 			tag_offset++;
-			cond_resched();
+			if (!test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
+				cond_resched();
 			continue;
 		/*
 		 * There was a data integrity error.
@@ -1892,6 +1899,11 @@ static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io, int async)
 
 	clone->bi_iter.bi_sector = cc->start + io->sector;
 
+	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
+		generic_make_request(clone);
+		return;
+	}
+
 	if (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) {
 		generic_make_request(clone);
 		return;
@@ -2031,12 +2043,26 @@ static void kcryptd_crypt(struct work_struct *work)
 		kcryptd_crypt_write_convert(io);
 }
 
+static void kcryptd_crypt_tasklet(unsigned long work)
+{
+	kcryptd_crypt((struct work_struct *)work);
+}
+
 static void kcryptd_queue_crypt(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
 
-	INIT_WORK(&io->work, kcryptd_crypt);
-	queue_work(cc->crypt_queue, &io->work);
+	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
+		if (in_irq()) {
+			/* Crypto API will fail hard in hard IRQ context */
+			tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
+			tasklet_schedule(&io->tasklet);
+		} else
+			kcryptd_crypt(&io->work);
+	} else {
+		INIT_WORK(&io->work, kcryptd_crypt);
+		queue_work(cc->crypt_queue, &io->work);
+	}
 }
 
 static void crypt_free_tfms_aead(struct crypt_config *cc)
@@ -2838,7 +2864,7 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
 	struct crypt_config *cc = ti->private;
 	struct dm_arg_set as;
 	static const struct dm_arg _args[] = {
-		{0, 6, "Invalid number of feature args"},
+		{0, 7, "Invalid number of feature args"},
 	};
 	unsigned int opt_params, val;
 	const char *opt_string, *sval;
@@ -2868,6 +2894,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
 
 		else if (!strcasecmp(opt_string, "submit_from_crypt_cpus"))
 			set_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags);
+		else if (!strcasecmp(opt_string, "force_inline"))
+			set_bit(DM_CRYPT_FORCE_INLINE, &cc->flags);
 		else if (sscanf(opt_string, "integrity:%u:", &val) == 1) {
 			if (val == 0 || val > MAX_TAG_SIZE) {
 				ti->error = "Invalid integrity arguments";
@@ -3196,6 +3224,7 @@ static void crypt_status(struct dm_target *ti, status_type_t type,
 		num_feature_args += !!ti->num_discard_bios;
 		num_feature_args += test_bit(DM_CRYPT_SAME_CPU, &cc->flags);
 		num_feature_args += test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags);
+		num_feature_args += test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags);
 		num_feature_args += cc->sector_size != (1 << SECTOR_SHIFT);
 		num_feature_args += test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags);
 		if (cc->on_disk_tag_size)
@@ -3208,6 +3237,8 @@ static void crypt_status(struct dm_target *ti, status_type_t type,
 				DMEMIT(" same_cpu_crypt");
 			if (test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags))
 				DMEMIT(" submit_from_crypt_cpus");
+			if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
+				DMEMIT(" force_inline");
 			if (cc->on_disk_tag_size)
 				DMEMIT(" integrity:%u:%s", cc->on_disk_tag_size, cc->cipher_auth);
 			if (cc->sector_size != (1 << SECTOR_SHIFT))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-19 16:41 [RFC PATCH 0/1] dm-crypt excessive overhead Ignat Korchagin
  2020-06-19 16:41 ` [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target Ignat Korchagin
@ 2020-06-19 16:55 ` Mike Snitzer
  2020-06-19 18:39   ` Mikulas Patocka
  2020-06-22  0:45   ` [dm-devel] " Damien Le Moal
  1 sibling, 2 replies; 31+ messages in thread
From: Mike Snitzer @ 2020-06-19 16:55 UTC (permalink / raw)
  To: Ignat Korchagin, Mikulas Patocka
  Cc: agk, dm-devel, dm-crypt, linux-kernel, kernel-team

On Fri, Jun 19 2020 at 12:41pm -0400,
Ignat Korchagin <ignat@cloudflare.com> wrote:

> This is a follow up from the long-forgotten [1], but with some more convincing
> evidence. Consider the following script:
> 
> #!/bin/bash -e
> 
> # create 4G ramdisk
> sudo modprobe brd rd_nr=1 rd_size=4194304
> 
> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
> 
> # create a dm-crypt device with NULL cipher and custom force_inline flag
> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
> 
> # read all data from /dev/ram0
> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> 
> # read the same data from /dev/mapper/eram0
> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> 
> # read the same data from /dev/mapper/inline-eram0
> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> 
> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
> for "encyption"). The first instance is the current dm-crypt implementation from
> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
> better readability):
> 
> # plain ram0
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> 
> # eram0 (current dm-crypt)
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> 
> # inline-eram0 (patched dm-crypt)
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> 
> As we can see, current dm-crypt implementation creates a significant IO
> performance overhead (at least on small IO block sizes) for both latency and
> throughput. We suspect offloading IO request processing into workqueues and
> async threads is more harmful these days with the modern fast storage. I also
> did some digging into the dm-crypt git history and much of this async processing
> is not needed anymore, because the reasons it was added are mostly gone from the
> kernel. More details can be found in [2] (see "Git archeology" section).
> 
> We have been running the attached patch on different hardware generations in
> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
> happy with the performance benefits.
> 
> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> 
> Ignat Korchagin (1):
>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> 
>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 43 insertions(+), 12 deletions(-)
> 
> -- 
> 2.20.1
> 

Hi,

I saw [2] and have been expecting something from cloudflare ever since.
Nice to see this submission.

There is useful context in your 0th patch header.  I'll likely merge
parts of this patch header with the more terse 1/1 header (reality is
there only needed to be a single patch submission).

Will review and stage accordingly if all looks fine to me.  Mikulas,
please have a look too.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-19 16:55 ` [RFC PATCH 0/1] dm-crypt excessive overhead Mike Snitzer
@ 2020-06-19 18:39   ` Mikulas Patocka
  2020-06-19 19:44     ` Ignat Korchagin
  2020-06-20  1:23     ` Herbert Xu
  2020-06-22  0:45   ` [dm-devel] " Damien Le Moal
  1 sibling, 2 replies; 31+ messages in thread
From: Mikulas Patocka @ 2020-06-19 18:39 UTC (permalink / raw)
  To: Ignat Korchagin, Herbert Xu, David S. Miller
  Cc: Mike Snitzer, agk, dm-devel, dm-crypt, linux-kernel, kernel-team



On Fri, 19 Jun 2020, Mike Snitzer wrote:

> On Fri, Jun 19 2020 at 12:41pm -0400,
> Ignat Korchagin <ignat@cloudflare.com> wrote:
> 
> > This is a follow up from the long-forgotten [1], but with some more convincing
> > evidence. Consider the following script:
> > 
> > [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> > [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> > 
> > Ignat Korchagin (1):
> >   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> > 
> >  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
> >  1 file changed, 43 insertions(+), 12 deletions(-)
> > 
> > -- 
> > 2.20.1
> > 
> 
> Hi,
> 
> I saw [2] and have been expecting something from cloudflare ever since.
> Nice to see this submission.
> 
> There is useful context in your 0th patch header.  I'll likely merge
> parts of this patch header with the more terse 1/1 header (reality is
> there only needed to be a single patch submission).
> 
> Will review and stage accordingly if all looks fine to me.  Mikulas,
> please have a look too.
> 
> Thanks,
> Mike

+       if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
+               if (in_irq()) {
+                       /* Crypto API will fail hard in hard IRQ context */
+                       tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
+                       tasklet_schedule(&io->tasklet);
+               } else
+                       kcryptd_crypt(&io->work);
+       } else {
+               INIT_WORK(&io->work, kcryptd_crypt);
+               queue_work(cc->crypt_queue, &io->work);
+       }

I'm looking at this and I'd like to know why does the crypto API fail in 
hard-irq context and why does it work in tasklet context. What's the exact 
reason behind this?

Mikulas


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-19 18:39   ` Mikulas Patocka
@ 2020-06-19 19:44     ` Ignat Korchagin
  2020-06-20  1:23     ` Herbert Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-19 19:44 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Herbert Xu, David S. Miller, Mike Snitzer, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team

On Fri, Jun 19, 2020 at 7:39 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Fri, 19 Jun 2020, Mike Snitzer wrote:
>
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin <ignat@cloudflare.com> wrote:
> >
> > > This is a follow up from the long-forgotten [1], but with some more convincing
> > > evidence. Consider the following script:
> > >
> > > [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> > > [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> > >
> > > Ignat Korchagin (1):
> > >   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> > >
> > >  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 43 insertions(+), 12 deletions(-)
> > >
> > > --
> > > 2.20.1
> > >
> >
> > Hi,
> >
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> >
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> >
> > Will review and stage accordingly if all looks fine to me.  Mikulas,
> > please have a look too.
> >
> > Thanks,
> > Mike
>
> +       if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
> +               if (in_irq()) {
> +                       /* Crypto API will fail hard in hard IRQ context */
> +                       tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
> +                       tasklet_schedule(&io->tasklet);
> +               } else
> +                       kcryptd_crypt(&io->work);
> +       } else {
> +               INIT_WORK(&io->work, kcryptd_crypt);
> +               queue_work(cc->crypt_queue, &io->work);
> +       }
>
> I'm looking at this and I'd like to know why does the crypto API fail in
> hard-irq context and why does it work in tasklet context. What's the exact
> reason behind this?

This comes from
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/crypto/skcipher.c?id=5e857ce6eae7ca21b2055cca4885545e29228fe2#n433
And, I believe, it is there for the same reason kcryptd was introduced
in 2005 in dm-crypt:
"...because it would be very unwise to do decryption in an interrupt
context..." (that is, when other interrupts are disabled). In tasklet
however we can still service other interrupts even if we process a
large block.

>
>
> Mikulas

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-19 18:39   ` Mikulas Patocka
  2020-06-19 19:44     ` Ignat Korchagin
@ 2020-06-20  1:23     ` Herbert Xu
  2020-06-20 19:36       ` Mikulas Patocka
  2020-06-23 15:33       ` Mike Snitzer
  1 sibling, 2 replies; 31+ messages in thread
From: Herbert Xu @ 2020-06-20  1:23 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ignat Korchagin, David S. Miller, Mike Snitzer, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team

On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
>
> I'm looking at this and I'd like to know why does the crypto API fail in 
> hard-irq context and why does it work in tasklet context. What's the exact 
> reason behind this?

You're not supposed to do any real work in IRQ handlers.  All
the substantial work should be postponed to softirq context.

Why do you need to do work in hard IRQ context?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-20  1:23     ` Herbert Xu
@ 2020-06-20 19:36       ` Mikulas Patocka
  2020-06-20 21:02         ` Ignat Korchagin
  2020-06-23 15:33       ` Mike Snitzer
  1 sibling, 1 reply; 31+ messages in thread
From: Mikulas Patocka @ 2020-06-20 19:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Ignat Korchagin, David S. Miller, Mike Snitzer, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team



On Sat, 20 Jun 2020, Herbert Xu wrote:

> On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> >
> > I'm looking at this and I'd like to know why does the crypto API fail in 
> > hard-irq context and why does it work in tasklet context. What's the exact 
> > reason behind this?
> 
> You're not supposed to do any real work in IRQ handlers.  All
> the substantial work should be postponed to softirq context.

I see.

BTW - should it also warn if it is running in a process context with 
interrupts disabled?

Mikulas

> Why do you need to do work in hard IRQ context?
> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-20 19:36       ` Mikulas Patocka
@ 2020-06-20 21:02         ` Ignat Korchagin
  0 siblings, 0 replies; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-20 21:02 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Herbert Xu, David S. Miller, Mike Snitzer, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team

Yes, it should.

I got one when I was testing the first iteration (without the tasklet)
of the patch on an NVME? disk.

On Sat, Jun 20, 2020 at 8:36 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Sat, 20 Jun 2020, Herbert Xu wrote:
>
> > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> > >
> > > I'm looking at this and I'd like to know why does the crypto API fail in
> > > hard-irq context and why does it work in tasklet context. What's the exact
> > > reason behind this?
> >
> > You're not supposed to do any real work in IRQ handlers.  All
> > the substantial work should be postponed to softirq context.
>
> I see.
>
> BTW - should it also warn if it is running in a process context with
> interrupts disabled?
>
> Mikulas
>
> > Why do you need to do work in hard IRQ context?
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> >
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-19 16:55 ` [RFC PATCH 0/1] dm-crypt excessive overhead Mike Snitzer
  2020-06-19 18:39   ` Mikulas Patocka
@ 2020-06-22  0:45   ` Damien Le Moal
  2020-06-22  7:55     ` Ignat Korchagin
  2020-06-23 15:01     ` Mike Snitzer
  1 sibling, 2 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-22  0:45 UTC (permalink / raw)
  To: Mike Snitzer, Ignat Korchagin, Mikulas Patocka
  Cc: dm-crypt, dm-devel, linux-kernel, agk, kernel-team

On 2020/06/20 1:56, Mike Snitzer wrote:
> On Fri, Jun 19 2020 at 12:41pm -0400,
> Ignat Korchagin <ignat@cloudflare.com> wrote:
> 
>> This is a follow up from the long-forgotten [1], but with some more convincing
>> evidence. Consider the following script:
>>
>> #!/bin/bash -e
>>
>> # create 4G ramdisk
>> sudo modprobe brd rd_nr=1 rd_size=4194304
>>
>> # create a dm-crypt device with NULL cipher on top of /dev/ram0
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
>>
>> # create a dm-crypt device with NULL cipher and custom force_inline flag
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
>>
>> # read all data from /dev/ram0
>> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
>>
>> # read the same data from /dev/mapper/eram0
>> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
>>
>> # read the same data from /dev/mapper/inline-eram0
>> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
>>
>> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
>> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
>> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
>> for "encyption"). The first instance is the current dm-crypt implementation from
>> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
>> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
>> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
>> better readability):
>>
>> # plain ram0
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>
>> # eram0 (current dm-crypt)
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>
>> # inline-eram0 (patched dm-crypt)
>> 1048576+0 records in
>> 1048576+0 records out
>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>
>> As we can see, current dm-crypt implementation creates a significant IO
>> performance overhead (at least on small IO block sizes) for both latency and
>> throughput. We suspect offloading IO request processing into workqueues and
>> async threads is more harmful these days with the modern fast storage. I also
>> did some digging into the dm-crypt git history and much of this async processing
>> is not needed anymore, because the reasons it was added are mostly gone from the
>> kernel. More details can be found in [2] (see "Git archeology" section).
>>
>> We have been running the attached patch on different hardware generations in
>> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
>> happy with the performance benefits.
>>
>> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
>> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
>>
>> Ignat Korchagin (1):
>>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
>>
>>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
>>  1 file changed, 43 insertions(+), 12 deletions(-)
>>
>> -- 
>> 2.20.1
>>
> 
> Hi,
> 
> I saw [2] and have been expecting something from cloudflare ever since.
> Nice to see this submission.
> 
> There is useful context in your 0th patch header.  I'll likely merge
> parts of this patch header with the more terse 1/1 header (reality is
> there only needed to be a single patch submission).
> 
> Will review and stage accordingly if all looks fine to me.  Mikulas,
> please have a look too.

Very timely: I was about to send a couple of patches to add zoned block device
support to dm-crypt :)

I used [1] work as a base to have all _write_ requests be processed inline in
the submitter context so that the submission order is preserved, avoiding the
potential reordering of sequential writes that the normal workqueue based
processing can generate. This inline processing is done only for writes. Reads
are unaffected.

To do this, I added a "inline_io" flag to struct convert_context which is
initialized in crypt_io_init() based on the BIO op.
kcryptd_crypt_write_io_submit() then uses this flag to call directly
generic_make_request() if inline_io is true.

This simplifies things compared to [1] since reads can still be processed as is,
so there are no issued with irq context and no need for a tasklet.

Should I send these patches as RFC to see what can be merged ? Or I can wait for
these patches and rebase on top.

> 
> Thanks,
> Mike
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-22  0:45   ` [dm-devel] " Damien Le Moal
@ 2020-06-22  7:55     ` Ignat Korchagin
  2020-06-22  8:08       ` Damien Le Moal
  2020-06-23 15:01     ` Mike Snitzer
  1 sibling, 1 reply; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-22  7:55 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mike Snitzer, Mikulas Patocka, dm-crypt, dm-devel, linux-kernel,
	agk, kernel-team

On Mon, Jun 22, 2020 at 1:45 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> On 2020/06/20 1:56, Mike Snitzer wrote:
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin <ignat@cloudflare.com> wrote:
> >
> >> This is a follow up from the long-forgotten [1], but with some more convincing
> >> evidence. Consider the following script:
> >>
> >> #!/bin/bash -e
> >>
> >> # create 4G ramdisk
> >> sudo modprobe brd rd_nr=1 rd_size=4194304
> >>
> >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
> >>
> >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
> >>
> >> # read all data from /dev/ram0
> >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/eram0
> >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/inline-eram0
> >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> >>
> >> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
> >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> >> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
> >> for "encyption"). The first instance is the current dm-crypt implementation from
> >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
> >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
> >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
> >> better readability):
> >>
> >> # plain ram0
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # eram0 (current dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # inline-eram0 (patched dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> As we can see, current dm-crypt implementation creates a significant IO
> >> performance overhead (at least on small IO block sizes) for both latency and
> >> throughput. We suspect offloading IO request processing into workqueues and
> >> async threads is more harmful these days with the modern fast storage. I also
> >> did some digging into the dm-crypt git history and much of this async processing
> >> is not needed anymore, because the reasons it was added are mostly gone from the
> >> kernel. More details can be found in [2] (see "Git archeology" section).
> >>
> >> We have been running the attached patch on different hardware generations in
> >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
> >> happy with the performance benefits.
> >>
> >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> >>
> >> Ignat Korchagin (1):
> >>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> >>
> >>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
> >>  1 file changed, 43 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 2.20.1
> >>
> >
> > Hi,
> >
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> >
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> >
> > Will review and stage accordingly if all looks fine to me.  Mikulas,
> > please have a look too.
>
> Very timely: I was about to send a couple of patches to add zoned block device
> support to dm-crypt :)
>
> I used [1] work as a base to have all _write_ requests be processed inline in
> the submitter context so that the submission order is preserved, avoiding the
> potential reordering of sequential writes that the normal workqueue based
> processing can generate. This inline processing is done only for writes. Reads
> are unaffected.
>
> To do this, I added a "inline_io" flag to struct convert_context which is
> initialized in crypt_io_init() based on the BIO op.
> kcryptd_crypt_write_io_submit() then uses this flag to call directly
> generic_make_request() if inline_io is true.
>
> This simplifies things compared to [1] since reads can still be processed as is,
> so there are no issued with irq context and no need for a tasklet.

In one of our major IO workflows (CDN cache) using dm-crypt created
high and spiky p99 response times, which actually prompted this
investigation. So, of all the things we do prefer the read path to be
inlined even more than the write path.

> Should I send these patches as RFC to see what can be merged ? Or I can wait for
> these patches and rebase on top.
>
> >
> > Thanks,
> > Mike
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> >
> >
>
>
> --
> Damien Le Moal
> Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-22  7:55     ` Ignat Korchagin
@ 2020-06-22  8:08       ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-22  8:08 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Mike Snitzer, Mikulas Patocka, dm-crypt, dm-devel, linux-kernel,
	agk, kernel-team

On 2020/06/22 16:55, Ignat Korchagin wrote:
> On Mon, Jun 22, 2020 at 1:45 AM Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>>
>> On 2020/06/20 1:56, Mike Snitzer wrote:
>>> On Fri, Jun 19 2020 at 12:41pm -0400,
>>> Ignat Korchagin <ignat@cloudflare.com> wrote:
>>>
>>>> This is a follow up from the long-forgotten [1], but with some more convincing
>>>> evidence. Consider the following script:
>>>>
>>>> #!/bin/bash -e
>>>>
>>>> # create 4G ramdisk
>>>> sudo modprobe brd rd_nr=1 rd_size=4194304
>>>>
>>>> # create a dm-crypt device with NULL cipher on top of /dev/ram0
>>>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
>>>>
>>>> # create a dm-crypt device with NULL cipher and custom force_inline flag
>>>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
>>>>
>>>> # read all data from /dev/ram0
>>>> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
>>>>
>>>> # read the same data from /dev/mapper/eram0
>>>> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
>>>>
>>>> # read the same data from /dev/mapper/inline-eram0
>>>> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
>>>>
>>>> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
>>>> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
>>>> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
>>>> for "encyption"). The first instance is the current dm-crypt implementation from
>>>> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
>>>> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
>>>> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
>>>> better readability):
>>>>
>>>> # plain ram0
>>>> 1048576+0 records in
>>>> 1048576+0 records out
>>>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
>>>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>>>
>>>> # eram0 (current dm-crypt)
>>>> 1048576+0 records in
>>>> 1048576+0 records out
>>>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
>>>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>>>
>>>> # inline-eram0 (patched dm-crypt)
>>>> 1048576+0 records in
>>>> 1048576+0 records out
>>>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
>>>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>>>
>>>> As we can see, current dm-crypt implementation creates a significant IO
>>>> performance overhead (at least on small IO block sizes) for both latency and
>>>> throughput. We suspect offloading IO request processing into workqueues and
>>>> async threads is more harmful these days with the modern fast storage. I also
>>>> did some digging into the dm-crypt git history and much of this async processing
>>>> is not needed anymore, because the reasons it was added are mostly gone from the
>>>> kernel. More details can be found in [2] (see "Git archeology" section).
>>>>
>>>> We have been running the attached patch on different hardware generations in
>>>> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
>>>> happy with the performance benefits.
>>>>
>>>> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
>>>> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
>>>>
>>>> Ignat Korchagin (1):
>>>>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
>>>>
>>>>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
>>>>  1 file changed, 43 insertions(+), 12 deletions(-)
>>>>
>>>> --
>>>> 2.20.1
>>>>
>>>
>>> Hi,
>>>
>>> I saw [2] and have been expecting something from cloudflare ever since.
>>> Nice to see this submission.
>>>
>>> There is useful context in your 0th patch header.  I'll likely merge
>>> parts of this patch header with the more terse 1/1 header (reality is
>>> there only needed to be a single patch submission).
>>>
>>> Will review and stage accordingly if all looks fine to me.  Mikulas,
>>> please have a look too.
>>
>> Very timely: I was about to send a couple of patches to add zoned block device
>> support to dm-crypt :)
>>
>> I used [1] work as a base to have all _write_ requests be processed inline in
>> the submitter context so that the submission order is preserved, avoiding the
>> potential reordering of sequential writes that the normal workqueue based
>> processing can generate. This inline processing is done only for writes. Reads
>> are unaffected.
>>
>> To do this, I added a "inline_io" flag to struct convert_context which is
>> initialized in crypt_io_init() based on the BIO op.
>> kcryptd_crypt_write_io_submit() then uses this flag to call directly
>> generic_make_request() if inline_io is true.
>>
>> This simplifies things compared to [1] since reads can still be processed as is,
>> so there are no issued with irq context and no need for a tasklet.
> 
> In one of our major IO workflows (CDN cache) using dm-crypt created
> high and spiky p99 response times, which actually prompted this
> investigation. So, of all the things we do prefer the read path to be
> inlined even more than the write path.

Yes, I understand. Zoned block device support requires at the very least that
writes be inline to preserve order. For reads, inline or not are both fine. The
choice can be made by the user here using or not using the force inline flag you
are proposing. These works are definitely not incompatible. Doing everything
inline with a zoned block device backend is perfectly fine.

I mentioned my work because half of it is very similar to your patches. So we
could combine the patches, or as mentioned, one will have to rebase on top of
the other as clearly we will have conflicts :)

> 
>> Should I send these patches as RFC to see what can be merged ? Or I can wait for
>> these patches and rebase on top.
>>
>>>
>>> Thanks,
>>> Mike
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@redhat.com
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>
>>>
>>
>>
>> --
>> Damien Le Moal
>> Western Digital Research
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-22  0:45   ` [dm-devel] " Damien Le Moal
  2020-06-22  7:55     ` Ignat Korchagin
@ 2020-06-23 15:01     ` Mike Snitzer
  2020-06-23 15:07       ` Ignat Korchagin
  2020-06-24  4:28       ` Damien Le Moal
  1 sibling, 2 replies; 31+ messages in thread
From: Mike Snitzer @ 2020-06-23 15:01 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ignat Korchagin, Mikulas Patocka, dm-crypt, dm-devel,
	linux-kernel, agk, kernel-team

On Sun, Jun 21 2020 at  8:45pm -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> On 2020/06/20 1:56, Mike Snitzer wrote:
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin <ignat@cloudflare.com> wrote:
> > 
> >> This is a follow up from the long-forgotten [1], but with some more convincing
> >> evidence. Consider the following script:
> >>
> >> #!/bin/bash -e
> >>
> >> # create 4G ramdisk
> >> sudo modprobe brd rd_nr=1 rd_size=4194304
> >>
> >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
> >>
> >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
> >>
> >> # read all data from /dev/ram0
> >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/eram0
> >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/inline-eram0
> >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> >>
> >> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
> >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> >> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
> >> for "encyption"). The first instance is the current dm-crypt implementation from
> >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
> >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
> >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
> >> better readability):
> >>
> >> # plain ram0
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # eram0 (current dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # inline-eram0 (patched dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> As we can see, current dm-crypt implementation creates a significant IO
> >> performance overhead (at least on small IO block sizes) for both latency and
> >> throughput. We suspect offloading IO request processing into workqueues and
> >> async threads is more harmful these days with the modern fast storage. I also
> >> did some digging into the dm-crypt git history and much of this async processing
> >> is not needed anymore, because the reasons it was added are mostly gone from the
> >> kernel. More details can be found in [2] (see "Git archeology" section).
> >>
> >> We have been running the attached patch on different hardware generations in
> >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
> >> happy with the performance benefits.
> >>
> >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> >>
> >> Ignat Korchagin (1):
> >>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> >>
> >>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
> >>  1 file changed, 43 insertions(+), 12 deletions(-)
> >>
> >> -- 
> >> 2.20.1
> >>
> > 
> > Hi,
> > 
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> > 
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> > 
> > Will review and stage accordingly if all looks fine to me.  Mikulas,
> > please have a look too.
> 
> Very timely: I was about to send a couple of patches to add zoned block device
> support to dm-crypt :)
> 
> I used [1] work as a base to have all _write_ requests be processed inline in
> the submitter context so that the submission order is preserved, avoiding the
> potential reordering of sequential writes that the normal workqueue based
> processing can generate. This inline processing is done only for writes. Reads
> are unaffected.
> 
> To do this, I added a "inline_io" flag to struct convert_context which is
> initialized in crypt_io_init() based on the BIO op.
> kcryptd_crypt_write_io_submit() then uses this flag to call directly
> generic_make_request() if inline_io is true.
> 
> This simplifies things compared to [1] since reads can still be processed as is,
> so there are no issued with irq context and no need for a tasklet.
> 
> Should I send these patches as RFC to see what can be merged ? Or I can wait for
> these patches and rebase on top.

It'd be ideal for this inline capability to address both Ignat's and
your needs.  Given Ignat's changes _should_ enable yours (and Ignat
clarified that having reads inline is actually important) then I think it
best if you review Ignat's patch closely, rebase on it and test that it
meets your needs.

I'll wait for you to do this work so that I can get your feedback on
whether Ignat's changes look good for you too.  We have some time before
the 5.9 merge window opens, lets just keep the communication going and
make sure what we send upstream addresses everyone's needs and concerns.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-23 15:01     ` Mike Snitzer
@ 2020-06-23 15:07       ` Ignat Korchagin
  2020-06-23 15:22         ` Mike Snitzer
  2020-06-24  4:28       ` Damien Le Moal
  1 sibling, 1 reply; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-23 15:07 UTC (permalink / raw)
  To: Mike Snitzer, Damien Le Moal
  Cc: Mikulas Patocka, dm-crypt, dm-devel, linux-kernel, agk, kernel-team

Do you think it may be better to break it in two flags: one for read
path and one for write? So, depending on the needs and workflow these
could be enabled independently?

Regards,
Ignat

On Tue, Jun 23, 2020 at 4:01 PM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Sun, Jun 21 2020 at  8:45pm -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> > On 2020/06/20 1:56, Mike Snitzer wrote:
> > > On Fri, Jun 19 2020 at 12:41pm -0400,
> > > Ignat Korchagin <ignat@cloudflare.com> wrote:
> > >
> > >> This is a follow up from the long-forgotten [1], but with some more convincing
> > >> evidence. Consider the following script:
> > >>
> > >> #!/bin/bash -e
> > >>
> > >> # create 4G ramdisk
> > >> sudo modprobe brd rd_nr=1 rd_size=4194304
> > >>
> > >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
> > >>
> > >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> > >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
> > >>
> > >> # read all data from /dev/ram0
> > >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> > >>
> > >> # read the same data from /dev/mapper/eram0
> > >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> > >>
> > >> # read the same data from /dev/mapper/inline-eram0
> > >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> > >>
> > >> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
> > >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> > >> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
> > >> for "encyption"). The first instance is the current dm-crypt implementation from
> > >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
> > >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
> > >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
> > >> better readability):
> > >>
> > >> # plain ram0
> > >> 1048576+0 records in
> > >> 1048576+0 records out
> > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> > >>
> > >> # eram0 (current dm-crypt)
> > >> 1048576+0 records in
> > >> 1048576+0 records out
> > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> > >>
> > >> # inline-eram0 (patched dm-crypt)
> > >> 1048576+0 records in
> > >> 1048576+0 records out
> > >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> > >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> > >>
> > >> As we can see, current dm-crypt implementation creates a significant IO
> > >> performance overhead (at least on small IO block sizes) for both latency and
> > >> throughput. We suspect offloading IO request processing into workqueues and
> > >> async threads is more harmful these days with the modern fast storage. I also
> > >> did some digging into the dm-crypt git history and much of this async processing
> > >> is not needed anymore, because the reasons it was added are mostly gone from the
> > >> kernel. More details can be found in [2] (see "Git archeology" section).
> > >>
> > >> We have been running the attached patch on different hardware generations in
> > >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
> > >> happy with the performance benefits.
> > >>
> > >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> > >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> > >>
> > >> Ignat Korchagin (1):
> > >>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> > >>
> > >>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
> > >>  1 file changed, 43 insertions(+), 12 deletions(-)
> > >>
> > >> --
> > >> 2.20.1
> > >>
> > >
> > > Hi,
> > >
> > > I saw [2] and have been expecting something from cloudflare ever since.
> > > Nice to see this submission.
> > >
> > > There is useful context in your 0th patch header.  I'll likely merge
> > > parts of this patch header with the more terse 1/1 header (reality is
> > > there only needed to be a single patch submission).
> > >
> > > Will review and stage accordingly if all looks fine to me.  Mikulas,
> > > please have a look too.
> >
> > Very timely: I was about to send a couple of patches to add zoned block device
> > support to dm-crypt :)
> >
> > I used [1] work as a base to have all _write_ requests be processed inline in
> > the submitter context so that the submission order is preserved, avoiding the
> > potential reordering of sequential writes that the normal workqueue based
> > processing can generate. This inline processing is done only for writes. Reads
> > are unaffected.
> >
> > To do this, I added a "inline_io" flag to struct convert_context which is
> > initialized in crypt_io_init() based on the BIO op.
> > kcryptd_crypt_write_io_submit() then uses this flag to call directly
> > generic_make_request() if inline_io is true.
> >
> > This simplifies things compared to [1] since reads can still be processed as is,
> > so there are no issued with irq context and no need for a tasklet.
> >
> > Should I send these patches as RFC to see what can be merged ? Or I can wait for
> > these patches and rebase on top.
>
> It'd be ideal for this inline capability to address both Ignat's and
> your needs.  Given Ignat's changes _should_ enable yours (and Ignat
> clarified that having reads inline is actually important) then I think it
> best if you review Ignat's patch closely, rebase on it and test that it
> meets your needs.
>
> I'll wait for you to do this work so that I can get your feedback on
> whether Ignat's changes look good for you too.  We have some time before
> the 5.9 merge window opens, lets just keep the communication going and
> make sure what we send upstream addresses everyone's needs and concerns.
>
> Thanks,
> Mike
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-23 15:07       ` Ignat Korchagin
@ 2020-06-23 15:22         ` Mike Snitzer
  2020-06-24  4:54           ` [dm-devel] " Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2020-06-23 15:22 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Damien Le Moal, Mikulas Patocka, dm-crypt, dm-devel,
	linux-kernel, agk, kernel-team

On Tue, Jun 23 2020 at 11:07am -0400,
Ignat Korchagin <ignat@cloudflare.com> wrote:

> Do you think it may be better to break it in two flags: one for read
> path and one for write? So, depending on the needs and workflow these
> could be enabled independently?

If there is a need to split, then sure.  But I think Damien had a hard
requirement that writes had to be inlined but that reads didn't _need_
to be for his dm-zoned usecase.  Damien may not yet have assessed the
performance implications, of not have reads inlined, as much as you
have.

So let's see how Damien's work goes and if he trully doesn't need/want
reads to be inlined then 2 flags can be created.

Mike


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-20  1:23     ` Herbert Xu
  2020-06-20 19:36       ` Mikulas Patocka
@ 2020-06-23 15:33       ` Mike Snitzer
  2020-06-23 16:24         ` Ignat Korchagin
  1 sibling, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2020-06-23 15:33 UTC (permalink / raw)
  To: Herbert Xu, Ignat Korchagin
  Cc: Mikulas Patocka, Ignat Korchagin, David S. Miller, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team

On Fri, Jun 19 2020 at  9:23pm -0400,
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> >
> > I'm looking at this and I'd like to know why does the crypto API fail in 
> > hard-irq context and why does it work in tasklet context. What's the exact 
> > reason behind this?
> 
> You're not supposed to do any real work in IRQ handlers.  All
> the substantial work should be postponed to softirq context.
> 
> Why do you need to do work in hard IRQ context?

Ignat, think you may have missed Herbert's question?

My understanding is that you're doing work in hard IRQ context (via
tasklet) precisely to avoid overhead of putting to a workqueue?  Did
you try using a workqueue and it didn't work adequately?  If so, do you
have a handle on why that is?  E.g. was it due to increased latency? or
IO completion occurring on different cpus that submitted (are you
leaning heavily on blk-mq's ability to pin IO completion to same cpu as
IO was submitted?)

I'm fishing here but I'd just like to tease out the details for _why_
you need to do work from hard IRQ via tasklet so that I can potentially
defend it if needed.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-23 15:33       ` Mike Snitzer
@ 2020-06-23 16:24         ` Ignat Korchagin
  2020-06-24  0:23           ` Herbert Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-23 16:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Herbert Xu, Mikulas Patocka, David S. Miller, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team

On Tue, Jun 23, 2020 at 4:34 PM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Fri, Jun 19 2020 at  9:23pm -0400,
> Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> > >
> > > I'm looking at this and I'd like to know why does the crypto API fail in
> > > hard-irq context and why does it work in tasklet context. What's the exact
> > > reason behind this?
> >
> > You're not supposed to do any real work in IRQ handlers.  All
> > the substantial work should be postponed to softirq context.
> >
> > Why do you need to do work in hard IRQ context?
>
> Ignat, think you may have missed Herbert's question?
>
> My understanding is that you're doing work in hard IRQ context (via
> tasklet) precisely to avoid overhead of putting to a workqueue?  Did
> you try using a workqueue and it didn't work adequately?  If so, do you
> have a handle on why that is?  E.g. was it due to increased latency? or
> IO completion occurring on different cpus that submitted (are you
> leaning heavily on blk-mq's ability to pin IO completion to same cpu as
> IO was submitted?)
>
> I'm fishing here but I'd just like to tease out the details for _why_
> you need to do work from hard IRQ via tasklet so that I can potentially
> defend it if needed.

I may be misunderstanding the terminology, but tasklets execute in
soft IRQ, don't they? What we care about is to execute the decryption
as fast as possible, but we can't do it in a hard IRQ context (that
is, the interrupt context where other interrupts are disabled). As far
as I understand, tasklets are executed right after the hard IRQ
context, but with interrupts enabled - which is the first safe-ish
place to do more lengthy processing without the risk of missing an
interrupt.

Workqueues instead of tasklets - is the way how it is mostly
implemented now. But that creates additional IO latency, most probably
due to the fact that we're introducing CPU scheduling latency into the
overall read IO latency. This is confirmed by the fact that our busier
production systems have much worse and more important - spiky and
unstable p99 read latency, which somewhat correlates to high CPU
scheduling latency reported by metrics.

So, by inlining crypto or using a tasklet we're effectively
prioritising IO encryption/decryption. What we want to avoid is mixing
unpredicted additional latency from an unrelated subsystem (CPU
scheduler), because our expectation is that the total latency should
be real HW io latency + crypto operation latency (which is usually
quite stable).

I hope this makes sense.

>
> Thanks,
> Mike
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-23 16:24         ` Ignat Korchagin
@ 2020-06-24  0:23           ` Herbert Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Herbert Xu @ 2020-06-24  0:23 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Mike Snitzer, Mikulas Patocka, David S. Miller, agk, dm-devel,
	dm-crypt, linux-kernel, kernel-team

On Tue, Jun 23, 2020 at 05:24:39PM +0100, Ignat Korchagin wrote:
> 
> I may be misunderstanding the terminology, but tasklets execute in
> soft IRQ, don't they? What we care about is to execute the decryption
> as fast as possible, but we can't do it in a hard IRQ context (that
> is, the interrupt context where other interrupts are disabled). As far
> as I understand, tasklets are executed right after the hard IRQ
> context, but with interrupts enabled - which is the first safe-ish
> place to do more lengthy processing without the risk of missing an
> interrupt.

Yes you are absolutely right.  In general high-performance work
should be carried out in softirq context.  That's how the networking
stack works for example.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-23 15:01     ` Mike Snitzer
  2020-06-23 15:07       ` Ignat Korchagin
@ 2020-06-24  4:28       ` Damien Le Moal
  1 sibling, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  4:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ignat Korchagin, Mikulas Patocka, dm-crypt, dm-devel,
	linux-kernel, agk, kernel-team

On 2020/06/24 0:01, Mike Snitzer wrote:
> On Sun, Jun 21 2020 at  8:45pm -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> On 2020/06/20 1:56, Mike Snitzer wrote:
>>> On Fri, Jun 19 2020 at 12:41pm -0400,
>>> Ignat Korchagin <ignat@cloudflare.com> wrote:
>>>
>>>> This is a follow up from the long-forgotten [1], but with some more convincing
>>>> evidence. Consider the following script:
>>>>
>>>> #!/bin/bash -e
>>>>
>>>> # create 4G ramdisk
>>>> sudo modprobe brd rd_nr=1 rd_size=4194304
>>>>
>>>> # create a dm-crypt device with NULL cipher on top of /dev/ram0
>>>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup create eram0
>>>>
>>>> # create a dm-crypt device with NULL cipher and custom force_inline flag
>>>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | sudo dmsetup create inline-eram0
>>>>
>>>> # read all data from /dev/ram0
>>>> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
>>>>
>>>> # read the same data from /dev/mapper/eram0
>>>> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
>>>>
>>>> # read the same data from /dev/mapper/inline-eram0
>>>> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
>>>>
>>>> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
>>>> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
>>>> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
>>>> for "encyption"). The first instance is the current dm-crypt implementation from
>>>> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled from
>>>> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
>>>> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
>>>> better readability):
>>>>
>>>> # plain ram0
>>>> 1048576+0 records in
>>>> 1048576+0 records out
>>>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
>>>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>>>
>>>> # eram0 (current dm-crypt)
>>>> 1048576+0 records in
>>>> 1048576+0 records out
>>>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
>>>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>>>
>>>> # inline-eram0 (patched dm-crypt)
>>>> 1048576+0 records in
>>>> 1048576+0 records out
>>>> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
>>>> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
>>>>
>>>> As we can see, current dm-crypt implementation creates a significant IO
>>>> performance overhead (at least on small IO block sizes) for both latency and
>>>> throughput. We suspect offloading IO request processing into workqueues and
>>>> async threads is more harmful these days with the modern fast storage. I also
>>>> did some digging into the dm-crypt git history and much of this async processing
>>>> is not needed anymore, because the reasons it was added are mostly gone from the
>>>> kernel. More details can be found in [2] (see "Git archeology" section).
>>>>
>>>> We have been running the attached patch on different hardware generations in
>>>> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
>>>> happy with the performance benefits.
>>>>
>>>> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
>>>> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
>>>>
>>>> Ignat Korchagin (1):
>>>>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
>>>>
>>>>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
>>>>  1 file changed, 43 insertions(+), 12 deletions(-)
>>>>
>>>> -- 
>>>> 2.20.1
>>>>
>>>
>>> Hi,
>>>
>>> I saw [2] and have been expecting something from cloudflare ever since.
>>> Nice to see this submission.
>>>
>>> There is useful context in your 0th patch header.  I'll likely merge
>>> parts of this patch header with the more terse 1/1 header (reality is
>>> there only needed to be a single patch submission).
>>>
>>> Will review and stage accordingly if all looks fine to me.  Mikulas,
>>> please have a look too.
>>
>> Very timely: I was about to send a couple of patches to add zoned block device
>> support to dm-crypt :)
>>
>> I used [1] work as a base to have all _write_ requests be processed inline in
>> the submitter context so that the submission order is preserved, avoiding the
>> potential reordering of sequential writes that the normal workqueue based
>> processing can generate. This inline processing is done only for writes. Reads
>> are unaffected.
>>
>> To do this, I added a "inline_io" flag to struct convert_context which is
>> initialized in crypt_io_init() based on the BIO op.
>> kcryptd_crypt_write_io_submit() then uses this flag to call directly
>> generic_make_request() if inline_io is true.
>>
>> This simplifies things compared to [1] since reads can still be processed as is,
>> so there are no issued with irq context and no need for a tasklet.
>>
>> Should I send these patches as RFC to see what can be merged ? Or I can wait for
>> these patches and rebase on top.
> 
> It'd be ideal for this inline capability to address both Ignat's and
> your needs.  Given Ignat's changes _should_ enable yours (and Ignat
> clarified that having reads inline is actually important) then I think it
> best if you review Ignat's patch closely, rebase on it and test that it
> meets your needs.

We did a lot of testing of the Cloudflare inline crypto patch to understand
implications on performance when using a server with a large population of
drives. These tests used regular drives, but we already understood while going
through this exercise that inline writes are an easy way to support SMR drives.
Point is: all the testing went well, no problem whatsoever detected. I will
review Ignat's patch.

> I'll wait for you to do this work so that I can get your feedback on
> whether Ignat's changes look good for you too.  We have some time before
> the 5.9 merge window opens, lets just keep the communication going and
> make sure what we send upstream addresses everyone's needs and concerns.

I based my work on the Ignat patch that was available on github. While that
patch was initially developped for 4.x kernels, it was easy to apply onto 5.8-rc
and I used this as a base. I ended up changing a lot of things because:
1) I did not needed the inline reads, but they can be if the user want them to
be for performance  reasons.
2) I micro-optimized writes to conventional zones, allowing those to not be
inline, inlining only writes targetting sequential zones. The benefits of this
optimization are rather dubious though (read: hard to measure) as conventional
zones represent generally about 1% of the drive total capacity, and the tendency
is to move toward SMR drives that only have sequential zones. Dropping this
optimization is fine. It simplifies things and adding SMR support is basically
reduced to adding the report zones method and marking dm-crypt features with
DM_TARGET_ZONED_HM. All on top of Ignat's patch.

Please see my following reply to Ignat's idea of separating read & write
inlining with 2 flags.

> 
> Thanks,
> Mike
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-23 15:22         ` Mike Snitzer
@ 2020-06-24  4:54           ` Damien Le Moal
  2020-06-24  5:22             ` Mike Snitzer
  0 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  4:54 UTC (permalink / raw)
  To: Mike Snitzer, Ignat Korchagin
  Cc: kernel-team, dm-crypt, linux-kernel, dm-devel, Mikulas Patocka, agk

On 2020/06/24 0:23, Mike Snitzer wrote:
> On Tue, Jun 23 2020 at 11:07am -0400,
> Ignat Korchagin <ignat@cloudflare.com> wrote:
> 
>> Do you think it may be better to break it in two flags: one for read
>> path and one for write? So, depending on the needs and workflow these
>> could be enabled independently?
> 
> If there is a need to split, then sure.  But I think Damien had a hard
> requirement that writes had to be inlined but that reads didn't _need_
> to be for his dm-zoned usecase.  Damien may not yet have assessed the
> performance implications, of not have reads inlined, as much as you
> have.

We did do performance testing :)
The results are mixed and performance differences between inline vs workqueues
depend on the workload (IO size, IO queue depth and number of drives being used
mostly). In many cases, inlining everything does really improve performance as
Ignat reported.

In our testing, we used hard drives and so focused mostly on throughput rather
than command latency. The added workqueue context switch overhead and crypto
work latency compared to typical HDD IO times is small, and significant only if
the backend storage as short IO times.

In the case of HDDs, especially for large IO sizes, inlining crypto work does
not shine as it prevents an efficient use of CPU resources. This is especially
true with reads on a large system with many drives connected to a single HBA:
the softirq context decryption work does not lend itself well to using other
CPUs that did not receive the HBA IRQ signaling command completions. The test
results clearly show much higher throughputs using dm-crypt as is.

On the other hand, inlining crypto work significantly improves workloads of
small random IOs, even for a large number of disks: removing the overhead of
context switches allows faster completions, allowing sending more requests to
the drives more quickly, keeping them busy.

For SMR, the inlining of write requests is *mandatory* to preserve the issuer
write sequence, but encryption work being done in the issuer context (writes to
SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be
achieved by simply using multiple writer thread/processes, working on different
zones of different disks. This is a very reasonable model for SMR as writes into
a single zone have to be done under mutual exclusion to ensure sequentiality.

For reads, SMR drives are essentially exactly the same as regular disks, so
as-is or inline are both OK. Based on our performance results, allowing the user
to have the choice of inlining or not reads based on the target workload would
be great.

Of note is that zone append writes (emulated in SCSI, native with NVMe) are not
subject to the sequential write constraint, so they can also be executed either
inline or asynchronously.

> So let's see how Damien's work goes and if he trully doesn't need/want
> reads to be inlined then 2 flags can be created.

For SMR, I do not need inline reads, but I do want the user to have the
possibility of using this setup as that can provide better performance for some
workloads. I think that splitting the inline flag in 2 is exactly what we want:

1) For SMR, the write-inline flag can be automatically turned on when the target
device is created if the backend device used is a host-managed zoned drive (scsi
or NVMe ZNS). For reads, it would be the user choice, based on the target workload.
2) For regular block devices, write-inline only, read-inline only or both would
be the user choice, to optimize for their target workload.

With the split into 2 flags, my SMR support patch becomes very simple.

> 
> Mike
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-19 16:41 ` [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target Ignat Korchagin
@ 2020-06-24  5:04   ` Eric Biggers
  2020-06-24  5:21     ` [dm-devel] " Damien Le Moal
                       ` (2 more replies)
  2020-06-24  5:12   ` [dm-devel] " Damien Le Moal
  1 sibling, 3 replies; 31+ messages in thread
From: Eric Biggers @ 2020-06-24  5:04 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: agk, snitzer, dm-devel, dm-crypt, linux-kernel, kernel-team

On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
> especially visible on busy systems with many processes/threads. Moreover, most
> Crypto API implementaions are async, that is they offload crypto operations on
> their own, so this dm-crypt offloading is excessive.

This really should say "some Crypto API implementations are async" instead of
"most Crypto API implementations are async".

Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
context where SIMD instructions are usable.  It's only asynchronous when SIMD is
not usable.  (This seems to have been missed in your blog post.)

> This adds a new flag, which directs dm-crypt not to offload crypto operations
> and process everything inline. For cases, where crypto operations cannot happen
> inline (hard interrupt context, for example the read path of the NVME driver),
> we offload the work to a tasklet rather than a workqueue.

This patch both removes some dm-crypt specific queueing, and changes decryption
to use softIRQ context instead of a workqueue.  It would be useful to know how
much of a difference the workqueue => softIRQ change makes by itself.  Such a
change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
decryption, but besides that doesn't use any other queueing.)

> @@ -127,7 +128,7 @@ struct iv_elephant_private {
>   * and encrypts / decrypts at the same time.
>   */
>  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
> -	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
> +	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) };

Assigning a specific enum value isn't necessary.

> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>  
>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>  
> -	/*
> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> -	 * requests if driver request queue is full.
> -	 */
> -	skcipher_request_set_callback(ctx->r.req,
> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> +		/* make sure we zero important fields of the request */
> +		skcipher_request_set_callback(ctx->r.req,
> +	        0, NULL, NULL);
> +	else
> +		/*
> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> +		 * requests if driver request queue is full.
> +		 */
> +		skcipher_request_set_callback(ctx->r.req,
> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>  }

This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
in which case providing a callback is required.

Do you intend that the "force_inline" option forces the use of a synchronous
skcipher (alongside the other things it does)?  Or should it still allow
asynchronous ones?

We may not actually have a choice in that matter, since xts-aes-aesni has the
CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
cases; thus, the crypto API won't give you it if you ask for a synchronous
cipher.  So I think you still need to allow async skciphers?  That means a
callback is still always required.

- Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-19 16:41 ` [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target Ignat Korchagin
  2020-06-24  5:04   ` [dm-crypt] " Eric Biggers
@ 2020-06-24  5:12   ` Damien Le Moal
  1 sibling, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  5:12 UTC (permalink / raw)
  To: Ignat Korchagin, agk, snitzer, dm-devel, dm-crypt, linux-kernel
  Cc: kernel-team

On 2020/06/22 17:47, Ignat Korchagin wrote:
> Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
> especially visible on busy systems with many processes/threads. Moreover, most
> Crypto API implementaions are async, that is they offload crypto operations on
> their own, so this dm-crypt offloading is excessive.
> 
> This adds a new flag, which directs dm-crypt not to offload crypto operations
> and process everything inline. For cases, where crypto operations cannot happen
> inline (hard interrupt context, for example the read path of the NVME driver),
> we offload the work to a tasklet rather than a workqueue.
> 
> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
> ---
>  drivers/md/dm-crypt.c | 55 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 000ddfab5ba0..5a9bac4fdffb 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -69,6 +69,7 @@ struct dm_crypt_io {
>  	u8 *integrity_metadata;
>  	bool integrity_metadata_from_pool;
>  	struct work_struct work;
> +	struct tasklet_struct tasklet;
>  
>  	struct convert_context ctx;
>  
> @@ -127,7 +128,7 @@ struct iv_elephant_private {
>   * and encrypts / decrypts at the same time.
>   */
>  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
> -	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
> +	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) };

I do not see why a special value needs to be defined for DM_CRYPT_FORCE_INLINE.
It is clear from the number of members in the enum that we have far less than 32
flags. Also, it may be good to add DM_CRYPT_FORCE_INLINE as a new line to avoid
the long-ish line.

>  
>  enum cipher_flags {
>  	CRYPT_MODE_INTEGRITY_AEAD,	/* Use authenticated mode for cihper */
> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>  
>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>  
> -	/*
> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> -	 * requests if driver request queue is full.
> -	 */
> -	skcipher_request_set_callback(ctx->r.req,
> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> +		/* make sure we zero important fields of the request */
> +		skcipher_request_set_callback(ctx->r.req,
> +	        0, NULL, NULL);

May be add a return here to avoid the need for else ?

> +	else
> +		/*
> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> +		 * requests if driver request queue is full.
> +		 */
> +		skcipher_request_set_callback(ctx->r.req,
> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>  }
>  
>  static void crypt_alloc_req_aead(struct crypt_config *cc,
> @@ -1566,7 +1572,8 @@ static blk_status_t crypt_convert(struct crypt_config *cc,
>  			atomic_dec(&ctx->cc_pending);
>  			ctx->cc_sector += sector_step;
>  			tag_offset++;
> -			cond_resched();
> +			if (!test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> +				cond_resched();
>  			continue;
>  		/*
>  		 * There was a data integrity error.
> @@ -1892,6 +1899,11 @@ static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io, int async)
>  
>  	clone->bi_iter.bi_sector = cc->start + io->sector;
>  
> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
> +		generic_make_request(clone);
> +		return;
> +	}
> +
>  	if (likely(!async) && test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags)) {

A little inline helper such as kcryptd_crypt_write_io_inline(io, async) would
simplify things here: the conditions leading to an inline write will be gathered
together and can be explained. And for SMR, since we also need an IO type based
selection, we can extent the helper without needing another change here.

>  		generic_make_request(clone);
>  		return;
> @@ -2031,12 +2043,26 @@ static void kcryptd_crypt(struct work_struct *work)
>  		kcryptd_crypt_write_convert(io);
>  }
>  
> +static void kcryptd_crypt_tasklet(unsigned long work)
> +{
> +	kcryptd_crypt((struct work_struct *)work);
> +}
> +
>  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
>  {
>  	struct crypt_config *cc = io->cc;
>  
> -	INIT_WORK(&io->work, kcryptd_crypt);
> -	queue_work(cc->crypt_queue, &io->work);
> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags)) {
> +		if (in_irq()) {
> +			/* Crypto API will fail hard in hard IRQ context */
> +			tasklet_init(&io->tasklet, kcryptd_crypt_tasklet, (unsigned long)&io->work);
> +			tasklet_schedule(&io->tasklet);
> +		} else
> +			kcryptd_crypt(&io->work);

Same as above: return here to avoid the else ?

> +	} else {
> +		INIT_WORK(&io->work, kcryptd_crypt);
> +		queue_work(cc->crypt_queue, &io->work);
> +	}
>  }
>  
>  static void crypt_free_tfms_aead(struct crypt_config *cc)
> @@ -2838,7 +2864,7 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
>  	struct crypt_config *cc = ti->private;
>  	struct dm_arg_set as;
>  	static const struct dm_arg _args[] = {
> -		{0, 6, "Invalid number of feature args"},
> +		{0, 7, "Invalid number of feature args"},
>  	};
>  	unsigned int opt_params, val;
>  	const char *opt_string, *sval;
> @@ -2868,6 +2894,8 @@ static int crypt_ctr_optional(struct dm_target *ti, unsigned int argc, char **ar
>  
>  		else if (!strcasecmp(opt_string, "submit_from_crypt_cpus"))
>  			set_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags);
> +		else if (!strcasecmp(opt_string, "force_inline"))
> +			set_bit(DM_CRYPT_FORCE_INLINE, &cc->flags);
>  		else if (sscanf(opt_string, "integrity:%u:", &val) == 1) {
>  			if (val == 0 || val > MAX_TAG_SIZE) {
>  				ti->error = "Invalid integrity arguments";
> @@ -3196,6 +3224,7 @@ static void crypt_status(struct dm_target *ti, status_type_t type,
>  		num_feature_args += !!ti->num_discard_bios;
>  		num_feature_args += test_bit(DM_CRYPT_SAME_CPU, &cc->flags);
>  		num_feature_args += test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags);
> +		num_feature_args += test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags);
>  		num_feature_args += cc->sector_size != (1 << SECTOR_SHIFT);
>  		num_feature_args += test_bit(CRYPT_IV_LARGE_SECTORS, &cc->cipher_flags);
>  		if (cc->on_disk_tag_size)
> @@ -3208,6 +3237,8 @@ static void crypt_status(struct dm_target *ti, status_type_t type,
>  				DMEMIT(" same_cpu_crypt");
>  			if (test_bit(DM_CRYPT_NO_OFFLOAD, &cc->flags))
>  				DMEMIT(" submit_from_crypt_cpus");
> +			if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> +				DMEMIT(" force_inline");
>  			if (cc->on_disk_tag_size)
>  				DMEMIT(" integrity:%u:%s", cc->on_disk_tag_size, cc->cipher_auth);
>  			if (cc->sector_size != (1 << SECTOR_SHIFT))
> 

Apart from the above few comments, this all looks OK to me (and tested).
One question though: do you have patches for cryptsetup user land tools to allow
controlling the specification of the inline flag on device open ?

It turns out that the most difficult part of the SMR support is patching
cryptsetup. Not much work needed for plain use, but formats such as Luks do not
write super block and metadata sequentially, which causes problems with drives
that do not have conventional zones at LBA 0...
But this is my problem to solve :)

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  5:04   ` [dm-crypt] " Eric Biggers
@ 2020-06-24  5:21     ` Damien Le Moal
  2020-06-24  5:27       ` Eric Biggers
  2020-06-24  7:49     ` Damien Le Moal
  2020-06-24  8:24     ` Ignat Korchagin
  2 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  5:21 UTC (permalink / raw)
  To: Eric Biggers, Ignat Korchagin
  Cc: snitzer, kernel-team, dm-crypt, linux-kernel, dm-devel, agk

On 2020/06/24 14:05, Eric Biggers wrote:
> On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
>> Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
>> especially visible on busy systems with many processes/threads. Moreover, most
>> Crypto API implementaions are async, that is they offload crypto operations on
>> their own, so this dm-crypt offloading is excessive.
> 
> This really should say "some Crypto API implementations are async" instead of
> "most Crypto API implementations are async".
> 
> Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
> context where SIMD instructions are usable.  It's only asynchronous when SIMD is
> not usable.  (This seems to have been missed in your blog post.)
> 
>> This adds a new flag, which directs dm-crypt not to offload crypto operations
>> and process everything inline. For cases, where crypto operations cannot happen
>> inline (hard interrupt context, for example the read path of the NVME driver),
>> we offload the work to a tasklet rather than a workqueue.
> 
> This patch both removes some dm-crypt specific queueing, and changes decryption
> to use softIRQ context instead of a workqueue.  It would be useful to know how
> much of a difference the workqueue => softIRQ change makes by itself.  Such a
> change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
> decryption, but besides that doesn't use any other queueing.)
> 
>> @@ -127,7 +128,7 @@ struct iv_elephant_private {
>>   * and encrypts / decrypts at the same time.
>>   */
>>  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
>> -	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
>> +	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) };
> 
> Assigning a specific enum value isn't necessary.
> 
>> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>>  
>>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>>  
>> -	/*
>> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>> -	 * requests if driver request queue is full.
>> -	 */
>> -	skcipher_request_set_callback(ctx->r.req,
>> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
>> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
>> +		/* make sure we zero important fields of the request */
>> +		skcipher_request_set_callback(ctx->r.req,
>> +	        0, NULL, NULL);
>> +	else
>> +		/*
>> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>> +		 * requests if driver request queue is full.
>> +		 */
>> +		skcipher_request_set_callback(ctx->r.req,
>> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
>> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>  }
> 
> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
> crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
> in which case providing a callback is required.
> 
> Do you intend that the "force_inline" option forces the use of a synchronous
> skcipher (alongside the other things it does)?  Or should it still allow
> asynchronous ones?
> 
> We may not actually have a choice in that matter, since xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
> cases; thus, the crypto API won't give you it if you ask for a synchronous
> cipher.  So I think you still need to allow async skciphers?  That means a
> callback is still always required.

Arg... So it means that some skciphers will not be OK at all for SMR writes. I
was not aware of these differences (tested with aes-xts-plain64 only). The ugly
way to support async ciphers would be to just wait inline for the crypto API to
complete using a completion for instance. But that is very ugly. Back to
brainstorming, and need to learn more about the crypto API...

> 
> - Eric
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-24  4:54           ` [dm-devel] " Damien Le Moal
@ 2020-06-24  5:22             ` Mike Snitzer
  2020-06-24  8:02               ` Ignat Korchagin
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2020-06-24  5:22 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ignat Korchagin, kernel-team, dm-crypt, linux-kernel, dm-devel,
	Mikulas Patocka, agk

On Wed, Jun 24 2020 at 12:54am -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> On 2020/06/24 0:23, Mike Snitzer wrote:
> > On Tue, Jun 23 2020 at 11:07am -0400,
> > Ignat Korchagin <ignat@cloudflare.com> wrote:
> > 
> >> Do you think it may be better to break it in two flags: one for read
> >> path and one for write? So, depending on the needs and workflow these
> >> could be enabled independently?
> > 
> > If there is a need to split, then sure.  But I think Damien had a hard
> > requirement that writes had to be inlined but that reads didn't _need_
> > to be for his dm-zoned usecase.  Damien may not yet have assessed the
> > performance implications, of not have reads inlined, as much as you
> > have.
> 
> We did do performance testing :)
> The results are mixed and performance differences between inline vs workqueues
> depend on the workload (IO size, IO queue depth and number of drives being used
> mostly). In many cases, inlining everything does really improve performance as
> Ignat reported.
> 
> In our testing, we used hard drives and so focused mostly on throughput rather
> than command latency. The added workqueue context switch overhead and crypto
> work latency compared to typical HDD IO times is small, and significant only if
> the backend storage as short IO times.
> 
> In the case of HDDs, especially for large IO sizes, inlining crypto work does
> not shine as it prevents an efficient use of CPU resources. This is especially
> true with reads on a large system with many drives connected to a single HBA:
> the softirq context decryption work does not lend itself well to using other
> CPUs that did not receive the HBA IRQ signaling command completions. The test
> results clearly show much higher throughputs using dm-crypt as is.
> 
> On the other hand, inlining crypto work significantly improves workloads of
> small random IOs, even for a large number of disks: removing the overhead of
> context switches allows faster completions, allowing sending more requests to
> the drives more quickly, keeping them busy.
> 
> For SMR, the inlining of write requests is *mandatory* to preserve the issuer
> write sequence, but encryption work being done in the issuer context (writes to
> SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be
> achieved by simply using multiple writer thread/processes, working on different
> zones of different disks. This is a very reasonable model for SMR as writes into
> a single zone have to be done under mutual exclusion to ensure sequentiality.
> 
> For reads, SMR drives are essentially exactly the same as regular disks, so
> as-is or inline are both OK. Based on our performance results, allowing the user
> to have the choice of inlining or not reads based on the target workload would
> be great.
> 
> Of note is that zone append writes (emulated in SCSI, native with NVMe) are not
> subject to the sequential write constraint, so they can also be executed either
> inline or asynchronously.
> 
> > So let's see how Damien's work goes and if he trully doesn't need/want
> > reads to be inlined then 2 flags can be created.
> 
> For SMR, I do not need inline reads, but I do want the user to have the
> possibility of using this setup as that can provide better performance for some
> workloads. I think that splitting the inline flag in 2 is exactly what we want:
> 
> 1) For SMR, the write-inline flag can be automatically turned on when the target
> device is created if the backend device used is a host-managed zoned drive (scsi
> or NVMe ZNS). For reads, it would be the user choice, based on the target workload.
> 2) For regular block devices, write-inline only, read-inline only or both would
> be the user choice, to optimize for their target workload.
> 
> With the split into 2 flags, my SMR support patch becomes very simple.

OK, thanks for all the context.  Was a fun read ;)

SO let's run with splitting into 2 flags.  Ignat would you be up to
tweaking your patch to provide that and post a v2?

An added bonus would be to consolidate your 0/1 and 1/1 patch headers,
and add in the additional answers you provided in this thread to help
others understand the patch (mainly some more detail about why tasklet
is used).

Thanks,
Mike


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  5:21     ` [dm-devel] " Damien Le Moal
@ 2020-06-24  5:27       ` Eric Biggers
  2020-06-24  6:46         ` Damien Le Moal
  2020-06-24  7:24         ` Damien Le Moal
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Biggers @ 2020-06-24  5:27 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ignat Korchagin, snitzer, kernel-team, dm-crypt, linux-kernel,
	dm-devel, agk

On Wed, Jun 24, 2020 at 05:21:24AM +0000, Damien Le Moal wrote:
> >> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
> >>  
> >>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
> >>  
> >> -	/*
> >> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> >> -	 * requests if driver request queue is full.
> >> -	 */
> >> -	skcipher_request_set_callback(ctx->r.req,
> >> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
> >> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> >> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> >> +		/* make sure we zero important fields of the request */
> >> +		skcipher_request_set_callback(ctx->r.req,
> >> +	        0, NULL, NULL);
> >> +	else
> >> +		/*
> >> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> >> +		 * requests if driver request queue is full.
> >> +		 */
> >> +		skcipher_request_set_callback(ctx->r.req,
> >> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
> >> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> >>  }
> > 
> > This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
> > crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
> > in which case providing a callback is required.
> > 
> > Do you intend that the "force_inline" option forces the use of a synchronous
> > skcipher (alongside the other things it does)?  Or should it still allow
> > asynchronous ones?
> > 
> > We may not actually have a choice in that matter, since xts-aes-aesni has the
> > CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
> > cases; thus, the crypto API won't give you it if you ask for a synchronous
> > cipher.  So I think you still need to allow async skciphers?  That means a
> > callback is still always required.
> 
> Arg... So it means that some skciphers will not be OK at all for SMR writes. I
> was not aware of these differences (tested with aes-xts-plain64 only). The ugly
> way to support async ciphers would be to just wait inline for the crypto API to
> complete using a completion for instance. But that is very ugly. Back to
> brainstorming, and need to learn more about the crypto API...
> 

It's easy to wait for crypto API requests to complete if you need to --
just use crypto_wait_req().

We do this in fs/crypto/, for example.  (Not many people are using fscrypt with
crypto API based accelerators, so there hasn't yet been much need to support the
complexity of issuing multiple async crypto requests like dm-crypt supports.)

- Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  5:27       ` Eric Biggers
@ 2020-06-24  6:46         ` Damien Le Moal
  2020-06-24  7:24         ` Damien Le Moal
  1 sibling, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  6:46 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ignat Korchagin, snitzer, kernel-team, dm-crypt, linux-kernel,
	dm-devel, agk

On 2020/06/24 14:27, Eric Biggers wrote:
> On Wed, Jun 24, 2020 at 05:21:24AM +0000, Damien Le Moal wrote:
>>>> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>>>>  
>>>>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>>>>  
>>>> -	/*
>>>> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>>>> -	 * requests if driver request queue is full.
>>>> -	 */
>>>> -	skcipher_request_set_callback(ctx->r.req,
>>>> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
>>>> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>>> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
>>>> +		/* make sure we zero important fields of the request */
>>>> +		skcipher_request_set_callback(ctx->r.req,
>>>> +	        0, NULL, NULL);
>>>> +	else
>>>> +		/*
>>>> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>>>> +		 * requests if driver request queue is full.
>>>> +		 */
>>>> +		skcipher_request_set_callback(ctx->r.req,
>>>> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
>>>> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>>>  }
>>>
>>> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
>>> crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
>>> in which case providing a callback is required.
>>>
>>> Do you intend that the "force_inline" option forces the use of a synchronous
>>> skcipher (alongside the other things it does)?  Or should it still allow
>>> asynchronous ones?
>>>
>>> We may not actually have a choice in that matter, since xts-aes-aesni has the
>>> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
>>> cases; thus, the crypto API won't give you it if you ask for a synchronous
>>> cipher.  So I think you still need to allow async skciphers?  That means a
>>> callback is still always required.
>>
>> Arg... So it means that some skciphers will not be OK at all for SMR writes. I
>> was not aware of these differences (tested with aes-xts-plain64 only). The ugly
>> way to support async ciphers would be to just wait inline for the crypto API to
>> complete using a completion for instance. But that is very ugly. Back to
>> brainstorming, and need to learn more about the crypto API...
>>
> 
> It's easy to wait for crypto API requests to complete if you need to --
> just use crypto_wait_req().

OK. Thanks for the information. I will look into this and the performance
implications. A quick grep shows that a lot of different accelerators for
different architectures have CRYPTO_ALG_ASYNC set. So definitely something that
needs to be checked for SMR, and for Ignat inline patch.

> We do this in fs/crypto/, for example.  (Not many people are using fscrypt with
> crypto API based accelerators, so there hasn't yet been much need to support the
> complexity of issuing multiple async crypto requests like dm-crypt supports.)

Zonefs fscrypt support is on my to do list too :)

Thanks !

>
> - Eric
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  5:27       ` Eric Biggers
  2020-06-24  6:46         ` Damien Le Moal
@ 2020-06-24  7:24         ` Damien Le Moal
  1 sibling, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  7:24 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ignat Korchagin, snitzer, kernel-team, dm-crypt, linux-kernel,
	dm-devel, agk

On 2020/06/24 14:27, Eric Biggers wrote:
> On Wed, Jun 24, 2020 at 05:21:24AM +0000, Damien Le Moal wrote:
>>>> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>>>>  
>>>>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>>>>  
>>>> -	/*
>>>> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>>>> -	 * requests if driver request queue is full.
>>>> -	 */
>>>> -	skcipher_request_set_callback(ctx->r.req,
>>>> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
>>>> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>>> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
>>>> +		/* make sure we zero important fields of the request */
>>>> +		skcipher_request_set_callback(ctx->r.req,
>>>> +	        0, NULL, NULL);
>>>> +	else
>>>> +		/*
>>>> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>>>> +		 * requests if driver request queue is full.
>>>> +		 */
>>>> +		skcipher_request_set_callback(ctx->r.req,
>>>> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
>>>> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>>>  }
>>>
>>> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
>>> crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
>>> in which case providing a callback is required.

Digging the code further, in light of your hints, it looks like to fix this, all
that needs to be done is to change crypt_convert_block_skcipher() from doing:

	if (bio_data_dir(ctx->bio_in) == WRITE)
		r = crypto_skcipher_encrypt(req);
	else
		r = crypto_skcipher_decrypt(req);

to do something like:

	struct crypto_wait wait;

	...

	if (bio_data_dir(ctx->bio_in) == WRITE) {
		if (test_bit(DM_CRYPT_FORCE_INLINE_WRITE, &cc->flags))
			r = crypto_wait_req(crypto_skcipher_encrypt(req),
					    &wait);
		else
			r = crypto_skcipher_encrypt(req);
	} else {
		if (test_bit(DM_CRYPT_FORCE_INLINE_READ, &cc->flags))


			r = crypto_wait_req(crypto_skcipher_decrypt(req),
					    &wait);
		else
			r = crypto_skcipher_decrypt(req);
	}

Doing so, crypt_convert_block_skcipher() cannot return -EBUSY nor -EINPROGRESS
for inline IOs, leading to crypt_convert() to see synchronous completions, as
expected for inline case. The above likely does not add much overhead at all for
synchronous skcipher/accelerators, and handles asynchronous ones as if they were
synchronous. Would this be correct ?



>>>
>>> Do you intend that the "force_inline" option forces the use of a synchronous
>>> skcipher (alongside the other things it does)?  Or should it still allow
>>> asynchronous ones?
>>>
>>> We may not actually have a choice in that matter, since xts-aes-aesni has the
>>> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
>>> cases; thus, the crypto API won't give you it if you ask for a synchronous
>>> cipher.  So I think you still need to allow async skciphers?  That means a
>>> callback is still always required.
>>
>> Arg... So it means that some skciphers will not be OK at all for SMR writes. I
>> was not aware of these differences (tested with aes-xts-plain64 only). The ugly
>> way to support async ciphers would be to just wait inline for the crypto API to
>> complete using a completion for instance. But that is very ugly. Back to
>> brainstorming, and need to learn more about the crypto API...
>>
> 
> It's easy to wait for crypto API requests to complete if you need to --
> just use crypto_wait_req().
> 
> We do this in fs/crypto/, for example.  (Not many people are using fscrypt with
> crypto API based accelerators, so there hasn't yet been much need to support the
> complexity of issuing multiple async crypto requests like dm-crypt supports.)
> 
> - Eric
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  5:04   ` [dm-crypt] " Eric Biggers
  2020-06-24  5:21     ` [dm-devel] " Damien Le Moal
@ 2020-06-24  7:49     ` Damien Le Moal
  2020-06-24  8:24     ` Ignat Korchagin
  2 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2020-06-24  7:49 UTC (permalink / raw)
  To: Eric Biggers, Ignat Korchagin
  Cc: snitzer, kernel-team, dm-crypt, linux-kernel, dm-devel, agk

On 2020/06/24 14:05, Eric Biggers wrote:
> On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
>> Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
>> especially visible on busy systems with many processes/threads. Moreover, most
>> Crypto API implementaions are async, that is they offload crypto operations on
>> their own, so this dm-crypt offloading is excessive.
> 
> This really should say "some Crypto API implementations are async" instead of
> "most Crypto API implementations are async".
> 
> Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
> context where SIMD instructions are usable.  It's only asynchronous when SIMD is
> not usable.  (This seems to have been missed in your blog post.)
> 
>> This adds a new flag, which directs dm-crypt not to offload crypto operations
>> and process everything inline. For cases, where crypto operations cannot happen
>> inline (hard interrupt context, for example the read path of the NVME driver),
>> we offload the work to a tasklet rather than a workqueue.
> 
> This patch both removes some dm-crypt specific queueing, and changes decryption
> to use softIRQ context instead of a workqueue.  It would be useful to know how
> much of a difference the workqueue => softIRQ change makes by itself.  Such a
> change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
> decryption, but besides that doesn't use any other queueing.)
> 
>> @@ -127,7 +128,7 @@ struct iv_elephant_private {
>>   * and encrypts / decrypts at the same time.
>>   */
>>  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
>> -	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
>> +	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) };
> 
> Assigning a specific enum value isn't necessary.
> 
>> @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
>>  
>>  	skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
>>  
>> -	/*
>> -	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>> -	 * requests if driver request queue is full.
>> -	 */
>> -	skcipher_request_set_callback(ctx->r.req,
>> -	    CRYPTO_TFM_REQ_MAY_BACKLOG,
>> -	    kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>> +	if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
>> +		/* make sure we zero important fields of the request */
>> +		skcipher_request_set_callback(ctx->r.req,
>> +	        0, NULL, NULL);
>> +	else
>> +		/*
>> +		 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
>> +		 * requests if driver request queue is full.
>> +		 */
>> +		skcipher_request_set_callback(ctx->r.req,
>> +	        CRYPTO_TFM_REQ_MAY_BACKLOG,
>> +	        kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
>>  }
> 
> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
> crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
> in which case providing a callback is required.

Another point: for a skcipher implementation that is asynchronous, for the
regular case/not-inline, can't we just issue the request directly without using
the workqueue ? If yes, that would save one context switch, since queueing of
the request can be handled by the crypto API when the request callback is set
with CRYPTO_TFM_REQ_MAY_BACKLOG. At least that is how I understood the
documentation & comments. I may be wrong here...

> 
> Do you intend that the "force_inline" option forces the use of a synchronous
> skcipher (alongside the other things it does)?  Or should it still allow
> asynchronous ones?
> 
> We may not actually have a choice in that matter, since xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
> cases; thus, the crypto API won't give you it if you ask for a synchronous
> cipher.  So I think you still need to allow async skciphers?  That means a
> callback is still always required.
> 
> - Eric
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/1] dm-crypt excessive overhead
  2020-06-24  5:22             ` Mike Snitzer
@ 2020-06-24  8:02               ` Ignat Korchagin
  0 siblings, 0 replies; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-24  8:02 UTC (permalink / raw)
  To: Mike Snitzer, Damien Le Moal
  Cc: kernel-team, dm-crypt, linux-kernel, dm-devel, Mikulas Patocka, agk

On Wed, Jun 24, 2020 at 6:22 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Wed, Jun 24 2020 at 12:54am -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
> > On 2020/06/24 0:23, Mike Snitzer wrote:
> > > On Tue, Jun 23 2020 at 11:07am -0400,
> > > Ignat Korchagin <ignat@cloudflare.com> wrote:
> > >
> > >> Do you think it may be better to break it in two flags: one for read
> > >> path and one for write? So, depending on the needs and workflow these
> > >> could be enabled independently?
> > >
> > > If there is a need to split, then sure.  But I think Damien had a hard
> > > requirement that writes had to be inlined but that reads didn't _need_
> > > to be for his dm-zoned usecase.  Damien may not yet have assessed the
> > > performance implications, of not have reads inlined, as much as you
> > > have.
> >
> > We did do performance testing :)
> > The results are mixed and performance differences between inline vs workqueues
> > depend on the workload (IO size, IO queue depth and number of drives being used
> > mostly). In many cases, inlining everything does really improve performance as
> > Ignat reported.
> >
> > In our testing, we used hard drives and so focused mostly on throughput rather
> > than command latency. The added workqueue context switch overhead and crypto
> > work latency compared to typical HDD IO times is small, and significant only if
> > the backend storage as short IO times.
> >
> > In the case of HDDs, especially for large IO sizes, inlining crypto work does
> > not shine as it prevents an efficient use of CPU resources. This is especially
> > true with reads on a large system with many drives connected to a single HBA:
> > the softirq context decryption work does not lend itself well to using other
> > CPUs that did not receive the HBA IRQ signaling command completions. The test
> > results clearly show much higher throughputs using dm-crypt as is.
> >
> > On the other hand, inlining crypto work significantly improves workloads of
> > small random IOs, even for a large number of disks: removing the overhead of
> > context switches allows faster completions, allowing sending more requests to
> > the drives more quickly, keeping them busy.
> >
> > For SMR, the inlining of write requests is *mandatory* to preserve the issuer
> > write sequence, but encryption work being done in the issuer context (writes to
> > SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be
> > achieved by simply using multiple writer thread/processes, working on different
> > zones of different disks. This is a very reasonable model for SMR as writes into
> > a single zone have to be done under mutual exclusion to ensure sequentiality.
> >
> > For reads, SMR drives are essentially exactly the same as regular disks, so
> > as-is or inline are both OK. Based on our performance results, allowing the user
> > to have the choice of inlining or not reads based on the target workload would
> > be great.
> >
> > Of note is that zone append writes (emulated in SCSI, native with NVMe) are not
> > subject to the sequential write constraint, so they can also be executed either
> > inline or asynchronously.
> >
> > > So let's see how Damien's work goes and if he trully doesn't need/want
> > > reads to be inlined then 2 flags can be created.
> >
> > For SMR, I do not need inline reads, but I do want the user to have the
> > possibility of using this setup as that can provide better performance for some
> > workloads. I think that splitting the inline flag in 2 is exactly what we want:
> >
> > 1) For SMR, the write-inline flag can be automatically turned on when the target
> > device is created if the backend device used is a host-managed zoned drive (scsi
> > or NVMe ZNS). For reads, it would be the user choice, based on the target workload.
> > 2) For regular block devices, write-inline only, read-inline only or both would
> > be the user choice, to optimize for their target workload.
> >
> > With the split into 2 flags, my SMR support patch becomes very simple.
>
> OK, thanks for all the context.  Was a fun read ;)
>
> SO let's run with splitting into 2 flags.  Ignat would you be up to
> tweaking your patch to provide that and post a v2?
>
> An added bonus would be to consolidate your 0/1 and 1/1 patch headers,
> and add in the additional answers you provided in this thread to help
> others understand the patch (mainly some more detail about why tasklet
> is used).

Yes, will do

> Thanks,
> Mike
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  5:04   ` [dm-crypt] " Eric Biggers
  2020-06-24  5:21     ` [dm-devel] " Damien Le Moal
  2020-06-24  7:49     ` Damien Le Moal
@ 2020-06-24  8:24     ` Ignat Korchagin
  2020-06-24 16:24       ` Eric Biggers
  2 siblings, 1 reply; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-24  8:24 UTC (permalink / raw)
  To: Eric Biggers
  Cc: agk, Mike Snitzer, dm-devel, dm-crypt, linux-kernel, kernel-team

On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
> > especially visible on busy systems with many processes/threads. Moreover, most
> > Crypto API implementaions are async, that is they offload crypto operations on
> > their own, so this dm-crypt offloading is excessive.
>
> This really should say "some Crypto API implementations are async" instead of
> "most Crypto API implementations are async".

The most accurate would probably be: most hardware-accelerated Crypto
API implementations are async

> Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
> context where SIMD instructions are usable.  It's only asynchronous when SIMD is
> not usable.  (This seems to have been missed in your blog post.)

No, it was not. This is exactly why we made xts-proxy Crypto API
module as a second patch. But it seems now it does not make a big
difference if a used Crypto API implementation is synchronous as well
(based on some benchmarks outlined in the cover letter to this patch).
I think the v2 of this patch will not require a synchronous Crypto
API. This is probably a right thing to do, as the "inline" flag should
control the way how dm-crypt itself handles requests, not how Crypto
API handles requests. If a user wants to ensure a particular
synchronous Crypto API implementation, they can already reconfigure
dm-crypt and specify the implementation with a "capi:" prefix in the
the dm table description.

> > This adds a new flag, which directs dm-crypt not to offload crypto operations
> > and process everything inline. For cases, where crypto operations cannot happen
> > inline (hard interrupt context, for example the read path of the NVME driver),
> > we offload the work to a tasklet rather than a workqueue.
>
> This patch both removes some dm-crypt specific queueing, and changes decryption
> to use softIRQ context instead of a workqueue.  It would be useful to know how
> much of a difference the workqueue => softIRQ change makes by itself.  Such a
> change could be useful for fscrypt as well.  (fscrypt uses a workqueue for
> decryption, but besides that doesn't use any other queueing.)
>
> > @@ -127,7 +128,7 @@ struct iv_elephant_private {
> >   * and encrypts / decrypts at the same time.
> >   */
> >  enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
> > -          DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
> > +          DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD, DM_CRYPT_FORCE_INLINE = (sizeof(unsigned long) * 8 - 1) };
>
> Assigning a specific enum value isn't necessary.

Yes, this is a leftover from our "internal" patch which I wanted to
make "future proof" in case future iterations of dm-crypt will add
some flags to avoid flag collisions. Will remove in v2.

>
> > @@ -1458,13 +1459,18 @@ static void crypt_alloc_req_skcipher(struct crypt_config *cc,
> >
> >       skcipher_request_set_tfm(ctx->r.req, cc->cipher_tfm.tfms[key_index]);
> >
> > -     /*
> > -      * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> > -      * requests if driver request queue is full.
> > -      */
> > -     skcipher_request_set_callback(ctx->r.req,
> > -         CRYPTO_TFM_REQ_MAY_BACKLOG,
> > -         kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> > +     if (test_bit(DM_CRYPT_FORCE_INLINE, &cc->flags))
> > +             /* make sure we zero important fields of the request */
> > +             skcipher_request_set_callback(ctx->r.req,
> > +             0, NULL, NULL);
> > +     else
> > +             /*
> > +              * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
> > +              * requests if driver request queue is full.
> > +              */
> > +             skcipher_request_set_callback(ctx->r.req,
> > +             CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +             kcryptd_async_done, dmreq_of_req(cc, ctx->r.req));
> >  }
>
> This looks wrong.  Unless type=0 and mask=CRYPTO_ALG_ASYNC are passed to
> crypto_alloc_skcipher(), the skcipher implementation can still be asynchronous,
> in which case providing a callback is required.
>
> Do you intend that the "force_inline" option forces the use of a synchronous
> skcipher (alongside the other things it does)?  Or should it still allow
> asynchronous ones?

As mentioned above, I don't think we should require synchronous crypto
with the "force_inline" flag anymore. Although we may remove
CRYPTO_TFM_REQ_MAY_BACKLOG with the inline flag.

> We may not actually have a choice in that matter, since xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set (as I mentioned) despite being synchronous in most
> cases; thus, the crypto API won't give you it if you ask for a synchronous
> cipher.  So I think you still need to allow async skciphers?  That means a
> callback is still always required.
>
> - Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24  8:24     ` Ignat Korchagin
@ 2020-06-24 16:24       ` Eric Biggers
  2020-06-24 17:00         ` Ignat Korchagin
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Biggers @ 2020-06-24 16:24 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: agk, Mike Snitzer, dm-devel, dm-crypt, linux-kernel, kernel-team

On Wed, Jun 24, 2020 at 09:24:07AM +0100, Ignat Korchagin wrote:
> On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
> > > especially visible on busy systems with many processes/threads. Moreover, most
> > > Crypto API implementaions are async, that is they offload crypto operations on
> > > their own, so this dm-crypt offloading is excessive.
> >
> > This really should say "some Crypto API implementations are async" instead of
> > "most Crypto API implementations are async".
> 
> The most accurate would probably be: most hardware-accelerated Crypto
> API implementations are async
> 
> > Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
> > context where SIMD instructions are usable.  It's only asynchronous when SIMD is
> > not usable.  (This seems to have been missed in your blog post.)
> 
> No, it was not. This is exactly why we made xts-proxy Crypto API
> module as a second patch. But it seems now it does not make a big
> difference if a used Crypto API implementation is synchronous as well
> (based on some benchmarks outlined in the cover letter to this patch).
> I think the v2 of this patch will not require a synchronous Crypto
> API. This is probably a right thing to do, as the "inline" flag should
> control the way how dm-crypt itself handles requests, not how Crypto
> API handles requests. If a user wants to ensure a particular
> synchronous Crypto API implementation, they can already reconfigure
> dm-crypt and specify the implementation with a "capi:" prefix in the
> the dm table description.

I think you're missing the point.  Although xts-aes-aesni has the
CRYPTO_ALG_ASYNC bit set, the actual implementation processes the request
synchronously if SIMD instructions are currently usable.  That's always the case
in dm-crypt, as far as I can tell.  This algorithm has the ASYNC flag only
because it's not synchronous when called in hardIRQ context.

That's why your "xts-proxy" doesn't make a difference, and why it's misleading
to suggest that the crypto API is doing its own queueing when you're primarily
talking about xts-aes-aesni.  The crypto API definitely can do its own queueing,
mainly with hardware drivers.  But it doesn't in this common and relevant case.

- Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-crypt] [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
  2020-06-24 16:24       ` Eric Biggers
@ 2020-06-24 17:00         ` Ignat Korchagin
  0 siblings, 0 replies; 31+ messages in thread
From: Ignat Korchagin @ 2020-06-24 17:00 UTC (permalink / raw)
  To: Eric Biggers
  Cc: agk, Mike Snitzer, dm-devel, dm-crypt, linux-kernel, kernel-team

On Wed, Jun 24, 2020 at 5:24 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Jun 24, 2020 at 09:24:07AM +0100, Ignat Korchagin wrote:
> > On Wed, Jun 24, 2020 at 6:04 AM Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > On Fri, Jun 19, 2020 at 05:41:32PM +0100, Ignat Korchagin wrote:
> > > > Sometimes extra thread offloading imposed by dm-crypt hurts IO latency. This is
> > > > especially visible on busy systems with many processes/threads. Moreover, most
> > > > Crypto API implementaions are async, that is they offload crypto operations on
> > > > their own, so this dm-crypt offloading is excessive.
> > >
> > > This really should say "some Crypto API implementations are async" instead of
> > > "most Crypto API implementations are async".
> >
> > The most accurate would probably be: most hardware-accelerated Crypto
> > API implementations are async
> >
> > > Notably, the AES-NI implementation of AES-XTS is synchronous if you call it in a
> > > context where SIMD instructions are usable.  It's only asynchronous when SIMD is
> > > not usable.  (This seems to have been missed in your blog post.)
> >
> > No, it was not. This is exactly why we made xts-proxy Crypto API
> > module as a second patch. But it seems now it does not make a big
> > difference if a used Crypto API implementation is synchronous as well
> > (based on some benchmarks outlined in the cover letter to this patch).
> > I think the v2 of this patch will not require a synchronous Crypto
> > API. This is probably a right thing to do, as the "inline" flag should
> > control the way how dm-crypt itself handles requests, not how Crypto
> > API handles requests. If a user wants to ensure a particular
> > synchronous Crypto API implementation, they can already reconfigure
> > dm-crypt and specify the implementation with a "capi:" prefix in the
> > the dm table description.
>
> I think you're missing the point.  Although xts-aes-aesni has the
> CRYPTO_ALG_ASYNC bit set, the actual implementation processes the request
> synchronously if SIMD instructions are currently usable.  That's always the case
> in dm-crypt, as far as I can tell.  This algorithm has the ASYNC flag only
> because it's not synchronous when called in hardIRQ context.
>
> That's why your "xts-proxy" doesn't make a difference, and why it's misleading
> to suggest that the crypto API is doing its own queueing when you're primarily
> talking about xts-aes-aesni.  The crypto API definitely can do its own queueing,
> mainly with hardware drivers.  But it doesn't in this common and relevant case.

I think we're talking about the same things but from different points
of view. I would like to clarify that the whole post and this change
does not have the intention to focus on aesni (or any x86-specific
crypto optimizations). In fact it is quite the opposite: we want to
optimize the generic dm-crypt regardless of which crypto is used
(that's why I just used a NULL cipher in the cover letter). We also
have some arm64 machines [1] and I bet they would benefit here as
well. The important point my post tries to make is that the original
workqueue offloading in dm-crypt was added because the Crypto API was
synchronous back in the day and, exactly as you say, you may not be
able to use some hw-accelerated crypto in hard IRQ context as well as
doing non-hw crypto in hard IRQ context is a bad idea. Now, most
Crypto API are smart enough to figure out on their own if they should
process the request inline or offload it to a workqueue, so the
workarounds in the dm-crypt itself most likely are not needed. Though,
the generic Crypto API "cipher walk" function does refuse to "walk"
the buffers in hard IRQ context, so the "tasklet" functionality is
still required.

But from the dm-crypt perspective - it should not take into account if
a particular xts-aes-aesni implementation is MOSTLY synchronous -
those are details of the implementation of this particular cipher
dm-crypt has no visibility into. So it would be right to say in my
opinion if the cipher has the CRYPTO_ALG_ASYNC flag set - it can
offload the crypto request to a workqueue at any time. How often does
it do it - that's another story and probably should be reviewed
elsewhere, if it does it too often.

Ignat

[1]: https://blog.cloudflare.com/arm-takes-wing/

> - Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-06-24 17:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 16:41 [RFC PATCH 0/1] dm-crypt excessive overhead Ignat Korchagin
2020-06-19 16:41 ` [RFC PATCH 1/1] Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target Ignat Korchagin
2020-06-24  5:04   ` [dm-crypt] " Eric Biggers
2020-06-24  5:21     ` [dm-devel] " Damien Le Moal
2020-06-24  5:27       ` Eric Biggers
2020-06-24  6:46         ` Damien Le Moal
2020-06-24  7:24         ` Damien Le Moal
2020-06-24  7:49     ` Damien Le Moal
2020-06-24  8:24     ` Ignat Korchagin
2020-06-24 16:24       ` Eric Biggers
2020-06-24 17:00         ` Ignat Korchagin
2020-06-24  5:12   ` [dm-devel] " Damien Le Moal
2020-06-19 16:55 ` [RFC PATCH 0/1] dm-crypt excessive overhead Mike Snitzer
2020-06-19 18:39   ` Mikulas Patocka
2020-06-19 19:44     ` Ignat Korchagin
2020-06-20  1:23     ` Herbert Xu
2020-06-20 19:36       ` Mikulas Patocka
2020-06-20 21:02         ` Ignat Korchagin
2020-06-23 15:33       ` Mike Snitzer
2020-06-23 16:24         ` Ignat Korchagin
2020-06-24  0:23           ` Herbert Xu
2020-06-22  0:45   ` [dm-devel] " Damien Le Moal
2020-06-22  7:55     ` Ignat Korchagin
2020-06-22  8:08       ` Damien Le Moal
2020-06-23 15:01     ` Mike Snitzer
2020-06-23 15:07       ` Ignat Korchagin
2020-06-23 15:22         ` Mike Snitzer
2020-06-24  4:54           ` [dm-devel] " Damien Le Moal
2020-06-24  5:22             ` Mike Snitzer
2020-06-24  8:02               ` Ignat Korchagin
2020-06-24  4:28       ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).