linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] nocopy bvec for direct IO
@ 2020-12-09  2:19 Pavel Begunkov
  2020-12-09  2:19 ` [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED Pavel Begunkov
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09  2:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner

The idea is to avoid copying, merging, etc. bvec from iterator to bio
in direct I/O and use the one we've already got. Hook it up for io_uring.
Had an eye on it for a long, and it also was brought up by Matthew
just recently. Let me know if I forgot or misplaced some tags.

A benchmark got me 430KIOPS vs 540KIOPS, or +25% on bare metal. And perf
shows that bio_iov_iter_get_pages() was taking ~20%. The test is pretty
silly, but still imposing. I'll redo it closer to reality for next
iteration, anyway need to double check some cases.

If same applied to iomap, common chunck can be moved from block_dev
into bio_iov_iter_get_pages(), but if there any benefit for filesystems,
they should explicitly opt in with ITER_BVEC_FLAG_FIXED.

# how to apply
based on Jens' for-11/block
+ Ming's nr_vec patch,
+ io_uring fix, 9c3a205c5ffa36e96903c2 ("io_uring: fix ITER_BVEC check")

or there:
https://github.com/isilence/linux/commits/bvec_nocopy

# how to reproduce
null_blk queue_mode=2 completion_nsec=0 submit_queues=NUM_CPU
fio/t/io_uring with null blk, no iopoll, BS=16*4096


Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

Pavel Begunkov (2):
  iov: introduce ITER_BVEC_FLAG_FIXED
  block: no-copy bvec for direct IO

 fs/block_dev.c      | 30 +++++++++++++++++++++++++++++-
 fs/io_uring.c       |  1 +
 include/linux/uio.h | 14 +++++++++++---
 3 files changed, 41 insertions(+), 4 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09  2:19 [RFC 0/2] nocopy bvec for direct IO Pavel Begunkov
@ 2020-12-09  2:19 ` Pavel Begunkov
  2020-12-09  8:36   ` Christoph Hellwig
  2020-12-09  2:19 ` [PATCH 2/2] block: no-copy bvec for direct IO Pavel Begunkov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09  2:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Viro, linux-fsdevel, io-uring, linux-block, linux-kernel

Add ITER_BVEC_FLAG_FIXED iov iter flag, which will allow us to reuse
passed in bvec instead of copying it. In particular it means that
iter->bvec won't be freed and page references are taken remain so
until callees don't need them, including asynchronous execution.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c       |  1 +
 include/linux/uio.h | 14 +++++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c536462920a3..9ff2805d0075 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2920,6 +2920,7 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw,
 		}
 	}
 
+	iter->type |= ITER_BVEC_FLAG_FIXED;
 	return len;
 }
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 72d88566694e..af626eb970cf 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -18,6 +18,8 @@ struct kvec {
 };
 
 enum iter_type {
+	ITER_BVEC_FLAG_FIXED = 2,
+
 	/* iter types */
 	ITER_IOVEC = 4,
 	ITER_KVEC = 8,
@@ -29,8 +31,9 @@ enum iter_type {
 struct iov_iter {
 	/*
 	 * Bit 0 is the read/write bit, set if we're writing.
-	 * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
-	 * the caller isn't expecting to drop a page reference when done.
+	 * Bit 1 is the BVEC_FLAG_FIXED bit, set if type is a bvec and the
+	 * caller ensures that page references and memory baking bvec won't
+	 * go away until callees finish with them.
 	 */
 	unsigned int type;
 	size_t iov_offset;
@@ -52,7 +55,7 @@ struct iov_iter {
 
 static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 {
-	return i->type & ~(READ | WRITE);
+	return i->type & ~(READ | WRITE | ITER_BVEC_FLAG_FIXED);
 }
 
 static inline bool iter_is_iovec(const struct iov_iter *i)
@@ -85,6 +88,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 	return i->type & (READ | WRITE);
 }
 
+static inline unsigned char iov_iter_bvec_fixed(const struct iov_iter *i)
+{
+	return i->type & ITER_BVEC_FLAG_FIXED;
+}
+
 /*
  * Total number of bytes covered by an iovec.
  *
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09  2:19 [RFC 0/2] nocopy bvec for direct IO Pavel Begunkov
  2020-12-09  2:19 ` [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED Pavel Begunkov
@ 2020-12-09  2:19 ` Pavel Begunkov
  2020-12-09  8:40   ` Christoph Hellwig
  2020-12-09 21:13   ` David Laight
  2020-12-09  6:50 ` [RFC 0/2] nocopy " Christoph Hellwig
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09  2:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox

The block layer spends quite a while in blkdev_direct_IO() to copy and
initialise bio's bvec. However, if we've already got a bvec in the input
iterator it might be reused in some cases, i.e. when new
ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
performance boost, and it also reduces memory footprint.

Suggested-by: Matthew Wilcox <willy@infradead.org>
[BIO_WORKINGSET] Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/block_dev.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index d699f3af1a09..aee5d2e4f324 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -349,6 +349,28 @@ static void blkdev_bio_end_io(struct bio *bio)
 	}
 }
 
+static int bio_iov_fixed_bvec_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	bio->bi_vcnt = iter->nr_segs;
+	bio->bi_max_vecs = iter->nr_segs;
+	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
+	bio->bi_iter.bi_bvec_done = iter->iov_offset;
+	bio->bi_iter.bi_size = iter->count;
+
+	/*
+	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
+	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
+	 * approximate it by looking at the first page and inducing it to the
+	 * whole bio
+	 */
+	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
+		bio_set_flag(bio, BIO_WORKINGSET);
+	bio_set_flag(bio, BIO_NO_PAGE_REF);
+
+	iter->count = 0;
+	return 0;
+}
+
 static ssize_t
 __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_vecs)
 {
@@ -368,6 +390,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_vecs)
 	    (bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
 
+	if (iov_iter_bvec_fixed(iter))
+		nr_vecs = 0;
 	bio = bio_alloc_bioset(GFP_KERNEL, nr_vecs, &blkdev_dio_pool);
 
 	dio = container_of(bio, struct blkdev_dio, bio);
@@ -398,7 +422,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_vecs)
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
 
-		ret = bio_iov_iter_get_pages(bio, iter);
+		if (iov_iter_is_bvec(iter) && iov_iter_bvec_fixed(iter))
+			ret = bio_iov_fixed_bvec_get_pages(bio, iter);
+		else
+			ret = bio_iov_iter_get_pages(bio, iter);
+
 		if (unlikely(ret)) {
 			bio->bi_status = BLK_STS_IOERR;
 			bio_endio(bio);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC 0/2] nocopy bvec for direct IO
  2020-12-09  2:19 [RFC 0/2] nocopy bvec for direct IO Pavel Begunkov
  2020-12-09  2:19 ` [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED Pavel Begunkov
  2020-12-09  2:19 ` [PATCH 2/2] block: no-copy bvec for direct IO Pavel Begunkov
@ 2020-12-09  6:50 ` Christoph Hellwig
  2020-12-09 11:54   ` Pavel Begunkov
  2020-12-09 16:53 ` Jens Axboe
  2020-12-09 17:06 ` Al Viro
  4 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-12-09  6:50 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner

On Wed, Dec 09, 2020 at 02:19:50AM +0000, Pavel Begunkov wrote:
> A benchmark got me 430KIOPS vs 540KIOPS, or +25% on bare metal. And perf
> shows that bio_iov_iter_get_pages() was taking ~20%. The test is pretty
> silly, but still imposing. I'll redo it closer to reality for next
> iteration, anyway need to double check some cases.

That is pretty impressive.  But I only got this cover letter, not the
actual patches..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09  2:19 ` [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED Pavel Begunkov
@ 2020-12-09  8:36   ` Christoph Hellwig
  2020-12-09  9:06     ` Christoph Hellwig
  2020-12-09 13:07     ` Al Viro
  0 siblings, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2020-12-09  8:36 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel

Ok, seems like the patches made it to the lists, while oyu only
send the cover letter to my address which is very strange.

> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 72d88566694e..af626eb970cf 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -18,6 +18,8 @@ struct kvec {
>  };
>  
>  enum iter_type {
> +	ITER_BVEC_FLAG_FIXED = 2,
> +
>  	/* iter types */
>  	ITER_IOVEC = 4,
>  	ITER_KVEC = 8,

This is making the iter type even more of a mess than it already is.
I think we at least need placeholders for 0/1 here and an explicit
flags namespace, preferably after the types.

Then again I'd much prefer if we didn't even add the flag or at best
just add it for a short-term transition and move everyone over to the
new scheme.  Otherwise the amount of different interfaces and supporting
code keeps exploding.

> @@ -29,8 +31,9 @@ enum iter_type {
>  struct iov_iter {
>  	/*
>  	 * Bit 0 is the read/write bit, set if we're writing.
> -	 * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
> -	 * the caller isn't expecting to drop a page reference when done.
> +	 * Bit 1 is the BVEC_FLAG_FIXED bit, set if type is a bvec and the
> +	 * caller ensures that page references and memory baking bvec won't
> +	 * go away until callees finish with them.
>  	 */
>  	unsigned int type;

I think the comment needs to move to the enum.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09  2:19 ` [PATCH 2/2] block: no-copy bvec for direct IO Pavel Begunkov
@ 2020-12-09  8:40   ` Christoph Hellwig
  2020-12-09 12:01     ` Pavel Begunkov
  2020-12-11 14:06     ` Johannes Weiner
  2020-12-09 21:13   ` David Laight
  1 sibling, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2020-12-09  8:40 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox, Johannes Weiner

> +	/*
> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
> +	 * approximate it by looking at the first page and inducing it to the
> +	 * whole bio
> +	 */
> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
> +		bio_set_flag(bio, BIO_WORKINGSET);

IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
at all for direct I/O.

> +	bio_set_flag(bio, BIO_NO_PAGE_REF);
> +
> +	iter->count = 0;
> +	return 0;
> +}

This helper should go into bio.c, next to bio_iov_iter_get_pages.
And please convert the other callers of bio_iov_iter_get_pages to this
scheme as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09  8:36   ` Christoph Hellwig
@ 2020-12-09  9:06     ` Christoph Hellwig
  2020-12-09 11:54       ` Pavel Begunkov
  2020-12-09 13:07     ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-12-09  9:06 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel

On Wed, Dec 09, 2020 at 08:36:45AM +0000, Christoph Hellwig wrote:
> This is making the iter type even more of a mess than it already is.
> I think we at least need placeholders for 0/1 here and an explicit
> flags namespace, preferably after the types.
> 
> Then again I'd much prefer if we didn't even add the flag or at best
> just add it for a short-term transition and move everyone over to the
> new scheme.  Otherwise the amount of different interfaces and supporting
> code keeps exploding.

Note that the only other callers that use iov_iter_bvec and asynchronous
read/write are loop, target and nvme-target, so this should actually
be pretty simple.  Out of these only target needs something like this
trivial change to keep the bvec available over the duration of the I/O,
the other two should be fine already:

---
From 581a8eabbb1759e3dcfee4b1d2e419f814a8cb80 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 9 Dec 2020 10:05:04 +0100
Subject: target/file: allocate the bvec array as part of struct target_core_file_cmd

This saves one memory allocation, and ensures the bvecs aren't freed
before the AIO completion.  This will allow the lower level code to be
optimized so that it can avoid allocating another bvec array.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/target/target_core_file.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7143d03f0e027e..ed0c39a1f7c649 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -241,6 +241,7 @@ struct target_core_file_cmd {
 	unsigned long	len;
 	struct se_cmd	*cmd;
 	struct kiocb	iocb;
+	struct bio_vec	bvecs[];
 };
 
 static void cmd_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
@@ -268,29 +269,22 @@ fd_execute_rw_aio(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
 	struct target_core_file_cmd *aio_cmd;
 	struct iov_iter iter = {};
 	struct scatterlist *sg;
-	struct bio_vec *bvec;
 	ssize_t len = 0;
 	int ret = 0, i;
 
-	aio_cmd = kmalloc(sizeof(struct target_core_file_cmd), GFP_KERNEL);
+	aio_cmd = kmalloc(struct_size(aio_cmd, bvecs, sgl_nents), GFP_KERNEL);
 	if (!aio_cmd)
 		return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 
-	bvec = kcalloc(sgl_nents, sizeof(struct bio_vec), GFP_KERNEL);
-	if (!bvec) {
-		kfree(aio_cmd);
-		return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
-	}
-
 	for_each_sg(sgl, sg, sgl_nents, i) {
-		bvec[i].bv_page = sg_page(sg);
-		bvec[i].bv_len = sg->length;
-		bvec[i].bv_offset = sg->offset;
+		aio_cmd->bvecs[i].bv_page = sg_page(sg);
+		aio_cmd->bvecs[i].bv_len = sg->length;
+		aio_cmd->bvecs[i].bv_offset = sg->offset;
 
 		len += sg->length;
 	}
 
-	iov_iter_bvec(&iter, is_write, bvec, sgl_nents, len);
+	iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);
 
 	aio_cmd->cmd = cmd;
 	aio_cmd->len = len;
@@ -307,8 +301,6 @@ fd_execute_rw_aio(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
 	else
 		ret = call_read_iter(file, &aio_cmd->iocb, &iter);
 
-	kfree(bvec);
-
 	if (ret != -EIOCBQUEUED)
 		cmd_rw_aio_complete(&aio_cmd->iocb, ret, 0);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC 0/2] nocopy bvec for direct IO
  2020-12-09  6:50 ` [RFC 0/2] nocopy " Christoph Hellwig
@ 2020-12-09 11:54   ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09 11:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox, Ming Lei, Johannes Weiner

On 09/12/2020 06:50, Christoph Hellwig wrote:
> On Wed, Dec 09, 2020 at 02:19:50AM +0000, Pavel Begunkov wrote:
>> A benchmark got me 430KIOPS vs 540KIOPS, or +25% on bare metal. And perf
>> shows that bio_iov_iter_get_pages() was taking ~20%. The test is pretty
>> silly, but still imposing. I'll redo it closer to reality for next
>> iteration, anyway need to double check some cases.
> 
> That is pretty impressive.  

The difference will go down with BS=~1-2 pages, just need to to find a
moment to test properly.

> But I only got this cover letter, not the
> actual patches..

Apologies, that goes for everyone as lost CCs in the patches.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09  9:06     ` Christoph Hellwig
@ 2020-12-09 11:54       ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09 11:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox, Ming Lei

On 09/12/2020 09:06, Christoph Hellwig wrote:
> On Wed, Dec 09, 2020 at 08:36:45AM +0000, Christoph Hellwig wrote:
>> This is making the iter type even more of a mess than it already is.
>> I think we at least need placeholders for 0/1 here and an explicit
>> flags namespace, preferably after the types.
>>
>> Then again I'd much prefer if we didn't even add the flag or at best
>> just add it for a short-term transition and move everyone over to the
>> new scheme.  Otherwise the amount of different interfaces and supporting
>> code keeps exploding.

At least the flag can be ignored. Anyway sounds good to me. I'll take
your patch below to the series, thanks!

> 
> Note that the only other callers that use iov_iter_bvec and asynchronous
> read/write are loop, target and nvme-target, so this should actually
> be pretty simple.  Out of these only target needs something like this
> trivial change to keep the bvec available over the duration of the I/O,
> the other two should be fine already:
> 
> ---
> From 581a8eabbb1759e3dcfee4b1d2e419f814a8cb80 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Wed, 9 Dec 2020 10:05:04 +0100
> Subject: target/file: allocate the bvec array as part of struct target_core_file_cmd
> 
> This saves one memory allocation, and ensures the bvecs aren't freed
> before the AIO completion.  This will allow the lower level code to be
> optimized so that it can avoid allocating another bvec array.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/target/target_core_file.c | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
> index 7143d03f0e027e..ed0c39a1f7c649 100644
> --- a/drivers/target/target_core_file.c
> +++ b/drivers/target/target_core_file.c
> @@ -241,6 +241,7 @@ struct target_core_file_cmd {
>  	unsigned long	len;
>  	struct se_cmd	*cmd;
>  	struct kiocb	iocb;
> +	struct bio_vec	bvecs[];
>  };
>  
>  static void cmd_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
> @@ -268,29 +269,22 @@ fd_execute_rw_aio(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
>  	struct target_core_file_cmd *aio_cmd;
>  	struct iov_iter iter = {};
>  	struct scatterlist *sg;
> -	struct bio_vec *bvec;
>  	ssize_t len = 0;
>  	int ret = 0, i;
>  
> -	aio_cmd = kmalloc(sizeof(struct target_core_file_cmd), GFP_KERNEL);
> +	aio_cmd = kmalloc(struct_size(aio_cmd, bvecs, sgl_nents), GFP_KERNEL);
>  	if (!aio_cmd)
>  		return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
>  
> -	bvec = kcalloc(sgl_nents, sizeof(struct bio_vec), GFP_KERNEL);
> -	if (!bvec) {
> -		kfree(aio_cmd);
> -		return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
> -	}
> -
>  	for_each_sg(sgl, sg, sgl_nents, i) {
> -		bvec[i].bv_page = sg_page(sg);
> -		bvec[i].bv_len = sg->length;
> -		bvec[i].bv_offset = sg->offset;
> +		aio_cmd->bvecs[i].bv_page = sg_page(sg);
> +		aio_cmd->bvecs[i].bv_len = sg->length;
> +		aio_cmd->bvecs[i].bv_offset = sg->offset;
>  
>  		len += sg->length;
>  	}
>  
> -	iov_iter_bvec(&iter, is_write, bvec, sgl_nents, len);
> +	iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);
>  
>  	aio_cmd->cmd = cmd;
>  	aio_cmd->len = len;
> @@ -307,8 +301,6 @@ fd_execute_rw_aio(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
>  	else
>  		ret = call_read_iter(file, &aio_cmd->iocb, &iter);
>  
> -	kfree(bvec);
> -
>  	if (ret != -EIOCBQUEUED)
>  		cmd_rw_aio_complete(&aio_cmd->iocb, ret, 0);
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09  8:40   ` Christoph Hellwig
@ 2020-12-09 12:01     ` Pavel Begunkov
  2020-12-09 12:05       ` Christoph Hellwig
  2020-12-11 14:06     ` Johannes Weiner
  1 sibling, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09 12:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox, Johannes Weiner, Ming Lei

On 09/12/2020 08:40, Christoph Hellwig wrote:
>> +	/*
>> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
>> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
>> +	 * approximate it by looking at the first page and inducing it to the
>> +	 * whole bio
>> +	 */
>> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
>> +		bio_set_flag(bio, BIO_WORKINGSET);
> 
> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
> at all for direct I/O.

I was waiting for the conversation to unfold, i.e. for Johannes to
answer. BTW, would the same (skipping BIO_WORKINGSET) stand true for
iomap?

> 
>> +	bio_set_flag(bio, BIO_NO_PAGE_REF);
>> +
>> +	iter->count = 0;
>> +	return 0;
>> +}
> 
> This helper should go into bio.c, next to bio_iov_iter_get_pages.
> And please convert the other callers of bio_iov_iter_get_pages to this
> scheme as well.

Agree. In the end I want to merge that into bio_iov_iter_get_pages().

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09 12:05       ` Christoph Hellwig
@ 2020-12-09 12:03         ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09 12:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox, Johannes Weiner, Ming Lei

On 09/12/2020 12:05, Christoph Hellwig wrote:
> On Wed, Dec 09, 2020 at 12:01:27PM +0000, Pavel Begunkov wrote:
>> On 09/12/2020 08:40, Christoph Hellwig wrote:
>>>> +	/*
>>>> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
>>>> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
>>>> +	 * approximate it by looking at the first page and inducing it to the
>>>> +	 * whole bio
>>>> +	 */
>>>> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
>>>> +		bio_set_flag(bio, BIO_WORKINGSET);
>>>
>>> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
>>> at all for direct I/O.
>>
>> I was waiting for the conversation to unfold, i.e. for Johannes to
>> answer. BTW, would the same (skipping BIO_WORKINGSET) stand true for
>> iomap?
> 
> iomap direct I/O: yes.

That one, got it

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09 12:01     ` Pavel Begunkov
@ 2020-12-09 12:05       ` Christoph Hellwig
  2020-12-09 12:03         ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-12-09 12:05 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Christoph Hellwig, Jens Axboe, Alexander Viro, linux-fsdevel,
	io-uring, linux-block, linux-kernel, Matthew Wilcox,
	Johannes Weiner, Ming Lei

On Wed, Dec 09, 2020 at 12:01:27PM +0000, Pavel Begunkov wrote:
> On 09/12/2020 08:40, Christoph Hellwig wrote:
> >> +	/*
> >> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
> >> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
> >> +	 * approximate it by looking at the first page and inducing it to the
> >> +	 * whole bio
> >> +	 */
> >> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
> >> +		bio_set_flag(bio, BIO_WORKINGSET);
> > 
> > IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
> > at all for direct I/O.
> 
> I was waiting for the conversation to unfold, i.e. for Johannes to
> answer. BTW, would the same (skipping BIO_WORKINGSET) stand true for
> iomap?

iomap direct I/O: yes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09  8:36   ` Christoph Hellwig
  2020-12-09  9:06     ` Christoph Hellwig
@ 2020-12-09 13:07     ` Al Viro
  2020-12-09 13:37       ` Pavel Begunkov
  1 sibling, 1 reply; 26+ messages in thread
From: Al Viro @ 2020-12-09 13:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Begunkov, Jens Axboe, linux-fsdevel, io-uring, linux-block,
	linux-kernel

On Wed, Dec 09, 2020 at 08:36:45AM +0000, Christoph Hellwig wrote:
> 
> This is making the iter type even more of a mess than it already is.
> I think we at least need placeholders for 0/1 here and an explicit
> flags namespace, preferably after the types.
> 
> Then again I'd much prefer if we didn't even add the flag or at best
> just add it for a short-term transition and move everyone over to the
> new scheme.  Otherwise the amount of different interfaces and supporting
> code keeps exploding.

Yes.  The only problem I see is how to describe the rules - "bdev-backed
iterators need the bvec array to stay allocated until IO completes"?
And one way or another, that needs to be documented - D/f/porting with
"mandatory" for priority.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09 13:07     ` Al Viro
@ 2020-12-09 13:37       ` Pavel Begunkov
  2020-12-09 17:55         ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-09 13:37 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: Jens Axboe, linux-fsdevel, io-uring, linux-block, linux-kernel

On 09/12/2020 13:07, Al Viro wrote:
> On Wed, Dec 09, 2020 at 08:36:45AM +0000, Christoph Hellwig wrote:
>>
>> This is making the iter type even more of a mess than it already is.
>> I think we at least need placeholders for 0/1 here and an explicit
>> flags namespace, preferably after the types.
>>
>> Then again I'd much prefer if we didn't even add the flag or at best
>> just add it for a short-term transition and move everyone over to the
>> new scheme.  Otherwise the amount of different interfaces and supporting
>> code keeps exploding.
> 
> Yes.  The only problem I see is how to describe the rules - "bdev-backed
> iterators need the bvec array to stay allocated until IO completes"?
> And one way or another, that needs to be documented - D/f/porting with
> "mandatory" for priority.

Yeah, I had troubles to put comments around, and it's still open.

For current cases it can be bound to kiocb, e.g. "if an bvec iter passed
"together" with kiocb then the vector should stay intact up to 
->ki_complete()". But that "together" is rather full of holes.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/2] nocopy bvec for direct IO
  2020-12-09  2:19 [RFC 0/2] nocopy bvec for direct IO Pavel Begunkov
                   ` (2 preceding siblings ...)
  2020-12-09  6:50 ` [RFC 0/2] nocopy " Christoph Hellwig
@ 2020-12-09 16:53 ` Jens Axboe
  2020-12-13 22:03   ` Pavel Begunkov
  2020-12-09 17:06 ` Al Viro
  4 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2020-12-09 16:53 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner

On 12/8/20 7:19 PM, Pavel Begunkov wrote:
> The idea is to avoid copying, merging, etc. bvec from iterator to bio
> in direct I/O and use the one we've already got. Hook it up for io_uring.
> Had an eye on it for a long, and it also was brought up by Matthew
> just recently. Let me know if I forgot or misplaced some tags.
> 
> A benchmark got me 430KIOPS vs 540KIOPS, or +25% on bare metal. And perf
> shows that bio_iov_iter_get_pages() was taking ~20%. The test is pretty
> silly, but still imposing. I'll redo it closer to reality for next
> iteration, anyway need to double check some cases.
> 
> If same applied to iomap, common chunck can be moved from block_dev
> into bio_iov_iter_get_pages(), but if there any benefit for filesystems,
> they should explicitly opt in with ITER_BVEC_FLAG_FIXED.

Ran this on a real device, and I get a 10% bump in performance with it.
That's pretty amazing! So please do pursue this one and pull it to
completion.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/2] nocopy bvec for direct IO
  2020-12-09  2:19 [RFC 0/2] nocopy bvec for direct IO Pavel Begunkov
                   ` (3 preceding siblings ...)
  2020-12-09 16:53 ` Jens Axboe
@ 2020-12-09 17:06 ` Al Viro
  4 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2020-12-09 17:06 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, linux-fsdevel, io-uring, linux-block, linux-kernel,
	Christoph Hellwig, Matthew Wilcox, Ming Lei, Johannes Weiner

On Wed, Dec 09, 2020 at 02:19:50AM +0000, Pavel Begunkov wrote:
> The idea is to avoid copying, merging, etc. bvec from iterator to bio
> in direct I/O and use the one we've already got. Hook it up for io_uring.
> Had an eye on it for a long, and it also was brought up by Matthew
> just recently. Let me know if I forgot or misplaced some tags.
> 
> A benchmark got me 430KIOPS vs 540KIOPS, or +25% on bare metal. And perf
> shows that bio_iov_iter_get_pages() was taking ~20%. The test is pretty
> silly, but still imposing. I'll redo it closer to reality for next
> iteration, anyway need to double check some cases.
> 
> If same applied to iomap, common chunck can be moved from block_dev
> into bio_iov_iter_get_pages(), but if there any benefit for filesystems,
> they should explicitly opt in with ITER_BVEC_FLAG_FIXED.

To reiterate what hch said - this "opt in" is wrong.  Out-of-tree
code that does async IO on bvec-backed iov_iter, setting it up on
its own will have to adapt, that all.

iov_iter and its users are already in serious need of simplification
and cleanups; piling more on top of that would be a bloody bad idea.

Proposed semantics change for bvec-backed iov_iter makes a lot of sense,
so let's make sure that everything in tree can live with it, document
the change and switch to better semantics.

This thing should be unconditional.  Document it in D/f/porting and
if something out of tree complains, it's their problem - not ours.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09 13:37       ` Pavel Begunkov
@ 2020-12-09 17:55         ` Christoph Hellwig
  2020-12-09 18:24           ` Matthew Wilcox
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-12-09 17:55 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Al Viro, Christoph Hellwig, Jens Axboe, linux-fsdevel, io-uring,
	linux-block, linux-kernel

On Wed, Dec 09, 2020 at 01:37:05PM +0000, Pavel Begunkov wrote:
> Yeah, I had troubles to put comments around, and it's still open.
> 
> For current cases it can be bound to kiocb, e.g. "if an bvec iter passed
> "together" with kiocb then the vector should stay intact up to 
> ->ki_complete()". But that "together" is rather full of holes.

What about: "For bvec based iters the bvec must not be freed until the
I/O has completed.  For asynchronous I/O that means it must be freed
no earlier than from ->ki_complete."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09 17:55         ` Christoph Hellwig
@ 2020-12-09 18:24           ` Matthew Wilcox
  2020-12-13 22:09             ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2020-12-09 18:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Begunkov, Al Viro, Jens Axboe, linux-fsdevel, io-uring,
	linux-block, linux-kernel

On Wed, Dec 09, 2020 at 05:55:53PM +0000, Christoph Hellwig wrote:
> On Wed, Dec 09, 2020 at 01:37:05PM +0000, Pavel Begunkov wrote:
> > Yeah, I had troubles to put comments around, and it's still open.
> > 
> > For current cases it can be bound to kiocb, e.g. "if an bvec iter passed
> > "together" with kiocb then the vector should stay intact up to 
> > ->ki_complete()". But that "together" is rather full of holes.
> 
> What about: "For bvec based iters the bvec must not be freed until the
> I/O has completed.  For asynchronous I/O that means it must be freed
> no earlier than from ->ki_complete."

Perhaps for the second sentence "If the I/O is completed asynchronously,
the bvec must not be freed before ->ki_complete() has been called"?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09  2:19 ` [PATCH 2/2] block: no-copy bvec for direct IO Pavel Begunkov
  2020-12-09  8:40   ` Christoph Hellwig
@ 2020-12-09 21:13   ` David Laight
  1 sibling, 0 replies; 26+ messages in thread
From: David Laight @ 2020-12-09 21:13 UTC (permalink / raw)
  To: 'Pavel Begunkov', Jens Axboe
  Cc: Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox

From: Pavel Begunkov
> Sent: 09 December 2020 02:20
> 
> The block layer spends quite a while in blkdev_direct_IO() to copy and
> initialise bio's bvec. However, if we've already got a bvec in the input
> iterator it might be reused in some cases, i.e. when new
> ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
> performance boost, and it also reduces memory footprint.
...
> @@ -398,7 +422,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_vecs)
>  		bio->bi_end_io = blkdev_bio_end_io;
>  		bio->bi_ioprio = iocb->ki_ioprio;
> 
> -		ret = bio_iov_iter_get_pages(bio, iter);
> +		if (iov_iter_is_bvec(iter) && iov_iter_bvec_fixed(iter))
> +			ret = bio_iov_fixed_bvec_get_pages(bio, iter);
> +		else
> +			ret = bio_iov_iter_get_pages(bio, iter);
> +

Is it necessary to check iov_iter_is_bvec() as well as iov_iter_bvec_fixed() ?
If so it is probably worth using & not && so the compiler stands
a chance of generating a & (B | C) == B instead of 2 conditionals.
(I think I saw the bits in the same field being tested.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-09  8:40   ` Christoph Hellwig
  2020-12-09 12:01     ` Pavel Begunkov
@ 2020-12-11 14:06     ` Johannes Weiner
  2020-12-11 14:20       ` Pavel Begunkov
  1 sibling, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2020-12-11 14:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Begunkov, Jens Axboe, Alexander Viro, linux-fsdevel,
	io-uring, linux-block, linux-kernel, Matthew Wilcox

On Wed, Dec 09, 2020 at 08:40:05AM +0000, Christoph Hellwig wrote:
> > +	/*
> > +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
> > +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
> > +	 * approximate it by looking at the first page and inducing it to the
> > +	 * whole bio
> > +	 */
> > +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
> > +		bio_set_flag(bio, BIO_WORKINGSET);
> 
> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
> at all for direct I/O.

Yes, this hunk is incorrect. We must not use this flag for direct IO.
It's only for paging IO, when you bring in the data at page->mapping +
page->index. Otherwise you tell the pressure accounting code that you
are paging in a thrashing page, when really you're just reading new
data into a page frame that happens to be hot.

(As per the other thread, bio_add_page() currently makes that same
mistake for direct IO. I'm fixing that.)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-11 14:06     ` Johannes Weiner
@ 2020-12-11 14:20       ` Pavel Begunkov
  2020-12-11 15:38         ` Johannes Weiner
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-11 14:20 UTC (permalink / raw)
  To: Johannes Weiner, Christoph Hellwig
  Cc: Jens Axboe, Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Matthew Wilcox

On 11/12/2020 14:06, Johannes Weiner wrote:
> On Wed, Dec 09, 2020 at 08:40:05AM +0000, Christoph Hellwig wrote:
>>> +	/*
>>> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
>>> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
>>> +	 * approximate it by looking at the first page and inducing it to the
>>> +	 * whole bio
>>> +	 */
>>> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
>>> +		bio_set_flag(bio, BIO_WORKINGSET);
>>
>> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
>> at all for direct I/O.
> 
> Yes, this hunk is incorrect. We must not use this flag for direct IO.
> It's only for paging IO, when you bring in the data at page->mapping +
> page->index. Otherwise you tell the pressure accounting code that you
> are paging in a thrashing page, when really you're just reading new
> data into a page frame that happens to be hot.
> 
> (As per the other thread, bio_add_page() currently makes that same
> mistake for direct IO. I'm fixing that.)

I have that stuff fixed, it just didn't go into the RFC. That's basically
removing replacing add_page() with its version without BIO_WORKINGSET
in bio_iov_iter_get_pages() and all __bio_iov_*_{add,get}_pages() +
fix up ./fs/direct-io.c. Should cover all direct cases if I didn't miss
some.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-11 14:20       ` Pavel Begunkov
@ 2020-12-11 15:38         ` Johannes Weiner
  2020-12-11 15:47           ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2020-12-11 15:38 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Christoph Hellwig, Jens Axboe, Alexander Viro, linux-fsdevel,
	io-uring, linux-block, linux-kernel, Matthew Wilcox

On Fri, Dec 11, 2020 at 02:20:11PM +0000, Pavel Begunkov wrote:
> On 11/12/2020 14:06, Johannes Weiner wrote:
> > On Wed, Dec 09, 2020 at 08:40:05AM +0000, Christoph Hellwig wrote:
> >>> +	/*
> >>> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
> >>> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
> >>> +	 * approximate it by looking at the first page and inducing it to the
> >>> +	 * whole bio
> >>> +	 */
> >>> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
> >>> +		bio_set_flag(bio, BIO_WORKINGSET);
> >>
> >> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
> >> at all for direct I/O.
> > 
> > Yes, this hunk is incorrect. We must not use this flag for direct IO.
> > It's only for paging IO, when you bring in the data at page->mapping +
> > page->index. Otherwise you tell the pressure accounting code that you
> > are paging in a thrashing page, when really you're just reading new
> > data into a page frame that happens to be hot.
> > 
> > (As per the other thread, bio_add_page() currently makes that same
> > mistake for direct IO. I'm fixing that.)
> 
> I have that stuff fixed, it just didn't go into the RFC. That's basically
> removing replacing add_page() with its version without BIO_WORKINGSET
> in bio_iov_iter_get_pages() and all __bio_iov_*_{add,get}_pages() +
> fix up ./fs/direct-io.c. Should cover all direct cases if I didn't miss
> some.

Ah, that's fantastic! Thanks for clarifying.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-11 15:38         ` Johannes Weiner
@ 2020-12-11 15:47           ` Pavel Begunkov
  2020-12-11 16:13             ` Johannes Weiner
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-11 15:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christoph Hellwig, Jens Axboe, Alexander Viro, linux-fsdevel,
	io-uring, linux-block, linux-kernel, Matthew Wilcox

On 11/12/2020 15:38, Johannes Weiner wrote:
> On Fri, Dec 11, 2020 at 02:20:11PM +0000, Pavel Begunkov wrote:
>> On 11/12/2020 14:06, Johannes Weiner wrote:
>>> On Wed, Dec 09, 2020 at 08:40:05AM +0000, Christoph Hellwig wrote:
>>>>> +	/*
>>>>> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
>>>>> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
>>>>> +	 * approximate it by looking at the first page and inducing it to the
>>>>> +	 * whole bio
>>>>> +	 */
>>>>> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
>>>>> +		bio_set_flag(bio, BIO_WORKINGSET);
>>>>
>>>> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
>>>> at all for direct I/O.
>>>
>>> Yes, this hunk is incorrect. We must not use this flag for direct IO.
>>> It's only for paging IO, when you bring in the data at page->mapping +
>>> page->index. Otherwise you tell the pressure accounting code that you
>>> are paging in a thrashing page, when really you're just reading new
>>> data into a page frame that happens to be hot.
>>>
>>> (As per the other thread, bio_add_page() currently makes that same
>>> mistake for direct IO. I'm fixing that.)
>>
>> I have that stuff fixed, it just didn't go into the RFC. That's basically
>> removing replacing add_page() with its version without BIO_WORKINGSET

I wrote something strange... Should have been "replacing add_page() in
those functions with a version without BIO_WORKINGSET".

>> in bio_iov_iter_get_pages() and all __bio_iov_*_{add,get}_pages() +
>> fix up ./fs/direct-io.c. Should cover all direct cases if I didn't miss
>> some.
> 
> Ah, that's fantastic! Thanks for clarifying.

To keep it clear, do we go with what I have stashed (I'm planning to
reiterate this weekend)? or you're going to write it up yourself?
Just in case there is some cooler way you have in mind :)

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: no-copy bvec for direct IO
  2020-12-11 15:47           ` Pavel Begunkov
@ 2020-12-11 16:13             ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2020-12-11 16:13 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Christoph Hellwig, Jens Axboe, Alexander Viro, linux-fsdevel,
	io-uring, linux-block, linux-kernel, Matthew Wilcox

On Fri, Dec 11, 2020 at 03:47:23PM +0000, Pavel Begunkov wrote:
> On 11/12/2020 15:38, Johannes Weiner wrote:
> > On Fri, Dec 11, 2020 at 02:20:11PM +0000, Pavel Begunkov wrote:
> >> On 11/12/2020 14:06, Johannes Weiner wrote:
> >>> On Wed, Dec 09, 2020 at 08:40:05AM +0000, Christoph Hellwig wrote:
> >>>>> +	/*
> >>>>> +	 * In practice groups of pages tend to be accessed/reclaimed/refaulted
> >>>>> +	 * together. To not go over bvec for those who didn't set BIO_WORKINGSET
> >>>>> +	 * approximate it by looking at the first page and inducing it to the
> >>>>> +	 * whole bio
> >>>>> +	 */
> >>>>> +	if (unlikely(PageWorkingset(iter->bvec->bv_page)))
> >>>>> +		bio_set_flag(bio, BIO_WORKINGSET);
> >>>>
> >>>> IIRC the feedback was that we do not need to deal with BIO_WORKINGSET
> >>>> at all for direct I/O.
> >>>
> >>> Yes, this hunk is incorrect. We must not use this flag for direct IO.
> >>> It's only for paging IO, when you bring in the data at page->mapping +
> >>> page->index. Otherwise you tell the pressure accounting code that you
> >>> are paging in a thrashing page, when really you're just reading new
> >>> data into a page frame that happens to be hot.
> >>>
> >>> (As per the other thread, bio_add_page() currently makes that same
> >>> mistake for direct IO. I'm fixing that.)
> >>
> >> I have that stuff fixed, it just didn't go into the RFC. That's basically
> >> removing replacing add_page() with its version without BIO_WORKINGSET
> 
> I wrote something strange... Should have been "replacing add_page() in
> those functions with a version without BIO_WORKINGSET".

No worries, I understood.

> >> in bio_iov_iter_get_pages() and all __bio_iov_*_{add,get}_pages() +
> >> fix up ./fs/direct-io.c. Should cover all direct cases if I didn't miss
> >> some.
> > 
> > Ah, that's fantastic! Thanks for clarifying.
> 
> To keep it clear, do we go with what I have stashed (I'm planning to
> reiterate this weekend)? or you're going to write it up yourself?
> Just in case there is some cooler way you have in mind :)

Honestly, I only wrote all my ideas down and asked for feedback
because I wasn't super excited about any of them ;-)

If your changes happen to separate the direct io path from the
buffered io path naturally, I'm okay with it.

I'd say let's go with what you already have and see whether Jens and
Christoph like it. We can always do follow-on cleanups.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/2] nocopy bvec for direct IO
  2020-12-09 16:53 ` Jens Axboe
@ 2020-12-13 22:03   ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-13 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Viro, linux-fsdevel, io-uring, linux-block,
	linux-kernel, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner

On 09/12/2020 16:53, Jens Axboe wrote:
> On 12/8/20 7:19 PM, Pavel Begunkov wrote:
>> The idea is to avoid copying, merging, etc. bvec from iterator to bio
>> in direct I/O and use the one we've already got. Hook it up for io_uring.
>> Had an eye on it for a long, and it also was brought up by Matthew
>> just recently. Let me know if I forgot or misplaced some tags.
>>
>> A benchmark got me 430KIOPS vs 540KIOPS, or +25% on bare metal. And perf
>> shows that bio_iov_iter_get_pages() was taking ~20%. The test is pretty
>> silly, but still imposing. I'll redo it closer to reality for next
>> iteration, anyway need to double check some cases.
>>
>> If same applied to iomap, common chunck can be moved from block_dev
>> into bio_iov_iter_get_pages(), but if there any benefit for filesystems,
>> they should explicitly opt in with ITER_BVEC_FLAG_FIXED.
> 
> Ran this on a real device, and I get a 10% bump in performance with it.
> That's pretty amazing! So please do pursue this one and pull it to
> completion.

I'm curious, what block size did you use?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED
  2020-12-09 18:24           ` Matthew Wilcox
@ 2020-12-13 22:09             ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2020-12-13 22:09 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: Al Viro, Jens Axboe, linux-fsdevel, io-uring, linux-block, linux-kernel

On 09/12/2020 18:24, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 05:55:53PM +0000, Christoph Hellwig wrote:
>> On Wed, Dec 09, 2020 at 01:37:05PM +0000, Pavel Begunkov wrote:
>>> Yeah, I had troubles to put comments around, and it's still open.
>>>
>>> For current cases it can be bound to kiocb, e.g. "if an bvec iter passed
>>> "together" with kiocb then the vector should stay intact up to 
>>> ->ki_complete()". But that "together" is rather full of holes.
>>
>> What about: "For bvec based iters the bvec must not be freed until the
>> I/O has completed.  For asynchronous I/O that means it must be freed
>> no earlier than from ->ki_complete."
> 
> Perhaps for the second sentence "If the I/O is completed asynchronously,
> the bvec must not be freed before ->ki_complete() has been called"?

Sounds good, I'll use it. Thanks!

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-12-13 22:13 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-09  2:19 [RFC 0/2] nocopy bvec for direct IO Pavel Begunkov
2020-12-09  2:19 ` [PATCH 1/2] iov: introduce ITER_BVEC_FLAG_FIXED Pavel Begunkov
2020-12-09  8:36   ` Christoph Hellwig
2020-12-09  9:06     ` Christoph Hellwig
2020-12-09 11:54       ` Pavel Begunkov
2020-12-09 13:07     ` Al Viro
2020-12-09 13:37       ` Pavel Begunkov
2020-12-09 17:55         ` Christoph Hellwig
2020-12-09 18:24           ` Matthew Wilcox
2020-12-13 22:09             ` Pavel Begunkov
2020-12-09  2:19 ` [PATCH 2/2] block: no-copy bvec for direct IO Pavel Begunkov
2020-12-09  8:40   ` Christoph Hellwig
2020-12-09 12:01     ` Pavel Begunkov
2020-12-09 12:05       ` Christoph Hellwig
2020-12-09 12:03         ` Pavel Begunkov
2020-12-11 14:06     ` Johannes Weiner
2020-12-11 14:20       ` Pavel Begunkov
2020-12-11 15:38         ` Johannes Weiner
2020-12-11 15:47           ` Pavel Begunkov
2020-12-11 16:13             ` Johannes Weiner
2020-12-09 21:13   ` David Laight
2020-12-09  6:50 ` [RFC 0/2] nocopy " Christoph Hellwig
2020-12-09 11:54   ` Pavel Begunkov
2020-12-09 16:53 ` Jens Axboe
2020-12-13 22:03   ` Pavel Begunkov
2020-12-09 17:06 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).