linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET 0/3] io_uring improvements
@ 2019-03-15 14:59 Jens Axboe
  2019-03-15 14:59 ` [PATCH 1/3] iov_iter: add ITER_BVEC_FLAG_NO_REF flag Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jens Axboe @ 2019-03-15 14:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: viro

This is on top of the just posted "PATCHSET v2 0/7] io_uring fixes"
series.

This re-introduces the ability to NOT have to touch page reference
counts for our fixed buffers. Instead of assuming that's always
safe to do with a BVEC iter, we add an explicit flag for that. This
shifts the iter types a bit, but otherwise shouldn't cause any
changes.

The final patch re-adds the "we hit cache" flag. I left this out of
the initial series due to concerns over informing userspace about
this, with all the mincore etc discussions.

 block/bio.c                   | 43 +++++++++++++++++++----------------
 fs/block_dev.c                | 12 ++++++----
 fs/io_uring.c                 |  9 +++++++-
 fs/iomap.c                    | 12 ++++++----
 include/linux/blk_types.h     |  1 +
 include/linux/uio.h           | 24 +++++++++++++++----
 include/uapi/linux/io_uring.h |  5 ++++
 7 files changed, 71 insertions(+), 35 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] iov_iter: add ITER_BVEC_FLAG_NO_REF flag
  2019-03-15 14:59 [PATCHSET 0/3] io_uring improvements Jens Axboe
@ 2019-03-15 14:59 ` Jens Axboe
  2019-03-15 14:59 ` [PATCH 2/3] block: add BIO_NO_PAGE_REF flag Jens Axboe
  2019-03-15 14:59 ` [PATCH 3/3] io_uring: add io_uring_event cache hit information Jens Axboe
  2 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2019-03-15 14:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: viro, Jens Axboe

For ITER_BVEC, if we're holding on to kernel pages, the caller
doesn't need to grab a reference to the bvec pages, and drop that
same reference on IO completion. This is essentially safe for any
ITER_BVEC, but some use cases end up reusing pages and uncondtionally
dropping a page reference on completion. And example of that is
sendfile(2), that ends up being a splice_in + splice_out on the
pipe pages.

Add a flag that tells us it's fine to not grab a page reference
to the bvec pages, since that caller knows not to drop a reference
when it's done with the pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c       |  3 +++
 include/linux/uio.h | 24 +++++++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5be6e4f99a9e..f66a4a5daf35 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -862,6 +862,9 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
 	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
 	if (offset)
 		iov_iter_advance(iter, offset);
+
+	/* don't drop a reference to these pages */
+	iter->type |= ITER_BVEC_FLAG_NO_REF;
 	return 0;
 }
 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index ecf584f6b82d..4e926641fa80 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -23,14 +23,23 @@ struct kvec {
 };
 
 enum iter_type {
-	ITER_IOVEC = 0,
-	ITER_KVEC = 2,
-	ITER_BVEC = 4,
-	ITER_PIPE = 8,
-	ITER_DISCARD = 16,
+	/* set if ITER_BVEC doesn't hold a bv_page ref */
+	ITER_BVEC_FLAG_NO_REF = 2,
+
+	/* iter types */
+	ITER_IOVEC = 4,
+	ITER_KVEC = 8,
+	ITER_BVEC = 16,
+	ITER_PIPE = 32,
+	ITER_DISCARD = 64,
 };
 
 struct iov_iter {
+	/*
+	 * Bit 0 is the read/write bit, set if we're writing.
+	 * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
+	 * the caller isn't expecting to drop a page reference when done.
+	 */
 	unsigned int type;
 	size_t iov_offset;
 	size_t count;
@@ -84,6 +93,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 	return i->type & (READ | WRITE);
 }
 
+static inline bool iov_iter_bvec_no_ref(const struct iov_iter *i)
+{
+	return (i->type & ITER_BVEC_FLAG_NO_REF) != 0;
+}
+
 /*
  * Total number of bytes covered by an iovec.
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] block: add BIO_NO_PAGE_REF flag
  2019-03-15 14:59 [PATCHSET 0/3] io_uring improvements Jens Axboe
  2019-03-15 14:59 ` [PATCH 1/3] iov_iter: add ITER_BVEC_FLAG_NO_REF flag Jens Axboe
@ 2019-03-15 14:59 ` Jens Axboe
  2019-03-15 14:59 ` [PATCH 3/3] io_uring: add io_uring_event cache hit information Jens Axboe
  2 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2019-03-15 14:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: viro, Jens Axboe

If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
with NO_REF, then we don't need to add a page reference for the pages
that we add.

Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
not to drop a reference to these pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 43 ++++++++++++++++++++++-----------------
 fs/block_dev.c            | 12 ++++++-----
 fs/iomap.c                | 12 ++++++-----
 include/linux/blk_types.h |  1 +
 4 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 71a78d9fb8b7..b64cedc7f87c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -849,20 +849,14 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 	size = bio_add_page(bio, bv->bv_page, len,
 				bv->bv_offset + iter->iov_offset);
 	if (size == len) {
-		struct page *page;
-		int i;
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
+			struct page *page;
+			int i;
+
+			mp_bvec_for_each_page(page, bv, i)
+				get_page(page);
+		}
 
-		/*
-		 * For the normal O_DIRECT case, we could skip grabbing this
-		 * reference and then not have to put them again when IO
-		 * completes. But this breaks some in-kernel users, like
-		 * splicing to/from a loop device, where we release the pipe
-		 * pages unconditionally. If we can fix that case, we can
-		 * get rid of the get here and the need to call
-		 * bio_release_pages() at IO completion time.
-		 */
-		mp_bvec_for_each_page(page, bv, i)
-			get_page(page);
 		iov_iter_advance(iter, size);
 		return 0;
 	}
@@ -925,10 +919,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
  * This takes either an iterator pointing to user memory, or one pointing to
  * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
  * map them into the kernel. On IO completion, the caller should put those
- * pages. For now, when adding kernel pages, we still grab a reference to the
- * page. This isn't strictly needed for the common case, but some call paths
- * end up releasing pages from eg a pipe and we can't easily control these.
- * See comment in __bio_iov_bvec_add_pages().
+ * pages. If we're adding kernel pages, and the caller told us it's safe to
+ * do so, we just have to add the pages to the bio directly. We don't grab an
+ * extra reference to those pages (the user should already have that), and we
+ * don't put the page on IO completion. The caller needs to check if the bio is
+ * flagged BIO_NO_PAGE_REF on IO completion. If it isn't, then pages should be
+ * released.
  *
  * The function tries, but does not guarantee, to pin as many pages as
  * fit into the bio, or are requested in *iter, whatever is smaller. If
@@ -940,6 +936,13 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;
 
+	/*
+	 * If this is a BVEC iter, then the pages are kernel pages. Don't
+	 * release them on IO completion, if the caller asked us to.
+	 */
+	if (is_bvec && iov_iter_bvec_no_ref(iter))
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+
 	do {
 		int ret;
 
@@ -1696,7 +1699,8 @@ static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1713,7 +1717,8 @@ void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e9faa52bb489..78d3257435c0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -336,12 +336,14 @@ static void blkdev_bio_end_io(struct bio *bio)
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
 	} else {
-		struct bio_vec *bvec;
-		int i;
-		struct bvec_iter_all iter_all;
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
+			struct bvec_iter_all iter_all;
+			struct bio_vec *bvec;
+			int i;
 
-		bio_for_each_segment_all(bvec, bio, i, iter_all)
-			put_page(bvec->bv_page);
+			bio_for_each_segment_all(bvec, bio, i, iter_all)
+				put_page(bvec->bv_page);
+		}
 		bio_put(bio);
 	}
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 97cb9d486a7d..abdd18e404f8 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1589,12 +1589,14 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
 	} else {
-		struct bio_vec *bvec;
-		int i;
-		struct bvec_iter_all iter_all;
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF)) {
+			struct bvec_iter_all iter_all;
+			struct bio_vec *bvec;
+			int i;
 
-		bio_for_each_segment_all(bvec, bio, i, iter_all)
-			put_page(bvec->bv_page);
+			bio_for_each_segment_all(bvec, bio, i, iter_all)
+				put_page(bvec->bv_page);
+		}
 		bio_put(bio);
 	}
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..791fee35df88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -215,6 +215,7 @@ struct bio {
 /*
  * bio flags
  */
+#define BIO_NO_PAGE_REF	0	/* don't put release vec pages */
 #define BIO_SEG_VALID	1	/* bi_phys_segments valid */
 #define BIO_CLONED	2	/* doesn't own data */
 #define BIO_BOUNCED	3	/* bio is a bounce bio */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] io_uring: add io_uring_event cache hit information
  2019-03-15 14:59 [PATCHSET 0/3] io_uring improvements Jens Axboe
  2019-03-15 14:59 ` [PATCH 1/3] iov_iter: add ITER_BVEC_FLAG_NO_REF flag Jens Axboe
  2019-03-15 14:59 ` [PATCH 2/3] block: add BIO_NO_PAGE_REF flag Jens Axboe
@ 2019-03-15 14:59 ` Jens Axboe
  2019-03-15 16:27   ` Jens Axboe
  2 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2019-03-15 14:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: viro, Jens Axboe

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 6 +++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f66a4a5daf35..513df722702e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -635,10 +635,14 @@ static void kiocb_end_write(struct kiocb *kiocb)
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
-	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
+		ev_flags = IOCQE_FLAG_CACHEHIT;
+
+	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
 	io_put_req(req);
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e23408692118..24906e99fdc7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -69,6 +69,11 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] io_uring: add io_uring_event cache hit information
  2019-03-15 14:59 ` [PATCH 3/3] io_uring: add io_uring_event cache hit information Jens Axboe
@ 2019-03-15 16:27   ` Jens Axboe
  2019-03-16  1:34     ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2019-03-15 16:27 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: viro, Linus Torvalds

On 3/15/19 8:59 AM, Jens Axboe wrote:
> Add hint on whether a read was served out of the page cache, or if it
> hit media. This is useful for buffered async IO, O_DIRECT reads would
> never have this set (for obvious reasons).
> 
> If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
> set.

Linus, curious on your opinion on this one. I had this as part of the
original series, but removed it from the pull request due to the
mincore etc discussions.

Leaving the patch intact for you. This is part of a small series of
improvements, which sit on top of a series of fixes for this release.

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index f66a4a5daf35..513df722702e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -635,10 +635,14 @@ static void kiocb_end_write(struct kiocb *kiocb)
>  static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
>  {
>  	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
> +	unsigned ev_flags = 0;
>  
>  	kiocb_end_write(kiocb);
>  
> -	io_cqring_add_event(req->ctx, req->user_data, res, 0);
> +	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
> +		ev_flags = IOCQE_FLAG_CACHEHIT;
> +
> +	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
>  	io_put_req(req);
>  }
>  
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index e23408692118..24906e99fdc7 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -69,6 +69,11 @@ struct io_uring_cqe {
>  	__u32	flags;
>  };
>  
> +/*
> + * io_uring_event->flags
> + */
> +#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
> +
>  /*
>   * Magic offsets for the application to mmap the data it needs
>   */
> 


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] io_uring: add io_uring_event cache hit information
  2019-03-15 16:27   ` Jens Axboe
@ 2019-03-16  1:34     ` Linus Torvalds
  2019-03-16 16:27       ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2019-03-16  1:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, Al Viro

On Fri, Mar 15, 2019 at 9:27 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Linus, curious on your opinion on this one. I had this as part of the
> original series, but removed it from the pull request due to the
> mincore etc discussions.

I'd rather not have new ways to leak cache information, and in fact
already the IOCB_NOWAIT is a bit questionable for that. But afaik, the
non-O_DIRECT paths don't even support it to begin with for some
filesystems.

Wasn't the whole point of this io_ring that we'd move *away* from
direct block access and O_DIRECT?

I'm seeing a lot of stuff that looks like just "more of the same old
O_DIRECT garbage" that seems to be all about thinking that IO doesn't
cache.

Haven't we gotten over that already? It was one of the arguments you
used for io_ring to begin with.

                      Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] io_uring: add io_uring_event cache hit information
  2019-03-16  1:34     ` Linus Torvalds
@ 2019-03-16 16:27       ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2019-03-16 16:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, linux-block, Al Viro

On 3/15/19 7:34 PM, Linus Torvalds wrote:
> On Fri, Mar 15, 2019 at 9:27 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Linus, curious on your opinion on this one. I had this as part of the
>> original series, but removed it from the pull request due to the
>> mincore etc discussions.
> 
> I'd rather not have new ways to leak cache information, and in fact
> already the IOCB_NOWAIT is a bit questionable for that. But afaik, the
> non-O_DIRECT paths don't even support it to begin with for some
> filesystems.

For the most popular it works fine, though. For buffered async IO it'd
be nice to have it be fully reliable, so we can ensure we punt when we
have to. Currently for writes, we just ignore it and punt to async,
since it doesn't work at all there. The latter should be fixable.

On the hint, I hear ya, that's what I figured. I'll drop the patch for
now.

> Wasn't the whole point of this io_ring that we'd move *away* from
> direct block access and O_DIRECT?
> 
> I'm seeing a lot of stuff that looks like just "more of the same old
> O_DIRECT garbage" that seems to be all about thinking that IO doesn't
> cache.
> 
> Haven't we gotten over that already? It was one of the arguments you
> used for io_ring to begin with.

There are still folks that use it, even if O_DIRECT is nasty. If you
want to get peak performance, there's no way around it. That doesn't
mean that io_uring is in any way centered around O_DIRECT at all, it
isn't. At its core, it's "just" an efficient delivery mechanism for
submissions and completions.

Not sure what "stuff" you are referring to, the patches in this short
series? That's not centered around block devices at all, works just fine
for files. The bvec iterator is what io_uring uses for fixed buffers,
it's not exclusive to block devices at all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-03-16 16:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-15 14:59 [PATCHSET 0/3] io_uring improvements Jens Axboe
2019-03-15 14:59 ` [PATCH 1/3] iov_iter: add ITER_BVEC_FLAG_NO_REF flag Jens Axboe
2019-03-15 14:59 ` [PATCH 2/3] block: add BIO_NO_PAGE_REF flag Jens Axboe
2019-03-15 14:59 ` [PATCH 3/3] io_uring: add io_uring_event cache hit information Jens Axboe
2019-03-15 16:27   ` Jens Axboe
2019-03-16  1:34     ` Linus Torvalds
2019-03-16 16:27       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).