All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v6 0/8] Improve async iomap DIO performance
@ 2023-07-24 22:55 Jens Axboe
  2023-07-24 22:55 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong

Hi,

Hi,

This patchset improves async iomap DIO performance, for XFS and ext4.
For full details on this patchset, see the v4 posting:

https://lore.kernel.org/io-uring/20230720181310.71589-1-axboe@kernel.dk/

 fs/iomap/direct-io.c | 163 ++++++++++++++++++++++++++++++++-----------
 include/linux/fs.h   |  35 +++++++++-
 io_uring/rw.c        |  26 ++++++-
 3 files changed, 179 insertions(+), 45 deletions(-)

Can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.6

No change in performance since last time, and passes my testing without
complaints.

Changes in v6:
- Drop the polled patch, it's not needed anymore
- Change the "inline is safe" logic based on Dave's suggestions
- Gate HIPRI on INLINE_COMP|CALLER_COMP, so polled IO follows the
  same rules as inline/deferred completions.
- INLINE_COMP is purely for reads, writes can user CALLER_COMP to
  avoid a workqueue punt. This is necessary as we need to invalidate
  pages on write completions, and if we race with a buffered reader
  or writer on the file.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 2/8] iomap: use an unsigned type for IOMAP_DIO_* defines Jens Axboe
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

Make the logic a bit easier to follow:

1) Add a release_bio out path, as everybody needs to touch that, and
   have our bio ref check jump there if it's non-zero.
2) Add a kiocb local variable.
3) Add comments for each of the three conditions (sync, inline, or
   async workqueue punt).

No functional changes in this patch.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 46 +++++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ea3b868c8355..0ce60e80c901 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -152,27 +152,43 @@ void iomap_dio_bio_end_io(struct bio *bio)
 {
 	struct iomap_dio *dio = bio->bi_private;
 	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+	struct kiocb *iocb = dio->iocb;
 
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
+	if (!atomic_dec_and_test(&dio->ref))
+		goto release_bio;
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
-			struct task_struct *waiter = dio->submit.waiter;
-			WRITE_ONCE(dio->submit.waiter, NULL);
-			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
-			struct inode *inode = file_inode(dio->iocb->ki_filp);
-
-			WRITE_ONCE(dio->iocb->private, NULL);
-			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
-			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
-		} else {
-			WRITE_ONCE(dio->iocb->private, NULL);
-			iomap_dio_complete_work(&dio->aio.work);
-		}
+	/*
+	 * Synchronous dio, task itself will handle any completion work
+	 * that needs after IO. All we need to do is wake the task.
+	 */
+	if (dio->wait_for_completion) {
+		struct task_struct *waiter = dio->submit.waiter;
+
+		WRITE_ONCE(dio->submit.waiter, NULL);
+		blk_wake_io_task(waiter);
+		goto release_bio;
+	}
+
+	/* Read completion can always complete inline. */
+	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+		WRITE_ONCE(iocb->private, NULL);
+		iomap_dio_complete_work(&dio->aio.work);
+		goto release_bio;
 	}
 
+	/*
+	 * Async DIO completion that requires filesystem level completion work
+	 * gets punted to a work queue to complete as the operation may require
+	 * more IO to be issued to finalise filesystem metadata changes or
+	 * guarantee data integrity.
+	 */
+	WRITE_ONCE(iocb->private, NULL);
+	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
+			&dio->aio.work);
+release_bio:
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
 	} else {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/8] iomap: use an unsigned type for IOMAP_DIO_* defines
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
  2023-07-24 22:55 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

IOMAP_DIO_DIRTY shifts by 31 bits, which makes UBSAN unhappy. Clean up
all the defines by making the shifted value an unsigned value.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 0ce60e80c901..7d627d43d10b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,10 +20,10 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
-#define IOMAP_DIO_WRITE_FUA	(1 << 28)
-#define IOMAP_DIO_NEED_SYNC	(1 << 29)
-#define IOMAP_DIO_WRITE		(1 << 30)
-#define IOMAP_DIO_DIRTY		(1 << 31)
+#define IOMAP_DIO_WRITE_FUA	(1U << 28)
+#define IOMAP_DIO_NEED_SYNC	(1U << 29)
+#define IOMAP_DIO_WRITE		(1U << 30)
+#define IOMAP_DIO_DIRTY		(1U << 31)
 
 struct iomap_dio {
 	struct kiocb		*iocb;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/8] iomap: treat a write through cache the same as FUA
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
  2023-07-24 22:55 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
  2023-07-24 22:55 ` [PATCH 2/8] iomap: use an unsigned type for IOMAP_DIO_* defines Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 4/8] iomap: only set iocb->private for polled bio Jens Axboe
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

Whether we have a write back cache and are using FUA or don't have
a write back cache at all is the same situation. Treat them the same.

This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
two cases that provide stable writes:

1) Volatile write cache with FUA writes
2) Normal write without a volatile write cache

Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
update some of the FUA comments as well.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 7d627d43d10b..6b690fc22365 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,7 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
-#define IOMAP_DIO_WRITE_FUA	(1U << 28)
+#define IOMAP_DIO_WRITE_THROUGH	(1U << 28)
 #define IOMAP_DIO_NEED_SYNC	(1U << 29)
 #define IOMAP_DIO_WRITE		(1U << 30)
 #define IOMAP_DIO_DIRTY		(1U << 31)
@@ -219,7 +219,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 /*
  * Figure out the bio's operation flags from the dio request, the
  * mapping, and whether or not we want FUA.  Note that we can end up
- * clearing the WRITE_FUA flag in the dio request.
+ * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		const struct iomap *iomap, bool use_fua)
@@ -233,7 +233,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 	if (use_fua)
 		opflags |= REQ_FUA;
 	else
-		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
 
 	return opflags;
 }
@@ -273,11 +273,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * Use a FUA write if we need datasync semantics, this is a pure
 		 * data IO that doesn't require any metadata updates (including
 		 * after IO completion such as unwritten extent conversion) and
-		 * the underlying device supports FUA. This allows us to avoid
-		 * cache flushes on IO completion.
+		 * the underlying device either supports FUA or doesn't have
+		 * a volatile write cache. This allows us to avoid cache flushes
+		 * on IO completion.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
-		    (dio->flags & IOMAP_DIO_WRITE_FUA) && bdev_fua(iomap->bdev))
+		    (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
+		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
 	}
 
@@ -553,13 +555,16 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
 
 		       /*
-			* For datasync only writes, we optimistically try
-			* using FUA for this IO.  Any non-FUA write that
-			* occurs will clear this flag, hence we know before
-			* completion whether a cache flush is necessary.
+			* For datasync only writes, we optimistically try using
+			* WRITE_THROUGH for this IO. This flag requires either
+			* FUA writes through the device's write cache, or a
+			* normal write to a device without a volatile write
+			* cache. For the former, Any non-FUA write that occurs
+			* will clear this flag, hence we know before completion
+			* whether a cache flush is necessary.
 			*/
 			if (!(iocb->ki_flags & IOCB_SYNC))
-				dio->flags |= IOMAP_DIO_WRITE_FUA;
+				dio->flags |= IOMAP_DIO_WRITE_THROUGH;
 		}
 
 		/*
@@ -621,10 +626,11 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomap_dio_set_error(dio, ret);
 
 	/*
-	 * If all the writes we issued were FUA, we don't need to flush the
-	 * cache on IO completion. Clear the sync flag for this case.
+	 * If all the writes we issued were already written through to the
+	 * media, we don't need to flush the cache on IO completion. Clear the
+	 * sync flag for this case.
 	 */
-	if (dio->flags & IOMAP_DIO_WRITE_FUA)
+	if (dio->flags & IOMAP_DIO_WRITE_THROUGH)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
 	WRITE_ONCE(iocb->private, dio->submit.poll_bio);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/8] iomap: only set iocb->private for polled bio
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (2 preceding siblings ...)
  2023-07-24 22:55 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 5/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

iocb->private is only used for polled IO, where the completer will
find the bio to poll through that field.

Assign it when we're submitting a polled bio, and get rid of the
dio->poll_bio indirection.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 6b690fc22365..e4b9d9123b75 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -41,7 +41,6 @@ struct iomap_dio {
 		struct {
 			struct iov_iter		*iter;
 			struct task_struct	*waiter;
-			struct bio		*poll_bio;
 		} submit;
 
 		/* used for aio completion: */
@@ -63,12 +62,14 @@ static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
 static void iomap_dio_submit_bio(const struct iomap_iter *iter,
 		struct iomap_dio *dio, struct bio *bio, loff_t pos)
 {
+	struct kiocb *iocb = dio->iocb;
+
 	atomic_inc(&dio->ref);
 
 	/* Sync dio can't be polled reliably */
-	if ((dio->iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(dio->iocb)) {
-		bio_set_polled(bio, dio->iocb);
-		dio->submit.poll_bio = bio;
+	if ((iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(iocb)) {
+		bio_set_polled(bio, iocb);
+		WRITE_ONCE(iocb->private, bio);
 	}
 
 	if (dio->dops && dio->dops->submit_io)
@@ -184,7 +185,6 @@ void iomap_dio_bio_end_io(struct bio *bio)
 	 * more IO to be issued to finalise filesystem metadata changes or
 	 * guarantee data integrity.
 	 */
-	WRITE_ONCE(iocb->private, NULL);
 	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
 	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
 			&dio->aio.work);
@@ -523,7 +523,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
-	dio->submit.poll_bio = NULL;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
@@ -633,8 +632,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_THROUGH)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	WRITE_ONCE(iocb->private, dio->submit.poll_bio);
-
 	/*
 	 * We are about to drop our additional submission reference, which
 	 * might be the last reference to the dio.  There are three different
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/8] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (3 preceding siblings ...)
  2023-07-24 22:55 ` [PATCH 4/8] iomap: only set iocb->private for polled bio Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

Rather than gate whether or not we need to punt a dio completion to a
workqueue on whether the IO is a write or not, add an explicit flag for
it. For now we treat them the same, reads always set the flags and async
writes do not.

No functional changes in this patch.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e4b9d9123b75..b943bc5c7b18 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_INLINE_COMP	(1U << 27)
 #define IOMAP_DIO_WRITE_THROUGH	(1U << 28)
 #define IOMAP_DIO_NEED_SYNC	(1U << 29)
 #define IOMAP_DIO_WRITE		(1U << 30)
@@ -172,8 +173,10 @@ void iomap_dio_bio_end_io(struct bio *bio)
 		goto release_bio;
 	}
 
-	/* Read completion can always complete inline. */
-	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+	/*
+	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
+	 */
+	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
 		WRITE_ONCE(iocb->private, NULL);
 		iomap_dio_complete_work(&dio->aio.work);
 		goto release_bio;
@@ -528,6 +531,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_NOWAIT;
 
 	if (iov_iter_rw(iter) == READ) {
+		/* reads can always complete inline */
+		dio->flags |= IOMAP_DIO_INLINE_COMP;
+
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (4 preceding siblings ...)
  2023-07-24 22:55 ` [PATCH 5/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_CALLER_COMP Jens Axboe
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

Async dio completions generally happen from hard/soft IRQ context, which
means that users like iomap may need to defer some of the completion
handling to a workqueue. This is less efficient than having the original
issuer handle it, like we do for sync IO, and it adds latency to the
completions.

Add IOCB_DIO_CALLER_COMP, which the issuer can set if it is able to
safely punt these completions to a safe context. If the dio handler is
aware of this flag, assign a callback handler in kiocb->dio_complete and
associated data io kiocb->private. The issuer will then call this
handler with that data from task context.

No functional changes in this patch.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..1e6dbe309d52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -338,6 +338,20 @@ enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+/*
+ * IOCB_DIO_CALLER_COMP can be set by the iocb owner, to indicate that the
+ * iocb completion can be passed back to the owner for execution from a safe
+ * context rather than needing to be punted through a workqueue. If this
+ * flag is set, the bio completion handling may set iocb->dio_complete to a
+ * handler function and iocb->private to context information for that handler.
+ * The issuer should call the handler with that context information from task
+ * context to complete the processing of the iocb. Note that while this
+ * provides a task context for the dio_complete() callback, it should only be
+ * used on the completion side for non-IO generating completions. It's fine to
+ * call blocking functions from this callback, but they should not wait for
+ * unrelated IO (like cache flushing, new IO generation, etc).
+ */
+#define IOCB_DIO_CALLER_COMP	(1 << 22)
 
 /* for use in trace events */
 #define TRACE_IOCB_STRINGS \
@@ -351,7 +365,8 @@ enum rw_hint {
 	{ IOCB_WRITE,		"WRITE" }, \
 	{ IOCB_WAITQ,		"WAITQ" }, \
 	{ IOCB_NOIO,		"NOIO" }, \
-	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
+	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
+	{ IOCB_DIO_CALLER_COMP,	"CALLER_COMP" }
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -360,7 +375,23 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	union {
+		/*
+		 * Only used for async buffered reads, where it denotes the
+		 * page waitqueue associated with completing the read. Valid
+		 * IFF IOCB_WAITQ is set.
+		 */
+		struct wait_page_queue	*ki_waitq;
+		/*
+		 * Can be used for O_DIRECT IO, where the completion handling
+		 * is punted back to the issuer of the IO. May only be set
+		 * if IOCB_DIO_CALLER_COMP is set by the issuer, and the issuer
+		 * must then check for presence of this handler when ki_complete
+		 * is invoked. The data passed in to this handler must be
+		 * assigned to ->private when dio_complete is assigned.
+		 */
+		ssize_t (*dio_complete)(void *data);
+	};
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (5 preceding siblings ...)
  2023-07-24 22:55 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-24 22:55 ` [PATCH 8/8] iomap: support IOCB_DIO_CALLER_COMP Jens Axboe
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

If the filesystem dio handler understands IOCB_DIO_CALLER_COMP, we'll
get a kiocb->ki_complete() callback with kiocb->dio_complete set. In
that case, rather than complete the IO directly through task_work, queue
up an intermediate task_work handler that first processes this callback
and then immediately completes the request.

For XFS, this avoids a punt through a workqueue, which is a lot less
efficient and adds latency to lower queue depth (or sync) O_DIRECT
writes.

Only do this for non-polled IO, as polled IO doesn't need this kind
of deferral as it always completes within the task itself. This then
avoids a check for deferral in the polled IO completion handler.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/rw.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 1bce2208b65c..f19f65b3f0ee 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -105,6 +105,7 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	} else {
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
+	rw->kiocb.dio_complete = NULL;
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
@@ -285,6 +286,14 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
 
 void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+
+	if (rw->kiocb.dio_complete) {
+		long res = rw->kiocb.dio_complete(rw->kiocb.private);
+
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
+
 	io_req_io_end(req);
 
 	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)) {
@@ -300,9 +309,11 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 	struct io_rw *rw = container_of(kiocb, struct io_rw, kiocb);
 	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 
-	if (__io_complete_rw_common(req, res))
-		return;
-	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	if (!rw->kiocb.dio_complete) {
+		if (__io_complete_rw_common(req, res))
+			return;
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
 	req->io_task_work.func = io_req_rw_complete;
 	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
 }
@@ -916,6 +927,15 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	}
 	kiocb->ki_flags |= IOCB_WRITE;
 
+	/*
+	 * For non-polled IO, set IOCB_DIO_CALLER_COMP, stating that our handler
+	 * groks deferring the completion to task context. This isn't
+	 * necessary and useful for polled IO as that can always complete
+	 * directly.
+	 */
+	if (!(kiocb->ki_flags & IOCB_HIPRI))
+		kiocb->ki_flags |= IOCB_DIO_CALLER_COMP;
+
 	if (likely(req->file->f_op->write_iter))
 		ret2 = call_write_iter(req->file, kiocb, &s->iter);
 	else if (req->file->f_op->write)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 8/8] iomap: support IOCB_DIO_CALLER_COMP
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (6 preceding siblings ...)
  2023-07-24 22:55 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_CALLER_COMP Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  2023-07-26 23:07 ` [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
  2023-08-01 22:13 ` Dave Chinner
  9 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

If IOCB_DIO_CALLER_COMP is set, utilize that to set kiocb->dio_complete
handler and data for that callback. Rather than punt the completion to a
workqueue, we pass back the handler and data to the issuer and will get
a callback from a safe task context.

Using the following fio job to randomly dio write 4k blocks at
queue depths of 1..16:

fio --name=dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=$depth

shows the following results before and after this patch:

	Stock	Patched		Diff
=======================================
QD1	155K	162K		+ 4.5%
QD2	290K	313K		+ 7.9%
QD4	533K	597K		+12.0%
QD8	604K	827K		+36.9%
QD16	615K	845K		+37.4%

which shows nice wins all around. If we factored in per-IOP efficiency,
the wins look even nicer. This becomes apparent as queue depth rises,
as the offloaded workqueue completions runs out of steam.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 62 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b943bc5c7b18..bcd3f8cf5ea4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_CALLER_COMP	(1U << 26)
 #define IOMAP_DIO_INLINE_COMP	(1U << 27)
 #define IOMAP_DIO_WRITE_THROUGH	(1U << 28)
 #define IOMAP_DIO_NEED_SYNC	(1U << 29)
@@ -132,6 +133,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
+static ssize_t iomap_dio_deferred_complete(void *data)
+{
+	return iomap_dio_complete(data);
+}
+
 static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
@@ -182,6 +188,31 @@ void iomap_dio_bio_end_io(struct bio *bio)
 		goto release_bio;
 	}
 
+	/*
+	 * If this dio is flagged with IOMAP_DIO_CALLER_COMP, then schedule
+	 * our completion that way to avoid an async punt to a workqueue.
+	 */
+	if (dio->flags & IOMAP_DIO_CALLER_COMP) {
+		/* only polled IO cares about private cleared */
+		iocb->private = dio;
+		iocb->dio_complete = iomap_dio_deferred_complete;
+
+		/*
+		 * Invoke ->ki_complete() directly. We've assigned our
+		 * dio_complete callback handler, and since the issuer set
+		 * IOCB_DIO_CALLER_COMP, we know their ki_complete handler will
+		 * notice ->dio_complete being set and will defer calling that
+		 * handler until it can be done from a safe task context.
+		 *
+		 * Note that the 'res' being passed in here is not important
+		 * for this case. The actual completion value of the request
+		 * will be gotten from dio_complete when that is run by the
+		 * issuer.
+		 */
+		iocb->ki_complete(iocb, 0);
+		goto release_bio;
+	}
+
 	/*
 	 * Async DIO completion that requires filesystem level completion work
 	 * gets punted to a work queue to complete as the operation may require
@@ -278,12 +309,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * after IO completion such as unwritten extent conversion) and
 		 * the underlying device either supports FUA or doesn't have
 		 * a volatile write cache. This allows us to avoid cache flushes
-		 * on IO completion.
+		 * on IO completion. If we can't use writethrough and need to
+		 * sync, disable in-task completions as dio completion will
+		 * need to call generic_write_sync() which will do a blocking
+		 * fsync / cache flush call.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
 		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
+		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
+			dio->flags &= ~IOMAP_DIO_CALLER_COMP;
 	}
 
 	/*
@@ -298,10 +334,23 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		goto out;
 
 	/*
-	 * We can only poll for single bio I/Os.
+	 * We can only do deferred completion for pure overwrites that
+	 * don't require additional IO at completion. This rules out
+	 * writes that need zeroing or extent conversion, extend
+	 * the file size, or issue journal IO or cache flushes
+	 * during completion processing.
 	 */
 	if (need_zeroout ||
+	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
 	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
+		dio->flags &= ~IOMAP_DIO_CALLER_COMP;
+
+	/*
+	 * The rules for polled IO completions follow the guidelines as the
+	 * ones we set for inline and deferred completions. If none of those
+	 * are available for this IO, clear the polled flag.
+	 */
+	if (!(dio->flags & (IOMAP_DIO_INLINE_COMP|IOMAP_DIO_CALLER_COMP)))
 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
 
 	if (need_zeroout) {
@@ -547,6 +596,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
+		/*
+		 * Flag as supporting deferred completions, if the issuer
+		 * groks it. This can avoid a workqueue punt for writes.
+		 * We may later clear this flag if we need to do other IO
+		 * as part of this IO completion.
+		 */
+		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
+			dio->flags |= IOMAP_DIO_CALLER_COMP;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCHSET v6 0/8] Improve async iomap DIO performance
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (7 preceding siblings ...)
  2023-07-24 22:55 ` [PATCH 8/8] iomap: support IOCB_DIO_CALLER_COMP Jens Axboe
@ 2023-07-26 23:07 ` Jens Axboe
  2023-07-28 21:42   ` Dave Chinner
  2023-08-01 22:13 ` Dave Chinner
  9 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2023-07-26 23:07 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong

On 7/24/23 4:55?PM, Jens Axboe wrote:
> Hi,
> 
> Hi,
> 
> This patchset improves async iomap DIO performance, for XFS and ext4.
> For full details on this patchset, see the v4 posting:
> 
> https://lore.kernel.org/io-uring/20230720181310.71589-1-axboe@kernel.dk/
> 
>  fs/iomap/direct-io.c | 163 ++++++++++++++++++++++++++++++++-----------
>  include/linux/fs.h   |  35 +++++++++-
>  io_uring/rw.c        |  26 ++++++-
>  3 files changed, 179 insertions(+), 45 deletions(-)
> 
> Can also be found here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.6
> 
> No change in performance since last time, and passes my testing without
> complaints.
> 
> Changes in v6:
> - Drop the polled patch, it's not needed anymore
> - Change the "inline is safe" logic based on Dave's suggestions
> - Gate HIPRI on INLINE_COMP|CALLER_COMP, so polled IO follows the
>   same rules as inline/deferred completions.
> - INLINE_COMP is purely for reads, writes can user CALLER_COMP to
>   avoid a workqueue punt. This is necessary as we need to invalidate
>   pages on write completions, and if we race with a buffered reader
>   or writer on the file.

Dave, are you happy with this one?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET v6 0/8] Improve async iomap DIO performance
  2023-07-26 23:07 ` [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
@ 2023-07-28 21:42   ` Dave Chinner
  2023-08-01  0:20     ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-07-28 21:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, djwong

On Wed, Jul 26, 2023 at 05:07:57PM -0600, Jens Axboe wrote:
> On 7/24/23 4:55?PM, Jens Axboe wrote:
> > Hi,
> > 
> > Hi,
> > 
> > This patchset improves async iomap DIO performance, for XFS and ext4.
> > For full details on this patchset, see the v4 posting:
> > 
> > https://lore.kernel.org/io-uring/20230720181310.71589-1-axboe@kernel.dk/
> > 
> >  fs/iomap/direct-io.c | 163 ++++++++++++++++++++++++++++++++-----------
> >  include/linux/fs.h   |  35 +++++++++-
> >  io_uring/rw.c        |  26 ++++++-
> >  3 files changed, 179 insertions(+), 45 deletions(-)
> > 
> > Can also be found here:
> > 
> > https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.6
> > 
> > No change in performance since last time, and passes my testing without
> > complaints.
> > 
> > Changes in v6:
> > - Drop the polled patch, it's not needed anymore
> > - Change the "inline is safe" logic based on Dave's suggestions
> > - Gate HIPRI on INLINE_COMP|CALLER_COMP, so polled IO follows the
> >   same rules as inline/deferred completions.
> > - INLINE_COMP is purely for reads, writes can user CALLER_COMP to
> >   avoid a workqueue punt. This is necessary as we need to invalidate
> >   pages on write completions, and if we race with a buffered reader
> >   or writer on the file.
> 
> Dave, are you happy with this one?

I haven't had a chance to look at it yet. Had my head in log hang
bug reports these last few days...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET v6 0/8] Improve async iomap DIO performance
  2023-07-28 21:42   ` Dave Chinner
@ 2023-08-01  0:20     ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-08-01  0:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres, djwong

On 7/28/23 3:42 PM, Dave Chinner wrote:
> On Wed, Jul 26, 2023 at 05:07:57PM -0600, Jens Axboe wrote:
>> On 7/24/23 4:55?PM, Jens Axboe wrote:
>>> Hi,
>>>
>>> Hi,
>>>
>>> This patchset improves async iomap DIO performance, for XFS and ext4.
>>> For full details on this patchset, see the v4 posting:
>>>
>>> https://lore.kernel.org/io-uring/20230720181310.71589-1-axboe@kernel.dk/
>>>
>>>  fs/iomap/direct-io.c | 163 ++++++++++++++++++++++++++++++++-----------
>>>  include/linux/fs.h   |  35 +++++++++-
>>>  io_uring/rw.c        |  26 ++++++-
>>>  3 files changed, 179 insertions(+), 45 deletions(-)
>>>
>>> Can also be found here:
>>>
>>> https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.6
>>>
>>> No change in performance since last time, and passes my testing without
>>> complaints.
>>>
>>> Changes in v6:
>>> - Drop the polled patch, it's not needed anymore
>>> - Change the "inline is safe" logic based on Dave's suggestions
>>> - Gate HIPRI on INLINE_COMP|CALLER_COMP, so polled IO follows the
>>>   same rules as inline/deferred completions.
>>> - INLINE_COMP is purely for reads, writes can user CALLER_COMP to
>>>   avoid a workqueue punt. This is necessary as we need to invalidate
>>>   pages on write completions, and if we race with a buffered reader
>>>   or writer on the file.
>>
>> Dave, are you happy with this one?
> 
> I haven't had a chance to look at it yet. Had my head in log hang
> bug reports these last few days...

Is it going to happen anytime soon? Would be nice to get this
flushed out for 6.6.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET v6 0/8] Improve async iomap DIO performance
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (8 preceding siblings ...)
  2023-07-26 23:07 ` [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
@ 2023-08-01 22:13 ` Dave Chinner
  2023-08-01 23:34   ` Jens Axboe
  9 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-08-01 22:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, djwong

On Mon, Jul 24, 2023 at 04:55:03PM -0600, Jens Axboe wrote:
> Hi,
> 
> Hi,
> 
> This patchset improves async iomap DIO performance, for XFS and ext4.
> For full details on this patchset, see the v4 posting:
> 
> https://lore.kernel.org/io-uring/20230720181310.71589-1-axboe@kernel.dk/
> 
>  fs/iomap/direct-io.c | 163 ++++++++++++++++++++++++++++++++-----------
>  include/linux/fs.h   |  35 +++++++++-
>  io_uring/rw.c        |  26 ++++++-
>  3 files changed, 179 insertions(+), 45 deletions(-)
> 
> Can also be found here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.6
> 
> No change in performance since last time, and passes my testing without
> complaints.

All looks good now. You can add:

Reviewed-by: Dave Chinner <dchinner@redhat.com>

To all the patches in the series.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET v6 0/8] Improve async iomap DIO performance
  2023-08-01 22:13 ` Dave Chinner
@ 2023-08-01 23:34   ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-08-01 23:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres, djwong

On 8/1/23 4:13 PM, Dave Chinner wrote:
> On Mon, Jul 24, 2023 at 04:55:03PM -0600, Jens Axboe wrote:
>> Hi,
>>
>> Hi,
>>
>> This patchset improves async iomap DIO performance, for XFS and ext4.
>> For full details on this patchset, see the v4 posting:
>>
>> https://lore.kernel.org/io-uring/20230720181310.71589-1-axboe@kernel.dk/
>>
>>  fs/iomap/direct-io.c | 163 ++++++++++++++++++++++++++++++++-----------
>>  include/linux/fs.h   |  35 +++++++++-
>>  io_uring/rw.c        |  26 ++++++-
>>  3 files changed, 179 insertions(+), 45 deletions(-)
>>
>> Can also be found here:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.6
>>
>> No change in performance since last time, and passes my testing without
>> complaints.
> 
> All looks good now. You can add:
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> To all the patches in the series.

Great, thank you Dave!

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-21 15:48   ` Darrick J. Wong
@ 2023-07-21 15:53     ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-21 15:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: io-uring, linux-xfs, hch, andres, david

On 7/21/23 9:48?AM, Darrick J. Wong wrote:
> On Thu, Jul 20, 2023 at 12:13:08PM -0600, Jens Axboe wrote:
>> Async dio completions generally happen from hard/soft IRQ context, which
>> means that users like iomap may need to defer some of the completion
>> handling to a workqueue. This is less efficient than having the original
>> issuer handle it, like we do for sync IO, and it adds latency to the
>> completions.
>>
>> Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
>> punt these completions to a safe context. If the dio handler is aware
>> of this flag, assign a callback handler in kiocb->dio_complete and
>> associated data io kiocb->private. The issuer will then call this handler
>> with that data from task context.
>>
>> No functional changes in this patch.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  include/linux/fs.h | 34 ++++++++++++++++++++++++++++++++--
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6867512907d6..2c589418a078 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -338,6 +338,20 @@ enum rw_hint {
>>  #define IOCB_NOIO		(1 << 20)
>>  /* can use bio alloc cache */
>>  #define IOCB_ALLOC_CACHE	(1 << 21)
>> +/*
>> + * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
>> + * iocb completion can be passed back to the owner for execution from a safe
>> + * context rather than needing to be punted through a workqueue. If this
>> + * flag is set, the completion handling may set iocb->dio_complete to a
>> + * handler, which the issuer will then call from task context to complete
>> + * the processing of the iocb. iocb->private should then also be set to
>> + * the argument being passed to this handler. Note that while this provides
> 
> Who should be setting iocb->private?  Can I suggest rewording this to:
> 
> "If this flag is set, the bio completion handling may set
> iocb->dio_complete to a handler function and iocb->private to context
> information for that handler.  The issuer should call the handler with
> that context information from task context to complete the processing of
> the iocb."
> 
> Assuming I've understood what this does from the next patch? :)

Yep this is definitely better - thanks, I'll update it!

>> + * a task context for the dio_complete() callback, it should only be used
>> + * on the completion side for non-IO generating completions. It's fine to
>> + * call blocking functions from this callback, but they should not wait for
>> + * unrelated IO (like cache flushing, new IO generation, etc).
>> + */
>> +#define IOCB_DIO_DEFER		(1 << 22)
> 
> Sorry to nitpick names here, but "defer" feels a little vague to me.
> Defer what?  And to whom?
> 
> This flag means "defer iocb completion to the caller", right?  If so,
> wouldn't this be better named IOCB_DIO_CALLER_COMP?

That is probably better indeed. Naming is hard! CALLER_COMP or
ISSUER_COMP would be better and more descriptive. I'll go with your
suggestion.

>>  /* for use in trace events */
>>  #define TRACE_IOCB_STRINGS \
>> @@ -351,7 +365,8 @@ enum rw_hint {
>>  	{ IOCB_WRITE,		"WRITE" }, \
>>  	{ IOCB_WAITQ,		"WAITQ" }, \
>>  	{ IOCB_NOIO,		"NOIO" }, \
>> -	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
>> +	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
>> +	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
>>  
>>  struct kiocb {
>>  	struct file		*ki_filp;
>> @@ -360,7 +375,22 @@ struct kiocb {
>>  	void			*private;
>>  	int			ki_flags;
>>  	u16			ki_ioprio; /* See linux/ioprio.h */
>> -	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
>> +	union {
>> +		/*
>> +		 * Only used for async buffered reads, where it denotes the
>> +		 * page waitqueue associated with completing the read. Valid
>> +		 * IFF IOCB_WAITQ is set.
>> +		 */
>> +		struct wait_page_queue	*ki_waitq;
>> +		/*
>> +		 * Can be used for O_DIRECT IO, where the completion handling
>> +		 * is punted back to the issuer of the IO. May only be set
>> +		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
>> +		 * then check for presence of this handler when ki_complete is
>> +		 * invoked.
> 
> Might want to reiterate in the comment that kiocb.private should be
> passed as @data.

OK, will do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
  2023-07-21  6:18   ` Christoph Hellwig
@ 2023-07-21 15:48   ` Darrick J. Wong
  2023-07-21 15:53     ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:08PM -0600, Jens Axboe wrote:
> Async dio completions generally happen from hard/soft IRQ context, which
> means that users like iomap may need to defer some of the completion
> handling to a workqueue. This is less efficient than having the original
> issuer handle it, like we do for sync IO, and it adds latency to the
> completions.
> 
> Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
> punt these completions to a safe context. If the dio handler is aware
> of this flag, assign a callback handler in kiocb->dio_complete and
> associated data io kiocb->private. The issuer will then call this handler
> with that data from task context.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/fs.h | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6867512907d6..2c589418a078 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -338,6 +338,20 @@ enum rw_hint {
>  #define IOCB_NOIO		(1 << 20)
>  /* can use bio alloc cache */
>  #define IOCB_ALLOC_CACHE	(1 << 21)
> +/*
> + * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
> + * iocb completion can be passed back to the owner for execution from a safe
> + * context rather than needing to be punted through a workqueue. If this
> + * flag is set, the completion handling may set iocb->dio_complete to a
> + * handler, which the issuer will then call from task context to complete
> + * the processing of the iocb. iocb->private should then also be set to
> + * the argument being passed to this handler. Note that while this provides

Who should be setting iocb->private?  Can I suggest rewording this to:

"If this flag is set, the bio completion handling may set
iocb->dio_complete to a handler function and iocb->private to context
information for that handler.  The issuer should call the handler with
that context information from task context to complete the processing of
the iocb."

Assuming I've understood what this does from the next patch? :)

> + * a task context for the dio_complete() callback, it should only be used
> + * on the completion side for non-IO generating completions. It's fine to
> + * call blocking functions from this callback, but they should not wait for
> + * unrelated IO (like cache flushing, new IO generation, etc).
> + */
> +#define IOCB_DIO_DEFER		(1 << 22)

Sorry to nitpick names here, but "defer" feels a little vague to me.
Defer what?  And to whom?

This flag means "defer iocb completion to the caller", right?  If so,
wouldn't this be better named IOCB_DIO_CALLER_COMP?

>  /* for use in trace events */
>  #define TRACE_IOCB_STRINGS \
> @@ -351,7 +365,8 @@ enum rw_hint {
>  	{ IOCB_WRITE,		"WRITE" }, \
>  	{ IOCB_WAITQ,		"WAITQ" }, \
>  	{ IOCB_NOIO,		"NOIO" }, \
> -	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
> +	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
> +	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -360,7 +375,22 @@ struct kiocb {
>  	void			*private;
>  	int			ki_flags;
>  	u16			ki_ioprio; /* See linux/ioprio.h */
> -	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
> +	union {
> +		/*
> +		 * Only used for async buffered reads, where it denotes the
> +		 * page waitqueue associated with completing the read. Valid
> +		 * IFF IOCB_WAITQ is set.
> +		 */
> +		struct wait_page_queue	*ki_waitq;
> +		/*
> +		 * Can be used for O_DIRECT IO, where the completion handling
> +		 * is punted back to the issuer of the IO. May only be set
> +		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
> +		 * then check for presence of this handler when ki_complete is
> +		 * invoked.

Might want to reiterate in the comment that kiocb.private should be
passed as @data.

--D

> +		 */
> +		ssize_t (*dio_complete)(void *data);
> +	};
>  };
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-21  6:18   ` Christoph Hellwig
  2023-07-21 15:48   ` Darrick J. Wong
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-20 18:13 [PATCHSET v4 " Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:18   ` Christoph Hellwig
  2023-07-21 15:48   ` Darrick J. Wong
  0 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Async dio completions generally happen from hard/soft IRQ context, which
means that users like iomap may need to defer some of the completion
handling to a workqueue. This is less efficient than having the original
issuer handle it, like we do for sync IO, and it adds latency to the
completions.

Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
punt these completions to a safe context. If the dio handler is aware
of this flag, assign a callback handler in kiocb->dio_complete and
associated data io kiocb->private. The issuer will then call this handler
with that data from task context.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..2c589418a078 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -338,6 +338,20 @@ enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+/*
+ * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
+ * iocb completion can be passed back to the owner for execution from a safe
+ * context rather than needing to be punted through a workqueue. If this
+ * flag is set, the completion handling may set iocb->dio_complete to a
+ * handler, which the issuer will then call from task context to complete
+ * the processing of the iocb. iocb->private should then also be set to
+ * the argument being passed to this handler. Note that while this provides
+ * a task context for the dio_complete() callback, it should only be used
+ * on the completion side for non-IO generating completions. It's fine to
+ * call blocking functions from this callback, but they should not wait for
+ * unrelated IO (like cache flushing, new IO generation, etc).
+ */
+#define IOCB_DIO_DEFER		(1 << 22)
 
 /* for use in trace events */
 #define TRACE_IOCB_STRINGS \
@@ -351,7 +365,8 @@ enum rw_hint {
 	{ IOCB_WRITE,		"WRITE" }, \
 	{ IOCB_WAITQ,		"WAITQ" }, \
 	{ IOCB_NOIO,		"NOIO" }, \
-	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
+	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
+	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -360,7 +375,22 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	union {
+		/*
+		 * Only used for async buffered reads, where it denotes the
+		 * page waitqueue associated with completing the read. Valid
+		 * IFF IOCB_WAITQ is set.
+		 */
+		struct wait_page_queue	*ki_waitq;
+		/*
+		 * Can be used for O_DIRECT IO, where the completion handling
+		 * is punted back to the issuer of the IO. May only be set
+		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
+		 * then check for presence of this handler when ki_complete is
+		 * invoked.
+		 */
+		ssize_t (*dio_complete)(void *data);
+	};
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-08-01 23:35 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
2023-07-24 22:55 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
2023-07-24 22:55 ` [PATCH 2/8] iomap: use an unsigned type for IOMAP_DIO_* defines Jens Axboe
2023-07-24 22:55 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
2023-07-24 22:55 ` [PATCH 4/8] iomap: only set iocb->private for polled bio Jens Axboe
2023-07-24 22:55 ` [PATCH 5/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
2023-07-24 22:55 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
2023-07-24 22:55 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_CALLER_COMP Jens Axboe
2023-07-24 22:55 ` [PATCH 8/8] iomap: support IOCB_DIO_CALLER_COMP Jens Axboe
2023-07-26 23:07 ` [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
2023-07-28 21:42   ` Dave Chinner
2023-08-01  0:20     ` Jens Axboe
2023-08-01 22:13 ` Dave Chinner
2023-08-01 23:34   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2023-07-20 18:13 [PATCHSET v4 " Jens Axboe
2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
2023-07-21  6:18   ` Christoph Hellwig
2023-07-21 15:48   ` Darrick J. Wong
2023-07-21 15:53     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.