[PATCH RFC 0/2] ext4: Do endio process under irq context for DIO overwrites

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] ext4: Do endio process under irq context for DIO overwrites
@ 2024-02-29 11:38 Zhihao Cheng
  2024-02-29 11:38 ` [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag Zhihao Cheng
  2024-02-29 11:38 ` [PATCH RFC 2/2] ext4: Optimize endio process for DIO overwrites Zhihao Cheng
  0 siblings, 2 replies; 6+ messages in thread
From: Zhihao Cheng @ 2024-02-29 11:38 UTC (permalink / raw)
  To: brauner, david, djwong, jack, tytso
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, yi.zhang

Recently we found an ext4 performance regression problem between 4.18
and 5.10 by following test command on a x86 physical machine with nvme:
  fio -direct=1 -iodepth=128 -rw=randwrite -ioengine=libaio -bs=4k
      -size=2G -numjobs=1 -time_based -runtime=60 -group_reporting
      -filename=/test/test -name=Rand_write_Testing --cpus_allowed=1
4.18: 288k IOPS    5.10: 234k IOPS

After anlayzing the changes between above two versions, we found that
the endio context changed since commit 378f32bab3714("ext4: introduce
direct I/O write using iomap infrastructure"), in aio DIO overwriting
case. And the problem still exist in latest mainline.
4.18: endio is processed under irq context
 dio_bio_end_aio
  defer_completion = dio->defer_completion || ... // defer_completion is
                                                  // false for overwrite
  if (defer_completion) {
    queue_work
  } else {
    dio_complete                          // endio in irq context
  }
mainline: endio is processed in kworker
 iomap_dio_bio_end_io
  if (dio->flags & IOMAP_DIO_INLINE_COMP) // false, only for read
  if (dio->flags & IOMAP_DIO_CALLER_COMP) // false, only for io_uring
  queue_work                              // endio in kworker context

Assume that nvme irq registers on cpu 1, and fio runs on cpu 1, it could
be possible(Actually we did reproduce it easily) that fio, nvme irq and
kworker all run on cpu 1. Firstly, compared with 4.18, there are extra
operations(eg. queue work and wake up kworker) while processing endio,
which is slower than processing endio under nvme irq context. Secondly,
fio and kworker will race to get cpu resource, which will slow down fio.

There is no need to do
ext4_convert_unwritten_extents/ext4_handle_inode_extension under
ext4_dio_write_end_io in overwriting case, so we can put ext4 dio endio
under irq context.

It is worth noting that iomap_dio_complete shouldn't be executed under
irq context if datasync is required(IOMAP_DIO_NEED_SYNC), so overwriting
endio is still done in kworker for this situation.

Zhihao Cheng (2):
  iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag
  ext4: Optimize endio process for DIO overwrites

 fs/ext4/file.c        |  8 ++++++--
 fs/iomap/direct-io.c  | 10 ++++++++--
 include/linux/iomap.h |  6 ++++++
 3 files changed, 20 insertions(+), 4 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag
  2024-02-29 11:38 [PATCH RFC 0/2] ext4: Do endio process under irq context for DIO overwrites Zhihao Cheng
@ 2024-02-29 11:38 ` Zhihao Cheng
  2024-03-01  0:40   ` Dave Chinner
  2024-02-29 11:38 ` [PATCH RFC 2/2] ext4: Optimize endio process for DIO overwrites Zhihao Cheng
  1 sibling, 1 reply; 6+ messages in thread
From: Zhihao Cheng @ 2024-02-29 11:38 UTC (permalink / raw)
  To: brauner, david, djwong, jack, tytso
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, yi.zhang

It will be more efficient to execute quick endio process(eg. non-sync
overwriting case) under irq process rather than starting a worker to
do it.
Add a flag to control DIO to be finished inline(under irq context), which
can be used for non-sync overwriting case.
Besides, skip invalidating pages if DIO is finished inline, which will
keep the same logic with dio_bio_end_aio in non-sync overwriting case.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
 fs/iomap/direct-io.c  | 10 ++++++++--
 include/linux/iomap.h |  6 ++++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..221715b38ce2 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -110,7 +110,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	 * ->end_io() when necessary, otherwise a racing buffer read would cache
 	 * zeros from unwritten extents.
 	 */
-	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE))
+	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE) &&
+	    !(dio->flags & IOMAP_DIO_INLINE_COMP))
 		kiocb_invalidate_post_direct_write(iocb, dio->size);
 
 	inode_dio_end(file_inode(iocb->ki_filp));
@@ -122,8 +123,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 		 * If this is a DSYNC write, make sure we push it to stable
 		 * storage now that we've written data.
 		 */
-		if (dio->flags & IOMAP_DIO_NEED_SYNC)
+		if (dio->flags & IOMAP_DIO_NEED_SYNC) {
+			WARN_ON_ONCE(dio->flags & IOMAP_DIO_INLINE_COMP);
 			ret = generic_write_sync(iocb, ret);
+		}
 		if (ret > 0)
 			ret += dio->done_before;
 	}
@@ -628,6 +631,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			*/
 			if (!(iocb->ki_flags & IOCB_SYNC))
 				dio->flags |= IOMAP_DIO_WRITE_THROUGH;
+		} else if (dio_flags & IOMAP_DIO_MAY_INLINE_COMP) {
+			/* writes could complete inline */
+			dio->flags |= IOMAP_DIO_INLINE_COMP;
 		}
 
 		/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..f292b10028d0 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -382,6 +382,12 @@ struct iomap_dio_ops {
  */
 #define IOMAP_DIO_PARTIAL		(1 << 2)
 
+/*
+ * DIO will be completed inline unless sync operation is needed after io is
+ * finished.
+ */
+#define IOMAP_DIO_MAY_INLINE_COMP	(1 << 3)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC 2/2] ext4: Optimize endio process for DIO overwrites
  2024-02-29 11:38 [PATCH RFC 0/2] ext4: Do endio process under irq context for DIO overwrites Zhihao Cheng
  2024-02-29 11:38 ` [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag Zhihao Cheng
@ 2024-02-29 11:38 ` Zhihao Cheng
  1 sibling, 0 replies; 6+ messages in thread
From: Zhihao Cheng @ 2024-02-29 11:38 UTC (permalink / raw)
  To: brauner, david, djwong, jack, tytso
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, yi.zhang

In DIO overwriting case, there is no need to convert unwritten exntents
and ext4_handle_inode_extension() can be ignored, which means that endio
process can be executed under irq context. Since commit 240930fb7e6b5
("ext4: dio take shared inode lock when overwriting preallocated blocks")
has provided a method to judge whether overwriting is happening, just do
nothing in endio process if DIO overwriting happens.
This patch enables ext4 processing endio under irq context in DIO
overwriting case, which brings a performance improvement in the
following fio test on a x86 physical machine with nvme when irq
and fio run on the same cpu:

Test: fio -direct=1 -iodepth=128 -rw=randwrite -ioengine=libaio -bs=4k
-size=2G -numjobs=1 -overwrite=1 -time_based -runtime=60 -group_reporting
-filename=/test/test -name=Rand_write_Testing --cpus_allowed=1

before: 953 MiB/s  after: 1350 MiB/s, ~41% perf improvement.

Suggested-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
 fs/ext4/file.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 54d6ff22585c..411a05c6b96e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -503,6 +503,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	loff_t offset = iocb->ki_pos;
 	size_t count = iov_iter_count(from);
 	const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
+	const struct iomap_dio_ops *iomap_dops = &ext4_dio_write_ops;
 	bool extend = false, unwritten = false;
 	bool ilock_shared = true;
 	int dio_flags = 0;
@@ -572,9 +573,12 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ext4_journal_stop(handle);
 	}
 
-	if (ilock_shared && !unwritten)
+	if (ilock_shared && !unwritten) {
 		iomap_ops = &ext4_iomap_overwrite_ops;
-	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
+		iomap_dops = NULL;
+		dio_flags = IOMAP_DIO_MAY_INLINE_COMP;
+	}
+	ret = iomap_dio_rw(iocb, from, iomap_ops, iomap_dops,
 			   dio_flags, NULL, 0);
 	if (ret == -ENOTBLK)
 		ret = 0;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag
  2024-02-29 11:38 ` [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag Zhihao Cheng
@ 2024-03-01  0:40   ` Dave Chinner
  2024-03-01 10:02     ` Zhihao Cheng
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2024-03-01  0:40 UTC (permalink / raw)
  To: Zhihao Cheng
  Cc: brauner, djwong, jack, tytso, linux-xfs, linux-fsdevel,
	linux-kernel, linux-ext4, yi.zhang

On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
> It will be more efficient to execute quick endio process(eg. non-sync
> overwriting case) under irq process rather than starting a worker to
> do it.
> Add a flag to control DIO to be finished inline(under irq context), which
> can be used for non-sync overwriting case.
> Besides, skip invalidating pages if DIO is finished inline, which will
> keep the same logic with dio_bio_end_aio in non-sync overwriting case.
> 
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>

A nice idea, but I don't think an ext4 specific API flag is the
right way to go about enabling this. The iomap dio code knows if
the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
pure overwrite FUA optimisations.

That is:

		/*
                 * Use a FUA write if we need datasync semantics, this is a pure
                 * data IO that doesn't require any metadata updates (including
                 * after IO completion such as unwritten extent conversion) and
                 * the underlying device either supports FUA or doesn't have
                 * a volatile write cache. This allows us to avoid cache flushes
                 * on IO completion. If we can't use writethrough and need to
                 * sync, disable in-task completions as dio completion will
                 * need to call generic_write_sync() which will do a blocking
                 * fsync / cache flush call.
                 */
                if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
                    (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
                    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
                        use_fua = true;

Hence if we want to optimise pure overwrites that have no data sync
requirements, we already have the detection and triggers in place to
do this. We just need to change the way we set up the IO flags to
allow write-through (i.e. non-blocking IO completions) to use inline
completions.

In __iomap_dio_rw():

+	/* Always try to complete inline. */
+	dio->flags |= IOMAP_DIO_INLINE_COMP;
	if (iov_iter_rw(iter) == READ) {                                         
-               /* reads can always complete inline */                           
-               dio->flags |= IOMAP_DIO_INLINE_COMP;
....

	} else {
+		/* Always try write-through semantics. If we can't
+		 * use writethough, it will be disabled along with
+		 * IOMAP_DIO_INLINE_COMP before dio completion is run
+		 * so it can be deferred to a task completion context
+		 * appropriately.
+		 */
+               dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;
		iomi.flags |= IOMAP_WRITE;
-               dio->flags |= IOMAP_DIO_WRITE;
.....
		/* for data sync or sync, we need sync completion processing */
                if (iocb_is_dsync(iocb)) {
                        dio->flags |= IOMAP_DIO_NEED_SYNC;

-                      /*
-                       * For datasync only writes, we optimistically try using
-                       * WRITE_THROUGH for this IO. This flag requires either
-                       * FUA writes through the device's write cache, or a
-                       * normal write to a device without a volatile write
-                       * cache. For the former, Any non-FUA write that occurs
-                       * will clear this flag, hence we know before completion
-                       * whether a cache flush is necessary.
-                       */
-                       if (!(iocb->ki_flags & IOCB_SYNC))
-                               dio->flags |= IOMAP_DIO_WRITE_THROUGH;
+			* For sync writes we know we are going to need
+			* blocking completion processing, so turn off
+			* writethrough now.
+			*/
			if (iocb->ki_flags & IOCB_SYNC) {
				dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
						IOMAP_DIO_INLINE_COMP);
			}
                }

This then sets up iomap_dio_bio_iter() to be able to determine if
the iomap returned is for a pure overwrite and allow for the use of
inline write completions.

	/*
	 * If we have a pure overwrite, we know that IO completion
	 * will not block and so we can use write through completion
	 * semantics and complete the write inline. If it's not a
	 * pure overwrite, make sure that we always defer
	 * completions to a task context.
	 */
	if (dio->flags & IOMAP_DIO_WRITE_THROUGH) {
		if (iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) {
			dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
					IOMAP_DIO_INLINE_COMP);
		} else if ((dio->flags & IOMAP_DIO_NEED_SYNC) &&
			   (bdev_fua(iomap->bdev) ||
			    !bdev_write_cache(iomap->bdev))) {
			/*
			 * Use REQ_FUA for datasync overwrites to
			 * avoid cache flushes on IO completion on
			 * devices that support FUA or don't have
			 * volatile caches.
			 */
			use_fua = true;
		}
	}


and iomap_dio_bio_opflags() gets changed to also clear
IOMAP_DIO_INLINE_COMP when it clears IOMAP_DIO_WRITE_THROUGH....

That then makes all pure overwrites for every filesystem do inline
completions without changing calling conventions. i.e. it's up to
the filesystem ->iomap_begin callouts to indicate whether the write
mapping returned requires blocking operations to be done in IO
completion (i.e. set the IOMAP_F_DIRTY flag) appropriately.

However, this does mean that any spinlock taken in the ->end_io()
callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
the spinlock protection around inode size updates will need to use
an irq safe locking, as will the locking in the DIO submission path
that it serialises against in xfs_file_write_checks(). That probably
is best implemented as a separate spinlock.

There will also be other filesystems that need to set IOMAP_F_DIRTY
unconditionally (e.g. zonefs) because they always take blocking
locks in their ->end_io callbacks and so must always run in task
context...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag
  2024-03-01  0:40   ` Dave Chinner
@ 2024-03-01 10:02     ` Zhihao Cheng
  2024-03-11  7:55       ` Zhihao Cheng
  0 siblings, 1 reply; 6+ messages in thread
From: Zhihao Cheng @ 2024-03-01 10:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: brauner, djwong, jack, tytso, linux-xfs, linux-fsdevel,
	linux-kernel, linux-ext4, yi.zhang

在 2024/3/1 8:40, Dave Chinner 写道:

Hi Dave, thanks for your detailed and nice suggestions, I have a few 
questions below.
> On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
>> It will be more efficient to execute quick endio process(eg. non-sync
>> overwriting case) under irq process rather than starting a worker to
>> do it.
>> Add a flag to control DIO to be finished inline(under irq context), which
>> can be used for non-sync overwriting case.
>> Besides, skip invalidating pages if DIO is finished inline, which will
>> keep the same logic with dio_bio_end_aio in non-sync overwriting case.
>>
>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> 
> A nice idea, but I don't think an ext4 specific API flag is the
> right way to go about enabling this. The iomap dio code knows if
> the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
> for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
> pure overwrite FUA optimisations.
> 
> That is:
> 
> 		/*
>                   * Use a FUA write if we need datasync semantics, this is a pure
>                   * data IO that doesn't require any metadata updates (including
>                   * after IO completion such as unwritten extent conversion) and
>                   * the underlying device either supports FUA or doesn't have
>                   * a volatile write cache. This allows us to avoid cache flushes
>                   * on IO completion. If we can't use writethrough and need to
>                   * sync, disable in-task completions as dio completion will
>                   * need to call generic_write_sync() which will do a blocking
>                   * fsync / cache flush call.
>                   */
>                  if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>                      (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
>                      (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
>                          use_fua = true;
> 
> Hence if we want to optimise pure overwrites that have no data sync
> requirements, we already have the detection and triggers in place to
> do this. We just need to change the way we set up the IO flags to
> allow write-through (i.e. non-blocking IO completions) to use inline
> completions.
> 
> In __iomap_dio_rw():
> 
> +	/* Always try to complete inline. */
> +	dio->flags |= IOMAP_DIO_INLINE_COMP;
> 	if (iov_iter_rw(iter) == READ) {
> -               /* reads can always complete inline */
> -               dio->flags |= IOMAP_DIO_INLINE_COMP;
> ....
> 
> 	} else {
> +		/* Always try write-through semantics. If we can't
> +		 * use writethough, it will be disabled along with
> +		 * IOMAP_DIO_INLINE_COMP before dio completion is run
> +		 * so it can be deferred to a task completion context
> +		 * appropriately.
> +		 */
> +               dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;

There is a behavior change here, if we set IOMAP_DIO_WRITE_THROUGH 
unconditionally, non-datasync IO will be set with REQ_FUA, which means 
that device will flush writecache for each IO, will it affect the 
performance in non-sync dio case?
> 		iomi.flags |= IOMAP_WRITE;
> -               dio->flags |= IOMAP_DIO_WRITE;
> .....
> 		/* for data sync or sync, we need sync completion processing */
>                  if (iocb_is_dsync(iocb)) {
>                          dio->flags |= IOMAP_DIO_NEED_SYNC;
> 
> -                      /*
> -                       * For datasync only writes, we optimistically try using
> -                       * WRITE_THROUGH for this IO. This flag requires either
> -                       * FUA writes through the device's write cache, or a
> -                       * normal write to a device without a volatile write
> -                       * cache. For the former, Any non-FUA write that occurs
> -                       * will clear this flag, hence we know before completion
> -                       * whether a cache flush is necessary.
> -                       */
> -                       if (!(iocb->ki_flags & IOCB_SYNC))
> -                               dio->flags |= IOMAP_DIO_WRITE_THROUGH;
> +			* For sync writes we know we are going to need
> +			* blocking completion processing, so turn off
> +			* writethrough now.
> +			*/
> 			if (iocb->ki_flags & IOCB_SYNC) {
> 				dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
> 						IOMAP_DIO_INLINE_COMP);
> 			}
>                  }
> 

[...]
> 
> However, this does mean that any spinlock taken in the ->end_io()
> callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
> the spinlock protection around inode size updates will need to use
> an irq safe locking, as will the locking in the DIO submission path
> that it serialises against in xfs_file_write_checks(). That probably
> is best implemented as a separate spinlock.
> 
> There will also be other filesystems that need to set IOMAP_F_DIRTY
> unconditionally (e.g. zonefs) because they always take blocking
> locks in their ->end_io callbacks and so must always run in task
> context...
Should we add a new flag(eg. IOMAP_F_ENDIO_IRQ ?) to indicate that the 
endio cannot be done under irq? Because I think IOMAP_F_DIRTY means that 
the metadata needs to be written, if we add a new semantics(endio must 
be done in defered work) for this flag, the code will looks a little 
complicated.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag
  2024-03-01 10:02     ` Zhihao Cheng
@ 2024-03-11  7:55       ` Zhihao Cheng
  0 siblings, 0 replies; 6+ messages in thread
From: Zhihao Cheng @ 2024-03-11  7:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: brauner, djwong, jack, tytso, linux-xfs, linux-fsdevel,
	linux-kernel, linux-ext4, yi.zhang

在 2024/3/1 18:02, Zhihao Cheng 写道:
> 在 2024/3/1 8:40, Dave Chinner 写道:
> 
Hi Dave. Friendly ping.
> Hi Dave, thanks for your detailed and nice suggestions, I have a few 
> questions below.
>> On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
>>> It will be more efficient to execute quick endio process(eg. non-sync
>>> overwriting case) under irq process rather than starting a worker to
>>> do it.
>>> Add a flag to control DIO to be finished inline(under irq context), 
>>> which
>>> can be used for non-sync overwriting case.
>>> Besides, skip invalidating pages if DIO is finished inline, which will
>>> keep the same logic with dio_bio_end_aio in non-sync overwriting case.
>>>
>>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>>
>> A nice idea, but I don't think an ext4 specific API flag is the
>> right way to go about enabling this. The iomap dio code knows if
>> the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
>> for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
>> pure overwrite FUA optimisations.
>>
>> That is:
>>
>>         /*
>>                   * Use a FUA write if we need datasync semantics, 
>> this is a pure
>>                   * data IO that doesn't require any metadata updates 
>> (including
>>                   * after IO completion such as unwritten extent 
>> conversion) and
>>                   * the underlying device either supports FUA or 
>> doesn't have
>>                   * a volatile write cache. This allows us to avoid 
>> cache flushes
>>                   * on IO completion. If we can't use writethrough and 
>> need to
>>                   * sync, disable in-task completions as dio 
>> completion will
>>                   * need to call generic_write_sync() which will do a 
>> blocking
>>                   * fsync / cache flush call.
>>                   */
>>                  if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>>                      (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
>>                      (bdev_fua(iomap->bdev) || 
>> !bdev_write_cache(iomap->bdev)))
>>                          use_fua = true;
>>
>> Hence if we want to optimise pure overwrites that have no data sync
>> requirements, we already have the detection and triggers in place to
>> do this. We just need to change the way we set up the IO flags to
>> allow write-through (i.e. non-blocking IO completions) to use inline
>> completions.
>>
>> In __iomap_dio_rw():
>>
>> +    /* Always try to complete inline. */
>> +    dio->flags |= IOMAP_DIO_INLINE_COMP;
>>     if (iov_iter_rw(iter) == READ) {
>> -               /* reads can always complete inline */
>> -               dio->flags |= IOMAP_DIO_INLINE_COMP;
>> ....
>>
>>     } else {
>> +        /* Always try write-through semantics. If we can't
>> +         * use writethough, it will be disabled along with
>> +         * IOMAP_DIO_INLINE_COMP before dio completion is run
>> +         * so it can be deferred to a task completion context
>> +         * appropriately.
>> +         */
>> +               dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;
> 
> There is a behavior change here, if we set IOMAP_DIO_WRITE_THROUGH 
> unconditionally, non-datasync IO will be set with REQ_FUA, which means 
> that device will flush writecache for each IO, will it affect the 
> performance in non-sync dio case?
>>         iomi.flags |= IOMAP_WRITE;
>> -               dio->flags |= IOMAP_DIO_WRITE;
>> .....
>>         /* for data sync or sync, we need sync completion processing */
>>                  if (iocb_is_dsync(iocb)) {
>>                          dio->flags |= IOMAP_DIO_NEED_SYNC;
>>
>> -                      /*
>> -                       * For datasync only writes, we optimistically 
>> try using
>> -                       * WRITE_THROUGH for this IO. This flag 
>> requires either
>> -                       * FUA writes through the device's write cache, 
>> or a
>> -                       * normal write to a device without a volatile 
>> write
>> -                       * cache. For the former, Any non-FUA write 
>> that occurs
>> -                       * will clear this flag, hence we know before 
>> completion
>> -                       * whether a cache flush is necessary.
>> -                       */
>> -                       if (!(iocb->ki_flags & IOCB_SYNC))
>> -                               dio->flags |= IOMAP_DIO_WRITE_THROUGH;
>> +            * For sync writes we know we are going to need
>> +            * blocking completion processing, so turn off
>> +            * writethrough now.
>> +            */
>>             if (iocb->ki_flags & IOCB_SYNC) {
>>                 dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
>>                         IOMAP_DIO_INLINE_COMP);
>>             }
>>                  }
>>
> 
> [...]
>>
>> However, this does mean that any spinlock taken in the ->end_io()
>> callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
>> the spinlock protection around inode size updates will need to use
>> an irq safe locking, as will the locking in the DIO submission path
>> that it serialises against in xfs_file_write_checks(). That probably
>> is best implemented as a separate spinlock.
>>
>> There will also be other filesystems that need to set IOMAP_F_DIRTY
>> unconditionally (e.g. zonefs) because they always take blocking
>> locks in their ->end_io callbacks and so must always run in task
>> context...
> Should we add a new flag(eg. IOMAP_F_ENDIO_IRQ ?) to indicate that the 
> endio cannot be done under irq? Because I think IOMAP_F_DIRTY means that 
> the metadata needs to be written, if we add a new semantics(endio must 
> be done in defered work) for this flag, the code will looks a little 
> complicated.
> 
> 
> .


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-03-11  7:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-29 11:38 [PATCH RFC 0/2] ext4: Do endio process under irq context for DIO overwrites Zhihao Cheng
2024-02-29 11:38 ` [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag Zhihao Cheng
2024-03-01  0:40   ` Dave Chinner
2024-03-01 10:02     ` Zhihao Cheng
2024-03-11  7:55       ` Zhihao Cheng
2024-02-29 11:38 ` [PATCH RFC 2/2] ext4: Optimize endio process for DIO overwrites Zhihao Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).