linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4 V2] iomap: Use FUA for O_DSYNC DIO writes
@ 2018-04-18  4:08 Dave Chinner
  2018-04-18  4:08 ` [PATCH 1/4] xfs: move generic_write_sync calls inwards Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Dave Chinner @ 2018-04-18  4:08 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-block, hch, rdorr

Hi folks,

This is the latest version of the "use FUA for DIO writes" patchset
that was last posted here:

https://marc.info/?l=linux-xfs&m=152213446528167&w=2

Functionality and performance is the same as the previous version.
Changes in this version address Christoph's review comments.

Version 2:
- Fixed comment typos in first patch
- simplified iomap_dio_complete_work()
- changed IOMAP_DIO_WRITE_SYNC to IOMAP_DIO_NEED_SYNC
- split blk_queue_fua() into it's own patch
- fixed formatting issue in last patch
- update bio->io_opf directly rather than use bio_set_op_attrs()
- Updated comment to mention we try to use FUA optimistically.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] xfs: move generic_write_sync calls inwards
  2018-04-18  4:08 [PATCH 0/4 V2] iomap: Use FUA for O_DSYNC DIO writes Dave Chinner
@ 2018-04-18  4:08 ` Dave Chinner
  2018-04-18  4:08 ` [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-04-18  4:08 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-block, hch, rdorr

From: Dave Chinner <dchinner@redhat.com>

To prepare for iomap iinfrastructure based DSYNC optimisations.

While moving the code araound, move the XFS write bytes metric
update for direct IO into xfs_dio_write_end_io callback so that we
always capture the amount of data written via AIO+DIO. This fixes
the problem where queued AIO+DIO writes are not accounted to this
metric.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a5b6973bf15a..6f15027661b6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -416,6 +416,12 @@ xfs_dio_write_end_io(
 	if (size <= 0)
 		return size;
 
+	/*
+	 * Capture amount written on completion as we can't reliably account
+	 * for it on submission.
+	 */
+	XFS_STATS_ADD(ip->i_mount, xs_write_bytes, size);
+
 	if (flags & IOMAP_DIO_COW) {
 		error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
@@ -564,6 +570,11 @@ xfs_file_dio_aio_write(
 	 * complete fully or fail.
 	 */
 	ASSERT(ret < 0 || ret == count);
+
+	if (ret > 0) {
+		/* Handle various SYNC-type writes */
+		ret = generic_write_sync(iocb, ret);
+	}
 	return ret;
 }
 
@@ -601,7 +612,16 @@ xfs_file_dax_write(
 	}
 out:
 	xfs_iunlock(ip, iolock);
-	return error ? error : ret;
+	if (error)
+		return error;
+
+	if (ret > 0) {
+		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
+
+		/* Handle various SYNC-type writes */
+		ret = generic_write_sync(iocb, ret);
+	}
+	return ret;
 }
 
 STATIC ssize_t
@@ -671,6 +691,12 @@ xfs_file_buffered_aio_write(
 out:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
+
+	if (ret > 0) {
+		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
+		/* Handle various SYNC-type writes */
+		ret = generic_write_sync(iocb, ret);
+	}
 	return ret;
 }
 
@@ -695,8 +721,9 @@ xfs_file_write_iter(
 		return -EIO;
 
 	if (IS_DAX(inode))
-		ret = xfs_file_dax_write(iocb, from);
-	else if (iocb->ki_flags & IOCB_DIRECT) {
+		return xfs_file_dax_write(iocb, from);
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
 		/*
 		 * Allow a directio write to fall back to a buffered
 		 * write *only* in the case that we're doing a reflink
@@ -704,20 +731,11 @@ xfs_file_write_iter(
 		 * allow an operation to fall back to buffered mode.
 		 */
 		ret = xfs_file_dio_aio_write(iocb, from);
-		if (ret == -EREMCHG)
-			goto buffered;
-	} else {
-buffered:
-		ret = xfs_file_buffered_aio_write(iocb, from);
+		if (ret != -EREMCHG)
+			return ret;
 	}
 
-	if (ret > 0) {
-		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
-
-		/* Handle various SYNC-type writes */
-		ret = generic_write_sync(iocb, ret);
-	}
-	return ret;
+	return xfs_file_buffered_aio_write(iocb, from);
 }
 
 #define	XFS_FALLOC_FL_SUPPORTED						\
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
  2018-04-18  4:08 [PATCH 0/4 V2] iomap: Use FUA for O_DSYNC DIO writes Dave Chinner
  2018-04-18  4:08 ` [PATCH 1/4] xfs: move generic_write_sync calls inwards Dave Chinner
@ 2018-04-18  4:08 ` Dave Chinner
  2018-04-21 13:03   ` Jan Kara
  2018-04-18  4:08 ` [PATCH 3/4] blk: add blk_queue_fua() helper function Dave Chinner
  2018-04-18  4:08 ` [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes Dave Chinner
  3 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2018-04-18  4:08 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-block, hch, rdorr

From: Dave Chinner <dchinner@redhat.com>

Currently iomap_dio_rw() only handles (data)sync write completions
for AIO. This means we can't optimised non-AIO IO to minimise device
flushes as we can't tell the caller whether a flush is required or
not.

To solve this problem and enable further optimisations, make
iomap_dio_rw responsible for data sync behaviour for all IO, not
just AIO.

In doing so, the sync operation is now accounted as part of the DIO
IO by inode_dio_end(), hence post-IO data stability updates will no
long race against operations that serialise via inode_dio_wait()
such as truncate or hole punch.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c        | 22 +++++++++++++++-------
 fs/xfs/xfs_file.c |  5 -----
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index afd163586aa0..1f59c2d9ade6 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
 #define IOMAP_DIO_DIRTY		(1 << 31)
 
@@ -759,6 +760,13 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 			dio_warn_stale_pagecache(iocb->ki_filp);
 	}
 
+	/*
+	 * If this is a DSYNC write, make sure we push it to stable storage now
+	 * that we've written data.
+	 */
+	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
+		ret = generic_write_sync(iocb, ret);
+
 	inode_dio_end(file_inode(iocb->ki_filp));
 	kfree(dio);
 
@@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
-	struct kiocb *iocb = dio->iocb;
-	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
-	ssize_t ret;
 
-	ret = iomap_dio_complete(dio);
-	if (is_write && ret > 0)
-		ret = generic_write_sync(iocb, ret);
-	iocb->ki_complete(iocb, ret, 0);
+	dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
 }
 
 /*
@@ -961,6 +963,10 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	return copied;
 }
 
+/*
+ * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
+ * is being issued as AIO or not.
+ */
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
@@ -1006,6 +1012,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			dio->flags |= IOMAP_DIO_DIRTY;
 	} else {
 		dio->flags |= IOMAP_DIO_WRITE;
+		if (iocb->ki_flags & IOCB_DSYNC)
+			dio->flags |= IOMAP_DIO_NEED_SYNC;
 		flags |= IOMAP_WRITE;
 	}
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6f15027661b6..0c4b8313d544 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -570,11 +570,6 @@ xfs_file_dio_aio_write(
 	 * complete fully or fail.
 	 */
 	ASSERT(ret < 0 || ret == count);
-
-	if (ret > 0) {
-		/* Handle various SYNC-type writes */
-		ret = generic_write_sync(iocb, ret);
-	}
 	return ret;
 }
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] blk: add blk_queue_fua() helper function
  2018-04-18  4:08 [PATCH 0/4 V2] iomap: Use FUA for O_DSYNC DIO writes Dave Chinner
  2018-04-18  4:08 ` [PATCH 1/4] xfs: move generic_write_sync calls inwards Dave Chinner
  2018-04-18  4:08 ` [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes Dave Chinner
@ 2018-04-18  4:08 ` Dave Chinner
  2018-04-18 10:34   ` Christoph Hellwig
  2018-04-18 14:23   ` Jens Axboe
  2018-04-18  4:08 ` [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes Dave Chinner
  3 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2018-04-18  4:08 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-block, hch, rdorr

From: Dave Chinner <dchinner@redhat.com>

So we can check FUA support status from the iomap direct IO code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 include/linux/blkdev.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..c362aadfe036 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -737,6 +737,7 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_quiesced(q)	test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
 #define blk_queue_preempt_only(q)				\
 	test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+#define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 
 extern int blk_set_preempt_only(struct request_queue *q);
 extern void blk_clear_preempt_only(struct request_queue *q);
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-18  4:08 [PATCH 0/4 V2] iomap: Use FUA for O_DSYNC DIO writes Dave Chinner
                   ` (2 preceding siblings ...)
  2018-04-18  4:08 ` [PATCH 3/4] blk: add blk_queue_fua() helper function Dave Chinner
@ 2018-04-18  4:08 ` Dave Chinner
  2018-04-21 12:54   ` Jan Kara
  3 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2018-04-18  4:08 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, linux-block, hch, rdorr

From: Dave Chinner <dchinner@redhat.com>

If we are doing direct IO writes with datasync semantics, we often
have to flush metadata changes along with the data write. However,
if we are overwriting existing data, there are no metadata changes
that we need to flush. In this case, optimising the IO by using
FUA write makes sense.

We know from the IOMAP_F_DIRTY flag as to whether a specific inode
requires a metadata flush - this is currently used by DAX to ensure
extent modification as stable in page fault operations. For direct
IO writes, we can use it to determine if we need to flush metadata
or not once the data is on disk.

Hence if we have been returned a mapped extent that is not new and
the IO mapping is not dirty, then we can use a FUA write to provide
datasync semantics. This allows us to short-cut the
generic_write_sync() call in IO completion and hence avoid
unnecessary operations. This makes pure direct IO data write
behaviour identical to the way block devices use REQ_FUA to provide
datasync semantics.

On a FUA enabled device, a synchronous direct IO write workload
(sequential 4k overwrites in 32MB file) had the following results:

# xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo

kernel		time	write()s	write iops	Write b/w
------		----	--------	----------	---------
(no dsync)	 4s	2173/s		2173		8.5MB/s
vanilla		22s	 370/s		 750		1.4MB/s
patched		19s	 420/s		 420		1.6MB/s

The patched code clearly doesn't send cache flushes anymore, but
instead uses FUA (confirmed via blktrace), and performance improves
a bit as a result. However, the benefits will be higher on workloads
that mix O_DSYNC overwrites with other write IO as we won't be
flushing the entire device cache on every DSYNC overwrite IO
anymore.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 48 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 1f59c2d9ade6..62f1f8256da2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
 #define IOMAP_DIO_DIRTY		(1 << 31)
@@ -860,6 +861,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct iov_iter iter;
 	struct bio *bio;
 	bool need_zeroout = false;
+	bool use_fua = false;
 	int nr_pages, ret;
 	size_t copied = 0;
 
@@ -883,8 +885,20 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	case IOMAP_MAPPED:
 		if (iomap->flags & IOMAP_F_SHARED)
 			dio->flags |= IOMAP_DIO_COW;
-		if (iomap->flags & IOMAP_F_NEW)
+		if (iomap->flags & IOMAP_F_NEW) {
 			need_zeroout = true;
+		} else {
+			/*
+			 * Use a FUA write if we need datasync semantics, this
+			 * is a pure data IO that doesn't require any metadata
+			 * updates and the underlying device supports FUA. This
+			 * allows us to avoid cache flushes on IO completion.
+			 */
+			if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
+			    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
+			    blk_queue_fua(bdev_get_queue(iomap->bdev)))
+				use_fua = true;
+		}
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -932,10 +946,14 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
-			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
+			if (use_fua)
+				bio->bi_opf |= REQ_FUA;
+			else
+				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
 			task_io_account_write(n);
 		} else {
-			bio_set_op_attrs(bio, REQ_OP_READ, 0);
+			bio->bi_opf = REQ_OP_READ;
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
@@ -965,7 +983,12 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 /*
  * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
- * is being issued as AIO or not.
+ * is being issued as AIO or not.  This allows us to optimise pure data writes
+ * to use REQ_FUA rather than requiring generic_write_sync() to issue a
+ * REQ_FLUSH post write. This is slightly tricky because a single request here
+ * can be mapped into multiple disjoint IOs and only a subset of the IOs issued
+ * may be pure data writes. In that case, we still need to do a full data sync
+ * completion.
  */
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -1012,8 +1035,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			dio->flags |= IOMAP_DIO_DIRTY;
 	} else {
 		dio->flags |= IOMAP_DIO_WRITE;
-		if (iocb->ki_flags & IOCB_DSYNC)
+		if (iocb->ki_flags & IOCB_DSYNC) {
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
+			/*
+			 * We optimistically try using FUA for this IO.  Any
+			 * non-FUA write that occurs will clear this flag, hence
+			 * we know before completion whether a cache flush is
+			 * necessary.
+			 */
+			dio->flags |= IOMAP_DIO_WRITE_FUA;
+		}
 		flags |= IOMAP_WRITE;
 	}
 
@@ -1070,6 +1101,13 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret < 0)
 		iomap_dio_set_error(dio, ret);
 
+	/*
+	 * If all the writes we issued were FUA, we don't need to flush the
+	 * cache on IO completion. Clear the sync flag for this case.
+	 */
+	if (dio->flags & IOMAP_DIO_WRITE_FUA)
+		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
+
 	if (!atomic_dec_and_test(&dio->ref)) {
 		if (!is_sync_kiocb(iocb))
 			return -EIOCBQUEUED;
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] blk: add blk_queue_fua() helper function
  2018-04-18  4:08 ` [PATCH 3/4] blk: add blk_queue_fua() helper function Dave Chinner
@ 2018-04-18 10:34   ` Christoph Hellwig
  2018-04-18 12:18     ` Matthew Wilcox
  2018-04-18 14:23   ` Jens Axboe
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2018-04-18 10:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-block, hch, rdorr

s/blk/block/ for block patches.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] blk: add blk_queue_fua() helper function
  2018-04-18 10:34   ` Christoph Hellwig
@ 2018-04-18 12:18     ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2018-04-18 12:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-block, rdorr

On Wed, Apr 18, 2018 at 12:34:03PM +0200, Christoph Hellwig wrote:
> s/blk/block/ for block patches.

I think this is something we should put in MAINTAINERS.  Eventually
some tooling can pull it out, but I don't think this is something
that people can reasonably be expected to know.


diff --git a/MAINTAINERS b/MAINTAINERS
index 0a1410d5a621..c6358fa58fed 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -81,6 +81,7 @@ Descriptions of section entries:
 	R: Designated reviewer: FullName <address@domain>
 	   These reviewers should be CCed on patches.
 	L: Mailing list that is relevant to this area
+	J: Prefix to use when sending a patch
 	W: Web-page with status/info
 	B: URI for where to file bugs. A web-page with detailed bug
 	   filing info, a direct bug tracker link, or a mailto: URI.
@@ -2667,6 +2668,7 @@ F:	drivers/leds/leds-blinkm.c
 BLOCK LAYER
 M:	Jens Axboe <axboe@kernel.dk>
 L:	linux-block@vger.kernel.org
+J:	block
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
 S:	Maintained
 F:	block/
@@ -14927,6 +14929,7 @@ M:	Pawel Osciak <pawel@osciak.com>
 M:	Marek Szyprowski <m.szyprowski@samsung.com>
 M:	Kyungmin Park <kyungmin.park@samsung.com>
 L:	linux-media@vger.kernel.org
+J:	videobuf
 S:	Maintained
 F:	drivers/media/v4l2-core/videobuf2-*
 F:	include/media/videobuf2-*

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] blk: add blk_queue_fua() helper function
  2018-04-18  4:08 ` [PATCH 3/4] blk: add blk_queue_fua() helper function Dave Chinner
  2018-04-18 10:34   ` Christoph Hellwig
@ 2018-04-18 14:23   ` Jens Axboe
  1 sibling, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2018-04-18 14:23 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-fsdevel, linux-block, hch, rdorr

On 4/17/18 10:08 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So we can check FUA support status from the iomap direct IO code.

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-18  4:08 ` [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes Dave Chinner
@ 2018-04-21 12:54   ` Jan Kara
  2018-04-24 17:34     ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2018-04-21 12:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On Wed 18-04-18 14:08:28, Dave Chinner wrote:
> @@ -1012,8 +1035,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			dio->flags |= IOMAP_DIO_DIRTY;
>  	} else {
>  		dio->flags |= IOMAP_DIO_WRITE;
> -		if (iocb->ki_flags & IOCB_DSYNC)
> +		if (iocb->ki_flags & IOCB_DSYNC) {
>  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> +			/*
> +			 * We optimistically try using FUA for this IO.  Any
> +			 * non-FUA write that occurs will clear this flag, hence
> +			 * we know before completion whether a cache flush is
> +			 * necessary.
> +			 */
> +			dio->flags |= IOMAP_DIO_WRITE_FUA;
> +		}

So I don't think this is quite correct. IOCB_DSYNC gets set also for O_SYNC
writes (in that case we also set IOCB_SYNC). And for those we cannot use
the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
indicator of a need of full fsync for O_SYNC). Other than that the patch
looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
  2018-04-18  4:08 ` [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes Dave Chinner
@ 2018-04-21 13:03   ` Jan Kara
  2018-05-02  2:45     ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2018-04-21 13:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently iomap_dio_rw() only handles (data)sync write completions
> for AIO. This means we can't optimised non-AIO IO to minimise device
> flushes as we can't tell the caller whether a flush is required or
> not.
> 
> To solve this problem and enable further optimisations, make
> iomap_dio_rw responsible for data sync behaviour for all IO, not
> just AIO.
> 
> In doing so, the sync operation is now accounted as part of the DIO
> IO by inode_dio_end(), hence post-IO data stability updates will no
> long race against operations that serialise via inode_dio_wait()
> such as truncate or hole punch.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/iomap.c        | 22 +++++++++++++++-------
>  fs/xfs/xfs_file.c |  5 -----
>  2 files changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..1f59c2d9ade6 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_NEED_SYNC	(1 << 29)
>  #define IOMAP_DIO_WRITE		(1 << 30)
>  #define IOMAP_DIO_DIRTY		(1 << 31)
>  
> @@ -759,6 +760,13 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  			dio_warn_stale_pagecache(iocb->ki_filp);
>  	}
>  
> +	/*
> +	 * If this is a DSYNC write, make sure we push it to stable storage now
> +	 * that we've written data.
> +	 */
> +	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> +		ret = generic_write_sync(iocb, ret);
> +
>  	inode_dio_end(file_inode(iocb->ki_filp));
>  	kfree(dio);
>  
> @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  static void iomap_dio_complete_work(struct work_struct *work)
>  {
>  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> -	struct kiocb *iocb = dio->iocb;
> -	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> -	ssize_t ret;
>  
> -	ret = iomap_dio_complete(dio);
> -	if (is_write && ret > 0)
> -		ret = generic_write_sync(iocb, ret);
> -	iocb->ki_complete(iocb, ret, 0);
> +	dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
>  }
>  
>  /*
> @@ -961,6 +963,10 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	return copied;
>  }
>  
> +/*
> + * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
> + * is being issued as AIO or not.
> + */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
> @@ -1006,6 +1012,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			dio->flags |= IOMAP_DIO_DIRTY;
>  	} else {
>  		dio->flags |= IOMAP_DIO_WRITE;
> +		if (iocb->ki_flags & IOCB_DSYNC)
> +			dio->flags |= IOMAP_DIO_NEED_SYNC;
>  		flags |= IOMAP_WRITE;
>  	}
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6f15027661b6..0c4b8313d544 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -570,11 +570,6 @@ xfs_file_dio_aio_write(
>  	 * complete fully or fail.
>  	 */
>  	ASSERT(ret < 0 || ret == count);
> -
> -	if (ret > 0) {
> -		/* Handle various SYNC-type writes */
> -		ret = generic_write_sync(iocb, ret);
> -	}
>  	return ret;
>  }
>  
> -- 
> 2.16.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-21 12:54   ` Jan Kara
@ 2018-04-24 17:34     ` Christoph Hellwig
  2018-04-24 22:07       ` Holger Hoffstätte
  2018-05-02  2:34       ` Dave Chinner
  0 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-04-24 17:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On Sat, Apr 21, 2018 at 02:54:05PM +0200, Jan Kara wrote:
> > -		if (iocb->ki_flags & IOCB_DSYNC)
> > +		if (iocb->ki_flags & IOCB_DSYNC) {
> >  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> > +			/*
> > +			 * We optimistically try using FUA for this IO.  Any
> > +			 * non-FUA write that occurs will clear this flag, hence
> > +			 * we know before completion whether a cache flush is
> > +			 * necessary.
> > +			 */
> > +			dio->flags |= IOMAP_DIO_WRITE_FUA;
> > +		}
> 
> So I don't think this is quite correct. IOCB_DSYNC gets set also for O_SYNC
> writes (in that case we also set IOCB_SYNC). And for those we cannot use
> the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
> indicator of a need of full fsync for O_SYNC). Other than that the patch
> looks good to me.

Oops, good catch. I think the above if should just be

		if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC) == IOCB_DSYNC)) {

and we are fine. 

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-24 17:34     ` Christoph Hellwig
@ 2018-04-24 22:07       ` Holger Hoffstätte
  2018-04-25  5:20         ` Christoph Hellwig
  2018-04-25 13:02         ` Jan Kara
  2018-05-02  2:34       ` Dave Chinner
  1 sibling, 2 replies; 19+ messages in thread
From: Holger Hoffstätte @ 2018-04-24 22:07 UTC (permalink / raw)
  To: Christoph Hellwig, Jan Kara
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On 04/24/18 19:34, Christoph Hellwig wrote:
> On Sat, Apr 21, 2018 at 02:54:05PM +0200, Jan Kara wrote:
>>> -		if (iocb->ki_flags & IOCB_DSYNC)
>>> +		if (iocb->ki_flags & IOCB_DSYNC) {
>>>   			dio->flags |= IOMAP_DIO_NEED_SYNC;
>>> +			/*
>>> +			 * We optimistically try using FUA for this IO.  Any
>>> +			 * non-FUA write that occurs will clear this flag, hence
>>> +			 * we know before completion whether a cache flush is
>>> +			 * necessary.
>>> +			 */
>>> +			dio->flags |= IOMAP_DIO_WRITE_FUA;
>>> +		}
>>
>> So I don't think this is quite correct. IOCB_DSYNC gets set also for O_SYNC
>> writes (in that case we also set IOCB_SYNC). And for those we cannot use
>> the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
>> indicator of a need of full fsync for O_SYNC). Other than that the patch
>> looks good to me.
> 
> Oops, good catch. I think the above if should just be
> 
> 		if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC) == IOCB_DSYNC)) {
> 
> and we are fine.

The above line just gives parenthesis salad errors, so why not compromise
on:

	if ((iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) == IOCB_DSYNC) {

Unless my bit twiddling has completely left me I think this is what was
intended, and it actually compiles too.

cheers
Holger

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-24 22:07       ` Holger Hoffstätte
@ 2018-04-25  5:20         ` Christoph Hellwig
  2018-04-25 13:02         ` Jan Kara
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-04-25  5:20 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-xfs,
	linux-fsdevel, linux-block, hch, rdorr

On Wed, Apr 25, 2018 at 12:07:07AM +0200, Holger Hoffst�tte wrote:
> The above line just gives parenthesis salad errors, so why not compromise
> on:
> 
> 	if ((iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) == IOCB_DSYNC) {
> 
> Unless my bit twiddling has completely left me I think this is what was
> intended, and it actually compiles too.

Yes, that is what was intended :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-24 22:07       ` Holger Hoffstätte
  2018-04-25  5:20         ` Christoph Hellwig
@ 2018-04-25 13:02         ` Jan Kara
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Kara @ 2018-04-25 13:02 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-xfs,
	linux-fsdevel, linux-block, hch, rdorr

On Wed 25-04-18 00:07:07, Holger Hoffst�tte wrote:
> On 04/24/18 19:34, Christoph Hellwig wrote:
> > On Sat, Apr 21, 2018 at 02:54:05PM +0200, Jan Kara wrote:
> > > > -		if (iocb->ki_flags & IOCB_DSYNC)
> > > > +		if (iocb->ki_flags & IOCB_DSYNC) {
> > > >   			dio->flags |= IOMAP_DIO_NEED_SYNC;
> > > > +			/*
> > > > +			 * We optimistically try using FUA for this IO.  Any
> > > > +			 * non-FUA write that occurs will clear this flag, hence
> > > > +			 * we know before completion whether a cache flush is
> > > > +			 * necessary.
> > > > +			 */
> > > > +			dio->flags |= IOMAP_DIO_WRITE_FUA;
> > > > +		}
> > > 
> > > So I don't think this is quite correct. IOCB_DSYNC gets set also for O_SYNC
> > > writes (in that case we also set IOCB_SYNC). And for those we cannot use
> > > the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
> > > indicator of a need of full fsync for O_SYNC). Other than that the patch
> > > looks good to me.
> > 
> > Oops, good catch. I think the above if should just be
> > 
> > 		if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC) == IOCB_DSYNC)) {
> > 
> > and we are fine.
> 
> The above line just gives parenthesis salad errors, so why not compromise
> on:
> 
> 	if ((iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) == IOCB_DSYNC) {
> 
> Unless my bit twiddling has completely left me I think this is what was
> intended, and it actually compiles too.

Yup, I agree this is what needs to happen.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes
  2018-04-24 17:34     ` Christoph Hellwig
  2018-04-24 22:07       ` Holger Hoffstätte
@ 2018-05-02  2:34       ` Dave Chinner
  1 sibling, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2018-05-02  2:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On Tue, Apr 24, 2018 at 10:34:44AM -0700, Christoph Hellwig wrote:
> On Sat, Apr 21, 2018 at 02:54:05PM +0200, Jan Kara wrote:
> > > -		if (iocb->ki_flags & IOCB_DSYNC)
> > > +		if (iocb->ki_flags & IOCB_DSYNC) {
> > >  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> > > +			/*
> > > +			 * We optimistically try using FUA for this IO.  Any
> > > +			 * non-FUA write that occurs will clear this flag, hence
> > > +			 * we know before completion whether a cache flush is
> > > +			 * necessary.
> > > +			 */
> > > +			dio->flags |= IOMAP_DIO_WRITE_FUA;
> > > +		}
> > 
> > So I don't think this is quite correct. IOCB_DSYNC gets set also for O_SYNC
> > writes (in that case we also set IOCB_SYNC). And for those we cannot use
> > the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
> > indicator of a need of full fsync for O_SYNC). Other than that the patch
> > looks good to me.
> 
> Oops, good catch. I think the above if should just be
> 
> 		if (iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC) == IOCB_DSYNC)) {
> 
> and we are fine. 

Ah, not exactly. IOMAP_DIO_NEED_SYNC needs to be set for either
DYSNC or SYNC writes, while IOMAP_DIO_WRITE_FUA should only be set
for DSYNC.

I'll fix this up appropriately.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
  2018-04-21 13:03   ` Jan Kara
@ 2018-05-02  2:45     ` Dave Chinner
  2018-05-02 14:27       ` Robert Dorr
  2018-05-03 12:51       ` Jan Kara
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2018-05-02  2:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently iomap_dio_rw() only handles (data)sync write completions
> > for AIO. This means we can't optimised non-AIO IO to minimise device
> > flushes as we can't tell the caller whether a flush is required or
> > not.
> > 
> > To solve this problem and enable further optimisations, make
> > iomap_dio_rw responsible for data sync behaviour for all IO, not
> > just AIO.
> > 
> > In doing so, the sync operation is now accounted as part of the DIO
> > IO by inode_dio_end(), hence post-IO data stability updates will no
> > long race against operations that serialise via inode_dio_wait()
> > such as truncate or hole punch.
> > 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

It looks good, but it's broken in a subtle, nasty way. :/

> > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >  static void iomap_dio_complete_work(struct work_struct *work)
> >  {
> >  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > -	struct kiocb *iocb = dio->iocb;
> > -	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > -	ssize_t ret;
> >  
> > -	ret = iomap_dio_complete(dio);
> > -	if (is_write && ret > 0)
> > -		ret = generic_write_sync(iocb, ret);
> > -	iocb->ki_complete(iocb, ret, 0);
> > +	dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);

This generates a use after free from KASAN from generic/016. it
appears the compiler orders the code so that it dereferences
dio->iocb after iomap_dio_complete() has freed the dio structure
(yay!).

I'll post a new version of the patchset now that I've got changes to
2 of the 3 remaining patches in it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
  2018-05-02  2:45     ` Dave Chinner
@ 2018-05-02 14:27       ` Robert Dorr
  2018-05-03 13:34         ` Jan Kara
  2018-05-03 12:51       ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Robert Dorr @ 2018-05-02 14:27 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: linux-xfs, linux-fsdevel, linux-block, hch, Slava Oks,
	Jasraj Dange, Michael Nelson, Amit Banerjee, Travis Wright,
	Bob Ward, Scott Konersmann

Thanks for your continued effort Dave.

In the current implementation the first write to the location updates the metadata and must issue the flush.   In Windows SQL Server can avoid this behavior.   SQL Server can issue DeviceIoControl with SET_FILE_VALID_DATA and then SetEndOfFile.  The SetEndOfFile acquires space and saves metadata without requiring the actual write.   This allows us to quickly create a large file and the writes do not need the added flush.

Is this something that fallocate could accommodate to avoid having to write once (triggers flush for metadata) and then secondary writes can use FUA and avoid the flush?



-----Original Message-----
From: Dave Chinner <david@fromorbit.com> 
Sent: Tuesday, May 1, 2018 9:46 PM
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-block@vger.kernel.org; hch@lst.de; Robert Dorr <rdorr@microsoft.com>
Subject: Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes

On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently iomap_dio_rw() only handles (data)sync write completions 
> > for AIO. This means we can't optimised non-AIO IO to minimise device 
> > flushes as we can't tell the caller whether a flush is required or 
> > not.
> > 
> > To solve this problem and enable further optimisations, make 
> > iomap_dio_rw responsible for data sync behaviour for all IO, not 
> > just AIO.
> > 
> > In doing so, the sync operation is now accounted as part of the DIO 
> > IO by inode_dio_end(), hence post-IO data stability updates will no 
> > long race against operations that serialise via inode_dio_wait() 
> > such as truncate or hole punch.
> > 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

It looks good, but it's broken in a subtle, nasty way. :/

> > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct 
> > iomap_dio *dio)  static void iomap_dio_complete_work(struct 
> > work_struct *work)  {
> >  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > -	struct kiocb *iocb = dio->iocb;
> > -	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > -	ssize_t ret;
> >  
> > -	ret = iomap_dio_complete(dio);
> > -	if (is_write && ret > 0)
> > -		ret = generic_write_sync(iocb, ret);
> > -	iocb->ki_complete(iocb, ret, 0);
> > +	dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);

This generates a use after free from KASAN from generic/016. it appears the compiler orders the code so that it dereferences
dio->iocb after iomap_dio_complete() has freed the dio structure
(yay!).

I'll post a new version of the patchset now that I've got changes to
2 of the 3 remaining patches in it....

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
  2018-05-02  2:45     ` Dave Chinner
  2018-05-02 14:27       ` Robert Dorr
@ 2018-05-03 12:51       ` Jan Kara
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Kara @ 2018-05-03 12:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, linux-fsdevel, linux-block, hch, rdorr

On Wed 02-05-18 12:45:40, Dave Chinner wrote:
> On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> > On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Currently iomap_dio_rw() only handles (data)sync write completions
> > > for AIO. This means we can't optimised non-AIO IO to minimise device
> > > flushes as we can't tell the caller whether a flush is required or
> > > not.
> > > 
> > > To solve this problem and enable further optimisations, make
> > > iomap_dio_rw responsible for data sync behaviour for all IO, not
> > > just AIO.
> > > 
> > > In doing so, the sync operation is now accounted as part of the DIO
> > > IO by inode_dio_end(), hence post-IO data stability updates will no
> > > long race against operations that serialise via inode_dio_wait()
> > > such as truncate or hole punch.
> > > 
> > > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > Looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> It looks good, but it's broken in a subtle, nasty way. :/
> 
> > > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >  static void iomap_dio_complete_work(struct work_struct *work)
> > >  {
> > >  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > > -	struct kiocb *iocb = dio->iocb;
> > > -	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > > -	ssize_t ret;
> > >  
> > > -	ret = iomap_dio_complete(dio);
> > > -	if (is_write && ret > 0)
> > > -		ret = generic_write_sync(iocb, ret);
> > > -	iocb->ki_complete(iocb, ret, 0);
> > > +	dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
> 
> This generates a use after free from KASAN from generic/016. it
> appears the compiler orders the code so that it dereferences
> dio->iocb after iomap_dio_complete() has freed the dio structure
> (yay!).

Yeah, very subtle but the compiler is indeed free to do this (in C the
sequence point is only the function call but the order of evaluation of
function arguments is unspecified). Thanks for catching this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
  2018-05-02 14:27       ` Robert Dorr
@ 2018-05-03 13:34         ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2018-05-03 13:34 UTC (permalink / raw)
  To: Robert Dorr
  Cc: Dave Chinner, Jan Kara, linux-xfs, linux-fsdevel, linux-block,
	hch, Slava Oks, Jasraj Dange, Michael Nelson, Amit Banerjee,
	Travis Wright, Bob Ward, Scott Konersmann

On Wed 02-05-18 14:27:37, Robert Dorr wrote:
> In the current implementation the first write to the location updates the
> metadata and must issue the flush.   In Windows SQL Server can avoid this
> behavior.   SQL Server can issue DeviceIoControl with SET_FILE_VALID_DATA
> and then SetEndOfFile.  The SetEndOfFile acquires space and saves
> metadata without requiring the actual write.   This allows us to quickly
> create a large file and the writes do not need the added flush.
> 
> Is this something that fallocate could accommodate to avoid having to
> write once (triggers flush for metadata) and then secondary writes can
> use FUA and avoid the flush?

Well, the question then is what do you see in the file if you read those
blocks before writing them or if the system crashes before writing the
data. As you describe the feature, you'd see there just the old block
contents which is a security issue (you could see for example old
/etc/passwd contents there). That's why we've refused such feature in the
past [1].

								Honza

[1] https://www.spinics.net/lists/linux-ext4/msg31637.html

> -----Original Message-----
> From: Dave Chinner <david@fromorbit.com> 
> Sent: Tuesday, May 1, 2018 9:46 PM
> To: Jan Kara <jack@suse.cz>
> Cc: linux-xfs@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-block@vger.kernel.org; hch@lst.de; Robert Dorr <rdorr@microsoft.com>
> Subject: Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes
> 
> On Sat, Apr 21, 2018 at 03:03:09PM +0200, Jan Kara wrote:
> > On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Currently iomap_dio_rw() only handles (data)sync write completions 
> > > for AIO. This means we can't optimised non-AIO IO to minimise device 
> > > flushes as we can't tell the caller whether a flush is required or 
> > > not.
> > > 
> > > To solve this problem and enable further optimisations, make 
> > > iomap_dio_rw responsible for data sync behaviour for all IO, not 
> > > just AIO.
> > > 
> > > In doing so, the sync operation is now accounted as part of the DIO 
> > > IO by inode_dio_end(), hence post-IO data stability updates will no 
> > > long race against operations that serialise via inode_dio_wait() 
> > > such as truncate or hole punch.
> > > 
> > > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > Looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> It looks good, but it's broken in a subtle, nasty way. :/
> 
> > > @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct 
> > > iomap_dio *dio)  static void iomap_dio_complete_work(struct 
> > > work_struct *work)  {
> > >  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> > > -	struct kiocb *iocb = dio->iocb;
> > > -	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> > > -	ssize_t ret;
> > >  
> > > -	ret = iomap_dio_complete(dio);
> > > -	if (is_write && ret > 0)
> > > -		ret = generic_write_sync(iocb, ret);
> > > -	iocb->ki_complete(iocb, ret, 0);
> > > +	dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
> 
> This generates a use after free from KASAN from generic/016. it appears the compiler orders the code so that it dereferences
> dio->iocb after iomap_dio_complete() has freed the dio structure
> (yay!).
> 
> I'll post a new version of the patchset now that I've got changes to
> 2 of the 3 remaining patches in it....
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-05-03 13:34 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18  4:08 [PATCH 0/4 V2] iomap: Use FUA for O_DSYNC DIO writes Dave Chinner
2018-04-18  4:08 ` [PATCH 1/4] xfs: move generic_write_sync calls inwards Dave Chinner
2018-04-18  4:08 ` [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes Dave Chinner
2018-04-21 13:03   ` Jan Kara
2018-05-02  2:45     ` Dave Chinner
2018-05-02 14:27       ` Robert Dorr
2018-05-03 13:34         ` Jan Kara
2018-05-03 12:51       ` Jan Kara
2018-04-18  4:08 ` [PATCH 3/4] blk: add blk_queue_fua() helper function Dave Chinner
2018-04-18 10:34   ` Christoph Hellwig
2018-04-18 12:18     ` Matthew Wilcox
2018-04-18 14:23   ` Jens Axboe
2018-04-18  4:08 ` [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes Dave Chinner
2018-04-21 12:54   ` Jan Kara
2018-04-24 17:34     ` Christoph Hellwig
2018-04-24 22:07       ` Holger Hoffstätte
2018-04-25  5:20         ` Christoph Hellwig
2018-04-25 13:02         ` Jan Kara
2018-05-02  2:34       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).