linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] xfs: reduce sub-block DIO serialisation
@ 2021-01-12  1:07 Dave Chinner
  2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

Hi folks,

This is the XFS implementation on the sub-block DIO optimisations
for written extents that I've mentioned on #xfs and a couple of
times now on the XFS mailing list.

It takes the approach of using the IOMAP_NOWAIT non-blocking
IO submission infrastructure to optimistically dispatch sub-block
DIO without exclusive locking. If the extent mapping callback
decides that it can't do the unaligned IO without extent
manipulation, sub-block zeroing, blocking or splitting the IO into
multiple parts, it aborts the IO with -EAGAIN. This allows the high
level filesystem code to then take exclusive locks and resubmit the
IO once it has guaranteed no other IO is in progress on the inode
(the current implementation).

This requires moving the IOMAP_NOWAIT setup decisions up into the
filesystem, adding yet another parameter to iomap_dio_rw(). So first
I convert iomap_dio_rw() to take an args structure so that we don't
have to modify the API every time we want to add another setup
parameter to the DIO submission code.

I then include Christophs IOCB_NOWAIT fxies and cleanups to the XFS
code, because they needed to be done regardless of the unaligned DIO
issues and they make the changes simpler. Then I split the unaligned
DIO path out from the aligned path, because all the extra complexity
to support better unaligned DIO submission concurrency is not
necessary for the block aligned path. Finally, I modify the
unaligned IO path to first submit the unaligned IO using
non-blocking semantics and provide a fallback to run the IO
exclusively if that fails.

This means that we consider sub-block dio into written a fast path
that should almost always succeed with minimal overhead and we put
all the overhead of failure into the slow path where exclusive
locking is required. Unlike Christoph's proposed patch, this means
we don't require an extra ILOCK cycle in the sub-block DIO setup
fast path, so it should perform almost identically to the block
aligned fast path.

Tested using fio with AIO+DIO randrw to a written file. Performance
increases from about 20k IOPS to 150k IOPS, which is the limit of
the setup I was using for testing. Also passed fstests auto group
on a both v4 and v5 XFS filesystems.

Thoughts, comments?

-Dave.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
@ 2021-01-12  1:07 ` Dave Chinner
  2021-01-12  1:22   ` Damien Le Moal
                     ` (2 more replies)
  2021-01-12  1:07 ` [PATCH 2/6] iomap: move DIO NOWAIT setup up into filesystems Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

From: Dave Chinner <dchinner@redhat.com>

Adding yet another parameter to the iomap_dio_rw() interface means
changing lots of filesystems to add the parameter. Convert this
interface to an args structure so in future we don't need to modify
every caller to add a new parameter.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/btrfs/file.c       | 21 ++++++++++++++++-----
 fs/ext4/file.c        | 24 ++++++++++++++++++------
 fs/gfs2/file.c        | 19 ++++++++++++++-----
 fs/iomap/direct-io.c  | 30 ++++++++++++++----------------
 fs/xfs/xfs_file.c     | 30 +++++++++++++++++++++---------
 fs/zonefs/super.c     | 21 +++++++++++++++++----
 include/linux/iomap.h | 16 ++++++++++------
 7 files changed, 110 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e41459b8de6..a49d9fa918d1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1907,6 +1907,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t err;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio = NULL;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= from,
+		.ops			= &btrfs_dio_iomap_ops,
+		.dops			= &btrfs_dio_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -1949,9 +1956,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		goto buffered;
 	}
 
-	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops,
-			     &btrfs_dio_ops, is_sync_kiocb(iocb));
-
+	dio = __iomap_dio_rw(&args);
 	btrfs_inode_unlock(inode, ilock_flags);
 
 	if (IS_ERR_OR_NULL(dio)) {
@@ -3617,13 +3622,19 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= to,
+		.ops			= &btrfs_dio_iomap_ops,
+		.dops			= &btrfs_dio_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			   is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(&args);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret;
 }
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3ed8c048fb12..436508be6d88 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -53,6 +53,12 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret;
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= to,
+		.ops			= &ext4_iomap_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock_shared(inode))
@@ -74,8 +80,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(&args);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -459,9 +464,15 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	loff_t offset = iocb->ki_pos;
 	size_t count = iov_iter_count(from);
-	const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
 	bool extend = false, unaligned_io = false;
 	bool ilock_shared = true;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= from,
+		.ops			= &ext4_iomap_ops,
+		.dops			= &ext4_dio_write_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	/*
 	 * We initially start with shared inode lock unless it is
@@ -548,9 +559,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	if (ilock_shared)
-		iomap_ops = &ext4_iomap_overwrite_ops;
-	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_io || extend);
+		args.ops = &ext4_iomap_overwrite_ops;
+	if (unaligned_io || extend)
+		args.wait_for_completion = true;
+	ret = iomap_dio_rw(&args);
 	if (ret == -ENOTBLK)
 		ret = 0;
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index b39b339feddc..d44a5f9c5f34 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -788,6 +788,12 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	size_t count = iov_iter_count(to);
 	ssize_t ret;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= to,
+		.ops			= &gfs2_iomap_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	if (!count)
 		return 0; /* skip atime */
@@ -797,9 +803,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
-
+	ret = iomap_dio_rw(&args);
 	gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
@@ -815,6 +819,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	size_t len = iov_iter_count(from);
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= from,
+		.ops			= &gfs2_iomap_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
@@ -833,8 +843,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(&args);
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 933f234d5bec..05cacc27578c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -418,13 +418,13 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
  * writes.  The callers needs to fall back to buffered I/O in this case.
  */
 struct iomap_dio *
-__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
-		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion)
+__iomap_dio_rw(struct iomap_dio_rw_args *args)
 {
+	struct kiocb *iocb = args->iocb;
+	struct iov_iter *iter = args->iter;
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
-	size_t count = iov_iter_count(iter);
+	size_t count = iov_iter_count(args->iter);
 	loff_t pos = iocb->ki_pos;
 	loff_t end = iocb->ki_pos + count - 1, ret = 0;
 	unsigned int flags = IOMAP_DIRECT;
@@ -434,7 +434,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (!count)
 		return NULL;
 
-	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
+	if (WARN_ON(is_sync_kiocb(iocb) && !args->wait_for_completion))
 		return ERR_PTR(-EIO);
 
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
@@ -445,7 +445,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	atomic_set(&dio->ref, 1);
 	dio->size = 0;
 	dio->i_size = i_size_read(inode);
-	dio->dops = dops;
+	dio->dops = args->dops;
 	dio->error = 0;
 	dio->flags = 0;
 
@@ -490,7 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iov_iter_rw(args->iter) == WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
@@ -503,7 +503,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			goto out_free_dio;
 		}
 
-		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
+		if (!args->wait_for_completion && !inode->i_sb->s_dio_done_wq) {
 			ret = sb_init_dio_done_wq(inode->i_sb);
 			if (ret < 0)
 				goto out_free_dio;
@@ -514,12 +514,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	blk_start_plug(&plug);
 	do {
-		ret = iomap_apply(inode, pos, count, flags, ops, dio,
+		ret = iomap_apply(inode, pos, count, flags, args->ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
-				wait_for_completion = true;
+				args->wait_for_completion = true;
 				ret = 0;
 			}
 			break;
@@ -566,9 +566,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 *	of the final reference, and we will complete and free it here
 	 *	after we got woken by the I/O completion handler.
 	 */
-	dio->wait_for_completion = wait_for_completion;
+	dio->wait_for_completion = args->wait_for_completion;
 	if (!atomic_dec_and_test(&dio->ref)) {
-		if (!wait_for_completion)
+		if (!args->wait_for_completion)
 			return ERR_PTR(-EIOCBQUEUED);
 
 		for (;;) {
@@ -596,13 +596,11 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 
 ssize_t
-iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
-		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion)
+iomap_dio_rw(struct iomap_dio_rw_args *args)
 {
 	struct iomap_dio *dio;
 
-	dio = __iomap_dio_rw(iocb, iter, ops, dops, wait_for_completion);
+	dio = __iomap_dio_rw(args);
 	if (IS_ERR_OR_NULL(dio))
 		return PTR_ERR_OR_ZERO(dio);
 	return iomap_dio_complete(dio);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..29f4204e551f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -205,6 +205,12 @@ xfs_file_dio_aio_read(
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
 	size_t			count = iov_iter_count(to);
 	ssize_t			ret;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= to,
+		.ops			= &xfs_read_iomap_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
 
@@ -219,8 +225,7 @@ xfs_file_dio_aio_read(
 	} else {
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	}
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
-			is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(&args);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -519,6 +524,13 @@ xfs_file_dio_aio_write(
 	int			iolock;
 	size_t			count = iov_iter_count(from);
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= from,
+		.ops			= &xfs_direct_write_iomap_ops,
+		.dops			= &xfs_dio_write_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+	};
 
 	/* DIO must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
@@ -535,6 +547,12 @@ xfs_file_dio_aio_write(
 	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
 		unaligned_io = 1;
 
+		/*
+		 * This must be the only IO in-flight. Wait on it before we
+		 * release the iolock to prevent subsequent overlapping IO.
+		 */
+		args.wait_for_completion = true;
+
 		/*
 		 * We can't properly handle unaligned direct I/O to reflink
 		 * files yet, as we can't unshare a partial block.
@@ -578,13 +596,7 @@ xfs_file_dio_aio_write(
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
-	/*
-	 * If unaligned, this is the only IO in-flight. Wait on it before we
-	 * release the iolock to prevent subsequent overlapping IO.
-	 */
-	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_io);
+	ret = iomap_dio_rw(&args);
 out:
 	xfs_iunlock(ip, iolock);
 
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index bec47f2d074b..edf353ad1edc 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -735,6 +735,13 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	bool append = false;
 	size_t count;
 	ssize_t ret;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= from,
+		.ops			= &zonefs_iomap_ops,
+		.dops			= &zonefs_write_dio_ops,
+		.wait_for_completion	= sync,
+	};
 
 	/*
 	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
@@ -779,8 +786,8 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	if (append)
 		ret = zonefs_file_dio_append(iocb, from);
 	else
-		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-				   &zonefs_write_dio_ops, sync);
+		ret = iomap_dio_rw(&args);
+
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
@@ -909,6 +916,13 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	mutex_unlock(&zi->i_truncate_mutex);
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
+		struct iomap_dio_rw_args args = {
+			.iocb			= iocb,
+			.iter			= to,
+			.ops			= &zonefs_iomap_ops,
+			.dops			= &zonefs_read_dio_ops,
+			.wait_for_completion	= is_sync_kiocb(iocb),
+		};
 		size_t count = iov_iter_count(to);
 
 		if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
@@ -916,8 +930,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			goto inode_unlock;
 		}
 		file_accessed(iocb->ki_filp);
-		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
-				   &zonefs_read_dio_ops, is_sync_kiocb(iocb));
+		ret = iomap_dio_rw(&args);
 	} else {
 		ret = generic_file_read_iter(iocb, to);
 		if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5bd3cac4df9c..16d20c01b5bb 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -256,12 +256,16 @@ struct iomap_dio_ops {
 			struct bio *bio, loff_t file_offset);
 };
 
-ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
-		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion);
-struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
-		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion);
+struct iomap_dio_rw_args {
+	struct kiocb		*iocb;
+	struct iov_iter		*iter;
+	const struct iomap_ops	*ops;
+	const struct iomap_dio_ops *dops;
+	bool			wait_for_completion;
+};
+
+ssize_t iomap_dio_rw(struct iomap_dio_rw_args *args);
+struct iomap_dio *__iomap_dio_rw(struct iomap_dio_rw_args *args);
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] iomap: move DIO NOWAIT setup up into filesystems
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
  2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
@ 2021-01-12  1:07 ` Dave Chinner
  2021-01-12  1:07 ` [PATCH 3/6] xfs: factor out a xfs_ilock_iocb helper Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

From: Dave Chinner <dchinner@redhat.com>

Add a parameter to iomap_dio_rw_args to allow callers to specify
whether nonblocking (NOWAIT) submission semantics should be used by
the DIO. This allows filesystems to add their own non-blocking
contraints to DIO on top of the user specified constraints held in
the iocb.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/btrfs/file.c       | 4 +++-
 fs/ext4/file.c        | 5 +++--
 fs/gfs2/file.c        | 2 ++
 fs/iomap/direct-io.c  | 2 +-
 fs/xfs/xfs_file.c     | 2 ++
 fs/zonefs/super.c     | 7 ++++---
 include/linux/iomap.h | 3 +++
 7 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a49d9fa918d1..2e7c3b7b70fe 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1913,9 +1913,10 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		.ops			= &btrfs_dio_iomap_ops,
 		.dops			= &btrfs_dio_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
-	if (iocb->ki_flags & IOCB_NOWAIT)
+	if (args.nonblocking)
 		ilock_flags |= BTRFS_ILOCK_TRY;
 
 	/* If the write DIO is within EOF, use a shared lock */
@@ -3628,6 +3629,7 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		.ops			= &btrfs_dio_iomap_ops,
 		.dops			= &btrfs_dio_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 436508be6d88..0ce5c4cae172 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -472,6 +472,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		.ops			= &ext4_iomap_ops,
 		.dops			= &ext4_dio_write_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	/*
@@ -490,7 +491,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (offset + count > i_size_read(inode))
 		ilock_shared = false;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
+	if (args.nonblocking) {
 		if (ilock_shared) {
 			if (!inode_trylock_shared(inode))
 				return -EAGAIN;
@@ -519,7 +520,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		return ret;
 
 	/* if we're going to block and IOCB_NOWAIT is set, return -EAGAIN */
-	if ((iocb->ki_flags & IOCB_NOWAIT) && (unaligned_io || extend)) {
+	if (args.nonblocking && (unaligned_io || extend)) {
 		ret = -EAGAIN;
 		goto out;
 	}
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index d44a5f9c5f34..ead246202144 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -793,6 +793,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 		.iter			= to,
 		.ops			= &gfs2_iomap_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	if (!count)
@@ -824,6 +825,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 		.iter			= from,
 		.ops			= &gfs2_iomap_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	/*
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 05cacc27578c..c0dd2db1253b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -478,7 +478,7 @@ __iomap_dio_rw(struct iomap_dio_rw_args *args)
 			dio->flags |= IOMAP_DIO_WRITE_FUA;
 	}
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
+	if (args->nonblocking) {
 		if (filemap_range_has_page(mapping, pos, end)) {
 			ret = -EAGAIN;
 			goto out_free_dio;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 29f4204e551f..3ced2746db4d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -210,6 +210,7 @@ xfs_file_dio_aio_read(
 		.iter			= to,
 		.ops			= &xfs_read_iomap_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
@@ -530,6 +531,7 @@ xfs_file_dio_aio_write(
 		.ops			= &xfs_direct_write_iomap_ops,
 		.dops			= &xfs_dio_write_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	/* DIO must be aligned to device logical sector size */
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index edf353ad1edc..486ff4872077 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -741,6 +741,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		.ops			= &zonefs_iomap_ops,
 		.dops			= &zonefs_write_dio_ops,
 		.wait_for_completion	= sync,
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
 	/*
@@ -748,11 +749,10 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
 	 * on the inode lock but the second goes through but is now unaligned).
 	 */
-	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync &&
-	    (iocb->ki_flags & IOCB_NOWAIT))
+	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync && args.nonblocking)
 		return -EOPNOTSUPP;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
+	if (args.nonblocking) {
 		if (!inode_trylock(inode))
 			return -EAGAIN;
 	} else {
@@ -922,6 +922,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			.ops			= &zonefs_iomap_ops,
 			.dops			= &zonefs_read_dio_ops,
 			.wait_for_completion	= is_sync_kiocb(iocb),
+			.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 		};
 		size_t count = iov_iter_count(to);
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 16d20c01b5bb..3f85fc33a4c9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,10 @@ struct iomap_dio_rw_args {
 	struct iov_iter		*iter;
 	const struct iomap_ops	*ops;
 	const struct iomap_dio_ops *dops;
+	/* wait for completion of submitted IO if true */
 	bool			wait_for_completion;
+	/* use non-blocking IO submission semantics if true */
+	bool			nonblocking;
 };
 
 ssize_t iomap_dio_rw(struct iomap_dio_rw_args *args);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] xfs: factor out a xfs_ilock_iocb helper
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
  2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
  2021-01-12  1:07 ` [PATCH 2/6] iomap: move DIO NOWAIT setup up into filesystems Dave Chinner
@ 2021-01-12  1:07 ` Dave Chinner
  2021-01-12  1:07 ` [PATCH 4/6] xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

From: Christoph Hellwig <hch@lst.de>

Add a helper to factor out the nowait locking logical for the read/write
helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 55 +++++++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3ced2746db4d..4eb4555516e4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -197,6 +197,23 @@ xfs_file_fsync(
 	return error;
 }
 
+static int
+xfs_ilock_iocb(
+	struct kiocb		*iocb,
+	unsigned int		lock_mode)
+{
+	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!xfs_ilock_nowait(ip, lock_mode))
+			return -EAGAIN;
+	} else {
+		xfs_ilock(ip, lock_mode);
+	}
+
+	return 0;
+}
+
 STATIC ssize_t
 xfs_file_dio_aio_read(
 	struct kiocb		*iocb,
@@ -220,12 +237,9 @@ xfs_file_dio_aio_read(
 
 	file_accessed(iocb->ki_filp);
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
+	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+	if (ret)
+		return ret;
 	ret = iomap_dio_rw(&args);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
@@ -246,13 +260,9 @@ xfs_file_dax_read(
 	if (!count)
 		return 0; /* skip atime */
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-
+	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+	if (ret)
+		return ret;
 	ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
@@ -270,12 +280,9 @@ xfs_file_buffered_aio_read(
 
 	trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
+	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+	if (ret)
+		return ret;
 	ret = generic_file_read_iter(iocb, to);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
@@ -622,13 +629,9 @@ xfs_file_dax_write(
 	size_t			count;
 	loff_t			pos;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, iolock))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, iolock);
-	}
-
+	ret = xfs_ilock_iocb(iocb, iolock);
+	if (ret)
+		return ret;
 	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
 	if (ret)
 		goto out;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
                   ` (2 preceding siblings ...)
  2021-01-12  1:07 ` [PATCH 3/6] xfs: factor out a xfs_ilock_iocb helper Dave Chinner
@ 2021-01-12  1:07 ` Dave Chinner
  2021-01-12  1:07 ` [PATCH 5/6] xfs: split unaligned DIO write code out Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

From: Christoph Hellwig <hch@lst.de>

Ensure we don't block on the iolock, or waiting for I/O in
xfs_file_aio_write_checks if the caller asked to avoid that.

Fixes: 29a5d29ec181 ("xfs: nowait aio support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4eb4555516e4..512833ce1d41 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -341,7 +341,14 @@ xfs_file_aio_write_checks(
 	if (error <= 0)
 		return error;
 
-	error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		error = break_layout(inode, false);
+		if (error == -EWOULDBLOCK)
+			error = -EAGAIN;
+	} else {
+		error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
+	}
+
 	if (error)
 		return error;
 
@@ -352,7 +359,11 @@ xfs_file_aio_write_checks(
 	if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
 		xfs_iunlock(ip, *iolock);
 		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
+		error = xfs_ilock_iocb(iocb, *iolock);
+		if (error) {
+			*iolock = 0;
+			return error;
+		}
 		goto restart;
 	}
 	/*
@@ -374,6 +385,10 @@ xfs_file_aio_write_checks(
 	isize = i_size_read(inode);
 	if (iocb->ki_pos > isize) {
 		spin_unlock(&ip->i_flags_lock);
+
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+
 		if (!drained_dio) {
 			if (*iolock == XFS_IOLOCK_SHARED) {
 				xfs_iunlock(ip, *iolock);
@@ -607,7 +622,8 @@ xfs_file_dio_aio_write(
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(&args);
 out:
-	xfs_iunlock(ip, iolock);
+	if (iolock)
+		xfs_iunlock(ip, iolock);
 
 	/*
 	 * No fallback to buffered IO after short writes for XFS, direct I/O
@@ -646,7 +662,8 @@ xfs_file_dax_write(
 		error = xfs_setfilesize(ip, pos, ret);
 	}
 out:
-	xfs_iunlock(ip, iolock);
+	if (iolock)
+		xfs_iunlock(ip, iolock);
 	if (error)
 		return error;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] xfs: split unaligned DIO write code out
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
                   ` (3 preceding siblings ...)
  2021-01-12  1:07 ` [PATCH 4/6] xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware Dave Chinner
@ 2021-01-12  1:07 ` Dave Chinner
  2021-01-12 10:37   ` Christoph Hellwig
  2021-01-12  1:07 ` [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

From: Dave Chinner <dchinner@redhat.com>

The unaligned DIO write path is more convulted than the normal path,
and we are about to make it more complex. Keep the block aligned
fast path dio write code trim and simple by splitting out the
unaligned DIO code from it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 177 +++++++++++++++++++++++++++++-----------------
 1 file changed, 113 insertions(+), 64 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 512833ce1d41..bba33be17eff 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -508,7 +508,7 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
 };
 
 /*
- * xfs_file_dio_aio_write - handle direct IO writes
+ * Handle block aligned direct IO writes
  *
  * Lock the inode appropriately to prepare for and issue a direct IO write.
  * By separating it from the buffered write path we remove all the tricky to
@@ -518,35 +518,88 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
  * until we're sure the bytes at the new EOF have been zeroed and/or the cached
  * pages are flushed out.
  *
- * In most cases the direct IO writes will be done holding IOLOCK_SHARED
+ * Returns with locks held indicated by @iolock and errors indicated by
+ * negative return values.
+ */
+STATIC ssize_t
+xfs_file_dio_write_aligned(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	int			iolock = XFS_IOLOCK_SHARED;
+	size_t			count;
+	ssize_t			ret;
+	struct iomap_dio_rw_args args = {
+		.iocb			= iocb,
+		.iter			= from,
+		.ops			= &xfs_direct_write_iomap_ops,
+		.dops			= &xfs_dio_write_ops,
+		.wait_for_completion	= is_sync_kiocb(iocb),
+		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
+	};
+
+	ret = xfs_ilock_iocb(iocb, iolock);
+	if (ret)
+		return ret;
+	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+	if (ret)
+		goto out;
+	count = iov_iter_count(from);
+
+	/*
+	 * We don't need to hold the IOLOCK exclusively across the IO, so demote
+	 * the iolock back to shared if we had to take the exclusive lock in
+	 * xfs_file_aio_write_checks() for other reasons.
+	 */
+	if (iolock == XFS_IOLOCK_EXCL) {
+		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
+		iolock = XFS_IOLOCK_SHARED;
+	}
+
+	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
+	ret = iomap_dio_rw(&args);
+out:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+
+	/*
+	 * No fallback to buffered IO after short writes for XFS, direct I/O
+	 * will either complete fully or return an error.
+	 */
+	ASSERT(ret < 0 || ret == count);
+	return ret;
+}
+
+/*
+ * Handle block unaligned direct IO writes
+ *
+ * In most cases direct IO writes will be done holding IOLOCK_SHARED
  * allowing them to be done in parallel with reads and other direct IO writes.
  * However, if the IO is not aligned to filesystem blocks, the direct IO layer
- * needs to do sub-block zeroing and that requires serialisation against other
+ * may need to do sub-block zeroing and that requires serialisation against other
  * direct IOs to the same block. In this case we need to serialise the
  * submission of the unaligned IOs so that we don't get racing block zeroing in
- * the dio layer.  To avoid the problem with aio, we also need to wait for
+ * the dio layer.
+ *
+ * To provide the same serialisation for AIO, we also need to wait for
  * outstanding IOs to complete so that unwritten extent conversion is completed
  * before we try to map the overlapping block. This is currently implemented by
  * hitting it with a big hammer (i.e. inode_dio_wait()).
  *
- * Returns with locks held indicated by @iolock and errors indicated by
- * negative return values.
+ * This means that unaligned dio writes alwys block. There is no "nowait" fast
+ * path in this code - if IOCB_NOWAIT is set we simply return -EAGAIN up front
+ * and we don't have to worry about that anymore.
  */
-STATIC ssize_t
-xfs_file_dio_aio_write(
+static ssize_t
+xfs_file_dio_write_unaligned(
+	struct xfs_inode	*ip,
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
-	struct file		*file = iocb->ki_filp;
-	struct address_space	*mapping = file->f_mapping;
-	struct inode		*inode = mapping->host;
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	ssize_t			ret = 0;
-	int			unaligned_io = 0;
-	int			iolock;
-	size_t			count = iov_iter_count(from);
-	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
+	int			iolock = XFS_IOLOCK_EXCL;
+	size_t			count;
+	ssize_t			ret;
 	struct iomap_dio_rw_args args = {
 		.iocb			= iocb,
 		.iter			= from,
@@ -556,49 +609,25 @@ xfs_file_dio_aio_write(
 		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
 	};
 
-	/* DIO must be aligned to device logical sector size */
-	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
-		return -EINVAL;
-
 	/*
-	 * Don't take the exclusive iolock here unless the I/O is unaligned to
-	 * the file system block size.  We don't need to consider the EOF
-	 * extension case here because xfs_file_aio_write_checks() will relock
-	 * the inode as necessary for EOF zeroing cases and fill out the new
-	 * inode size as appropriate.
+	 * This must be the only IO in-flight. Wait on it before we
+	 * release the iolock to prevent subsequent overlapping IO.
 	 */
-	if ((iocb->ki_pos & mp->m_blockmask) ||
-	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
-		unaligned_io = 1;
-
-		/*
-		 * This must be the only IO in-flight. Wait on it before we
-		 * release the iolock to prevent subsequent overlapping IO.
-		 */
-		args.wait_for_completion = true;
+	args.wait_for_completion = true;
 
-		/*
-		 * We can't properly handle unaligned direct I/O to reflink
-		 * files yet, as we can't unshare a partial block.
-		 */
-		if (xfs_is_cow_inode(ip)) {
-			trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
-			return -ENOTBLK;
-		}
-		iolock = XFS_IOLOCK_EXCL;
-	} else {
-		iolock = XFS_IOLOCK_SHARED;
+	/*
+	 * We can't properly handle unaligned direct I/O to reflink
+	 * files yet, as we can't unshare a partial block.
+	 */
+	if (xfs_is_cow_inode(ip)) {
+		trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
+		return -ENOTBLK;
 	}
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		/* unaligned dio always waits, bail */
-		if (unaligned_io)
-			return -EAGAIN;
-		if (!xfs_ilock_nowait(ip, iolock))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, iolock);
-	}
+	/* unaligned dio always waits, bail */
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return -EAGAIN;
+	xfs_ilock(ip, iolock);
 
 	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
 	if (ret)
@@ -612,13 +641,7 @@ xfs_file_dio_aio_write(
 	 * iolock if we had to take the exclusive lock in
 	 * xfs_file_aio_write_checks() for other reasons.
 	 */
-	if (unaligned_io) {
-		inode_dio_wait(inode);
-	} else if (iolock == XFS_IOLOCK_EXCL) {
-		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
-		iolock = XFS_IOLOCK_SHARED;
-	}
-
+	inode_dio_wait(VFS_I(ip));
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(&args);
 out:
@@ -633,6 +656,32 @@ xfs_file_dio_aio_write(
 	return ret;
 }
 
+static ssize_t
+xfs_file_dio_write(
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
+	size_t			count = iov_iter_count(from);
+
+	/* DIO must be aligned to device logical sector size */
+	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
+		return -EINVAL;
+
+	/*
+	 * Don't take the exclusive iolock here unless the I/O is unaligned to
+	 * the file system block size.  We don't need to consider the EOF
+	 * extension case here because xfs_file_aio_write_checks() will relock
+	 * the inode as necessary for EOF zeroing cases and fill out the new
+	 * inode size as appropriate.
+	 */
+	if ((iocb->ki_pos | count) & mp->m_blockmask)
+		return xfs_file_dio_write_unaligned(ip, iocb, from);
+	return xfs_file_dio_write_aligned(ip, iocb, from);
+}
+
 static noinline ssize_t
 xfs_file_dax_write(
 	struct kiocb		*iocb,
@@ -783,7 +832,7 @@ xfs_file_write_iter(
 		 * CoW.  In all other directio scenarios we do not
 		 * allow an operation to fall back to buffered mode.
 		 */
-		ret = xfs_file_dio_aio_write(iocb, from);
+		ret = xfs_file_dio_write(iocb, from);
 		if (ret != -ENOTBLK)
 			return ret;
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
                   ` (4 preceding siblings ...)
  2021-01-12  1:07 ` [PATCH 5/6] xfs: split unaligned DIO write code out Dave Chinner
@ 2021-01-12  1:07 ` Dave Chinner
  2021-01-12 10:42   ` Christoph Hellwig
  2021-01-12  8:01 ` [RFC] xfs: reduce sub-block DIO serialisation Avi Kivity
       [not found] ` <CACz=WechdgSnVHQsg0LKjMiG8kHLujBshmc270yrdjxfpffmDQ@mail.gmail.com>
  7 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel, avi, andres

From: Dave Chinner <dchinner@redhat.com>

Attempt shared locking for unaligned DIO, but only if the the
underlying extent is already allocated and in written state. On
failure, retry with the existing exclusive locking.

Test case is fio randrw of 512 byte IOs using AIO and an iodepth of
32 IOs.

Vanilla:

  READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec
  WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec


Patched:
   READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec
  WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec

That's an improvement from ~18k IOPS to a ~150k IOPS, which is
about the IOPS limit of the VM block device setup I'm testing on.

4kB block IO comparison:

   READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec
  WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec

Which is ~150k IOPS, same as what the test gets for sub-block
AIO+DIO writes with this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 94 +++++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_iomap.c | 32 +++++++++++-----
 2 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bba33be17eff..f5c75404b8a5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
 			drained_dio = true;
 			goto restart;
 		}
-	
+
 		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
 		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
 				NULL, &xfs_buffered_write_iomap_ops);
@@ -510,9 +510,9 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
 /*
  * Handle block aligned direct IO writes
  *
- * Lock the inode appropriately to prepare for and issue a direct IO write.
- * By separating it from the buffered write path we remove all the tricky to
- * follow locking changes and looping.
+ * Lock the inode appropriately to prepare for and issue a direct IO write.  By
+ * separating it from the buffered write path we remove all the tricky to follow
+ * locking changes and looping.
  *
  * If there are cached pages or we're extending the file, we need IOLOCK_EXCL
  * until we're sure the bytes at the new EOF have been zeroed and/or the cached
@@ -578,18 +578,31 @@ xfs_file_dio_write_aligned(
  * allowing them to be done in parallel with reads and other direct IO writes.
  * However, if the IO is not aligned to filesystem blocks, the direct IO layer
  * may need to do sub-block zeroing and that requires serialisation against other
- * direct IOs to the same block. In this case we need to serialise the
- * submission of the unaligned IOs so that we don't get racing block zeroing in
- * the dio layer.
+ * direct IOs to the same block. In the case where sub-block zeroing is not
+ * required, we can do concurrent sub-block dios to the same block successfully.
+ *
+ * Hence we have two cases here - the shared, optimisitic fast path for written
+ * extents, and everything else that needs exclusive IO path access across the
+ * entire IO.
+ *
+ * For the first case, we do all the checks we need at the mapping layer in the
+ * DIO code as part of the existing NOWAIT infrastructure. Hence all we need to
+ * do to support concurrent subblock dio is first try a non-blocking submission.
+ * If that returns -EAGAIN, then we simply repeat the IO submission with full
+ * IO exclusivity guaranteed so that we avoid racing sub-block zeroing.
+ *
+ * The only wrinkle in this case is that the iomap DIO code always does
+ * partial tail sub-block zeroing for post-EOF writes. Hence for any IO that
+ * _ends_ past the current EOF we need to run with full exclusivity. Note that
+ * we also check for the start of IO being beyond EOF because then zeroing
+ * between the old EOF and the start of the IO is required and that also
+ * requires exclusivity. Hence we avoid lock cycles and blocking under
+ * IOCB_NOWAIT for this situation, too.
  *
- * To provide the same serialisation for AIO, we also need to wait for
+ * To provide the exclusivity required when using AIO, we also need to wait for
  * outstanding IOs to complete so that unwritten extent conversion is completed
  * before we try to map the overlapping block. This is currently implemented by
  * hitting it with a big hammer (i.e. inode_dio_wait()).
- *
- * This means that unaligned dio writes alwys block. There is no "nowait" fast
- * path in this code - if IOCB_NOWAIT is set we simply return -EAGAIN up front
- * and we don't have to worry about that anymore.
  */
 static ssize_t
 xfs_file_dio_write_unaligned(
@@ -597,23 +610,35 @@ xfs_file_dio_write_unaligned(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
-	int			iolock = XFS_IOLOCK_EXCL;
+	int			iolock = XFS_IOLOCK_SHARED;
 	size_t			count;
 	ssize_t			ret;
+	size_t			isize = i_size_read(VFS_I(ip));
 	struct iomap_dio_rw_args args = {
 		.iocb			= iocb,
 		.iter			= from,
 		.ops			= &xfs_direct_write_iomap_ops,
 		.dops			= &xfs_dio_write_ops,
 		.wait_for_completion	= is_sync_kiocb(iocb),
-		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
+		.nonblocking		= true,
 	};
 
 	/*
-	 * This must be the only IO in-flight. Wait on it before we
-	 * release the iolock to prevent subsequent overlapping IO.
+	 * Extending writes need exclusivity because of the sub-block zeroing
+	 * that the DIO code always does for partial tail blocks beyond EOF.
 	 */
-	args.wait_for_completion = true;
+	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
+retry_exclusive:
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		iolock = XFS_IOLOCK_EXCL;
+		args.nonblocking = false;
+		args.wait_for_completion = true;
+	}
+
+	ret = xfs_ilock_iocb(iocb, iolock);
+	if (ret)
+		return ret;
 
 	/*
 	 * We can't properly handle unaligned direct I/O to reflink
@@ -621,30 +646,37 @@ xfs_file_dio_write_unaligned(
 	 */
 	if (xfs_is_cow_inode(ip)) {
 		trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
-		return -ENOTBLK;
+		ret = -ENOTBLK;
+		goto out_unlock;
 	}
 
-	/* unaligned dio always waits, bail */
-	if (iocb->ki_flags & IOCB_NOWAIT)
-		return -EAGAIN;
-	xfs_ilock(ip, iolock);
-
 	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
 	if (ret)
-		goto out;
+		goto out_unlock;
 	count = iov_iter_count(from);
 
 	/*
-	 * If we are doing unaligned IO, we can't allow any other overlapping IO
-	 * in-flight at the same time or we risk data corruption. Wait for all
-	 * other IO to drain before we submit. If the IO is aligned, demote the
-	 * iolock if we had to take the exclusive lock in
-	 * xfs_file_aio_write_checks() for other reasons.
+	 * If we are doing exclusive unaligned IO, we can't allow any other
+	 * overlapping IO in-flight at the same time or we risk data corruption.
+	 * Wait for all other IO to drain before we submit.
 	 */
-	inode_dio_wait(VFS_I(ip));
+	if (!args.nonblocking)
+		inode_dio_wait(VFS_I(ip));
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(&args);
-out:
+
+	/*
+	 * Retry unaligned IO with exclusive blocking semantics if the DIO
+	 * layer rejected it for mapping or locking reasons. If we are doing
+	 * nonblocking user IO, propagate the error.
+	 */
+	if (ret == -EAGAIN) {
+		ASSERT(args.nonblocking == true);
+		xfs_iunlock(ip, iolock);
+		goto retry_exclusive;
+	}
+
+out_unlock:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7b9ff824e82d..e5659200e5e8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -783,16 +783,30 @@ xfs_direct_write_iomap_begin(
 	if (imap_needs_alloc(inode, flags, &imap, nimaps))
 		goto allocate_blocks;
 
-	/*
-	 * NOWAIT IO needs to span the entire requested IO with a single map so
-	 * that we avoid partial IO failures due to the rest of the IO range not
-	 * covered by this map triggering an EAGAIN condition when it is
-	 * subsequently mapped and aborting the IO.
-	 */
-	if ((flags & IOMAP_NOWAIT) &&
-	    !imap_spans_range(&imap, offset_fsb, end_fsb)) {
+	/* Handle special NOWAIT conditions for existing allocated extents. */
+	if (flags & IOMAP_NOWAIT) {
 		error = -EAGAIN;
-		goto out_unlock;
+		/*
+		 * NOWAIT IO needs to span the entire requested IO with a single
+		 * map so that we avoid partial IO failures due to the rest of
+		 * the IO range not covered by this map triggering an EAGAIN
+		 * condition when it is subsequently mapped and aborting the IO.
+		 */
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
+			goto out_unlock;
+
+		/*
+		 * If the IO is unaligned and the caller holds a shared IOLOCK,
+		 * NOWAIT will be set because we can only do the IO if it spans
+		 * a written extent. Otherwise we have to do sub-block zeroing,
+		 * and that can only be done under an exclusive IOLOCK. Hence if
+		 * this is not a written extent, return EAGAIN to tell the
+		 * caller to try again.
+		 */
+		if (imap.br_state != XFS_EXT_NORM &&
+		    ((offset & mp->m_blockmask) ||
+		     ((offset + length) & mp->m_blockmask)))
+			goto out_unlock;
 	}
 
 	xfs_iunlock(ip, lockmode);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure
  2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
@ 2021-01-12  1:22   ` Damien Le Moal
  2021-01-12  1:40   ` Darrick J. Wong
  2021-01-12 10:31   ` Christoph Hellwig
  2 siblings, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2021-01-12  1:22 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-fsdevel, avi, andres

On 2021/01/12 10:08, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Adding yet another parameter to the iomap_dio_rw() interface means
> changing lots of filesystems to add the parameter. Convert this
> interface to an args structure so in future we don't need to modify
> every caller to add a new parameter.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/btrfs/file.c       | 21 ++++++++++++++++-----
>  fs/ext4/file.c        | 24 ++++++++++++++++++------
>  fs/gfs2/file.c        | 19 ++++++++++++++-----
>  fs/iomap/direct-io.c  | 30 ++++++++++++++----------------
>  fs/xfs/xfs_file.c     | 30 +++++++++++++++++++++---------
>  fs/zonefs/super.c     | 21 +++++++++++++++++----
>  include/linux/iomap.h | 16 ++++++++++------
>  7 files changed, 110 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0e41459b8de6..a49d9fa918d1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1907,6 +1907,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t err;
>  	unsigned int ilock_flags = 0;
>  	struct iomap_dio *dio = NULL;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &btrfs_dio_iomap_ops,
> +		.dops			= &btrfs_dio_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		ilock_flags |= BTRFS_ILOCK_TRY;
> @@ -1949,9 +1956,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  		goto buffered;
>  	}
>  
> -	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops,
> -			     &btrfs_dio_ops, is_sync_kiocb(iocb));
> -
> +	dio = __iomap_dio_rw(&args);
>  	btrfs_inode_unlock(inode, ilock_flags);
>  
>  	if (IS_ERR_OR_NULL(dio)) {
> @@ -3617,13 +3622,19 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &btrfs_dio_iomap_ops,
> +		.dops			= &btrfs_dio_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
>  		return 0;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> -	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			   is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>  	return ret;
>  }
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 3ed8c048fb12..436508be6d88 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -53,6 +53,12 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	ssize_t ret;
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &ext4_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT) {
>  		if (!inode_trylock_shared(inode))
> @@ -74,8 +80,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  		return generic_file_read_iter(iocb, to);
>  	}
>  
> -	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
> -			   is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	inode_unlock_shared(inode);
>  
>  	file_accessed(iocb->ki_filp);
> @@ -459,9 +464,15 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	loff_t offset = iocb->ki_pos;
>  	size_t count = iov_iter_count(from);
> -	const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
>  	bool extend = false, unaligned_io = false;
>  	bool ilock_shared = true;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &ext4_iomap_ops,
> +		.dops			= &ext4_dio_write_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	/*
>  	 * We initially start with shared inode lock unless it is
> @@ -548,9 +559,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	}
>  
>  	if (ilock_shared)
> -		iomap_ops = &ext4_iomap_overwrite_ops;
> -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> -			   is_sync_kiocb(iocb) || unaligned_io || extend);
> +		args.ops = &ext4_iomap_overwrite_ops;
> +	if (unaligned_io || extend)
> +		args.wait_for_completion = true;
> +	ret = iomap_dio_rw(&args);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
>  
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index b39b339feddc..d44a5f9c5f34 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -788,6 +788,12 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
>  	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
>  	size_t count = iov_iter_count(to);
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &gfs2_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (!count)
>  		return 0; /* skip atime */
> @@ -797,9 +803,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
>  	if (ret)
>  		goto out_uninit;
>  
> -	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
> -			   is_sync_kiocb(iocb));
> -
> +	ret = iomap_dio_rw(&args);
>  	gfs2_glock_dq(gh);
>  out_uninit:
>  	gfs2_holder_uninit(gh);
> @@ -815,6 +819,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
>  	size_t len = iov_iter_count(from);
>  	loff_t offset = iocb->ki_pos;
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &gfs2_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	/*
>  	 * Deferred lock, even if its a write, since we do no allocation on
> @@ -833,8 +843,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
>  	if (offset + len > i_size_read(&ip->i_inode))
>  		goto out;
>  
> -	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
> -			   is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
>  out:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 933f234d5bec..05cacc27578c 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -418,13 +418,13 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>   * writes.  The callers needs to fall back to buffered I/O in this case.
>   */
>  struct iomap_dio *
> -__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion)
> +__iomap_dio_rw(struct iomap_dio_rw_args *args)
>  {
> +	struct kiocb *iocb = args->iocb;
> +	struct iov_iter *iter = args->iter;
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct inode *inode = file_inode(iocb->ki_filp);
> -	size_t count = iov_iter_count(iter);
> +	size_t count = iov_iter_count(args->iter);
>  	loff_t pos = iocb->ki_pos;
>  	loff_t end = iocb->ki_pos + count - 1, ret = 0;
>  	unsigned int flags = IOMAP_DIRECT;
> @@ -434,7 +434,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (!count)
>  		return NULL;
>  
> -	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
> +	if (WARN_ON(is_sync_kiocb(iocb) && !args->wait_for_completion))
>  		return ERR_PTR(-EIO);
>  
>  	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
> @@ -445,7 +445,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	atomic_set(&dio->ref, 1);
>  	dio->size = 0;
>  	dio->i_size = i_size_read(inode);
> -	dio->dops = dops;
> +	dio->dops = args->dops;
>  	dio->error = 0;
>  	dio->flags = 0;
>  
> @@ -490,7 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (ret)
>  		goto out_free_dio;
>  
> -	if (iov_iter_rw(iter) == WRITE) {
> +	if (iov_iter_rw(args->iter) == WRITE) {
>  		/*
>  		 * Try to invalidate cache pages for the range we are writing.
>  		 * If this invalidation fails, let the caller fall back to
> @@ -503,7 +503,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			goto out_free_dio;
>  		}
>  
> -		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
> +		if (!args->wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>  			ret = sb_init_dio_done_wq(inode->i_sb);
>  			if (ret < 0)
>  				goto out_free_dio;
> @@ -514,12 +514,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	blk_start_plug(&plug);
>  	do {
> -		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> +		ret = iomap_apply(inode, pos, count, flags, args->ops, dio,
>  				iomap_dio_actor);
>  		if (ret <= 0) {
>  			/* magic error code to fall back to buffered I/O */
>  			if (ret == -ENOTBLK) {
> -				wait_for_completion = true;
> +				args->wait_for_completion = true;
>  				ret = 0;
>  			}
>  			break;
> @@ -566,9 +566,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 *	of the final reference, and we will complete and free it here
>  	 *	after we got woken by the I/O completion handler.
>  	 */
> -	dio->wait_for_completion = wait_for_completion;
> +	dio->wait_for_completion = args->wait_for_completion;
>  	if (!atomic_dec_and_test(&dio->ref)) {
> -		if (!wait_for_completion)
> +		if (!args->wait_for_completion)
>  			return ERR_PTR(-EIOCBQUEUED);
>  
>  		for (;;) {
> @@ -596,13 +596,11 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  EXPORT_SYMBOL_GPL(__iomap_dio_rw);
>  
>  ssize_t
> -iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion)
> +iomap_dio_rw(struct iomap_dio_rw_args *args)
>  {
>  	struct iomap_dio *dio;
>  
> -	dio = __iomap_dio_rw(iocb, iter, ops, dops, wait_for_completion);
> +	dio = __iomap_dio_rw(args);
>  	if (IS_ERR_OR_NULL(dio))
>  		return PTR_ERR_OR_ZERO(dio);
>  	return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5b0f93f73837..29f4204e551f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -205,6 +205,12 @@ xfs_file_dio_aio_read(
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>  	size_t			count = iov_iter_count(to);
>  	ssize_t			ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &xfs_read_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
>  
> @@ -219,8 +225,7 @@ xfs_file_dio_aio_read(
>  	} else {
>  		xfs_ilock(ip, XFS_IOLOCK_SHARED);
>  	}
> -	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
> -			is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>  	return ret;
> @@ -519,6 +524,13 @@ xfs_file_dio_aio_write(
>  	int			iolock;
>  	size_t			count = iov_iter_count(from);
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &xfs_direct_write_iomap_ops,
> +		.dops			= &xfs_dio_write_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	/* DIO must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> @@ -535,6 +547,12 @@ xfs_file_dio_aio_write(
>  	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
>  		unaligned_io = 1;
>  
> +		/*
> +		 * This must be the only IO in-flight. Wait on it before we
> +		 * release the iolock to prevent subsequent overlapping IO.
> +		 */
> +		args.wait_for_completion = true;
> +
>  		/*
>  		 * We can't properly handle unaligned direct I/O to reflink
>  		 * files yet, as we can't unshare a partial block.
> @@ -578,13 +596,7 @@ xfs_file_dio_aio_write(
>  	}
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> -	/*
> -	 * If unaligned, this is the only IO in-flight. Wait on it before we
> -	 * release the iolock to prevent subsequent overlapping IO.
> -	 */
> -	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> -			   &xfs_dio_write_ops,
> -			   is_sync_kiocb(iocb) || unaligned_io);
> +	ret = iomap_dio_rw(&args);
>  out:
>  	xfs_iunlock(ip, iolock);
>  
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index bec47f2d074b..edf353ad1edc 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -735,6 +735,13 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>  	bool append = false;
>  	size_t count;
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &zonefs_iomap_ops,
> +		.dops			= &zonefs_write_dio_ops,
> +		.wait_for_completion	= sync,
> +	};
>  
>  	/*
>  	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
> @@ -779,8 +786,8 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (append)
>  		ret = zonefs_file_dio_append(iocb, from);
>  	else
> -		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> -				   &zonefs_write_dio_ops, sync);
> +		ret = iomap_dio_rw(&args);
> +
>  	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>  	    (ret > 0 || ret == -EIOCBQUEUED)) {
>  		if (ret > 0)
> @@ -909,6 +916,13 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	mutex_unlock(&zi->i_truncate_mutex);
>  
>  	if (iocb->ki_flags & IOCB_DIRECT) {
> +		struct iomap_dio_rw_args args = {
> +			.iocb			= iocb,
> +			.iter			= to,
> +			.ops			= &zonefs_iomap_ops,
> +			.dops			= &zonefs_read_dio_ops,
> +			.wait_for_completion	= is_sync_kiocb(iocb),
> +		};
>  		size_t count = iov_iter_count(to);
>  
>  		if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
> @@ -916,8 +930,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  			goto inode_unlock;
>  		}
>  		file_accessed(iocb->ki_filp);
> -		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> -				   &zonefs_read_dio_ops, is_sync_kiocb(iocb));
> +		ret = iomap_dio_rw(&args);
>  	} else {
>  		ret = generic_file_read_iter(iocb, to);
>  		if (ret == -EIO)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5bd3cac4df9c..16d20c01b5bb 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -256,12 +256,16 @@ struct iomap_dio_ops {
>  			struct bio *bio, loff_t file_offset);
>  };
>  
> -ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion);
> -struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion);
> +struct iomap_dio_rw_args {
> +	struct kiocb		*iocb;
> +	struct iov_iter		*iter;
> +	const struct iomap_ops	*ops;
> +	const struct iomap_dio_ops *dops;
> +	bool			wait_for_completion;
> +};
> +
> +ssize_t iomap_dio_rw(struct iomap_dio_rw_args *args);
> +struct iomap_dio *__iomap_dio_rw(struct iomap_dio_rw_args *args);
>  ssize_t iomap_dio_complete(struct iomap_dio *dio);
>  int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);

Looks all good to me.

For the zonefs part:

Acked-by: Damien Le Moal <damien.lemoal@wdc.com>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure
  2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
  2021-01-12  1:22   ` Damien Le Moal
@ 2021-01-12  1:40   ` Darrick J. Wong
  2021-01-12  1:53     ` Dave Chinner
  2021-01-12 10:31   ` Christoph Hellwig
  2 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2021-01-12  1:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, avi, andres

On Tue, Jan 12, 2021 at 12:07:41PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Adding yet another parameter to the iomap_dio_rw() interface means
> changing lots of filesystems to add the parameter. Convert this
> interface to an args structure so in future we don't need to modify
> every caller to add a new parameter.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/btrfs/file.c       | 21 ++++++++++++++++-----
>  fs/ext4/file.c        | 24 ++++++++++++++++++------
>  fs/gfs2/file.c        | 19 ++++++++++++++-----
>  fs/iomap/direct-io.c  | 30 ++++++++++++++----------------
>  fs/xfs/xfs_file.c     | 30 +++++++++++++++++++++---------
>  fs/zonefs/super.c     | 21 +++++++++++++++++----
>  include/linux/iomap.h | 16 ++++++++++------
>  7 files changed, 110 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0e41459b8de6..a49d9fa918d1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1907,6 +1907,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t err;
>  	unsigned int ilock_flags = 0;
>  	struct iomap_dio *dio = NULL;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &btrfs_dio_iomap_ops,
> +		.dops			= &btrfs_dio_ops,

/me wonders if it would make sense to move all the iomap_dio_ops fields
into iomap_dio_rw_args to reduce pointer dereferencing when making the
indirect call?

--D

> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		ilock_flags |= BTRFS_ILOCK_TRY;
> @@ -1949,9 +1956,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  		goto buffered;
>  	}
>  
> -	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops,
> -			     &btrfs_dio_ops, is_sync_kiocb(iocb));
> -
> +	dio = __iomap_dio_rw(&args);
>  	btrfs_inode_unlock(inode, ilock_flags);
>  
>  	if (IS_ERR_OR_NULL(dio)) {
> @@ -3617,13 +3622,19 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &btrfs_dio_iomap_ops,
> +		.dops			= &btrfs_dio_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
>  		return 0;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> -	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			   is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>  	return ret;
>  }
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 3ed8c048fb12..436508be6d88 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -53,6 +53,12 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  	ssize_t ret;
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &ext4_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT) {
>  		if (!inode_trylock_shared(inode))
> @@ -74,8 +80,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  		return generic_file_read_iter(iocb, to);
>  	}
>  
> -	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
> -			   is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	inode_unlock_shared(inode);
>  
>  	file_accessed(iocb->ki_filp);
> @@ -459,9 +464,15 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	loff_t offset = iocb->ki_pos;
>  	size_t count = iov_iter_count(from);
> -	const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
>  	bool extend = false, unaligned_io = false;
>  	bool ilock_shared = true;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &ext4_iomap_ops,
> +		.dops			= &ext4_dio_write_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	/*
>  	 * We initially start with shared inode lock unless it is
> @@ -548,9 +559,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	}
>  
>  	if (ilock_shared)
> -		iomap_ops = &ext4_iomap_overwrite_ops;
> -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> -			   is_sync_kiocb(iocb) || unaligned_io || extend);
> +		args.ops = &ext4_iomap_overwrite_ops;
> +	if (unaligned_io || extend)
> +		args.wait_for_completion = true;
> +	ret = iomap_dio_rw(&args);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
>  
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index b39b339feddc..d44a5f9c5f34 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -788,6 +788,12 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
>  	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
>  	size_t count = iov_iter_count(to);
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &gfs2_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	if (!count)
>  		return 0; /* skip atime */
> @@ -797,9 +803,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
>  	if (ret)
>  		goto out_uninit;
>  
> -	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
> -			   is_sync_kiocb(iocb));
> -
> +	ret = iomap_dio_rw(&args);
>  	gfs2_glock_dq(gh);
>  out_uninit:
>  	gfs2_holder_uninit(gh);
> @@ -815,6 +819,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
>  	size_t len = iov_iter_count(from);
>  	loff_t offset = iocb->ki_pos;
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &gfs2_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	/*
>  	 * Deferred lock, even if its a write, since we do no allocation on
> @@ -833,8 +843,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
>  	if (offset + len > i_size_read(&ip->i_inode))
>  		goto out;
>  
> -	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
> -			   is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
>  out:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 933f234d5bec..05cacc27578c 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -418,13 +418,13 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>   * writes.  The callers needs to fall back to buffered I/O in this case.
>   */
>  struct iomap_dio *
> -__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion)
> +__iomap_dio_rw(struct iomap_dio_rw_args *args)
>  {
> +	struct kiocb *iocb = args->iocb;
> +	struct iov_iter *iter = args->iter;
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct inode *inode = file_inode(iocb->ki_filp);
> -	size_t count = iov_iter_count(iter);
> +	size_t count = iov_iter_count(args->iter);
>  	loff_t pos = iocb->ki_pos;
>  	loff_t end = iocb->ki_pos + count - 1, ret = 0;
>  	unsigned int flags = IOMAP_DIRECT;
> @@ -434,7 +434,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (!count)
>  		return NULL;
>  
> -	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
> +	if (WARN_ON(is_sync_kiocb(iocb) && !args->wait_for_completion))
>  		return ERR_PTR(-EIO);
>  
>  	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
> @@ -445,7 +445,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	atomic_set(&dio->ref, 1);
>  	dio->size = 0;
>  	dio->i_size = i_size_read(inode);
> -	dio->dops = dops;
> +	dio->dops = args->dops;
>  	dio->error = 0;
>  	dio->flags = 0;
>  
> @@ -490,7 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (ret)
>  		goto out_free_dio;
>  
> -	if (iov_iter_rw(iter) == WRITE) {
> +	if (iov_iter_rw(args->iter) == WRITE) {
>  		/*
>  		 * Try to invalidate cache pages for the range we are writing.
>  		 * If this invalidation fails, let the caller fall back to
> @@ -503,7 +503,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			goto out_free_dio;
>  		}
>  
> -		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
> +		if (!args->wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>  			ret = sb_init_dio_done_wq(inode->i_sb);
>  			if (ret < 0)
>  				goto out_free_dio;
> @@ -514,12 +514,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	blk_start_plug(&plug);
>  	do {
> -		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> +		ret = iomap_apply(inode, pos, count, flags, args->ops, dio,
>  				iomap_dio_actor);
>  		if (ret <= 0) {
>  			/* magic error code to fall back to buffered I/O */
>  			if (ret == -ENOTBLK) {
> -				wait_for_completion = true;
> +				args->wait_for_completion = true;
>  				ret = 0;
>  			}
>  			break;
> @@ -566,9 +566,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 *	of the final reference, and we will complete and free it here
>  	 *	after we got woken by the I/O completion handler.
>  	 */
> -	dio->wait_for_completion = wait_for_completion;
> +	dio->wait_for_completion = args->wait_for_completion;
>  	if (!atomic_dec_and_test(&dio->ref)) {
> -		if (!wait_for_completion)
> +		if (!args->wait_for_completion)
>  			return ERR_PTR(-EIOCBQUEUED);
>  
>  		for (;;) {
> @@ -596,13 +596,11 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  EXPORT_SYMBOL_GPL(__iomap_dio_rw);
>  
>  ssize_t
> -iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion)
> +iomap_dio_rw(struct iomap_dio_rw_args *args)
>  {
>  	struct iomap_dio *dio;
>  
> -	dio = __iomap_dio_rw(iocb, iter, ops, dops, wait_for_completion);
> +	dio = __iomap_dio_rw(args);
>  	if (IS_ERR_OR_NULL(dio))
>  		return PTR_ERR_OR_ZERO(dio);
>  	return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5b0f93f73837..29f4204e551f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -205,6 +205,12 @@ xfs_file_dio_aio_read(
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>  	size_t			count = iov_iter_count(to);
>  	ssize_t			ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= to,
> +		.ops			= &xfs_read_iomap_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
>  
> @@ -219,8 +225,7 @@ xfs_file_dio_aio_read(
>  	} else {
>  		xfs_ilock(ip, XFS_IOLOCK_SHARED);
>  	}
> -	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
> -			is_sync_kiocb(iocb));
> +	ret = iomap_dio_rw(&args);
>  	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>  	return ret;
> @@ -519,6 +524,13 @@ xfs_file_dio_aio_write(
>  	int			iolock;
>  	size_t			count = iov_iter_count(from);
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &xfs_direct_write_iomap_ops,
> +		.dops			= &xfs_dio_write_ops,
> +		.wait_for_completion	= is_sync_kiocb(iocb),
> +	};
>  
>  	/* DIO must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
> @@ -535,6 +547,12 @@ xfs_file_dio_aio_write(
>  	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
>  		unaligned_io = 1;
>  
> +		/*
> +		 * This must be the only IO in-flight. Wait on it before we
> +		 * release the iolock to prevent subsequent overlapping IO.
> +		 */
> +		args.wait_for_completion = true;
> +
>  		/*
>  		 * We can't properly handle unaligned direct I/O to reflink
>  		 * files yet, as we can't unshare a partial block.
> @@ -578,13 +596,7 @@ xfs_file_dio_aio_write(
>  	}
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> -	/*
> -	 * If unaligned, this is the only IO in-flight. Wait on it before we
> -	 * release the iolock to prevent subsequent overlapping IO.
> -	 */
> -	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> -			   &xfs_dio_write_ops,
> -			   is_sync_kiocb(iocb) || unaligned_io);
> +	ret = iomap_dio_rw(&args);
>  out:
>  	xfs_iunlock(ip, iolock);
>  
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index bec47f2d074b..edf353ad1edc 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -735,6 +735,13 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>  	bool append = false;
>  	size_t count;
>  	ssize_t ret;
> +	struct iomap_dio_rw_args args = {
> +		.iocb			= iocb,
> +		.iter			= from,
> +		.ops			= &zonefs_iomap_ops,
> +		.dops			= &zonefs_write_dio_ops,
> +		.wait_for_completion	= sync,
> +	};
>  
>  	/*
>  	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
> @@ -779,8 +786,8 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (append)
>  		ret = zonefs_file_dio_append(iocb, from);
>  	else
> -		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> -				   &zonefs_write_dio_ops, sync);
> +		ret = iomap_dio_rw(&args);
> +
>  	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>  	    (ret > 0 || ret == -EIOCBQUEUED)) {
>  		if (ret > 0)
> @@ -909,6 +916,13 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	mutex_unlock(&zi->i_truncate_mutex);
>  
>  	if (iocb->ki_flags & IOCB_DIRECT) {
> +		struct iomap_dio_rw_args args = {
> +			.iocb			= iocb,
> +			.iter			= to,
> +			.ops			= &zonefs_iomap_ops,
> +			.dops			= &zonefs_read_dio_ops,
> +			.wait_for_completion	= is_sync_kiocb(iocb),
> +		};
>  		size_t count = iov_iter_count(to);
>  
>  		if ((iocb->ki_pos | count) & (sb->s_blocksize - 1)) {
> @@ -916,8 +930,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  			goto inode_unlock;
>  		}
>  		file_accessed(iocb->ki_filp);
> -		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> -				   &zonefs_read_dio_ops, is_sync_kiocb(iocb));
> +		ret = iomap_dio_rw(&args);
>  	} else {
>  		ret = generic_file_read_iter(iocb, to);
>  		if (ret == -EIO)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5bd3cac4df9c..16d20c01b5bb 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -256,12 +256,16 @@ struct iomap_dio_ops {
>  			struct bio *bio, loff_t file_offset);
>  };
>  
> -ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion);
> -struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> -		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion);
> +struct iomap_dio_rw_args {
> +	struct kiocb		*iocb;
> +	struct iov_iter		*iter;
> +	const struct iomap_ops	*ops;
> +	const struct iomap_dio_ops *dops;
> +	bool			wait_for_completion;
> +};
> +
> +ssize_t iomap_dio_rw(struct iomap_dio_rw_args *args);
> +struct iomap_dio *__iomap_dio_rw(struct iomap_dio_rw_args *args);
>  ssize_t iomap_dio_complete(struct iomap_dio *dio);
>  int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure
  2021-01-12  1:40   ` Darrick J. Wong
@ 2021-01-12  1:53     ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12  1:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, avi, andres

On Mon, Jan 11, 2021 at 05:40:23PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 12, 2021 at 12:07:41PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Adding yet another parameter to the iomap_dio_rw() interface means
> > changing lots of filesystems to add the parameter. Convert this
> > interface to an args structure so in future we don't need to modify
> > every caller to add a new parameter.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/btrfs/file.c       | 21 ++++++++++++++++-----
> >  fs/ext4/file.c        | 24 ++++++++++++++++++------
> >  fs/gfs2/file.c        | 19 ++++++++++++++-----
> >  fs/iomap/direct-io.c  | 30 ++++++++++++++----------------
> >  fs/xfs/xfs_file.c     | 30 +++++++++++++++++++++---------
> >  fs/zonefs/super.c     | 21 +++++++++++++++++----
> >  include/linux/iomap.h | 16 ++++++++++------
> >  7 files changed, 110 insertions(+), 51 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 0e41459b8de6..a49d9fa918d1 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1907,6 +1907,13 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >  	ssize_t err;
> >  	unsigned int ilock_flags = 0;
> >  	struct iomap_dio *dio = NULL;
> > +	struct iomap_dio_rw_args args = {
> > +		.iocb			= iocb,
> > +		.iter			= from,
> > +		.ops			= &btrfs_dio_iomap_ops,
> > +		.dops			= &btrfs_dio_ops,
> 
> /me wonders if it would make sense to move all the iomap_dio_ops fields
> into iomap_dio_rw_args to reduce pointer dereferencing when making the
> indirect call?

Perhaps so - there are only two ops defined in that structure so
there's not a whole lot of gain/loss there either way.  Trivial to
do, though, with them encapsulated in this structure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
                   ` (5 preceding siblings ...)
  2021-01-12  1:07 ` [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio Dave Chinner
@ 2021-01-12  8:01 ` Avi Kivity
  2021-01-12 22:13   ` Dave Chinner
       [not found] ` <CACz=WechdgSnVHQsg0LKjMiG8kHLujBshmc270yrdjxfpffmDQ@mail.gmail.com>
  7 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2021-01-12  8:01 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: linux-fsdevel, andres

On 1/12/21 3:07 AM, Dave Chinner wrote:
> Hi folks,
>
> This is the XFS implementation on the sub-block DIO optimisations
> for written extents that I've mentioned on #xfs and a couple of
> times now on the XFS mailing list.
>
> It takes the approach of using the IOMAP_NOWAIT non-blocking
> IO submission infrastructure to optimistically dispatch sub-block
> DIO without exclusive locking. If the extent mapping callback
> decides that it can't do the unaligned IO without extent
> manipulation, sub-block zeroing, blocking or splitting the IO into
> multiple parts, it aborts the IO with -EAGAIN. This allows the high
> level filesystem code to then take exclusive locks and resubmit the
> IO once it has guaranteed no other IO is in progress on the inode
> (the current implementation).


Can you expand on the no-splitting requirement? Does it involve only 
splitting by XFS (IO spans >1 extents) or lower layers (RAID)?


The reason I'm concerned is that it's the constraint that the 
application has least control over. I guess I could use RWF_NOWAIT to 
avoid blocking my main thread (but last time I tried I'd get occasional 
EIOs that frightened me off that). It also seems to me to be the one 
easiest to resolve - perhaps do two passes, with the first verifying the 
other constraints are achieved, or one pass that copies the results in a 
temporary structure that is discarded if the other constraints fail.


> This requires moving the IOMAP_NOWAIT setup decisions up into the
> filesystem, adding yet another parameter to iomap_dio_rw(). So first
> I convert iomap_dio_rw() to take an args structure so that we don't
> have to modify the API every time we want to add another setup
> parameter to the DIO submission code.
>
> I then include Christophs IOCB_NOWAIT fxies and cleanups to the XFS
> code, because they needed to be done regardless of the unaligned DIO
> issues and they make the changes simpler. Then I split the unaligned
> DIO path out from the aligned path, because all the extra complexity
> to support better unaligned DIO submission concurrency is not
> necessary for the block aligned path. Finally, I modify the
> unaligned IO path to first submit the unaligned IO using
> non-blocking semantics and provide a fallback to run the IO
> exclusively if that fails.
>
> This means that we consider sub-block dio into written a fast path
> that should almost always succeed with minimal overhead and we put
> all the overhead of failure into the slow path where exclusive
> locking is required. Unlike Christoph's proposed patch, this means
> we don't require an extra ILOCK cycle in the sub-block DIO setup
> fast path, so it should perform almost identically to the block
> aligned fast path.
>
> Tested using fio with AIO+DIO randrw to a written file. Performance
> increases from about 20k IOPS to 150k IOPS, which is the limit of
> the setup I was using for testing. Also passed fstests auto group
> on a both v4 and v5 XFS filesystems.
>
> Thoughts, comments?
>
> -Dave.
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure
  2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
  2021-01-12  1:22   ` Damien Le Moal
  2021-01-12  1:40   ` Darrick J. Wong
@ 2021-01-12 10:31   ` Christoph Hellwig
  2 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2021-01-12 10:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, avi, andres

On Tue, Jan 12, 2021 at 12:07:41PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Adding yet another parameter to the iomap_dio_rw() interface means
> changing lots of filesystems to add the parameter. Convert this
> interface to an args structure so in future we don't need to modify
> every caller to add a new parameter.


I don't like this at all - it leads to bloating of both the source and
binary code without a good reason as you're only passing additional
flags.  Converting the existing wait_for_completion to flags value gives
you everyting you need while both being more readable and generating
better code.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] xfs: split unaligned DIO write code out
  2021-01-12  1:07 ` [PATCH 5/6] xfs: split unaligned DIO write code out Dave Chinner
@ 2021-01-12 10:37   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2021-01-12 10:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, avi, andres

On Tue, Jan 12, 2021 at 12:07:45PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The unaligned DIO write path is more convulted than the normal path,

s/convulted/convoluted/

>
> and we are about to make it more complex. Keep the block aligned
> fast path dio write code trim and simple by splitting out the
> unaligned DIO code from it.

I like this, but a few comments below:

>  /*
> + * Handle block aligned direct IO writes
>   *
>   * Lock the inode appropriately to prepare for and issue a direct IO write.
>   * By separating it from the buffered write path we remove all the tricky to
> @@ -518,35 +518,88 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
>   * until we're sure the bytes at the new EOF have been zeroed and/or the cached
>   * pages are flushed out.
>   *
> + * Returns with locks held indicated by @iolock and errors indicated by
> + * negative return values.

As far as I can tell no locks are held when returning from
xfs_file_dio_write_aligned.

> + */
> +STATIC ssize_t
> +xfs_file_dio_write_aligned(

I thought we got rid of STATIC for new code?

> +	/*
> +	 * No fallback to buffered IO after short writes for XFS, direct I/O
> +	 * will either complete fully or return an error.
> +	 */
> +	ASSERT(ret < 0 || ret == count);

Maybe it is time to drop this assert rather than duplicating it given that
iomap direct I/O has behaved sane in this regard from the start?

> + * To provide the same serialisation for AIO, we also need to wait for
>   * outstanding IOs to complete so that unwritten extent conversion is completed
>   * before we try to map the overlapping block. This is currently implemented by
>   * hitting it with a big hammer (i.e. inode_dio_wait()).
>   *
> + * This means that unaligned dio writes alwys block. There is no "nowait" fast

s/alwys/always/

> +	/*
> +	 * We can't properly handle unaligned direct I/O to reflink
> +	 * files yet, as we can't unshare a partial block.
> +	 */

FYI, this could use up the whole 80 chars.

> +	/*
> +	 * Don't take the exclusive iolock here unless the I/O is unaligned to
> +	 * the file system block size.  We don't need to consider the EOF
> +	 * extension case here because xfs_file_aio_write_checks() will relock
> +	 * the inode as necessary for EOF zeroing cases and fill out the new
> +	 * inode size as appropriate.
> +	 */
> +	if ((iocb->ki_pos | count) & mp->m_blockmask)
> +		return xfs_file_dio_write_unaligned(ip, iocb, from);
> +	return xfs_file_dio_write_aligned(ip, iocb, from);

I don't think that whole comment makes much sense here, the locking
documentation belongs into the aligned/unaligned helpers now.

> -		ret = xfs_file_dio_aio_write(iocb, from);
> +		ret = xfs_file_dio_write(iocb, from);

If we change this naming it would be nice to throw in a patch to
also change the read side as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
  2021-01-12  1:07 ` [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio Dave Chinner
@ 2021-01-12 10:42   ` Christoph Hellwig
  2021-01-12 17:01     ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2021-01-12 10:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, avi, andres

> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bba33be17eff..f5c75404b8a5 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -	
> +

Spurious unrelated whitespace change.

>  	struct iomap_dio_rw_args args = {
>  		.iocb			= iocb,
>  		.iter			= from,
>  		.ops			= &xfs_direct_write_iomap_ops,
>  		.dops			= &xfs_dio_write_ops,
>  		.wait_for_completion	= is_sync_kiocb(iocb),
> -		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
> +		.nonblocking		= true,

I think this is in many ways wrong.  As far as I can tell you want this
so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
we should not trigger any of the other checks, so we'd really need
another flag instead of reusing this one.

imap_spans_range is a bit pessimistic for avoiding the exclusive lock,
but I guess we could live that if it is clearly documented as helping
with the implementation, but we really should not automatically trigger
all the other effects of nowait I/O.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
  2021-01-12 10:42   ` Christoph Hellwig
@ 2021-01-12 17:01     ` Brian Foster
  2021-01-12 17:10       ` Christoph Hellwig
  2021-01-12 22:06       ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Brian Foster @ 2021-01-12 17:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs, linux-fsdevel, avi, andres

On Tue, Jan 12, 2021 at 11:42:57AM +0100, Christoph Hellwig wrote:
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index bba33be17eff..f5c75404b8a5 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
> >  			drained_dio = true;
> >  			goto restart;
> >  		}
> > -	
> > +
> 
> Spurious unrelated whitespace change.
> 
> >  	struct iomap_dio_rw_args args = {
> >  		.iocb			= iocb,
> >  		.iter			= from,
> >  		.ops			= &xfs_direct_write_iomap_ops,
> >  		.dops			= &xfs_dio_write_ops,
> >  		.wait_for_completion	= is_sync_kiocb(iocb),
> > -		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
> > +		.nonblocking		= true,
> 
> I think this is in many ways wrong.  As far as I can tell you want this
> so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> we should not trigger any of the other checks, so we'd really need
> another flag instead of reusing this one.
> 

It's really the br_state != XFS_EXT_NORM check that we want for the
unaligned case, isn't it?

> imap_spans_range is a bit pessimistic for avoiding the exclusive lock,
> but I guess we could live that if it is clearly documented as helping
> with the implementation, but we really should not automatically trigger
> all the other effects of nowait I/O.
> 

Regardless, I agree on this point. I don't have a strong opinion in
general on this approach vs. the other, but it does seem odd to me to
overload the broader nowait semantics with the unaligned I/O checks. I
see that it works for the primary case we care about, but this also
means things like the _has_page() check now trigger exclusivity for the
unaligned case where that doesn't seem to be necessary. I do like the
previous cleanups so I suspect if we worked this into a new
'subblock_io' flag that indicates to the lower layer whether the
filesystem can allow zeroing, that might clean much of this up.

Brian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
  2021-01-12 17:01     ` Brian Foster
@ 2021-01-12 17:10       ` Christoph Hellwig
  2021-01-12 22:06       ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2021-01-12 17:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Dave Chinner, linux-xfs, linux-fsdevel, avi, andres

On Tue, Jan 12, 2021 at 12:01:33PM -0500, Brian Foster wrote:
> > I think this is in many ways wrong.  As far as I can tell you want this
> > so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> > we should not trigger any of the other checks, so we'd really need
> > another flag instead of reusing this one.
> > 
> 
> It's really the br_state != XFS_EXT_NORM check that we want for the
> unaligned case, isn't it?

Inherently, yes.  But if we want to avoid the extra irec lookup outside
->iomap_begin we have to limit us to a single I/O, as we'll do a partial
write otherwise if only the extent that the end of write falls into is
unwritten and not block aligned.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio
  2021-01-12 17:01     ` Brian Foster
  2021-01-12 17:10       ` Christoph Hellwig
@ 2021-01-12 22:06       ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-12 22:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, avi, andres

On Tue, Jan 12, 2021 at 12:01:33PM -0500, Brian Foster wrote:
> On Tue, Jan 12, 2021 at 11:42:57AM +0100, Christoph Hellwig wrote:
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index bba33be17eff..f5c75404b8a5 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -408,7 +408,7 @@ xfs_file_aio_write_checks(
> > >  			drained_dio = true;
> > >  			goto restart;
> > >  		}
> > > -	
> > > +
> > 
> > Spurious unrelated whitespace change.
> > 
> > >  	struct iomap_dio_rw_args args = {
> > >  		.iocb			= iocb,
> > >  		.iter			= from,
> > >  		.ops			= &xfs_direct_write_iomap_ops,
> > >  		.dops			= &xfs_dio_write_ops,
> > >  		.wait_for_completion	= is_sync_kiocb(iocb),
> > > -		.nonblocking		= (iocb->ki_flags & IOCB_NOWAIT),
> > > +		.nonblocking		= true,
> > 
> > I think this is in many ways wrong.  As far as I can tell you want this
> > so that we get the imap_spans_range in xfs_direct_write_iomap_begin. But
> > we should not trigger any of the other checks, so we'd really need
> > another flag instead of reusing this one.
> > 
> 
> It's really the br_state != XFS_EXT_NORM check that we want for the
> unaligned case, isn't it?

We can only submit unaligned DIO with a shared IOLOCK to a written
range, which means we need to abort the IO if we hit a COW range
(imap_needs_cow()), a hole (imap_needs_alloc()), the range spans
multiple extents (imap_spans_range()) and, finally, unwritten
extents (the new check I added).

IOMAP_NOWAIT aborts on all these cases and returns EAGAIN.

> > imap_spans_range is a bit pessimistic for avoiding the exclusive lock,

No, it's absolutely required.

If the sub-block aligned dio spans multiple extents, we don't know
what locking is required for that next extent until iomap_apply()
loops and calls us again for that range. WHile the first range might
be written and OK to issue, the next extent range could
require allocation, COW or unwritten extent conversion and so would
require exclusive IO locking.  And so we end up with partial IO
submission, which causes all sorts of problems...

IOWs, if the unaligned dio cannot be mapped to a single written
extent, we can't do it under shared locking conditions - it must be
done under exclusive locking to maintain the "no partial submission"
rules we have for DIO.

> > but I guess we could live that if it is clearly documented as helping
> > with the implementation, but we really should not automatically trigger
> > all the other effects of nowait I/O.
> 
> Regardless, I agree on this point.

The only thing that IOMAP_NOWAIT does that might be questionable is
the xfs_ilock_nowait() call on the ILOCK. We want it to abort shared
IO if we don't have the extents read in - Christoph's patch made
this trigger exclusive IO, too and so of all the things that
IOMAP_NOWAIT triggers, the -only thing- we can raise a question
about is the trylock.

And, quite frankly, if something is modifying the inode metadata
while we are trying to sub-block DIO, I want the sub-block DIO to
fall back to exclusive locking just to be safe. It may not be
necessary, but right now I'd prefer to err on the side of caution
and be conservative about when this optimisation triggers. If we get
it wrong, we corrupt data....

> I don't have a strong opinion in general on this approach vs. the
> other, but it does seem odd to me to overload the broader nowait
> semantics with the unaligned I/O checks. I see that it works for
> the primary case we care about, but this also means things like
> the _has_page() check now trigger exclusivity for the unaligned
> case where that doesn't seem to be necessary.

Actually, it's another case of being safe rather than sorry. In the
sub-block DIO is racing with mmap or write() dirtying the page that
spans the DIO range, we end up issuing concurrent IOs to the same
LBA range, something that results in undefined behaviour and is
something we must absolutely not do.

That is:

	DIO	(1024, 512)
		submit_bio (1024, 512)
		.....
	mmap
		(0, 4096)
		touch byte 0
		page dirty

	DIO	(2048, 512)
		filemap_write_and_wait_range(2048, 512)
		submit_bio(0, 4096)
		.....

and now we have overlapping concurrent IO in flight even though
usrespace has not done any overlapping modifications at all.
Overlapping IO should never be issued by the filesystem as the
result is undefined. Yes, the application should not be mixing
mmap+DIO, but we the filesystem in this case is doing something even
worse and something we tell userspace developers that *they should
never do*. We can trivially avoid this corruption case by falling
back to exclusive locking for subblock dio if writeback and/or page
cache invalidation may be required.

IOWs, IOMAP_NOWAIT gives us exactly the behaviour we need here for
serialising concurrent sub-block dio against page cache based IO...

> I do like the
> previous cleanups so I suspect if we worked this into a new
> 'subblock_io' flag that indicates to the lower layer whether the
> filesystem can allow zeroing, that might clean much of this up.

Allow zeroing where, exactly? e.g. some filesystems do zeroing in
their allocation routines during mapping. IOWs, this strikes me as
encoding specific filesystem implementation requirements into the
generic API as opposed to using generic functionality to implement
specific FS behavioural requirements.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-12  8:01 ` [RFC] xfs: reduce sub-block DIO serialisation Avi Kivity
@ 2021-01-12 22:13   ` Dave Chinner
  2021-01-13  8:00     ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2021-01-12 22:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs, linux-fsdevel, andres

On Tue, Jan 12, 2021 at 10:01:35AM +0200, Avi Kivity wrote:
> On 1/12/21 3:07 AM, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is the XFS implementation on the sub-block DIO optimisations
> > for written extents that I've mentioned on #xfs and a couple of
> > times now on the XFS mailing list.
> > 
> > It takes the approach of using the IOMAP_NOWAIT non-blocking
> > IO submission infrastructure to optimistically dispatch sub-block
> > DIO without exclusive locking. If the extent mapping callback
> > decides that it can't do the unaligned IO without extent
> > manipulation, sub-block zeroing, blocking or splitting the IO into
> > multiple parts, it aborts the IO with -EAGAIN. This allows the high
> > level filesystem code to then take exclusive locks and resubmit the
> > IO once it has guaranteed no other IO is in progress on the inode
> > (the current implementation).
> 
> 
> Can you expand on the no-splitting requirement? Does it involve only
> splitting by XFS (IO spans >1 extents) or lower layers (RAID)?

XFS only.

> The reason I'm concerned is that it's the constraint that the application
> has least control over. I guess I could use RWF_NOWAIT to avoid blocking my
> main thread (but last time I tried I'd get occasional EIOs that frightened
> me off that).

Spurious EIO from RWF_NOWAIT is a bug that needs to be fixed. DO you
have any details?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-12 22:13   ` Dave Chinner
@ 2021-01-13  8:00     ` Avi Kivity
  2021-01-13 20:38       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2021-01-13  8:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, andres

On 1/13/21 12:13 AM, Dave Chinner wrote:
> On Tue, Jan 12, 2021 at 10:01:35AM +0200, Avi Kivity wrote:
>> On 1/12/21 3:07 AM, Dave Chinner wrote:
>>> Hi folks,
>>>
>>> This is the XFS implementation on the sub-block DIO optimisations
>>> for written extents that I've mentioned on #xfs and a couple of
>>> times now on the XFS mailing list.
>>>
>>> It takes the approach of using the IOMAP_NOWAIT non-blocking
>>> IO submission infrastructure to optimistically dispatch sub-block
>>> DIO without exclusive locking. If the extent mapping callback
>>> decides that it can't do the unaligned IO without extent
>>> manipulation, sub-block zeroing, blocking or splitting the IO into
>>> multiple parts, it aborts the IO with -EAGAIN. This allows the high
>>> level filesystem code to then take exclusive locks and resubmit the
>>> IO once it has guaranteed no other IO is in progress on the inode
>>> (the current implementation).
>>
>> Can you expand on the no-splitting requirement? Does it involve only
>> splitting by XFS (IO spans >1 extents) or lower layers (RAID)?
> XFS only.


Ok, that is somewhat under control as I can provide an extent hint, and 
wish really hard that the filesystem isn't fragmented.


>> The reason I'm concerned is that it's the constraint that the application
>> has least control over. I guess I could use RWF_NOWAIT to avoid blocking my
>> main thread (but last time I tried I'd get occasional EIOs that frightened
>> me off that).
> Spurious EIO from RWF_NOWAIT is a bug that needs to be fixed. DO you
> have any details?
>

I reported it in [1]. It's long since gone since I disabled RWF_NOWAIT. 
It was relatively rare, sometimes happening in continuous integration 
runs that take hours, and sometimes not.


I expect it's fixed by now since io_uring relies on it. Maybe I should 
turn it on for kernels > some_random_version.


[1] 
https://lore.kernel.org/lkml/9bab0f40-5748-f147-efeb-5aac4fd44533@scylladb.com/t/#u


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-13  8:00     ` Avi Kivity
@ 2021-01-13 20:38       ` Dave Chinner
  2021-01-14  6:48         ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2021-01-13 20:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs, linux-fsdevel, andres

On Wed, Jan 13, 2021 at 10:00:37AM +0200, Avi Kivity wrote:
> On 1/13/21 12:13 AM, Dave Chinner wrote:
> > On Tue, Jan 12, 2021 at 10:01:35AM +0200, Avi Kivity wrote:
> > > On 1/12/21 3:07 AM, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > This is the XFS implementation on the sub-block DIO optimisations
> > > > for written extents that I've mentioned on #xfs and a couple of
> > > > times now on the XFS mailing list.
> > > > 
> > > > It takes the approach of using the IOMAP_NOWAIT non-blocking
> > > > IO submission infrastructure to optimistically dispatch sub-block
> > > > DIO without exclusive locking. If the extent mapping callback
> > > > decides that it can't do the unaligned IO without extent
> > > > manipulation, sub-block zeroing, blocking or splitting the IO into
> > > > multiple parts, it aborts the IO with -EAGAIN. This allows the high
> > > > level filesystem code to then take exclusive locks and resubmit the
> > > > IO once it has guaranteed no other IO is in progress on the inode
> > > > (the current implementation).
> > > 
> > > Can you expand on the no-splitting requirement? Does it involve only
> > > splitting by XFS (IO spans >1 extents) or lower layers (RAID)?
> > XFS only.
> 
> 
> Ok, that is somewhat under control as I can provide an extent hint, and wish
> really hard that the filesystem isn't fragmented.
> 
> 
> > > The reason I'm concerned is that it's the constraint that the application
> > > has least control over. I guess I could use RWF_NOWAIT to avoid blocking my
> > > main thread (but last time I tried I'd get occasional EIOs that frightened
> > > me off that).
> > Spurious EIO from RWF_NOWAIT is a bug that needs to be fixed. DO you
> > have any details?
> > 
> 
> I reported it in [1]. It's long since gone since I disabled RWF_NOWAIT. It
> was relatively rare, sometimes happening in continuous integration runs that
> take hours, and sometimes not.
> 
> 
> I expect it's fixed by now since io_uring relies on it. Maybe I should turn
> it on for kernels > some_random_version.
> 
> 
> [1] https://lore.kernel.org/lkml/9bab0f40-5748-f147-efeb-5aac4fd44533@scylladb.com/t/#u

Yeah, as I thought. Usage of REQ_NOWAIT with filesystem based IO is
simply broken - it causes spurious IO failures to be reported to IO
completion callbacks and so are very difficult to track and/or
retry. iomap does not use REQ_NOWAIT at all, so you should not ever
see this from XFS or ext4 DIO anymore...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-13 20:38       ` Dave Chinner
@ 2021-01-14  6:48         ` Avi Kivity
  2021-01-17 21:34           ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2021-01-14  6:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, andres

On 1/13/21 10:38 PM, Dave Chinner wrote:
> On Wed, Jan 13, 2021 at 10:00:37AM +0200, Avi Kivity wrote:
>> On 1/13/21 12:13 AM, Dave Chinner wrote:
>>> On Tue, Jan 12, 2021 at 10:01:35AM +0200, Avi Kivity wrote:
>>>> On 1/12/21 3:07 AM, Dave Chinner wrote:
>>>>> Hi folks,
>>>>>
>>>>> This is the XFS implementation on the sub-block DIO optimisations
>>>>> for written extents that I've mentioned on #xfs and a couple of
>>>>> times now on the XFS mailing list.
>>>>>
>>>>> It takes the approach of using the IOMAP_NOWAIT non-blocking
>>>>> IO submission infrastructure to optimistically dispatch sub-block
>>>>> DIO without exclusive locking. If the extent mapping callback
>>>>> decides that it can't do the unaligned IO without extent
>>>>> manipulation, sub-block zeroing, blocking or splitting the IO into
>>>>> multiple parts, it aborts the IO with -EAGAIN. This allows the high
>>>>> level filesystem code to then take exclusive locks and resubmit the
>>>>> IO once it has guaranteed no other IO is in progress on the inode
>>>>> (the current implementation).
>>>> Can you expand on the no-splitting requirement? Does it involve only
>>>> splitting by XFS (IO spans >1 extents) or lower layers (RAID)?
>>> XFS only.
>>
>> Ok, that is somewhat under control as I can provide an extent hint, and wish
>> really hard that the filesystem isn't fragmented.
>>
>>
>>>> The reason I'm concerned is that it's the constraint that the application
>>>> has least control over. I guess I could use RWF_NOWAIT to avoid blocking my
>>>> main thread (but last time I tried I'd get occasional EIOs that frightened
>>>> me off that).
>>> Spurious EIO from RWF_NOWAIT is a bug that needs to be fixed. DO you
>>> have any details?
>>>
>> I reported it in [1]. It's long since gone since I disabled RWF_NOWAIT. It
>> was relatively rare, sometimes happening in continuous integration runs that
>> take hours, and sometimes not.
>>
>>
>> I expect it's fixed by now since io_uring relies on it. Maybe I should turn
>> it on for kernels > some_random_version.
>>
>>
>> [1] https://lore.kernel.org/lkml/9bab0f40-5748-f147-efeb-5aac4fd44533@scylladb.com/t/#u
> Yeah, as I thought. Usage of REQ_NOWAIT with filesystem based IO is
> simply broken - it causes spurious IO failures to be reported to IO
> completion callbacks and so are very difficult to track and/or
> retry. iomap does not use REQ_NOWAIT at all, so you should not ever
> see this from XFS or ext4 DIO anymore...


What kernel version would be good?


Searching the log I found


commit 4503b7676a2e0abe69c2f2c0d8b03aec53f2f048
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Jun 1 10:00:27 2020 -0600

     io_uring: catch -EIO from buffered issue request failure

     -EIO bubbles up like -EAGAIN if we fail to allocate a request at the
     lower level. Play it safe and treat it like -EAGAIN in terms of sync
     retry, to avoid passing back an errant -EIO.

     Catch some of these early for block based file, as non-mq devices
     generally do not support NOWAIT. That saves us some overhead by
     not first trying, then retrying from async context. We can go straight
     to async punt instead.

     Signed-off-by: Jens Axboe <axboe@kernel.dk>

but this looks to be io_uring specific fix (somewhat frightening too), 
not removal of REQ_NOWAIT.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-14  6:48         ` Avi Kivity
@ 2021-01-17 21:34           ` Dave Chinner
  2021-01-18  7:41             ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2021-01-17 21:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs, linux-fsdevel, andres

On Thu, Jan 14, 2021 at 08:48:36AM +0200, Avi Kivity wrote:
> On 1/13/21 10:38 PM, Dave Chinner wrote:
> > On Wed, Jan 13, 2021 at 10:00:37AM +0200, Avi Kivity wrote:
> > > On 1/13/21 12:13 AM, Dave Chinner wrote:
> > > > On Tue, Jan 12, 2021 at 10:01:35AM +0200, Avi Kivity wrote:
> > > > > On 1/12/21 3:07 AM, Dave Chinner wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > This is the XFS implementation on the sub-block DIO optimisations
> > > > > > for written extents that I've mentioned on #xfs and a couple of
> > > > > > times now on the XFS mailing list.
> > > > > > 
> > > > > > It takes the approach of using the IOMAP_NOWAIT non-blocking
> > > > > > IO submission infrastructure to optimistically dispatch sub-block
> > > > > > DIO without exclusive locking. If the extent mapping callback
> > > > > > decides that it can't do the unaligned IO without extent
> > > > > > manipulation, sub-block zeroing, blocking or splitting the IO into
> > > > > > multiple parts, it aborts the IO with -EAGAIN. This allows the high
> > > > > > level filesystem code to then take exclusive locks and resubmit the
> > > > > > IO once it has guaranteed no other IO is in progress on the inode
> > > > > > (the current implementation).
> > > > > Can you expand on the no-splitting requirement? Does it involve only
> > > > > splitting by XFS (IO spans >1 extents) or lower layers (RAID)?
> > > > XFS only.
> > > 
> > > Ok, that is somewhat under control as I can provide an extent hint, and wish
> > > really hard that the filesystem isn't fragmented.
> > > 
> > > 
> > > > > The reason I'm concerned is that it's the constraint that the application
> > > > > has least control over. I guess I could use RWF_NOWAIT to avoid blocking my
> > > > > main thread (but last time I tried I'd get occasional EIOs that frightened
> > > > > me off that).
> > > > Spurious EIO from RWF_NOWAIT is a bug that needs to be fixed. DO you
> > > > have any details?
> > > > 
> > > I reported it in [1]. It's long since gone since I disabled RWF_NOWAIT. It
> > > was relatively rare, sometimes happening in continuous integration runs that
> > > take hours, and sometimes not.
> > > 
> > > 
> > > I expect it's fixed by now since io_uring relies on it. Maybe I should turn
> > > it on for kernels > some_random_version.
> > > 
> > > 
> > > [1] https://lore.kernel.org/lkml/9bab0f40-5748-f147-efeb-5aac4fd44533@scylladb.com/t/#u
> > Yeah, as I thought. Usage of REQ_NOWAIT with filesystem based IO is
> > simply broken - it causes spurious IO failures to be reported to IO
> > completion callbacks and so are very difficult to track and/or
> > retry. iomap does not use REQ_NOWAIT at all, so you should not ever
> > see this from XFS or ext4 DIO anymore...
> 
> What kernel version would be good?

For ext4? >= 5.5 was when it was converted to the iomap DIO path
should be safe.  Before taht it would use the old DIO path which
sets REQ_NOWAIT when IOCB_NOWAIT (i.e. RWF_NOWAIT) was set for the
IO.

Btrfs is an even more recent convert to iomap-based dio (5.9?).

The REQ_NOWAIT behaviour was introduced into the old DIO path back
in 4.13 by commit 03a07c92a9ed ("block: return on congested block
device") and was intended to support RWF_NOWAIT on raw block
devices.  Hence it was not added to the iomap path as block devices
don't use that path.

Other examples of how REQ_NOWAIT breaks filesystems was a io_uring
hack to force REQ_NOWAIT IO behaviour through filesystems via
"nowait block plugs" resulted in XFS filesystem shutdowns because
of unexpected IO errors during journal writes:

https://lore.kernel.org/linux-xfs/20200915113327.GA1554921@bfoster/

There have been patches proposed to add REQ_NOWAIT to the iomap DIO
code proporsed, but they've all been NACKed because of the fact it
will break filesystem-based RWF_NOWAIT DIO.

So, long story short: On XFS you are fine on all kernels. On all
other block based filesystems you need <4.13, except for ext4 where
>= 5.5 and btrfs where >=5.9 will work correctly.

> commit 4503b7676a2e0abe69c2f2c0d8b03aec53f2f048
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Mon Jun 1 10:00:27 2020 -0600
> 
>     io_uring: catch -EIO from buffered issue request failure
> 
>     -EIO bubbles up like -EAGAIN if we fail to allocate a request at the
>     lower level. Play it safe and treat it like -EAGAIN in terms of sync
>     retry, to avoid passing back an errant -EIO.
> 
>     Catch some of these early for block based file, as non-mq devices
>     generally do not support NOWAIT. That saves us some overhead by
>     not first trying, then retrying from async context. We can go straight
>     to async punt instead.
> 
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> but this looks to be io_uring specific fix (somewhat frightening too), not
> removal of REQ_NOWAIT.

That looks like a similar case to the one I mention above where
io_uring and REQ_NOWAIT aren't playing well with others....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
       [not found] ` <CACz=WechdgSnVHQsg0LKjMiG8kHLujBshmc270yrdjxfpffmDQ@mail.gmail.com>
@ 2021-01-17 21:36   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-01-17 21:36 UTC (permalink / raw)
  To: Raphael Carvalho; +Cc: linux-xfs, linux-fsdevel, Avi Kivity, andres

On Fri, Jan 15, 2021 at 03:45:14PM -0300, Raphael Carvalho wrote:
> On Tue, Jan 12, 2021 at 7:46 AM Dave Chinner <david@fromorbit.com> wrote:
> 
> > Hi folks,
> >
> > This is the XFS implementation on the sub-block DIO optimisations
> > for written extents that I've mentioned on #xfs and a couple of
> > times now on the XFS mailing list.
> >
> > It takes the approach of using the IOMAP_NOWAIT non-blocking
> > IO submission infrastructure to optimistically dispatch sub-block
> > DIO without exclusive locking. If the extent mapping callback
> > decides that it can't do the unaligned IO without extent
> > manipulation, sub-block zeroing, blocking or splitting the IO into
> > multiple parts, it aborts the IO with -EAGAIN. This allows the high
> > level filesystem code to then take exclusive locks and resubmit the
> > IO once it has guaranteed no other IO is in progress on the inode
> > (the current implementation).
> >
> 
> I like this optimistic approach very much. One question though: If
> application submits IO with RWF_NOWAIT, then this fallback step will be
> avoided and application will receive EAGAIN, right?

Yes, all the proposed patches do this correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] xfs: reduce sub-block DIO serialisation
  2021-01-17 21:34           ` Dave Chinner
@ 2021-01-18  7:41             ` Avi Kivity
  0 siblings, 0 replies; 24+ messages in thread
From: Avi Kivity @ 2021-01-18  7:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, andres

On 1/17/21 11:34 PM, Dave Chinner wrote:
> On Thu, Jan 14, 2021 at 08:48:36AM +0200, Avi Kivity wrote:
>> On 1/13/21 10:38 PM, Dave Chinner wrote:
>>> On Wed, Jan 13, 2021 at 10:00:37AM +0200, Avi Kivity wrote:
>>>> On 1/13/21 12:13 AM, Dave Chinner wrote:
>>>>> On Tue, Jan 12, 2021 at 10:01:35AM +0200, Avi Kivity wrote:
>>>>>> On 1/12/21 3:07 AM, Dave Chinner wrote:
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> This is the XFS implementation on the sub-block DIO optimisations
>>>>>>> for written extents that I've mentioned on #xfs and a couple of
>>>>>>> times now on the XFS mailing list.
>>>>>>>
>>>>>>> It takes the approach of using the IOMAP_NOWAIT non-blocking
>>>>>>> IO submission infrastructure to optimistically dispatch sub-block
>>>>>>> DIO without exclusive locking. If the extent mapping callback
>>>>>>> decides that it can't do the unaligned IO without extent
>>>>>>> manipulation, sub-block zeroing, blocking or splitting the IO into
>>>>>>> multiple parts, it aborts the IO with -EAGAIN. This allows the high
>>>>>>> level filesystem code to then take exclusive locks and resubmit the
>>>>>>> IO once it has guaranteed no other IO is in progress on the inode
>>>>>>> (the current implementation).
>>>>>> Can you expand on the no-splitting requirement? Does it involve only
>>>>>> splitting by XFS (IO spans >1 extents) or lower layers (RAID)?
>>>>> XFS only.
>>>> Ok, that is somewhat under control as I can provide an extent hint, and wish
>>>> really hard that the filesystem isn't fragmented.
>>>>
>>>>
>>>>>> The reason I'm concerned is that it's the constraint that the application
>>>>>> has least control over. I guess I could use RWF_NOWAIT to avoid blocking my
>>>>>> main thread (but last time I tried I'd get occasional EIOs that frightened
>>>>>> me off that).
>>>>> Spurious EIO from RWF_NOWAIT is a bug that needs to be fixed. DO you
>>>>> have any details?
>>>>>
>>>> I reported it in [1]. It's long since gone since I disabled RWF_NOWAIT. It
>>>> was relatively rare, sometimes happening in continuous integration runs that
>>>> take hours, and sometimes not.
>>>>
>>>>
>>>> I expect it's fixed by now since io_uring relies on it. Maybe I should turn
>>>> it on for kernels > some_random_version.
>>>>
>>>>
>>>> [1] https://lore.kernel.org/lkml/9bab0f40-5748-f147-efeb-5aac4fd44533@scylladb.com/t/#u
>>> Yeah, as I thought. Usage of REQ_NOWAIT with filesystem based IO is
>>> simply broken - it causes spurious IO failures to be reported to IO
>>> completion callbacks and so are very difficult to track and/or
>>> retry. iomap does not use REQ_NOWAIT at all, so you should not ever
>>> see this from XFS or ext4 DIO anymore...
>> What kernel version would be good?
> For ext4? >= 5.5 was when it was converted to the iomap DIO path
> should be safe.  Before taht it would use the old DIO path which
> sets REQ_NOWAIT when IOCB_NOWAIT (i.e. RWF_NOWAIT) was set for the
> IO.
>
> Btrfs is an even more recent convert to iomap-based dio (5.9?).
>
> The REQ_NOWAIT behaviour was introduced into the old DIO path back
> in 4.13 by commit 03a07c92a9ed ("block: return on congested block
> device") and was intended to support RWF_NOWAIT on raw block
> devices.  Hence it was not added to the iomap path as block devices
> don't use that path.
>
> Other examples of how REQ_NOWAIT breaks filesystems was a io_uring
> hack to force REQ_NOWAIT IO behaviour through filesystems via
> "nowait block plugs" resulted in XFS filesystem shutdowns because
> of unexpected IO errors during journal writes:
>
> https://lore.kernel.org/linux-xfs/20200915113327.GA1554921@bfoster/
>
> There have been patches proposed to add REQ_NOWAIT to the iomap DIO
> code proporsed, but they've all been NACKed because of the fact it
> will break filesystem-based RWF_NOWAIT DIO.
>
> So, long story short: On XFS you are fine on all kernels. On all
> other block based filesystems you need <4.13, except for ext4 where
>> = 5.5 and btrfs where >=5.9 will work correctly.


My report mentions XFS though it was so long ago I'm willing to treat it 
as measurement error. I'll incorporate these numbers into the code, and 
we'll see. Luckily I was already forced to have filesystem specific code 
so the ugliness is already there.




^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-01-18  7:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-12  1:07 [RFC] xfs: reduce sub-block DIO serialisation Dave Chinner
2021-01-12  1:07 ` [PATCH 1/6] iomap: convert iomap_dio_rw() to an args structure Dave Chinner
2021-01-12  1:22   ` Damien Le Moal
2021-01-12  1:40   ` Darrick J. Wong
2021-01-12  1:53     ` Dave Chinner
2021-01-12 10:31   ` Christoph Hellwig
2021-01-12  1:07 ` [PATCH 2/6] iomap: move DIO NOWAIT setup up into filesystems Dave Chinner
2021-01-12  1:07 ` [PATCH 3/6] xfs: factor out a xfs_ilock_iocb helper Dave Chinner
2021-01-12  1:07 ` [PATCH 4/6] xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware Dave Chinner
2021-01-12  1:07 ` [PATCH 5/6] xfs: split unaligned DIO write code out Dave Chinner
2021-01-12 10:37   ` Christoph Hellwig
2021-01-12  1:07 ` [PATCH 6/6] xfs: reduce exclusive locking on unaligned dio Dave Chinner
2021-01-12 10:42   ` Christoph Hellwig
2021-01-12 17:01     ` Brian Foster
2021-01-12 17:10       ` Christoph Hellwig
2021-01-12 22:06       ` Dave Chinner
2021-01-12  8:01 ` [RFC] xfs: reduce sub-block DIO serialisation Avi Kivity
2021-01-12 22:13   ` Dave Chinner
2021-01-13  8:00     ` Avi Kivity
2021-01-13 20:38       ` Dave Chinner
2021-01-14  6:48         ` Avi Kivity
2021-01-17 21:34           ` Dave Chinner
2021-01-18  7:41             ` Avi Kivity
     [not found] ` <CACz=WechdgSnVHQsg0LKjMiG8kHLujBshmc270yrdjxfpffmDQ@mail.gmail.com>
2021-01-17 21:36   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).