Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/5] ext4: direct IO via iomap infrastructure
@ 2019-08-12 12:52 Matthew Bobrowski
  2019-08-12 12:52 ` [PATCH 1/5] ext4: introduce direct IO read code path using " Matthew Bobrowski
                   ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-12 12:52 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux-fsdevel, jack, tytso, riteshh

This patch series converts the ext4 direct IO code paths to make use of the
iomap infrastructure and removes the old buffer_head direct-io based
implementation. The result is that ext4 is converted to the newer framework
and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.

These changes have been tested using xfstests in both DAX and non-DAX modes
using various configurations i.e. 4k, dioread_nolock, dax.

Matthew Bobrowski (5):
  ext4: introduce direct IO read code path using iomap infrastructure
  ext4: move inode extension/truncate code out from ext4_iomap_end()
  iomap: modify ->end_io() calling convention
  ext4: introduce direct IO write code path using iomap infrastructure
  ext4: clean up redundant buffer_head direct IO code

 fs/ext4/ext4.h        |   3 -
 fs/ext4/extents.c     |   8 +-
 fs/ext4/file.c        | 329 +++++++++++++++++++++++++++-------
 fs/ext4/inode.c       | 488 +++++---------------------------------------------
 fs/iomap/direct-io.c  |   9 +-
 fs/xfs/xfs_file.c     |  17 +-
 include/linux/iomap.h |   4 +-
 7 files changed, 322 insertions(+), 536 deletions(-)

-- 
2.16.4


-- 
Matthew Bobrowski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/5] ext4: introduce direct IO read code path using iomap infrastructure
  2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
@ 2019-08-12 12:52 ` " Matthew Bobrowski
  2019-08-12 17:18   ` Christoph Hellwig
  2019-08-12 12:52 ` [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-12 12:52 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux-fsdevel, jack, tytso, riteshh

This patch introduces a new direct IO read code path implementation
that makes use of the iomap infrastructure.

The new function ext4_dio_read_iter() is responsible for calling into the
iomap infrastructure via iomap_dio_rw(). If the inode in question does not
pass preliminary checks in ext4_dio_checks(), then we simply fallback to
buffered IO and try to take that path. Prior to doing so, we drop the
IOCB_DIRECT flag from iocb->ki_flags to prevent generic_file_read_iter() from
taking the direct IO code path once again.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
---
 fs/ext4/file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 70b0438dbc94..360eff7b6aa2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,53 @@
 #include "xattr.h"
 #include "acl.h"
 
+static bool ext4_dio_checks(struct inode *inode)
+{
+#ifdef CONFIG_FS_ENCRYPTION
+	if (IS_ENCRYPTED(inode))
+		return false;
+#endif
+	if (ext4_should_journal_data(inode))
+		return false;
+	if (ext4_has_inline_data(inode))
+		return false;
+	return true;
+}
+
+static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	/*
+	 * Get exclusion from truncate and other inode operations.
+	 */
+	if (!inode_trylock_shared(inode)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		inode_lock_shared(inode);
+	}
+
+	if (!ext4_dio_checks(inode)) {
+		inode_unlock_shared(inode);
+		/*
+		 * Fallback to buffered IO if the operation being
+		 * performed on the inode is not supported by direct
+		 * IO. The IOCB_DIRECT flag from iocb->ki_flags needs
+		 * to be cleared here to ensure that the direct IO
+		 * code path in generic_file_read_iter() is not taken.
+		 */
+		iocb->ki_flags &= ~IOCB_DIRECT;
+		return generic_file_read_iter(iocb, to);
+	}
+
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
+	inode_unlock_shared(inode);
+
+	file_accessed(iocb->ki_filp);
+	return ret;
+}
+
 #ifdef CONFIG_FS_DAX
 static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
@@ -64,16 +111,20 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
 
 	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
 #ifdef CONFIG_FS_DAX
-	if (IS_DAX(file_inode(iocb->ki_filp)))
+	if (IS_DAX(inode))
 		return ext4_dax_read_iter(iocb, to);
 #endif
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return ext4_dio_read_iter(iocb, to);
 	return generic_file_read_iter(iocb, to);
 }
 
-- 
2.16.4


-- 
Matthew Bobrowski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end()
  2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
  2019-08-12 12:52 ` [PATCH 1/5] ext4: introduce direct IO read code path using " Matthew Bobrowski
@ 2019-08-12 12:52 ` Matthew Bobrowski
  2019-08-12 17:18   ` Christoph Hellwig
  2019-08-28 19:59   ` Jan Kara
  2019-08-12 12:53 ` [PATCH 3/5] iomap: modify ->end_io() calling convention Matthew Bobrowski
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-12 12:52 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux-fsdevel, jack, tytso, riteshh

In preparation for implementing the direct IO write code path modifications
that make us of iomap infrastructure we need to move out the inode
extension/truncate code from ext4_iomap_end() callback. For direct IO, if the
current code remained it would behave incorrectly. If we update the inode size
prior to converting unwritten extents we run the risk of allowing a racing
direct IO read operation to find unwritten extents before they are converted.

The inode extension/truncate has been moved out into a new function
ext4_handle_inode_extension(). This will be used by both direct IO and DAX
code paths if the write results with the inode being extended.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
---
 fs/ext4/file.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/inode.c | 48 +--------------------------------------------
 2 files changed, 60 insertions(+), 48 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 360eff7b6aa2..7470800c63b7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
 #include "ext4_jbd2.h"
 #include "xattr.h"
 #include "acl.h"
+#include "truncate.h"
 
 static bool ext4_dio_checks(struct inode *inode)
 {
@@ -234,12 +235,62 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	return iov_iter_count(from);
 }
 
+static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
+				       size_t count)
+{
+	handle_t *handle;
+	bool truncate = false;
+	ext4_lblk_t written_blk, end_blk;
+	int ret = 0, blkbits = inode->i_blkbits;
+
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto orphan_del;
+	}
+
+	if (ext4_update_inode_size(inode, size))
+		ext4_mark_inode_dirty(handle, inode);
+
+	/*
+	 * We may need truncate allocated but not written blocks
+	 * beyond EOF.
+	 */
+	written_blk = ALIGN(size, 1 << blkbits);
+	end_blk = ALIGN(size + count, 1 << blkbits);
+	if (written_blk < end_blk && ext4_can_truncate(inode))
+		truncate = true;
+
+	/*
+	 * Remove the inode from the orphan list if it has been
+	 * extended and everything went OK.
+	 */
+	if (!truncate && inode->i_nlink)
+		ext4_orphan_del(handle, inode);
+	ext4_journal_stop(handle);
+
+	if (truncate) {
+		ext4_truncate_failed_write(inode);
+orphan_del:
+		/*
+		 * If the truncate operation failed early the inode
+		 * may still be on the orphan list. In that case, we
+		 * need try remove the inode from the linked list in
+		 * memory.
+		 */
+		if (inode->i_nlink)
+			ext4_orphan_del(NULL, inode);
+	}
+	return ret;
+}
+
 #ifdef CONFIG_FS_DAX
 static ssize_t
 ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
-	struct inode *inode = file_inode(iocb->ki_filp);
+	int err;
 	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
 
 	if (!inode_trylock(inode)) {
 		if (iocb->ki_flags & IOCB_NOWAIT)
@@ -257,6 +308,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		goto out;
 
 	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+
+	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
+		err = ext4_handle_inode_extension(inode, iocb->ki_pos,
+						  iov_iter_count(from));
+		if (err)
+			ret = err;
+	}
 out:
 	inode_unlock(inode);
 	if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..761ce6286b05 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3601,53 +3601,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
-	int ret = 0;
-	handle_t *handle;
-	int blkbits = inode->i_blkbits;
-	bool truncate = false;
-
-	if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
-		return 0;
-
-	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto orphan_del;
-	}
-	if (ext4_update_inode_size(inode, offset + written))
-		ext4_mark_inode_dirty(handle, inode);
-	/*
-	 * We may need to truncate allocated but not written blocks beyond EOF.
-	 */
-	if (iomap->offset + iomap->length > 
-	    ALIGN(inode->i_size, 1 << blkbits)) {
-		ext4_lblk_t written_blk, end_blk;
-
-		written_blk = (offset + written) >> blkbits;
-		end_blk = (offset + length) >> blkbits;
-		if (written_blk < end_blk && ext4_can_truncate(inode))
-			truncate = true;
-	}
-	/*
-	 * Remove inode from orphan list if we were extending a inode and
-	 * everything went fine.
-	 */
-	if (!truncate && inode->i_nlink &&
-	    !list_empty(&EXT4_I(inode)->i_orphan))
-		ext4_orphan_del(handle, inode);
-	ext4_journal_stop(handle);
-	if (truncate) {
-		ext4_truncate_failed_write(inode);
-orphan_del:
-		/*
-		 * If truncate failed early the inode might still be on the
-		 * orphan list; we need to make sure the inode is removed from
-		 * the orphan list in that case.
-		 */
-		if (inode->i_nlink)
-			ext4_orphan_del(NULL, inode);
-	}
-	return ret;
+	return 0;
 }
 
 const struct iomap_ops ext4_iomap_ops = {
-- 
2.16.4


-- 
Matthew Bobrowski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 3/5] iomap: modify ->end_io() calling convention
  2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
  2019-08-12 12:52 ` [PATCH 1/5] ext4: introduce direct IO read code path using " Matthew Bobrowski
  2019-08-12 12:52 ` [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
@ 2019-08-12 12:53 ` Matthew Bobrowski
  2019-08-12 17:18   ` Christoph Hellwig
  2019-08-12 12:53 ` [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure Matthew Bobrowski
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-12 12:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux-fsdevel, jack, tytso, riteshh

This patch modifies the calling convention for the iomap ->end_io()
callback. Rather than passing either dio->error or dio->size as the 'size'
argument, we instead pass both dio->error and dio->size values separately.

In the instance that an error occurred during a write, we currently cannot
determine whether any blocks have been allocated beyond the current EOF and
data has subsequently been written to these blocks within the ->end_io()
callback. As a result, we cannot judge whether we should take the truncate
failed write path. Having both dio->error and dio->size will allow us to
perform such checks within this callback.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
---
 fs/iomap/direct-io.c  |  9 +++------
 fs/xfs/xfs_file.c     | 17 +++++++++--------
 include/linux/iomap.h |  4 ++--
 3 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 10517cea9682..2ccf1c6460d4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -77,13 +77,10 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (dio->end_io) {
-		ret = dio->end_io(iocb,
-				dio->error ? dio->error : dio->size,
-				dio->flags);
-	} else {
+	if (dio->end_io)
+		ret = dio->end_io(iocb, dio->size, dio->error, dio->flags);
+	else
 		ret = dio->error;
-	}
 
 	if (likely(!ret)) {
 		ret = dio->size;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 28101bbc0b78..f2bc3ac4a60e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -369,21 +369,22 @@ static int
 xfs_dio_write_end_io(
 	struct kiocb		*iocb,
 	ssize_t			size,
+	ssize_t                 error,
 	unsigned		flags)
 {
 	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
 	loff_t			offset = iocb->ki_pos;
 	unsigned int		nofs_flag;
-	int			error = 0;
+	int			ret = 0;
 
 	trace_xfs_end_io_direct_write(ip, offset, size);
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (size <= 0)
-		return size;
+	if (error || !size)
+		return error ? error : size;
 
 	/*
 	 * Capture amount written on completion as we can't reliably account
@@ -399,8 +400,8 @@ xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
-		if (error)
+		ret = xfs_reflink_end_cow(ip, offset, size);
+		if (ret)
 			goto out;
 	}
 
@@ -411,7 +412,7 @@ xfs_dio_write_end_io(
 	 * they are converted.
 	 */
 	if (flags & IOMAP_DIO_UNWRITTEN) {
-		error = xfs_iomap_write_unwritten(ip, offset, size, true);
+		ret = xfs_iomap_write_unwritten(ip, offset, size, true);
 		goto out;
 	}
 
@@ -430,14 +431,14 @@ xfs_dio_write_end_io(
 	if (offset + size > i_size_read(inode)) {
 		i_size_write(inode, offset + size);
 		spin_unlock(&ip->i_flags_lock);
-		error = xfs_setfilesize(ip, offset, size);
+		ret = xfs_setfilesize(ip, offset, size);
 	} else {
 		spin_unlock(&ip->i_flags_lock);
 	}
 
 out:
 	memalloc_nofs_restore(nofs_flag);
-	return error;
+	return ret;
 }
 
 /*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bc499ceae392..900284e5c06c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -188,8 +188,8 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
  */
 #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
 #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
-typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
-		unsigned flags);
+typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
+				 ssize_t error, unsigned int flags);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
-- 
2.16.4


-- 
Matthew Bobrowski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
                   ` (2 preceding siblings ...)
  2019-08-12 12:53 ` [PATCH 3/5] iomap: modify ->end_io() calling convention Matthew Bobrowski
@ 2019-08-12 12:53 ` Matthew Bobrowski
  2019-08-12 17:04   ` RITESH HARJANI
                     ` (2 more replies)
  2019-08-12 12:53 ` [PATCH 5/5] ext4: clean up redundant buffer_head direct IO code Matthew Bobrowski
  2019-08-12 17:31 ` [PATCH 0/5] ext4: direct IO via iomap infrastructure RITESH HARJANI
  5 siblings, 3 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-12 12:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux-fsdevel, jack, tytso, riteshh

This patch introduces a new direct IO write code path implementation
that makes use of the iomap infrastructure.

All direct IO write operations are now passed from the ->write_iter() callback
to the new function ext4_dio_write_iter(). This function is responsible for
calling into iomap infrastructure via iomap_dio_rw(). Snippets of the direct
IO code from within ext4_file_write_iter(), such as checking whether the IO
request is unaligned asynchronous IO, or whether it will ber overwriting
allocated and initialized blocks has been moved out and into
ext4_dio_write_iter().

The block mapping flags that are passed to ext4_map_blocks() from within
ext4_dio_get_block() and friends have effectively been taken out and
introduced within the ext4_iomap_begin(). If ext4_map_blocks() happens to have
instantiated blocks beyond the i_size, then we attempt to place the inode onto
the orphan list. Despite being able to perform i_size extension checking
earlier on in the direct IO code path, it makes most sense to perform this bit
post successful block allocation.

The ->end_io() callback ext4_dio_write_end_io() is responsible for removing
the inode from the orphan list and determining if we should truncate a failed
write in the case of an error. We also convert a range of unwritten extents to
written if IOMAP_DIO_UNWRITTEN is set and perform the necessary
i_size/i_disksize extension if the iocb->ki_pos + dio->size > i_size_read(inode).

In the instance of a short write, we fallback to buffered IO and complete
whatever is left the 'iter'. Any blocks that may have been allocated in
preparation for direct IO will be reused by buffered IO, so there's no issue
with leaving allocated blocks beyond EOF.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
---
 fs/ext4/file.c  | 227 ++++++++++++++++++++++++++++++++++++++++----------------
 fs/ext4/inode.c |  42 +++++++++--
 2 files changed, 199 insertions(+), 70 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 7470800c63b7..d74576821676 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
 #include <linux/pagevec.h>
 #include <linux/uio.h>
 #include <linux/mman.h>
+#include <linux/backing-dev.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -218,6 +219,14 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (ret <= 0)
 		return ret;
 
+	ret = file_remove_privs(iocb->ki_filp);
+	if (ret)
+		return 0;
+
+	ret = file_update_time(iocb->ki_filp);
+	if (ret)
+		return 0;
+
 	if (unlikely(IS_IMMUTABLE(inode)))
 		return -EPERM;
 
@@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	return iov_iter_count(from);
 }
 
+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+					struct iov_iter *from)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (!inode_trylock(inode)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EOPNOTSUPP;
+		inode_lock(inode);
+	}
+
+	ret = ext4_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+	current->backing_dev_info = NULL;
+out:
+	inode_unlock(inode);
+	if (likely(ret > 0)) {
+		iocb->ki_pos += ret;
+		ret = generic_write_sync(iocb, ret);
+	}
+	return ret;
+}
+
 static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
 				       size_t count)
 {
@@ -284,6 +321,128 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
 	return ret;
 }
 
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+				 ssize_t error, unsigned int flags)
+{
+	int ret = 0;
+	handle_t *handle;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (error) {
+		if (offset + size > i_size_read(inode))
+			ext4_truncate_failed_write(inode);
+
+		/*
+		 * The inode may have been placed onto the orphan list
+		 * as a result of an extension. However, an error may
+		 * have been encountered prior to being able to
+		 * complete the write operation. Perform any necessary
+		 * clean up in this case.
+		 */
+		if (!list_empty(&EXT4_I(inode)->i_orphan)) {
+			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+			if (IS_ERR(handle)) {
+				if (inode->i_nlink)
+					ext4_orphan_del(NULL, inode);
+				return PTR_ERR(handle);
+			}
+
+			if (inode->i_nlink)
+				ext4_orphan_del(handle, inode);
+			ext4_journal_stop(handle);
+		}
+		return error;
+	}
+
+	if (flags & IOMAP_DIO_UNWRITTEN) {
+		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
+		if (ret)
+			return ret;
+	}
+
+	if (offset + size > i_size_read(inode)) {
+		ret = ext4_handle_inode_extension(inode, offset + size, 0);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t ret;
+	loff_t offset = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
+	struct inode *inode = file_inode(iocb->ki_filp);
+	bool extend = false, overwrite = false, unaligned_aio = false;
+
+	if (!inode_trylock(inode)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		inode_lock(inode);
+	}
+
+	if (!ext4_dio_checks(inode)) {
+		inode_unlock(inode);
+		/*
+		 * Fallback to buffered IO if the operation on the
+		 * inode is not supported by direct IO.
+		 */
+		return ext4_buffered_write_iter(iocb, from);
+	}
+
+	ret = ext4_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	/*
+	 * Unaligned direct AIO must be serialized among each other as
+	 * the zeroing of partial blocks of two competing unaligned
+	 * AIOs can result in data corruption.
+	 */
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+		unaligned_aio = true;
+		inode_dio_wait(inode);
+	}
+
+	/*
+	 * Determine whether the IO operation will overwrite allocated
+	 * and initialized blocks. If so, check to see whether it is
+	 * possible to take the dioread_nolock path.
+	 */
+	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+	    ext4_should_dioread_nolock(inode)) {
+		overwrite = true;
+		downgrade_write(&inode->i_rwsem);
+	}
+
+	if (offset + count > i_size_read(inode) ||
+	    offset + count > EXT4_I(inode)->i_disksize) {
+		ext4_update_i_disksize(inode, inode->i_size);
+		extend = true;
+	}
+
+	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
+
+	/*
+	 * Unaligned direct AIO must be the only IO in flight or else
+	 * any overlapping aligned IO after unaligned IO might result
+	 * in data corruption.
+	 */
+	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+		inode_dio_wait(inode);
+
+	if (ret >= 0 && iov_iter_count(from)) {
+		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
+		return ext4_buffered_write_iter(iocb, from);
+	}
+out:
+	overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
+	return ret;
+}
+
 #ifdef CONFIG_FS_DAX
 static ssize_t
 ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -300,12 +459,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
 		goto out;
-	ret = file_remove_privs(iocb->ki_filp);
-	if (ret)
-		goto out;
-	ret = file_update_time(iocb->ki_filp);
-	if (ret)
-		goto out;
 
 	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
 
@@ -327,10 +480,6 @@ static ssize_t
 ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
-	int o_direct = iocb->ki_flags & IOCB_DIRECT;
-	int unaligned_aio = 0;
-	int overwrite = 0;
-	ssize_t ret;
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
@@ -339,61 +488,9 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (IS_DAX(inode))
 		return ext4_dax_write_iter(iocb, from);
 #endif
-	if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
-		return -EOPNOTSUPP;
-
-	if (!inode_trylock(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		inode_lock(inode);
-	}
-
-	ret = ext4_write_checks(iocb, from);
-	if (ret <= 0)
-		goto out;
-
-	/*
-	 * Unaligned direct AIO must be serialized among each other as zeroing
-	 * of partial blocks of two competing unaligned AIOs can result in data
-	 * corruption.
-	 */
-	if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
-	    !is_sync_kiocb(iocb) &&
-	    ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
-		unaligned_aio = 1;
-		ext4_unwritten_wait(inode);
-	}
-
-	iocb->private = &overwrite;
-	/* Check whether we do a DIO overwrite or not */
-	if (o_direct && !unaligned_aio) {
-		if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
-			if (ext4_should_dioread_nolock(inode))
-				overwrite = 1;
-		} else if (iocb->ki_flags & IOCB_NOWAIT) {
-			ret = -EAGAIN;
-			goto out;
-		}
-	}
-
-	ret = __generic_file_write_iter(iocb, from);
-	/*
-	 * Unaligned direct AIO must be the only IO in flight. Otherwise
-	 * overlapping aligned IO after unaligned might result in data
-	 * corruption.
-	 */
-	if (ret == -EIOCBQUEUED && unaligned_aio)
-		ext4_unwritten_wait(inode);
-	inode_unlock(inode);
-
-	if (ret > 0)
-		ret = generic_write_sync(iocb, ret);
-
-	return ret;
-
-out:
-	inode_unlock(inode);
-	return ret;
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return ext4_dio_write_iter(iocb, from);
+	return ext4_buffered_write_iter(iocb, from);
 }
 
 #ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 761ce6286b05..9155a8a6eb0b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3533,8 +3533,38 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		if (IS_ERR(handle))
 			return PTR_ERR(handle);
 
-		ret = ext4_map_blocks(handle, inode, &map,
-				      EXT4_GET_BLOCKS_CREATE_ZERO);
+		if (IS_DAX(inode)) {
+			ret = ext4_map_blocks(handle, inode, &map,
+					      EXT4_GET_BLOCKS_CREATE_ZERO);
+		} else {
+			/*
+			 * DAX and direct IO are the only two
+			 * operations currently supported with
+			 * IOMAP_WRITE.
+			 */
+			WARN_ON(!(flags & IOMAP_DIRECT));
+			if (round_down(offset, i_blocksize(inode)) >=
+			    i_size_read(inode)) {
+				ret = ext4_map_blocks(handle, inode, &map,
+						      EXT4_GET_BLOCKS_CREATE);
+			} else if (!ext4_test_inode_flag(inode,
+							 EXT4_INODE_EXTENTS)) {
+				/*
+				 * We cannot fill holes in indirect
+				 * tree based inodes as that could
+				 * expose stale data in the case of a
+				 * crash. Use magic error code to
+				 * fallback to buffered IO.
+				 */
+				ret = ext4_map_blocks(handle, inode, &map, 0);
+				if (ret == 0)
+					ret = -ENOTBLK;
+			} else {
+				ret = ext4_map_blocks(handle, inode, &map,
+						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
+			}
+		}
+
 		if (ret < 0) {
 			ext4_journal_stop(handle);
 			if (ret == -ENOSPC &&
@@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
 		iomap->addr = IOMAP_NULL_ADDR;
 	} else {
-		if (map.m_flags & EXT4_MAP_MAPPED) {
-			iomap->type = IOMAP_MAPPED;
-		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
 			iomap->type = IOMAP_UNWRITTEN;
+		} else if (map.m_flags & EXT4_MAP_MAPPED) {
+			iomap->type = IOMAP_MAPPED;
 		} else {
 			WARN_ON_ONCE(1);
 			return -EIO;
@@ -3601,6 +3631,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
+	if (flags & IOMAP_DIRECT && written == 0)
+		return -ENOTBLK;
 	return 0;
 }
 
-- 
2.16.4


-- 
Matthew Bobrowski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 5/5] ext4: clean up redundant buffer_head direct IO code
  2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
                   ` (3 preceding siblings ...)
  2019-08-12 12:53 ` [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure Matthew Bobrowski
@ 2019-08-12 12:53 ` Matthew Bobrowski
  2019-08-12 17:31 ` [PATCH 0/5] ext4: direct IO via iomap infrastructure RITESH HARJANI
  5 siblings, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-12 12:53 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux-fsdevel, jack, tytso, riteshh

Clean up all the redundant buffer_head direct IO code that was implemented to
work with the direct-io infrastructure. The ->direct_IO() callbacks have also
been stubbed with noop_direct_IO() as the direct IO support is now provided by
iomap infrastructure.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
---
 fs/ext4/ext4.h    |   3 -
 fs/ext4/extents.c |   8 +-
 fs/ext4/file.c    |   7 -
 fs/ext4/inode.c   | 398 +-----------------------------------------------------
 4 files changed, 6 insertions(+), 410 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf660aa7a9e0..2ab91815f52d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1555,7 +1555,6 @@ enum {
 	EXT4_STATE_NO_EXPAND,		/* No space for expansion */
 	EXT4_STATE_DA_ALLOC_CLOSE,	/* Alloc DA blks on close */
 	EXT4_STATE_EXT_MIGRATE,		/* Inode is migrating */
-	EXT4_STATE_DIO_UNWRITTEN,	/* need convert on dio done*/
 	EXT4_STATE_NEWENTRY,		/* File just added to dir */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
@@ -2522,8 +2521,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 			     struct buffer_head *bh_result, int create);
 int ext4_get_block(struct inode *inode, sector_t iblock,
 		   struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
-		       struct buffer_head *bh_result, int create);
 int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 			   struct buffer_head *bh, int create);
 int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 92266a2da7d6..cfff7d34b438 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1755,13 +1755,11 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 		return 0;
 	/*
 	 * The check for IO to unwritten extent is somewhat racy as we
-	 * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
-	 * dropping i_data_sem. But reserved blocks should save us in that
-	 * case.
+	 * increment i_unwritten only after dropping i_data_sem. But
+	 * reserved blocks should save us in that case.
 	 */
 	if (ext4_ext_is_unwritten(ex1) &&
-	    (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
-	     atomic_read(&EXT4_I(inode)->i_unwritten) ||
+	    (atomic_read(&EXT4_I(inode)->i_unwritten) ||
 	     (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
 		return 0;
 #ifdef AGGRESSIVE_TEST
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d74576821676..3736dbf28d0a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -156,13 +156,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static void ext4_unwritten_wait(struct inode *inode)
-{
-	wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
-	wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
 /*
  * This tests whether the IO in question is block-aligned or not.
  * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9155a8a6eb0b..15715d22808d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,136 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
-				struct buffer_head *bh_result, int flags)
-{
-	int dio_credits;
-	handle_t *handle;
-	int retries = 0;
-	int ret;
-
-	/* Trim mapping request to maximum we can map at once for DIO */
-	if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
-		bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
-	dio_credits = ext4_chunk_trans_blocks(inode,
-				      bh_result->b_size >> inode->i_blkbits);
-retry:
-	handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
-	if (IS_ERR(handle))
-		return PTR_ERR(handle);
-
-	ret = _ext4_get_block(inode, iblock, bh_result, flags);
-	ext4_journal_stop(handle);
-
-	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
-	return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
-		       struct buffer_head *bh, int create)
-{
-	/* We don't expect handle for direct IO */
-	WARN_ON_ONCE(ext4_journal_current_handle());
-
-	if (!create)
-		return _ext4_get_block(inode, iblock, bh, 0);
-	return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
-		sector_t iblock, struct buffer_head *bh_result,	int create)
-{
-	int ret;
-
-	/* We don't expect handle for direct IO */
-	WARN_ON_ONCE(ext4_journal_current_handle());
-
-	ret = ext4_get_block_trans(inode, iblock, bh_result,
-				   EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
-	/*
-	 * When doing DIO using unwritten extents, we need io_end to convert
-	 * unwritten extents to written on IO completion. We allocate io_end
-	 * once we spot unwritten extent and store it in b_private. Generic
-	 * DIO code keeps b_private set and furthermore passes the value to
-	 * our completion callback in 'private' argument.
-	 */
-	if (!ret && buffer_unwritten(bh_result)) {
-		if (!bh_result->b_private) {
-			ext4_io_end_t *io_end;
-
-			io_end = ext4_init_io_end(inode, GFP_KERNEL);
-			if (!io_end)
-				return -ENOMEM;
-			bh_result->b_private = io_end;
-			ext4_set_io_unwritten_flag(inode, io_end);
-		}
-		set_buffer_defer_completion(bh_result);
-	}
-
-	return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
-		sector_t iblock, struct buffer_head *bh_result,	int create)
-{
-	int ret;
-
-	/* We don't expect handle for direct IO */
-	WARN_ON_ONCE(ext4_journal_current_handle());
-
-	ret = ext4_get_block_trans(inode, iblock, bh_result,
-				   EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
-	/*
-	 * Mark inode as having pending DIO writes to unwritten extents.
-	 * ext4_direct_IO_write() checks this flag and converts extents to
-	 * written.
-	 */
-	if (!ret && buffer_unwritten(bh_result))
-		ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
-	return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
-		   struct buffer_head *bh_result, int create)
-{
-	int ret;
-
-	ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
-		   inode->i_ino, create);
-	/* We don't expect handle for direct IO */
-	WARN_ON_ONCE(ext4_journal_current_handle());
-
-	ret = _ext4_get_block(inode, iblock, bh_result, 0);
-	/*
-	 * Blocks should have been preallocated! ext4_file_write_iter() checks
-	 * that.
-	 */
-	WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
-	return ret;
-}
-
-
 /*
  * `handle' can be NULL if create is zero
  */
@@ -3641,268 +3511,6 @@ const struct iomap_ops ext4_iomap_ops = {
 	.iomap_end		= ext4_iomap_end,
 };
 
-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
-			    ssize_t size, void *private)
-{
-        ext4_io_end_t *io_end = private;
-
-	/* if not async direct IO just return */
-	if (!io_end)
-		return 0;
-
-	ext_debug("ext4_end_io_dio(): io_end 0x%p "
-		  "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
-		  io_end, io_end->inode->i_ino, iocb, offset, size);
-
-	/*
-	 * Error during AIO DIO. We cannot convert unwritten extents as the
-	 * data was not written. Just clear the unwritten flag and drop io_end.
-	 */
-	if (size <= 0) {
-		ext4_clear_io_unwritten_flag(io_end);
-		size = 0;
-	}
-	io_end->offset = offset;
-	io_end->size = size;
-	ext4_put_io_end(io_end);
-
-	return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list.  So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	ssize_t ret;
-	loff_t offset = iocb->ki_pos;
-	size_t count = iov_iter_count(iter);
-	int overwrite = 0;
-	get_block_t *get_block_func = NULL;
-	int dio_flags = 0;
-	loff_t final_size = offset + count;
-	int orphan = 0;
-	handle_t *handle;
-
-	if (final_size > inode->i_size || final_size > ei->i_disksize) {
-		/* Credits for sb + inode write */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out;
-		}
-		ret = ext4_orphan_add(handle, inode);
-		if (ret) {
-			ext4_journal_stop(handle);
-			goto out;
-		}
-		orphan = 1;
-		ext4_update_i_disksize(inode, inode->i_size);
-		ext4_journal_stop(handle);
-	}
-
-	BUG_ON(iocb->private == NULL);
-
-	/*
-	 * Make all waiters for direct IO properly wait also for extent
-	 * conversion. This also disallows race between truncate() and
-	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
-	 */
-	inode_dio_begin(inode);
-
-	/* If we do a overwrite dio, i_mutex locking can be released */
-	overwrite = *((int *)iocb->private);
-
-	if (overwrite)
-		inode_unlock(inode);
-
-	/*
-	 * For extent mapped files we could direct write to holes and fallocate.
-	 *
-	 * Allocated blocks to fill the hole are marked as unwritten to prevent
-	 * parallel buffered read to expose the stale data before DIO complete
-	 * the data IO.
-	 *
-	 * As to previously fallocated extents, ext4 get_block will just simply
-	 * mark the buffer mapped but still keep the extents unwritten.
-	 *
-	 * For non AIO case, we will convert those unwritten extents to written
-	 * after return back from blockdev_direct_IO. That way we save us from
-	 * allocating io_end structure and also the overhead of offloading
-	 * the extent convertion to a workqueue.
-	 *
-	 * For async DIO, the conversion needs to be deferred when the
-	 * IO is completed. The ext4 end_io callback function will be
-	 * called to take care of the conversion work.  Here for async
-	 * case, we allocate an io_end structure to hook to the iocb.
-	 */
-	iocb->private = NULL;
-	if (overwrite)
-		get_block_func = ext4_dio_get_block_overwrite;
-	else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
-		   round_down(offset, i_blocksize(inode)) >= inode->i_size) {
-		get_block_func = ext4_dio_get_block;
-		dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
-	} else if (is_sync_kiocb(iocb)) {
-		get_block_func = ext4_dio_get_block_unwritten_sync;
-		dio_flags = DIO_LOCKING;
-	} else {
-		get_block_func = ext4_dio_get_block_unwritten_async;
-		dio_flags = DIO_LOCKING;
-	}
-	ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
-				   get_block_func, ext4_end_io_dio, NULL,
-				   dio_flags);
-
-	if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
-						EXT4_STATE_DIO_UNWRITTEN)) {
-		int err;
-		/*
-		 * for non AIO case, since the IO is already
-		 * completed, we could do the conversion right here
-		 */
-		err = ext4_convert_unwritten_extents(NULL, inode,
-						     offset, ret);
-		if (err < 0)
-			ret = err;
-		ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-	}
-
-	inode_dio_end(inode);
-	/* take i_mutex locking again if we do a ovewrite dio */
-	if (overwrite)
-		inode_lock(inode);
-
-	if (ret < 0 && final_size > inode->i_size)
-		ext4_truncate_failed_write(inode);
-
-	/* Handle extending of i_size after direct IO write */
-	if (orphan) {
-		int err;
-
-		/* Credits for sb + inode write */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle)) {
-			/*
-			 * We wrote the data but cannot extend
-			 * i_size. Bail out. In async io case, we do
-			 * not return error here because we have
-			 * already submmitted the corresponding
-			 * bio. Returning error here makes the caller
-			 * think that this IO is done and failed
-			 * resulting in race with bio's completion
-			 * handler.
-			 */
-			if (!ret)
-				ret = PTR_ERR(handle);
-			if (inode->i_nlink)
-				ext4_orphan_del(NULL, inode);
-
-			goto out;
-		}
-		if (inode->i_nlink)
-			ext4_orphan_del(handle, inode);
-		if (ret > 0) {
-			loff_t end = offset + ret;
-			if (end > inode->i_size || end > ei->i_disksize) {
-				ext4_update_i_disksize(inode, end);
-				if (end > inode->i_size)
-					i_size_write(inode, end);
-				/*
-				 * We're going to return a positive `ret'
-				 * here due to non-zero-length I/O, so there's
-				 * no way of reporting error returns from
-				 * ext4_mark_inode_dirty() to userspace.  So
-				 * ignore it.
-				 */
-				ext4_mark_inode_dirty(handle, inode);
-			}
-		}
-		err = ext4_journal_stop(handle);
-		if (ret == 0)
-			ret = err;
-	}
-out:
-	return ret;
-}
-
-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
-	struct address_space *mapping = iocb->ki_filp->f_mapping;
-	struct inode *inode = mapping->host;
-	size_t count = iov_iter_count(iter);
-	ssize_t ret;
-
-	/*
-	 * Shared inode_lock is enough for us - it protects against concurrent
-	 * writes & truncates and since we take care of writing back page cache,
-	 * we are protected against page writeback as well.
-	 */
-	inode_lock_shared(inode);
-	ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
-					   iocb->ki_pos + count - 1);
-	if (ret)
-		goto out_unlock;
-	ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
-				   iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
-	inode_unlock_shared(inode);
-	return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
-	size_t count = iov_iter_count(iter);
-	loff_t offset = iocb->ki_pos;
-	ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
-	if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
-		return 0;
-#endif
-
-	/*
-	 * If we are doing data journalling we don't support O_DIRECT
-	 */
-	if (ext4_should_journal_data(inode))
-		return 0;
-
-	/* Let buffer I/O handle the inline data case. */
-	if (ext4_has_inline_data(inode))
-		return 0;
-
-	trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
-	if (iov_iter_rw(iter) == READ)
-		ret = ext4_direct_IO_read(iocb, iter);
-	else
-		ret = ext4_direct_IO_write(iocb, iter);
-	trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
-	return ret;
-}
-
 /*
  * Pages can be marked dirty completely asynchronously from ext4's journalling
  * activity.  By filemap_sync_pte(), try_to_unmap_one(), etc.  We cannot do
@@ -3940,7 +3548,7 @@ static const struct address_space_operations ext4_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= ext4_direct_IO,
+	.direct_IO		= noop_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
@@ -3957,7 +3565,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_journalled_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= ext4_direct_IO,
+	.direct_IO		= noop_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 };
@@ -3973,7 +3581,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_da_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= ext4_direct_IO,
+	.direct_IO		= noop_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
-- 
2.16.4


-- 
Matthew Bobrowski

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-12 12:53 ` [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure Matthew Bobrowski
@ 2019-08-12 17:04   ` RITESH HARJANI
  2019-08-13 12:58     ` Matthew Bobrowski
  2019-08-12 17:34   ` Christoph Hellwig
  2019-08-28 20:26   ` Jan Kara
  2 siblings, 1 reply; 48+ messages in thread
From: RITESH HARJANI @ 2019-08-12 17:04 UTC (permalink / raw)
  To: Matthew Bobrowski, linux-ext4; +Cc: linux-fsdevel, jack, tytso


On 8/12/19 6:23 PM, Matthew Bobrowski wrote:
> This patch introduces a new direct IO write code path implementation
> that makes use of the iomap infrastructure.
>
> All direct IO write operations are now passed from the ->write_iter() callback
> to the new function ext4_dio_write_iter(). This function is responsible for
> calling into iomap infrastructure via iomap_dio_rw(). Snippets of the direct
> IO code from within ext4_file_write_iter(), such as checking whether the IO
> request is unaligned asynchronous IO, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
>
> The block mapping flags that are passed to ext4_map_blocks() from within
> ext4_dio_get_block() and friends have effectively been taken out and
> introduced within the ext4_iomap_begin(). If ext4_map_blocks() happens to have
> instantiated blocks beyond the i_size, then we attempt to place the inode onto
> the orphan list. Despite being able to perform i_size extension checking
> earlier on in the direct IO code path, it makes most sense to perform this bit
> post successful block allocation.
>
> The ->end_io() callback ext4_dio_write_end_io() is responsible for removing
> the inode from the orphan list and determining if we should truncate a failed
> write in the case of an error. We also convert a range of unwritten extents to
> written if IOMAP_DIO_UNWRITTEN is set and perform the necessary
> i_size/i_disksize extension if the iocb->ki_pos + dio->size > i_size_read(inode).
>
> In the instance of a short write, we fallback to buffered IO and complete
> whatever is left the 'iter'. Any blocks that may have been allocated in
> preparation for direct IO will be reused by buffered IO, so there's no issue
> with leaving allocated blocks beyond EOF.
>
> Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
> ---
>   fs/ext4/file.c  | 227 ++++++++++++++++++++++++++++++++++++++++----------------
>   fs/ext4/inode.c |  42 +++++++++--
>   2 files changed, 199 insertions(+), 70 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 7470800c63b7..d74576821676 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -29,6 +29,7 @@
>   #include <linux/pagevec.h>
>   #include <linux/uio.h>
>   #include <linux/mman.h>
> +#include <linux/backing-dev.h>
>   #include "ext4.h"
>   #include "ext4_jbd2.h"
>   #include "xattr.h"
> @@ -218,6 +219,14 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>   	if (ret <= 0)
>   		return ret;
>   
> +	ret = file_remove_privs(iocb->ki_filp);
> +	if (ret)
> +		return 0;
> +
> +	ret = file_update_time(iocb->ki_filp);
> +	if (ret)
> +		return 0;
> +
>   	if (unlikely(IS_IMMUTABLE(inode)))
>   		return -EPERM;
>   
> @@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>   	return iov_iter_count(from);
>   }
>   
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> +					struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EOPNOTSUPP;
> +		inode_lock(inode);
> +	}
> +
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	current->backing_dev_info = inode_to_bdi(inode);
> +	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> +	current->backing_dev_info = NULL;
> +out:
> +	inode_unlock(inode);
> +	if (likely(ret > 0)) {
> +		iocb->ki_pos += ret;
> +		ret = generic_write_sync(iocb, ret);
> +	}
> +	return ret;
> +}
> +
>   static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
>   				       size_t count)
>   {
> @@ -284,6 +321,128 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
>   	return ret;
>   }
>   
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> +				 ssize_t error, unsigned int flags)
> +{
> +	int ret = 0;
> +	handle_t *handle;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error) {
> +		if (offset + size > i_size_read(inode))
> +			ext4_truncate_failed_write(inode);
> +
> +		/*
> +		 * The inode may have been placed onto the orphan list
> +		 * as a result of an extension. However, an error may
> +		 * have been encountered prior to being able to
> +		 * complete the write operation. Perform any necessary
> +		 * clean up in this case.
> +		 */
> +		if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> +			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> +			if (IS_ERR(handle)) {
> +				if (inode->i_nlink)
> +					ext4_orphan_del(NULL, inode);
> +				return PTR_ERR(handle);
> +			}
> +
> +			if (inode->i_nlink)
> +				ext4_orphan_del(handle, inode);
> +			ext4_journal_stop(handle);
> +		}
> +		return error;
> +	}
> +
> +	if (flags & IOMAP_DIO_UNWRITTEN) {
> +		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (offset + size > i_size_read(inode)) {
> +		ret = ext4_handle_inode_extension(inode, offset + size, 0);
> +		if (ret)
> +			return ret;
> +	}
> +	return ret;
> +}
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	loff_t offset = iocb->ki_pos;
> +	size_t count = iov_iter_count(from);
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	bool extend = false, overwrite = false, unaligned_aio = false;
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		inode_lock(inode);
> +	}
> +
> +	if (!ext4_dio_checks(inode)) {
> +		inode_unlock(inode);
> +		/*
> +		 * Fallback to buffered IO if the operation on the
> +		 * inode is not supported by direct IO.
> +		 */
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	/*
> +	 * Unaligned direct AIO must be serialized among each other as
> +	 * the zeroing of partial blocks of two competing unaligned
> +	 * AIOs can result in data corruption.
> +	 */
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> +		unaligned_aio = true;
> +		inode_dio_wait(inode);
> +	}
> +
> +	/*
> +	 * Determine whether the IO operation will overwrite allocated
> +	 * and initialized blocks. If so, check to see whether it is
> +	 * possible to take the dioread_nolock path.
> +	 */
> +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> +	    ext4_should_dioread_nolock(inode)) {
> +		overwrite = true;
> +		downgrade_write(&inode->i_rwsem);
> +	}
> +
> +	if (offset + count > i_size_read(inode) ||
> +	    offset + count > EXT4_I(inode)->i_disksize) {
> +		ext4_update_i_disksize(inode, inode->i_size);
> +		extend = true;
> +	}
> +
> +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> +	/*
> +	 * Unaligned direct AIO must be the only IO in flight or else
> +	 * any overlapping aligned IO after unaligned IO might result
> +	 * in data corruption.
> +	 */
> +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> +		inode_dio_wait(inode);

Could you please add explain & add a comment about why we wait in AIO 
DIO case
when extend is true? As I see without iomap code this case was not 
present earlier.


> +
> +	if (ret >= 0 && iov_iter_count(from)) {
> +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
should not we copy code from "__generic_file_write_iter" which does below?

3436                 /*
3437                  * We need to ensure that the page cache pages are 
written to
3438                  * disk and invalidated to preserve the expected 
O_DIRECT
3439                  * semantics.
3440                  */


> +out:
> +	overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> +	return ret;
> +}
> +
>   #ifdef CONFIG_FS_DAX
>   static ssize_t
>   ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> @@ -300,12 +459,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   	ret = ext4_write_checks(iocb, from);
>   	if (ret <= 0)
>   		goto out;
> -	ret = file_remove_privs(iocb->ki_filp);
> -	if (ret)
> -		goto out;
> -	ret = file_update_time(iocb->ki_filp);
> -	if (ret)
> -		goto out;
>   
>   	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>   
> @@ -327,10 +480,6 @@ static ssize_t
>   ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   {
>   	struct inode *inode = file_inode(iocb->ki_filp);
> -	int o_direct = iocb->ki_flags & IOCB_DIRECT;
> -	int unaligned_aio = 0;
> -	int overwrite = 0;
> -	ssize_t ret;
>   
>   	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>   		return -EIO;
> @@ -339,61 +488,9 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   	if (IS_DAX(inode))
>   		return ext4_dax_write_iter(iocb, from);
>   #endif
> -	if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> -		return -EOPNOTSUPP;
> -
> -	if (!inode_trylock(inode)) {
> -		if (iocb->ki_flags & IOCB_NOWAIT)
> -			return -EAGAIN;
> -		inode_lock(inode);
> -	}
> -
> -	ret = ext4_write_checks(iocb, from);
> -	if (ret <= 0)
> -		goto out;
> -
> -	/*
> -	 * Unaligned direct AIO must be serialized among each other as zeroing
> -	 * of partial blocks of two competing unaligned AIOs can result in data
> -	 * corruption.
> -	 */
> -	if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> -	    !is_sync_kiocb(iocb) &&
> -	    ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
> -		unaligned_aio = 1;
> -		ext4_unwritten_wait(inode);
> -	}
> -
> -	iocb->private = &overwrite;
> -	/* Check whether we do a DIO overwrite or not */
> -	if (o_direct && !unaligned_aio) {
> -		if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> -			if (ext4_should_dioread_nolock(inode))
> -				overwrite = 1;
> -		} else if (iocb->ki_flags & IOCB_NOWAIT) {
> -			ret = -EAGAIN;
> -			goto out;
> -		}
> -	}
> -
> -	ret = __generic_file_write_iter(iocb, from);
> -	/*
> -	 * Unaligned direct AIO must be the only IO in flight. Otherwise
> -	 * overlapping aligned IO after unaligned might result in data
> -	 * corruption.
> -	 */
> -	if (ret == -EIOCBQUEUED && unaligned_aio)
> -		ext4_unwritten_wait(inode);
> -	inode_unlock(inode);
> -
> -	if (ret > 0)
> -		ret = generic_write_sync(iocb, ret);
> -
> -	return ret;
> -
> -out:
> -	inode_unlock(inode);
> -	return ret;
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return ext4_dio_write_iter(iocb, from);
> +	return ext4_buffered_write_iter(iocb, from);
>   }
>   
>   #ifdef CONFIG_FS_DAX
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 761ce6286b05..9155a8a6eb0b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3533,8 +3533,38 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		if (IS_ERR(handle))
>   			return PTR_ERR(handle);
>   
> -		ret = ext4_map_blocks(handle, inode, &map,
> -				      EXT4_GET_BLOCKS_CREATE_ZERO);
> +		if (IS_DAX(inode)) {
> +			ret = ext4_map_blocks(handle, inode, &map,
> +					      EXT4_GET_BLOCKS_CREATE_ZERO);
> +		} else {
> +			/*
> +			 * DAX and direct IO are the only two
> +			 * operations currently supported with
> +			 * IOMAP_WRITE.
> +			 */
> +			WARN_ON(!(flags & IOMAP_DIRECT));
> +			if (round_down(offset, i_blocksize(inode)) >=
> +			    i_size_read(inode)) {
> +				ret = ext4_map_blocks(handle, inode, &map,
> +						      EXT4_GET_BLOCKS_CREATE);
> +			} else if (!ext4_test_inode_flag(inode,
> +							 EXT4_INODE_EXTENTS)) {
> +				/*
> +				 * We cannot fill holes in indirect
> +				 * tree based inodes as that could
> +				 * expose stale data in the case of a
> +				 * crash. Use magic error code to
> +				 * fallback to buffered IO.
> +				 */
> +				ret = ext4_map_blocks(handle, inode, &map, 0);
> +				if (ret == 0)
> +					ret = -ENOTBLK;
> +			} else {
> +				ret = ext4_map_blocks(handle, inode, &map,
> +						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
> +			}
> +		}

Could you please check & confirm on below points -
1. Do you see a problem @above in case of *overwrite* with extents mapping?
It will fall into EXT4_GET_BLOCKS_IO_CREATE_EXT case.
So are we piggy backing on the fact that ext4_map_blocks first call 
ext4_ext_map_blocks
with flags & EXT4_GET_BLOCKS_KEEP_SIZE. And so for overwrite case since 
it will return
val > 0 then we will anyway not create any blocks and so we don't need 
to check overwrite
case specifically here?


2. For cases with flags passed is 0 to ext4_map_blocks (overwrite & 
fallocate without extent case),
we need not start the journaling transaction. But in above we are doing 
ext4_journal_start/stop unconditionally
& unnecessarily reserving dio_credits blocks.
We need to take care of that right?


> +
>   		if (ret < 0) {
>   			ext4_journal_stop(handle);
>   			if (ret == -ENOSPC &&
> @@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
>   		iomap->addr = IOMAP_NULL_ADDR;
>   	} else {
> -		if (map.m_flags & EXT4_MAP_MAPPED) {
> -			iomap->type = IOMAP_MAPPED;
> -		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>   			iomap->type = IOMAP_UNWRITTEN;
> +		} else if (map.m_flags & EXT4_MAP_MAPPED) {
> +			iomap->type = IOMAP_MAPPED;
Maybe a comment as to explaining why checking UNWRITTEN before is 
necessary for others.
So in case of fallocate & DIO write case we may get extent which is both 
unwritten & mapped (right?).
so we need to check if we have an unwritten extent first so that it will 
need the conversion in ->end_io
callback.

>   		} else {
>   			WARN_ON_ONCE(1);
>   			return -EIO;
> @@ -3601,6 +3631,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>   			  ssize_t written, unsigned flags, struct iomap *iomap)
>   {
> +	if (flags & IOMAP_DIRECT && written == 0)
> +		return -ENOTBLK;
>   	return 0;
>   }
>   


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] iomap: modify ->end_io() calling convention
  2019-08-12 12:53 ` [PATCH 3/5] iomap: modify ->end_io() calling convention Matthew Bobrowski
@ 2019-08-12 17:18   ` Christoph Hellwig
  2019-08-13 10:43     ` Matthew Bobrowski
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-12 17:18 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

please add linux-xfs to the cc list for the whole series.  Besides
touching xfs itself it is also mentioned in MAINTAINERS for the iomap
code.

On Mon, Aug 12, 2019 at 10:53:11PM +1000, Matthew Bobrowski wrote:
> -	if (size <= 0)
> -		return size;
> +	if (error || !size)
> +		return error ? error : size;

This should be:

	if (error)
		return error;
	if (!size)
		return 0;

>  	if (flags & IOMAP_DIO_COW) {
> -		error = xfs_reflink_end_cow(ip, offset, size);
> -		if (error)
> +		ret = xfs_reflink_end_cow(ip, offset, size);
> +		if (ret)

I think we can just keep reusing error here.

> +typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
> +				 ssize_t error, unsigned int flags);

error should be an int and not a ssize_t.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] ext4: introduce direct IO read code path using iomap infrastructure
  2019-08-12 12:52 ` [PATCH 1/5] ext4: introduce direct IO read code path using " Matthew Bobrowski
@ 2019-08-12 17:18   ` Christoph Hellwig
  2019-08-12 20:17     ` Matthew Wilcox
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-12 17:18 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 10:52:35PM +1000, Matthew Bobrowski wrote:
> +#ifdef CONFIG_FS_ENCRYPTION
> +	if (IS_ENCRYPTED(inode))
> +		return false;
> +#endif

This could use IS_ENABLED.

>  		return -EIO;
>  
>  	if (!iov_iter_count(to))
>  		return 0; /* skip atime */
>  
>  #ifdef CONFIG_FS_DAX
> -	if (IS_DAX(file_inode(iocb->ki_filp)))
> +	if (IS_DAX(inode))
>  		return ext4_dax_read_iter(iocb, to);
>  #endif

Same here.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end()
  2019-08-12 12:52 ` [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
@ 2019-08-12 17:18   ` Christoph Hellwig
  2019-08-13 10:46     ` Matthew Bobrowski
  2019-08-28 19:59   ` Jan Kara
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-12 17:18 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 10:52:53PM +1000, Matthew Bobrowski wrote:
> In preparation for implementing the direct IO write code path modifications
> that make us of iomap infrastructure we need to move out the inode
> extension/truncate code from ext4_iomap_end() callback. For direct IO, if the
> current code remained it would behave incorrectly. If we update the inode size
> prior to converting unwritten extents we run the risk of allowing a racing
> direct IO read operation to find unwritten extents before they are converted.
> 
> The inode extension/truncate has been moved out into a new function
> ext4_handle_inode_extension(). This will be used by both direct IO and DAX
> code paths if the write results with the inode being extended.

ext4_iomap_end is empty now, so you could as well remove it entirely.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
                   ` (4 preceding siblings ...)
  2019-08-12 12:53 ` [PATCH 5/5] ext4: clean up redundant buffer_head direct IO code Matthew Bobrowski
@ 2019-08-12 17:31 ` RITESH HARJANI
  2019-08-13 11:10   ` Matthew Bobrowski
  5 siblings, 1 reply; 48+ messages in thread
From: RITESH HARJANI @ 2019-08-12 17:31 UTC (permalink / raw)
  To: Matthew Bobrowski, linux-ext4; +Cc: linux-fsdevel, jack, tytso

Hello Matthew,

On 8/12/19 6:22 PM, Matthew Bobrowski wrote:

> This patch series converts the ext4 direct IO code paths to make use of the
> iomap infrastructure and removes the old buffer_head direct-io based
> implementation. The result is that ext4 is converted to the newer framework
> and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.
>
> These changes have been tested using xfstests in both DAX and non-DAX modes
> using various configurations i.e. 4k, dioread_nolock, dax.

I had some minor review comments posted on Patch-4.
But the rest of the patch series looks good to me.
I will also do some basic testing of xfstests which I did for my patches 
and will revert back.

One query, could you please help answering below for my understanding :-

I was under the assumption that we need to maintain 
ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or 
atomic_read(&EXT4_I(inode)->i_unwritten))
in case of non-AIO directIO or AIO directIO case as well (when we may 
allocate unwritten extents),
to protect with some kind of race with other parts of code(maybe 
truncate/bufferedIO/fallocate not sure?) which may call for 
ext4_can_extents_be_merged()
to check if extents can be merged or not.

Is it not the case?
Now that directIO code has no way of specifying that this inode has 
unwritten extent, will it not race with any other path, where this info 
was necessary (like
in above func ext4_can_extents_be_merged())?


Thanks
Ritesh

>
> Matthew Bobrowski (5):
>    ext4: introduce direct IO read code path using iomap infrastructure
>    ext4: move inode extension/truncate code out from ext4_iomap_end()
>    iomap: modify ->end_io() calling convention
>    ext4: introduce direct IO write code path using iomap infrastructure
>    ext4: clean up redundant buffer_head direct IO code
>
>   fs/ext4/ext4.h        |   3 -
>   fs/ext4/extents.c     |   8 +-
>   fs/ext4/file.c        | 329 +++++++++++++++++++++++++++-------
>   fs/ext4/inode.c       | 488 +++++---------------------------------------------
>   fs/iomap/direct-io.c  |   9 +-
>   fs/xfs/xfs_file.c     |  17 +-
>   include/linux/iomap.h |   4 +-
>   7 files changed, 322 insertions(+), 536 deletions(-)
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-12 12:53 ` [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure Matthew Bobrowski
  2019-08-12 17:04   ` RITESH HARJANI
@ 2019-08-12 17:34   ` Christoph Hellwig
  2019-08-13 10:45     ` Matthew Bobrowski
  2019-08-28 20:26   ` Jan Kara
  2 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-12 17:34 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

> +	if (error) {
> +		if (offset + size > i_size_read(inode))
> +			ext4_truncate_failed_write(inode);
> +
> +		/*
> +		 * The inode may have been placed onto the orphan list
> +		 * as a result of an extension. However, an error may
> +		 * have been encountered prior to being able to
> +		 * complete the write operation. Perform any necessary
> +		 * clean up in this case.
> +		 */
> +		if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> +			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> +			if (IS_ERR(handle)) {
> +				if (inode->i_nlink)
> +					ext4_orphan_del(NULL, inode);
> +				return PTR_ERR(handle);
> +			}
> +
> +			if (inode->i_nlink)
> +				ext4_orphan_del(handle, inode);
> +			ext4_journal_stop(handle);
> +		}
> +		return error;

I'd split this branch into a separate function just to keep the
end_io handler tidy.

> +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> +		inode_dio_wait(inode);
> +
> +	if (ret >= 0 && iov_iter_count(from)) {
> +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +out:
> +	overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> +	return ret;

the ? : expression here is weird.

I'd write this as:

	if (overwrite)
		inode_unlock_shared(inode);
	else
		inode_unlock(inode);

	if (ret >= 0 && iov_iter_count(from))
		return ext4_buffered_write_iter(iocb, from);
	return ret;

and handle the only place we jump to the current out label manually,
as that always does an exclusive unlock anyway.

> +		if (IS_DAX(inode)) {
> +			ret = ext4_map_blocks(handle, inode, &map,
> +					      EXT4_GET_BLOCKS_CREATE_ZERO);
> +		} else {
> +			/*
> +			 * DAX and direct IO are the only two
> +			 * operations currently supported with
> +			 * IOMAP_WRITE.
> +			 */
> +			WARN_ON(!(flags & IOMAP_DIRECT));
> +			if (round_down(offset, i_blocksize(inode)) >=
> +			    i_size_read(inode)) {
> +				ret = ext4_map_blocks(handle, inode, &map,
> +						      EXT4_GET_BLOCKS_CREATE);
> +			} else if (!ext4_test_inode_flag(inode,
> +							 EXT4_INODE_EXTENTS)) {
> +				/*
> +				 * We cannot fill holes in indirect
> +				 * tree based inodes as that could
> +				 * expose stale data in the case of a
> +				 * crash. Use magic error code to
> +				 * fallback to buffered IO.
> +				 */
> +				ret = ext4_map_blocks(handle, inode, &map, 0);
> +				if (ret == 0)
> +					ret = -ENOTBLK;
> +			} else {
> +				ret = ext4_map_blocks(handle, inode, &map,
> +						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
> +			}
> +		}

I think this could be simplified down to something like:

		int flags = 0;

		...

		/*
		 * DAX and direct IO are the only two operations currently
		 * supported with IOMAP_WRITE.
		 */
		WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));

		if (IS_DAX(inode))
			flags = EXT4_GET_BLOCKS_CREATE_ZERO;
		else if (round_down(offset, i_blocksize(inode)) >=
				i_size_read(inode)) {
			flags = EXT4_GET_BLOCKS_CREATE;
		else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
			flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

		/*
		 * We cannot fill holes in indirect tree based inodes as that
		 * could expose stale data in the case of a crash.  Use the
		 * magic error code to fallback to buffered IO.
		 */
		if (!flags && !ret)
			ret = -ENOTBLK;


> @@ -3601,6 +3631,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  			  ssize_t written, unsigned flags, struct iomap *iomap)
>  {
> +	if (flags & IOMAP_DIRECT && written == 0)
> +		return -ENOTBLK;

This probably wants a comment, too.  But do we actually ever end up
here?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] ext4: introduce direct IO read code path using iomap infrastructure
  2019-08-12 17:18   ` Christoph Hellwig
@ 2019-08-12 20:17     ` Matthew Wilcox
  2019-08-13 10:45       ` Matthew Bobrowski
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2019-08-12 20:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Bobrowski, linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 10:18:35AM -0700, Christoph Hellwig wrote:
> >  		return -EIO;
> >  
> >  	if (!iov_iter_count(to))
> >  		return 0; /* skip atime */
> >  
> >  #ifdef CONFIG_FS_DAX
> > -	if (IS_DAX(file_inode(iocb->ki_filp)))
> > +	if (IS_DAX(inode))
> >  		return ext4_dax_read_iter(iocb, to);
> >  #endif
> 
> Same here.

It doesn't even need IS_ENABLED.

include/linux/fs.h:#define IS_DAX(inode)                ((inode)->i_flags & S_DAX)

#ifdef CONFIG_FS_DAX
#define S_DAX           8192    /* Direct Access, avoiding the page cache */
#else
#define S_DAX           0       /* Make all the DAX code disappear */
#endif


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] iomap: modify ->end_io() calling convention
  2019-08-12 17:18   ` Christoph Hellwig
@ 2019-08-13 10:43     ` Matthew Bobrowski
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-13 10:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 10:18:32AM -0700, Christoph Hellwig wrote:
> please add linux-xfs to the cc list for the whole series.  Besides
> touching xfs itself it is also mentioned in MAINTAINERS for the iomap
> code.

Firstly, thank you for the review, highly appreciated! Secondly, not a
problem, will do.

> On Mon, Aug 12, 2019 at 10:53:11PM +1000, Matthew Bobrowski wrote:
> > -	if (size <= 0)
> > -		return size;
> > +	if (error || !size)
> > +		return error ? error : size;
> 
> This should be:
> 
> 	if (error)
> 		return error;
> 	if (!size)
> 		return 0;

OK.

> >  	if (flags & IOMAP_DIO_COW) {
> > -		error = xfs_reflink_end_cow(ip, offset, size);
> > -		if (error)
> > +		ret = xfs_reflink_end_cow(ip, offset, size);
> > +		if (ret)
> 
> I think we can just keep reusing error here.

Ah yes, that will work.

> > +typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
> > +				 ssize_t error, unsigned int flags);
> 
> error should be an int and not a ssize_t.

Updated.

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-12 17:34   ` Christoph Hellwig
@ 2019-08-13 10:45     ` Matthew Bobrowski
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-13 10:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 10:34:03AM -0700, Christoph Hellwig wrote:
> > +	if (error) {
> > +		if (offset + size > i_size_read(inode))
> > +			ext4_truncate_failed_write(inode);
> > +
> > +		/*
> > +		 * The inode may have been placed onto the orphan list
> > +		 * as a result of an extension. However, an error may
> > +		 * have been encountered prior to being able to
> > +		 * complete the write operation. Perform any necessary
> > +		 * clean up in this case.
> > +		 */
> > +		if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> > +			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > +			if (IS_ERR(handle)) {
> > +				if (inode->i_nlink)
> > +					ext4_orphan_del(NULL, inode);
> > +				return PTR_ERR(handle);
> > +			}
> > +
> > +			if (inode->i_nlink)
> > +				ext4_orphan_del(handle, inode);
> > +			ext4_journal_stop(handle);
> > +		}
> > +		return error;
> 
> I'd split this branch into a separate function just to keep the
> end_io handler tidy.

Good idea. I'll do that...

> > +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> > +		inode_dio_wait(inode);
> > +
> > +	if (ret >= 0 && iov_iter_count(from)) {
> > +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > +		return ext4_buffered_write_iter(iocb, from);
> > +	}
> > +out:
> > +	overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > +	return ret;
> 
> the ? : expression here is weird.
> 
> I'd write this as:
> 
> 	if (overwrite)
> 		inode_unlock_shared(inode);
> 	else
> 		inode_unlock(inode);
> 
> 	if (ret >= 0 && iov_iter_count(from))
> 		return ext4_buffered_write_iter(iocb, from);
> 	return ret;
> 
> and handle the only place we jump to the current out label manually,
> as that always does an exclusive unlock anyway.

Yeah, the ternary operators do look weird here and I'd prefer if we
also dropped them. I was at a point where I was trying to clean up
some of the code, but I had been staring at the screen for so long my
brain went numb and couldn't think of how to do this neatly. I'm happy
with this suggestion. :-)

> > +		if (IS_DAX(inode)) {
> > +			ret = ext4_map_blocks(handle, inode, &map,
> > +					      EXT4_GET_BLOCKS_CREATE_ZERO);
> > +		} else {
> > +			/*
> > +			 * DAX and direct IO are the only two
> > +			 * operations currently supported with
> > +			 * IOMAP_WRITE.
> > +			 */
> > +			WARN_ON(!(flags & IOMAP_DIRECT));
> > +			if (round_down(offset, i_blocksize(inode)) >=
> > +			    i_size_read(inode)) {
> > +				ret = ext4_map_blocks(handle, inode, &map,
> > +						      EXT4_GET_BLOCKS_CREATE);
> > +			} else if (!ext4_test_inode_flag(inode,
> > +							 EXT4_INODE_EXTENTS)) {
> > +				/*
> > +				 * We cannot fill holes in indirect
> > +				 * tree based inodes as that could
> > +				 * expose stale data in the case of a
> > +				 * crash. Use magic error code to
> > +				 * fallback to buffered IO.
> > +				 */
> > +				ret = ext4_map_blocks(handle, inode, &map, 0);
> > +				if (ret == 0)
> > +					ret = -ENOTBLK;
> > +			} else {
> > +				ret = ext4_map_blocks(handle, inode, &map,
> > +						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
> > +			}
> > +		}
> 
> I think this could be simplified down to something like:
> 
> 		int flags = 0;
> 
> 		...
> 
> 		/*
> 		 * DAX and direct IO are the only two operations currently
> 		 * supported with IOMAP_WRITE.
> 		 */
> 		WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> 
> 		if (IS_DAX(inode))
> 			flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> 		else if (round_down(offset, i_blocksize(inode)) >=
> 				i_size_read(inode)) {
> 			flags = EXT4_GET_BLOCKS_CREATE;
> 		else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> 			flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> 
> 		/*
> 		 * We cannot fill holes in indirect tree based inodes as that
> 		 * could expose stale data in the case of a crash.  Use the
> 		 * magic error code to fallback to buffered IO.
> 		 */
> 		if (!flags && !ret)
> 			ret = -ENOTBLK;

This also seems OK to me.

> > @@ -3601,6 +3631,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >  static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> >  			  ssize_t written, unsigned flags, struct iomap *iomap)
> >  {
> > +	if (flags & IOMAP_DIRECT && written == 0)
> > +		return -ENOTBLK;
> 
> This probably wants a comment, too.  But do we actually ever end up
> here?

Sure, I can append a comment. Also, I don't believe that we can
completely drop the ->iomap_end() callback as hinted in one of your
other comments. The reason I say this is because we still need this to
catch the case where an error an occurs within 'iomap_actor_t'. If
that happens to be, within iomap_dio_rw() we wait for IO completion
before returning and then we fallback to buffered IO to complete the
remainder of the IO. We will also be able to reuse the extent that was
allocated when preparing for direct IO if we do this.

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] ext4: introduce direct IO read code path using iomap infrastructure
  2019-08-12 20:17     ` Matthew Wilcox
@ 2019-08-13 10:45       ` Matthew Bobrowski
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-13 10:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 01:17:35PM -0700, Matthew Wilcox wrote:
> On Mon, Aug 12, 2019 at 10:18:35AM -0700, Christoph Hellwig wrote:
> > >  		return -EIO;
> > >  
> > >  	if (!iov_iter_count(to))
> > >  		return 0; /* skip atime */
> > >  
> > >  #ifdef CONFIG_FS_DAX
> > > -	if (IS_DAX(file_inode(iocb->ki_filp)))
> > > +	if (IS_DAX(inode))
> > >  		return ext4_dax_read_iter(iocb, to);
> > >  #endif
> > 
> > Same here.
> 
> It doesn't even need IS_ENABLED.
> 
> include/linux/fs.h:#define IS_DAX(inode)                ((inode)->i_flags & S_DAX)
> 
> #ifdef CONFIG_FS_DAX
> #define S_DAX           8192    /* Direct Access, avoiding the page cache */
> #else
> #define S_DAX           0       /* Make all the DAX code disappear */
> #endif

Ah, clever - I like it! I actually didn't see this and thank you for
highlighting. I guess I will be dropping the CONFIG_FS_DAX statement
here...

--M


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end()
  2019-08-12 17:18   ` Christoph Hellwig
@ 2019-08-13 10:46     ` Matthew Bobrowski
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-13 10:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon, Aug 12, 2019 at 10:18:39AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 12, 2019 at 10:52:53PM +1000, Matthew Bobrowski wrote:
> > In preparation for implementing the direct IO write code path modifications
> > that make us of iomap infrastructure we need to move out the inode
> > extension/truncate code from ext4_iomap_end() callback. For direct IO, if the
> > current code remained it would behave incorrectly. If we update the inode size
> > prior to converting unwritten extents we run the risk of allowing a racing
> > direct IO read operation to find unwritten extents before they are converted.
> > 
> > The inode extension/truncate has been moved out into a new function
> > ext4_handle_inode_extension(). This will be used by both direct IO and DAX
> > code paths if the write results with the inode being extended.
> 
> ext4_iomap_end is empty now, so you could as well remove it entirely.

As mentioned in my other email (4/5), this is callback needs to remain.

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-12 17:31 ` [PATCH 0/5] ext4: direct IO via iomap infrastructure RITESH HARJANI
@ 2019-08-13 11:10   ` Matthew Bobrowski
  2019-08-13 12:27     ` RITESH HARJANI
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-13 11:10 UTC (permalink / raw)
  To: RITESH HARJANI; +Cc: linux-ext4, linux-fsdevel, jack, tytso

On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
> > This patch series converts the ext4 direct IO code paths to make use of the
> > iomap infrastructure and removes the old buffer_head direct-io based
> > implementation. The result is that ext4 is converted to the newer framework
> > and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.
> > 
> > These changes have been tested using xfstests in both DAX and non-DAX modes
> > using various configurations i.e. 4k, dioread_nolock, dax.
> 
> I had some minor review comments posted on Patch-4.
> But the rest of the patch series looks good to me.

Thanks for the review, much appreciated! Also, apologies about any
delayed response to your queries, I predominantly do all this work in
my personal time.

> I will also do some basic testing of xfstests which I did for my patches and
> will revert back.

Sounds good!

> One query, could you please help answering below for my understanding :-
> 
> I was under the assumption that we need to maintain
> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
> atomic_read(&EXT4_I(inode)->i_unwritten))
> in case of non-AIO directIO or AIO directIO case as well (when we may
> allocate unwritten extents),
> to protect with some kind of race with other parts of code(maybe
> truncate/bufferedIO/fallocate not sure?) which may call for
> ext4_can_extents_be_merged()
> to check if extents can be merged or not.
> 
> Is it not the case?
> Now that directIO code has no way of specifying that this inode has
> unwritten extent, will it not race with any other path, where this info was
> necessary (like
> in above func ext4_can_extents_be_merged())?

Ah yes, I was under the same assumption when reviewing the code
initially and one of my first solutions was to also use this dynamic
'state' flag in the ->end_io() handler. But, I fell flat on my face as
that deemed to be problematic... This is because there can be multiple
direct IOs to unwritten extents against the same inode, so you cannot
possibly get away with tracking them using this single inode flag. So,
hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
whether _this_ particular IO has an underlying unwritten extent.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-13 11:10   ` Matthew Bobrowski
@ 2019-08-13 12:27     ` RITESH HARJANI
  2019-08-14  9:48       ` Matthew Bobrowski
  2019-08-21 13:14       ` Matthew Bobrowski
  0 siblings, 2 replies; 48+ messages in thread
From: RITESH HARJANI @ 2019-08-13 12:27 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso


On 8/13/19 4:40 PM, Matthew Bobrowski wrote:
> On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
>>> This patch series converts the ext4 direct IO code paths to make use of the
>>> iomap infrastructure and removes the old buffer_head direct-io based
>>> implementation. The result is that ext4 is converted to the newer framework
>>> and that it may _possibly_ gain a performance boost for O_SYNC | O_DIRECT IO.
>>>
>>> These changes have been tested using xfstests in both DAX and non-DAX modes
>>> using various configurations i.e. 4k, dioread_nolock, dax.
>> I had some minor review comments posted on Patch-4.
>> But the rest of the patch series looks good to me.
> Thanks for the review, much appreciated! Also, apologies about any
> delayed response to your queries, I predominantly do all this work in
> my personal time.
Np at all.

>
>> I will also do some basic testing of xfstests which I did for my patches and
>> will revert back.
> Sounds good!

I did not find any failure new failures in xfstests with 4K block size.
Neither in basic fio DIO/AIO testing. So my basic testing looks good
(these are mostly the tests which I was using for my debug/dev too)


>
>> One query, could you please help answering below for my understanding :-
>>
>> I was under the assumption that we need to maintain
>> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
>> atomic_read(&EXT4_I(inode)->i_unwritten))
>> in case of non-AIO directIO or AIO directIO case as well (when we may
>> allocate unwritten extents),
>> to protect with some kind of race with other parts of code(maybe
>> truncate/bufferedIO/fallocate not sure?) which may call for
>> ext4_can_extents_be_merged()
>> to check if extents can be merged or not.
>>
>> Is it not the case?
>> Now that directIO code has no way of specifying that this inode has
>> unwritten extent, will it not race with any other path, where this info was
>> necessary (like
>> in above func ext4_can_extents_be_merged())?
> Ah yes, I was under the same assumption when reviewing the code
> initially and one of my first solutions was to also use this dynamic
> 'state' flag in the ->end_io() handler. But, I fell flat on my face as
> that deemed to be problematic... This is because there can be multiple
> direct IOs to unwritten extents against the same inode, so you cannot
> possibly get away with tracking them using this single inode flag. So,
> hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
> IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
> whether _this_ particular IO has an underlying unwritten extent.

Thanks for taking time to explain this.

Yes, I do understand that part - i.e. while preparing block mapping in 
->iomap_begin
we get to know(from ext4_map_blocks) whether this is an unwritten extent 
and we add the flag
IOMAP_DIO_UNWRITTEN to iomap. This is needed so that we can convert 
unwritten extents in ->end_io
before we update the inode size and mark the inode dirty - to avoid any 
race with other AIO DIO or bufferedIO.

But what I meant was this (I may be wrong here since I haven't really 
looked into it),
but for my understanding I would like to discuss this -

So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining 
on whether a newextent can be merged with ex1 in function
ext4_extents_can_be_merged. But now since we have removed that flag we 
have no way of knowing that whether
this inode has any unwritten extents or not from any DIO path.
What I meant is isn't this removal of setting/unsetting of 
flag(EXT4_STATE_DIO_UNWRITTEN)
changing the behavior of this func - ext4_extents_can_be_merged?

Also - could you please explain why this check returns 0 in the first 
place (line 1762 - 1766 below)?

1733 int
1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent 
*ex1,
1735                                 struct ext4_extent *ex2)
<...>

1762         if (ext4_ext_is_unwritten(ex1) &&
1763             (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
1764              atomic_read(&EXT4_I(inode)->i_unwritten) ||
1765              (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
1766                 return 0;
<...>

Regards
Ritesh


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-12 17:04   ` RITESH HARJANI
@ 2019-08-13 12:58     ` Matthew Bobrowski
  2019-08-13 14:35       ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-13 12:58 UTC (permalink / raw)
  To: RITESH HARJANI; +Cc: linux-ext4, linux-fsdevel, jack, tytso

On Mon, Aug 12, 2019 at 10:34:29PM +0530, RITESH HARJANI wrote:
> > +	if (offset + count > i_size_read(inode) ||
> > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > +		ext4_update_i_disksize(inode, inode->i_size);
> > +		extend = true;
> > +	}
> > +
> > +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> > +
> > +	/*
> > +	 * Unaligned direct AIO must be the only IO in flight or else
> > +	 * any overlapping aligned IO after unaligned IO might result
> > +	 * in data corruption.
> > +	 */
> > +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> > +		inode_dio_wait(inode);
> 
> Could you please add explain & add a comment about why we wait in AIO DIO
> case
> when extend is true? As I see without iomap code this case was not present
> earlier.

Because while using the iomap infrastructure for AIO writes, we return to the
caller prior to invoking the ->end_io() handler. This callback is responsible
for performing the in-core/on-disk inode extension if it is deemed
necessary. If we don't wait in the case of an extend, we run the risk of
loosing inode size consistencies in addition to things leading to possible
data corruption...

> > +
> > +	if (ret >= 0 && iov_iter_count(from)) {
> > +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > +		return ext4_buffered_write_iter(iocb, from);
> > +	}
> should not we copy code from "__generic_file_write_iter" which does below?
> 
> 3436                 /*
> 3437                  * We need to ensure that the page cache pages are
> written to
> 3438                  * disk and invalidated to preserve the expected
> O_DIRECT
> 3439                  * semantics.
> 3440                  */

Hm, I don't see why this would be required seeing as though the page cache
invalidation semantics pre and post write are handled by iomap_dio_rw() and
iomap_dio_complete(). But, I could be completely wrong here, so we may need to
wait for some others to provide comments on this.

> > +			WARN_ON(!(flags & IOMAP_DIRECT));
> > +			if (round_down(offset, i_blocksize(inode)) >=
> > +			    i_size_read(inode)) {
> > +				ret = ext4_map_blocks(handle, inode, &map,
> > +						      EXT4_GET_BLOCKS_CREATE);
> > +			} else if (!ext4_test_inode_flag(inode,
> > +							 EXT4_INODE_EXTENTS)) {
> > +				/*
> > +				 * We cannot fill holes in indirect
> > +				 * tree based inodes as that could
> > +				 * expose stale data in the case of a
> > +				 * crash. Use magic error code to
> > +				 * fallback to buffered IO.
> > +				 */
> > +				ret = ext4_map_blocks(handle, inode, &map, 0);
> > +				if (ret == 0)
> > +					ret = -ENOTBLK;
> > +			} else {
> > +				ret = ext4_map_blocks(handle, inode, &map,
> > +						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
> > +			}
> > +		}
> 
> Could you please check & confirm on below points -
> 1. Do you see a problem @above in case of *overwrite* with extents mapping?
> It will fall into EXT4_GET_BLOCKS_IO_CREATE_EXT case.
> So are we piggy backing on the fact that ext4_map_blocks first call
> ext4_ext_map_blocks
> with flags & EXT4_GET_BLOCKS_KEEP_SIZE. And so for overwrite case since it
> will return
> val > 0 then we will anyway not create any blocks and so we don't need to
> check overwrite
> case specifically here?
> 
> 
> 2. For cases with flags passed is 0 to ext4_map_blocks (overwrite &
> fallocate without extent case),
> we need not start the journaling transaction. But in above we are doing
> ext4_journal_start/stop unconditionally
> & unnecessarily reserving dio_credits blocks.
> We need to take care of that right?

Hm, I think you raise valid points here.

Jan, do you have any comments on the above? I vaguely remember having a
discussion around dropping the overwrite checks in ext4_iomap_begin() as we're
removing the inode_lock() early on in ext4_dio_write_iter(), so it woudln't be
necessary to do so. But, now that Ritesh mentioned it again I'm thinking it
may actually be required...

> >   		if (ret < 0) {
> >   			ext4_journal_stop(handle);
> >   			if (ret == -ENOSPC &&
> > @@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >   		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> >   		iomap->addr = IOMAP_NULL_ADDR;
> >   	} else {
> > -		if (map.m_flags & EXT4_MAP_MAPPED) {
> > -			iomap->type = IOMAP_MAPPED;
> > -		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > +		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> >   			iomap->type = IOMAP_UNWRITTEN;
> > +		} else if (map.m_flags & EXT4_MAP_MAPPED) {
> > +			iomap->type = IOMAP_MAPPED;
> Maybe a comment as to explaining why checking UNWRITTEN before is necessary
> for others.
> So in case of fallocate & DIO write case we may get extent which is both
> unwritten & mapped (right?).
> so we need to check if we have an unwritten extent first so that it will
> need the conversion in ->end_io
> callback.

Yes, that is essentially correct.

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-13 12:58     ` Matthew Bobrowski
@ 2019-08-13 14:35       ` Darrick J. Wong
  2019-08-14  9:51         ` Matthew Bobrowski
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2019-08-13 14:35 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: RITESH HARJANI, linux-ext4, linux-fsdevel, jack, tytso

On Tue, Aug 13, 2019 at 10:58:42PM +1000, Matthew Bobrowski wrote:
> On Mon, Aug 12, 2019 at 10:34:29PM +0530, RITESH HARJANI wrote:
> > > +	if (offset + count > i_size_read(inode) ||
> > > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > > +		ext4_update_i_disksize(inode, inode->i_size);
> > > +		extend = true;
> > > +	}
> > > +
> > > +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> > > +
> > > +	/*
> > > +	 * Unaligned direct AIO must be the only IO in flight or else
> > > +	 * any overlapping aligned IO after unaligned IO might result
> > > +	 * in data corruption.
> > > +	 */
> > > +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> > > +		inode_dio_wait(inode);
> > 
> > Could you please add explain & add a comment about why we wait in AIO DIO
> > case
> > when extend is true? As I see without iomap code this case was not present
> > earlier.
> 
> Because while using the iomap infrastructure for AIO writes, we return to the
> caller prior to invoking the ->end_io() handler. This callback is responsible
> for performing the in-core/on-disk inode extension if it is deemed
> necessary. If we don't wait in the case of an extend, we run the risk of
> loosing inode size consistencies in addition to things leading to possible
> data corruption...
> 
> > > +
> > > +	if (ret >= 0 && iov_iter_count(from)) {
> > > +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > > +		return ext4_buffered_write_iter(iocb, from);
> > > +	}
> > should not we copy code from "__generic_file_write_iter" which does below?
> > 
> > 3436                 /*
> > 3437                  * We need to ensure that the page cache pages are
> > written to
> > 3438                  * disk and invalidated to preserve the expected
> > O_DIRECT
> > 3439                  * semantics.
> > 3440                  */
> 
> Hm, I don't see why this would be required seeing as though the page cache
> invalidation semantics pre and post write are handled by iomap_dio_rw() and
> iomap_dio_complete(). But, I could be completely wrong here, so we may need to
> wait for some others to provide comments on this.

iomap_dio_rw is supposed to zap the page cache before the write and
again afterwards (and whine if someone is racing buffered and direct
writes to the same file location), so ext4 shouldn't need to do that
itself.

--D

> > > +			WARN_ON(!(flags & IOMAP_DIRECT));
> > > +			if (round_down(offset, i_blocksize(inode)) >=
> > > +			    i_size_read(inode)) {
> > > +				ret = ext4_map_blocks(handle, inode, &map,
> > > +						      EXT4_GET_BLOCKS_CREATE);
> > > +			} else if (!ext4_test_inode_flag(inode,
> > > +							 EXT4_INODE_EXTENTS)) {
> > > +				/*
> > > +				 * We cannot fill holes in indirect
> > > +				 * tree based inodes as that could
> > > +				 * expose stale data in the case of a
> > > +				 * crash. Use magic error code to
> > > +				 * fallback to buffered IO.
> > > +				 */
> > > +				ret = ext4_map_blocks(handle, inode, &map, 0);
> > > +				if (ret == 0)
> > > +					ret = -ENOTBLK;
> > > +			} else {
> > > +				ret = ext4_map_blocks(handle, inode, &map,
> > > +						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
> > > +			}
> > > +		}
> > 
> > Could you please check & confirm on below points -
> > 1. Do you see a problem @above in case of *overwrite* with extents mapping?
> > It will fall into EXT4_GET_BLOCKS_IO_CREATE_EXT case.
> > So are we piggy backing on the fact that ext4_map_blocks first call
> > ext4_ext_map_blocks
> > with flags & EXT4_GET_BLOCKS_KEEP_SIZE. And so for overwrite case since it
> > will return
> > val > 0 then we will anyway not create any blocks and so we don't need to
> > check overwrite
> > case specifically here?
> > 
> > 
> > 2. For cases with flags passed is 0 to ext4_map_blocks (overwrite &
> > fallocate without extent case),
> > we need not start the journaling transaction. But in above we are doing
> > ext4_journal_start/stop unconditionally
> > & unnecessarily reserving dio_credits blocks.
> > We need to take care of that right?
> 
> Hm, I think you raise valid points here.
> 
> Jan, do you have any comments on the above? I vaguely remember having a
> discussion around dropping the overwrite checks in ext4_iomap_begin() as we're
> removing the inode_lock() early on in ext4_dio_write_iter(), so it woudln't be
> necessary to do so. But, now that Ritesh mentioned it again I'm thinking it
> may actually be required...
> 
> > >   		if (ret < 0) {
> > >   			ext4_journal_stop(handle);
> > >   			if (ret == -ENOSPC &&
> > > @@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > >   		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> > >   		iomap->addr = IOMAP_NULL_ADDR;
> > >   	} else {
> > > -		if (map.m_flags & EXT4_MAP_MAPPED) {
> > > -			iomap->type = IOMAP_MAPPED;
> > > -		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > > +		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > >   			iomap->type = IOMAP_UNWRITTEN;
> > > +		} else if (map.m_flags & EXT4_MAP_MAPPED) {
> > > +			iomap->type = IOMAP_MAPPED;
> > Maybe a comment as to explaining why checking UNWRITTEN before is necessary
> > for others.
> > So in case of fallocate & DIO write case we may get extent which is both
> > unwritten & mapped (right?).
> > so we need to check if we have an unwritten extent first so that it will
> > need the conversion in ->end_io
> > callback.
> 
> Yes, that is essentially correct.
> 
> --M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-13 12:27     ` RITESH HARJANI
@ 2019-08-14  9:48       ` Matthew Bobrowski
  2019-08-14 11:58         ` RITESH HARJANI
  2019-08-21 13:14       ` Matthew Bobrowski
  1 sibling, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-14  9:48 UTC (permalink / raw)
  To: RITESH HARJANI; +Cc: linux-ext4, linux-fsdevel, jack, tytso

On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> On 8/13/19 4:40 PM, Matthew Bobrowski wrote:
> > On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
> > > I was under the assumption that we need to maintain
> > > ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
> > > atomic_read(&EXT4_I(inode)->i_unwritten))
> > > in case of non-AIO directIO or AIO directIO case as well (when we may
> > > allocate unwritten extents),
> > > to protect with some kind of race with other parts of code(maybe
> > > truncate/bufferedIO/fallocate not sure?) which may call for
> > > ext4_can_extents_be_merged()
> > > to check if extents can be merged or not.
> > > 
> > > Is it not the case?
> > > Now that directIO code has no way of specifying that this inode has
> > > unwritten extent, will it not race with any other path, where this info was
> > > necessary (like
> > > in above func ext4_can_extents_be_merged())?
> > Ah yes, I was under the same assumption when reviewing the code
> > initially and one of my first solutions was to also use this dynamic
> > 'state' flag in the ->end_io() handler. But, I fell flat on my face as
> > that deemed to be problematic... This is because there can be multiple
> > direct IOs to unwritten extents against the same inode, so you cannot
> > possibly get away with tracking them using this single inode flag. So,
> > hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
> > IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
> > whether _this_ particular IO has an underlying unwritten extent.
> 
> Thanks for taking time to explain this.
> 
> But what I meant was this (I may be wrong here since I haven't really looked
> into it),
> but for my understanding I would like to discuss this -
> 
> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> whether a newextent can be merged with ex1 in function
> ext4_extents_can_be_merged. But now since we have removed that flag we have
> no way of knowing that whether
> this inode has any unwritten extents or not from any DIO path.
> What I meant is isn't this removal of setting/unsetting of
> flag(EXT4_STATE_DIO_UNWRITTEN)
> changing the behavior of this func - ext4_extents_can_be_merged?

Ah yes, I see. I believe that what you're saying is correct and I
think we will need to account for this case. But, I haven't thought
about how to do this just yet.

> Also - could you please explain why this check returns 0 in the first place
> (line 1762 - 1766 below)?

I cannot explain why, because I myself don't know exactly why in this
particular case the extents cannot be merged. Perhaps `git blame` is
our friend and we can direct that question accordingly, or someone
else on this mailing list knows the answer. :-)

> 1733 int
> 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent
> *ex1,
> 1735                                 struct ext4_extent *ex2)
> <...>
> 
> 1762         if (ext4_ext_is_unwritten(ex1) &&
> 1763             (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
> 1764              atomic_read(&EXT4_I(inode)->i_unwritten) ||
> 1765              (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
> 1766                 return 0;

--M


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-13 14:35       ` Darrick J. Wong
@ 2019-08-14  9:51         ` Matthew Bobrowski
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-14  9:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: RITESH HARJANI, linux-ext4, linux-fsdevel, jack, tytso

On Tue, Aug 13, 2019 at 07:35:40AM -0700, Darrick J. Wong wrote:
> On Tue, Aug 13, 2019 at 10:58:42PM +1000, Matthew Bobrowski wrote:
> > On Mon, Aug 12, 2019 at 10:34:29PM +0530, RITESH HARJANI wrote:
> > > > +
> > > > +	if (ret >= 0 && iov_iter_count(from)) {
> > > > +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > > > +		return ext4_buffered_write_iter(iocb, from);
> > > > +	}
> > > should not we copy code from "__generic_file_write_iter" which does below?
> > > 
> > > 3436                 /*
> > > 3437                  * We need to ensure that the page cache pages are
> > > written to
> > > 3438                  * disk and invalidated to preserve the expected
> > > O_DIRECT
> > > 3439                  * semantics.
> > > 3440                  */
> > 
> > Hm, I don't see why this would be required seeing as though the page cache
> > invalidation semantics pre and post write are handled by iomap_dio_rw() and
> > iomap_dio_complete(). But, I could be completely wrong here, so we may need to
> > wait for some others to provide comments on this.
> 
> iomap_dio_rw is supposed to zap the page cache before the write and
> again afterwards (and whine if someone is racing buffered and direct
> writes to the same file location), so ext4 shouldn't need to do that
> itself.

Thanks for confirming Darrick! I thought that was the case.

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-14  9:48       ` Matthew Bobrowski
@ 2019-08-14 11:58         ` RITESH HARJANI
  0 siblings, 0 replies; 48+ messages in thread
From: RITESH HARJANI @ 2019-08-14 11:58 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, aneesh.kumar


On 8/14/19 3:18 PM, Matthew Bobrowski wrote:
> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
>> On 8/13/19 4:40 PM, Matthew Bobrowski wrote:
>>> On Mon, Aug 12, 2019 at 11:01:50PM +0530, RITESH HARJANI wrote:
>>>> I was under the assumption that we need to maintain
>>>> ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) or
>>>> atomic_read(&EXT4_I(inode)->i_unwritten))
>>>> in case of non-AIO directIO or AIO directIO case as well (when we may
>>>> allocate unwritten extents),
>>>> to protect with some kind of race with other parts of code(maybe
>>>> truncate/bufferedIO/fallocate not sure?) which may call for
>>>> ext4_can_extents_be_merged()
>>>> to check if extents can be merged or not.
>>>>
>>>> Is it not the case?
>>>> Now that directIO code has no way of specifying that this inode has
>>>> unwritten extent, will it not race with any other path, where this info was
>>>> necessary (like
>>>> in above func ext4_can_extents_be_merged())?
>>> Ah yes, I was under the same assumption when reviewing the code
>>> initially and one of my first solutions was to also use this dynamic
>>> 'state' flag in the ->end_io() handler. But, I fell flat on my face as
>>> that deemed to be problematic... This is because there can be multiple
>>> direct IOs to unwritten extents against the same inode, so you cannot
>>> possibly get away with tracking them using this single inode flag. So,
>>> hence the reason why we drop using EXT4_STATE_DIO_UNWRITTEN and use
>>> IOMAP_DIO_UNWRITTEN instead in the ->end_io() handler, which tracks
>>> whether _this_ particular IO has an underlying unwritten extent.
>> Thanks for taking time to explain this.
>>
>> But what I meant was this (I may be wrong here since I haven't really looked
>> into it),
>> but for my understanding I would like to discuss this -
>>
>> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
>> whether a newextent can be merged with ex1 in function
>> ext4_extents_can_be_merged. But now since we have removed that flag we have
>> no way of knowing that whether
>> this inode has any unwritten extents or not from any DIO path.
>> What I meant is isn't this removal of setting/unsetting of
>> flag(EXT4_STATE_DIO_UNWRITTEN)
>> changing the behavior of this func - ext4_extents_can_be_merged?
> Ah yes, I see. I believe that what you're saying is correct and I
> think we will need to account for this case. But, I haven't thought
> about how to do this just yet.

That's not a problem, we can surely discuss the possible approaches.


>> Also - could you please explain why this check returns 0 in the first place
>> (line 1762 - 1766 below)?
>>
>> 1733 int
>> 1734 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent
>> *ex1,
>> 1735                                 struct ext4_extent *ex2)
>> <...>
>>
>> 1762         if (ext4_ext_is_unwritten(ex1) &&
>> 1763             (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
>> 1764              atomic_read(&EXT4_I(inode)->i_unwritten) ||
>> 1765              (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
>> 1766                 return 0;
> I cannot explain why, because I myself don't know exactly why in this
> particular case the extents cannot be merged. Perhaps `git blame` is
> our friend and we can direct that question accordingly, or someone
> else on this mailing list knows the answer. :-)

git blame didn't help much. But I think I may know what the above 
condition means.
So I think if there is an ongoing IO to an unwritten extent, we may 
sometimes first split that extent
to match the exact range of the IO, then write data to it and then 
convert that *exact range* to written extent.

So this means while there is an ongoing IO to any inode which has 
unwritten extents, we should not allow
any other extent to merge with extents of this inode.

Now one race which I could think of it is, when we are doing AIO DIO and 
in parallel doing fallocate to the end of the file.
But since we wait for inode_dio_wait in fallocate, so that should not 
race with any DIO paths.

I think, here we can wait to get answers from others to understand, if 
there is any race with any of the DIO paths on removing this state 
flag(EXT4_STATE_DIO_UNWRITTEN) & not
setting "i_unwritten". Whether it can race with any other path and if 
this state flag is necessary(in DIO cases also) for the correct 
functionality?


Regards
Ritesh


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-13 12:27     ` RITESH HARJANI
  2019-08-14  9:48       ` Matthew Bobrowski
@ 2019-08-21 13:14       ` Matthew Bobrowski
  2019-08-22 12:00         ` Matthew Bobrowski
  1 sibling, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-21 13:14 UTC (permalink / raw)
  To: tytso, jack, adilger.kernel; +Cc: linux-ext4, linux-fsdevel, riteshh, hch

On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> But what I meant was this (I may be wrong here since I haven't really looked
> into it), but for my understanding I would like to discuss this -
> 
> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> whether a newextent can be merged with ex1 in function
> ext4_extents_can_be_merged. But now since we have removed that flag we have
> no way of knowing that whether this inode has any unwritten extents or not
> from any DIO path.
> 
> What I meant is isn't this removal of setting/unsetting of
> flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> ext4_extents_can_be_merged?

OK, I'm stuck and looking for either clarity, revalidation of my
thought process, or any input on how to solve this problem for that
matter.

In the current ext4 direct IO implementation, the dynamic state flag
EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
for ext4_io_end->flag, and the value of i_unwritten is
incremented/decremented for asynchronous direct IO writes. All
mechanisms by which are used to track and determine whether the inode,
or an IO in flight against a particular inode have any pending
unwritten extents that need to be converted after the IO has
completed. In addition to this, we have ext4_can_extents_be_merged()
performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
i_unwritten and using them to determine whether it can or cannot merge
a requested extent into an existing extent.

This is all fine for the current direct IO implementation. However,
while switching the direct IO code paths over to make use of the iomap
infrastructure, I believe that we can no longer simply track whether
an inode has unwritten extents needing to be converted by simply
setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
inode. The reason being is that there can be multiple direct IO
operations to unwritten extents running against the inode and we don't
particularly distinguish synchronous from asynchronous writes within
ext4_iomap_begin() as there's really no need. So, the only way to
accurately determine whether extent conversion is deemed necessary for
an IO operation whether it'd be synchronous or asynchronous is by
checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
callback. I'm certain that this portion of the logic is correct, but
we're still left with some issues when it comes to the checks that I
previously mentioned in ext4_can_extents_be_merged(), which is the
part I need some input on.

I was doing some thinking and I don't believe that making use of the
EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
only for reasons that I've briefly mentioned above, but also because
of the fact that it'll probably lead to a lot of inaccurate judgements
while taking particular code paths and some really ugly code that
creeps close to the definition of insanity. Rather, what if we solve
this problem by continuing to just use i_unwritten to keep track of
all the direct IOs to unwritten against running against an inode?
Within ext4_iomap_begin() post successful creation of unwritten
extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
subsequently within the ->end_io() callback whether we take the
success or error path we'd call
atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
still rely on this value to be used in the check within
ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
has any...

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-21 13:14       ` Matthew Bobrowski
@ 2019-08-22 12:00         ` Matthew Bobrowski
  2019-08-22 14:11           ` Ritesh Harjani
  2019-08-23 13:43           ` [RFC 1/1] ext4: PoC implementation of option-1 Ritesh Harjani
  0 siblings, 2 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-22 12:00 UTC (permalink / raw)
  To: tytso, jack, adilger.kernel; +Cc: linux-ext4, linux-fsdevel, riteshh, hch

On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> > But what I meant was this (I may be wrong here since I haven't
> > really looked into it), but for my understanding I would like to
> > discuss this -
> > 
> > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> > whether a newextent can be merged with ex1 in function
> > ext4_extents_can_be_merged. But now since we have removed that flag we have
> > no way of knowing that whether this inode has any unwritten extents or not
> > from any DIO path.
> > 
> > What I meant is isn't this removal of setting/unsetting of
> > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> > ext4_extents_can_be_merged?
> 
> OK, I'm stuck and looking for either clarity, revalidation of my
> thought process, or any input on how to solve this problem for that
> matter.
> 
> In the current ext4 direct IO implementation, the dynamic state flag
> EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
> writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
> for ext4_io_end->flag, and the value of i_unwritten is
> incremented/decremented for asynchronous direct IO writes. All
> mechanisms by which are used to track and determine whether the inode,
> or an IO in flight against a particular inode have any pending
> unwritten extents that need to be converted after the IO has
> completed. In addition to this, we have ext4_can_extents_be_merged()
> performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
> i_unwritten and using them to determine whether it can or cannot merge
> a requested extent into an existing extent.
> 
> This is all fine for the current direct IO implementation. However,
> while switching the direct IO code paths over to make use of the iomap
> infrastructure, I believe that we can no longer simply track whether
> an inode has unwritten extents needing to be converted by simply
> setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
> inode. The reason being is that there can be multiple direct IO
> operations to unwritten extents running against the inode and we don't
> particularly distinguish synchronous from asynchronous writes within
> ext4_iomap_begin() as there's really no need. So, the only way to
> accurately determine whether extent conversion is deemed necessary for
> an IO operation whether it'd be synchronous or asynchronous is by
> checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
> callback. I'm certain that this portion of the logic is correct, but
> we're still left with some issues when it comes to the checks that I
> previously mentioned in ext4_can_extents_be_merged(), which is the
> part I need some input on.
> 
> I was doing some thinking and I don't believe that making use of the
> EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
> only for reasons that I've briefly mentioned above, but also because
> of the fact that it'll probably lead to a lot of inaccurate judgements
> while taking particular code paths and some really ugly code that
> creeps close to the definition of insanity. Rather, what if we solve
> this problem by continuing to just use i_unwritten to keep track of
> all the direct IOs to unwritten against running against an inode?
> Within ext4_iomap_begin() post successful creation of unwritten
> extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
> subsequently within the ->end_io() callback whether we take the
> success or error path we'd call
> atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
> still rely on this value to be used in the check within
> ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
> has any...

Actually, no...

I've done some more thinking and what I suggested above around the use
of i_unwritten will also not work properly. Using iomap
infrastructure, there is the possibility of calling into the
->iomap_begin() more than once for a single direct IO operation. This
means that by the time we even get to decrementing i_unwritten in the
->end_io() callback after converting the unwritten extents we're
already running the possibility of i_unwritten becoming unbalanced
really quickly and staying that way. This also means that the
statement checking i_unwritten in ext4_can_extents_be_merged() will be
affected and potentially result in it being evaluated incorrectly. I
was thinking that we could just decrement i_unwritten in
->iomap_end(), but that seems to me like it would be racy and also
lead to incorrect results. At this point I'm out of ideas on how to
solve this, so any other ideas would be appreciated!

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-22 12:00         ` Matthew Bobrowski
@ 2019-08-22 14:11           ` Ritesh Harjani
  2019-08-24  3:18             ` Matthew Bobrowski
  2019-08-23 13:43           ` [RFC 1/1] ext4: PoC implementation of option-1 Ritesh Harjani
  1 sibling, 1 reply; 48+ messages in thread
From: Ritesh Harjani @ 2019-08-22 14:11 UTC (permalink / raw)
  To: Matthew Bobrowski, tytso, jack, adilger.kernel
  Cc: linux-ext4, linux-fsdevel, hch, aneesh.kumar



On 8/22/19 5:30 PM, Matthew Bobrowski wrote:
> On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
>> On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
>>> But what I meant was this (I may be wrong here since I haven't
>>> really looked into it), but for my understanding I would like to
>>> discuss this -
>>>
>>> So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
>>> whether a newextent can be merged with ex1 in function
>>> ext4_extents_can_be_merged. But now since we have removed that flag we have
>>> no way of knowing that whether this inode has any unwritten extents or not
>>> from any DIO path.
>>>
>>> What I meant is isn't this removal of setting/unsetting of
>>> flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
>>> ext4_extents_can_be_merged?
>>
>> OK, I'm stuck and looking for either clarity, revalidation of my
>> thought process, or any input on how to solve this problem for that
>> matter.
>>
>> In the current ext4 direct IO implementation, the dynamic state flag
>> EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
>> writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
>> for ext4_io_end->flag, and the value of i_unwritten is
>> incremented/decremented for asynchronous direct IO writes. All
>> mechanisms by which are used to track and determine whether the inode,
>> or an IO in flight against a particular inode have any pending
>> unwritten extents that need to be converted after the IO has
>> completed. In addition to this, we have ext4_can_extents_be_merged()
>> performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
>> i_unwritten and using them to determine whether it can or cannot merge
>> a requested extent into an existing extent.
>>
>> This is all fine for the current direct IO implementation. However,
>> while switching the direct IO code paths over to make use of the iomap
>> infrastructure, I believe that we can no longer simply track whether
>> an inode has unwritten extents needing to be converted by simply
>> setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
>> inode. The reason being is that there can be multiple direct IO
>> operations to unwritten extents running against the inode and we don't
>> particularly distinguish synchronous from asynchronous writes within
>> ext4_iomap_begin() as there's really no need. So, the only way to
>> accurately determine whether extent conversion is deemed necessary for
>> an IO operation whether it'd be synchronous or asynchronous is by
>> checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
>> callback. I'm certain that this portion of the logic is correct, but
>> we're still left with some issues when it comes to the checks that I
>> previously mentioned in ext4_can_extents_be_merged(), which is the
>> part I need some input on.
>>
>> I was doing some thinking and I don't believe that making use of the
>> EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
>> only for reasons that I've briefly mentioned above, but also because
>> of the fact that it'll probably lead to a lot of inaccurate judgements
>> while taking particular code paths and some really ugly code that
>> creeps close to the definition of insanity. Rather, what if we solve
>> this problem by continuing to just use i_unwritten to keep track of
>> all the direct IOs to unwritten against running against an inode?
>> Within ext4_iomap_begin() post successful creation of unwritten
>> extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
>> subsequently within the ->end_io() callback whether we take the
>> success or error path we'd call
>> atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
>> still rely on this value to be used in the check within
>> ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
>> has any...
> 
> Actually, no...
> 
> I've done some more thinking and what I suggested above around the use
> of i_unwritten will also not work properly. Using iomap
> infrastructure, there is the possibility of calling into the
> ->iomap_begin() more than once for a single direct IO operation. This
> means that by the time we even get to decrementing i_unwritten in the
> ->end_io() callback after converting the unwritten extents we're
> already running the possibility of i_unwritten becoming unbalanced
> really quickly and staying that way. This also means that the
> statement checking i_unwritten in ext4_can_extents_be_merged() will be
> affected and potentially result in it being evaluated incorrectly. I
> was thinking that we could just decrement i_unwritten in
> ->iomap_end(), but that seems to me like it would be racy and also
> lead to incorrect results. At this point I'm out of ideas on how to
> solve this, so any other ideas would be appreciated!

I will let others also comment, if someone has any other better approach.

1. One approach is to add the infrastructure in iomap with 
iomap_dio->private which is filesystem specific pointer. This can be
updated by filesystem in ->iomap_begin call into iomap->private.
And in case of iomap_dio_rw, this iomap->private will be copied to 
iomap_dio->private if not already set.

But I think this will eventually become hacky in the sense when you will 
have to determine on whether the dio->private is already set or not when 
iomap_apply will be called second time. It will become a problem with 
AIO DIO in ext4 since we use i_unwritten which tells us whether there is 
any unwritten extent but it does not tell whether this unwritten extent 
is due to a DIRECT AIO DIO in progress or a buffered one.

So we can ignore this approach - unless you or someone else have some 
good design ideas to build on top of above.


2. Second approach which I was thinking is to track only those extents 
which are marked unwritten and are under IO. This can be done in 
ext4_map_blocks. This way we will not have to track a particular inode 
has any unwritten extents or not, but it will be extent based.
Something similar was also done a while ago. Do you think this approach 
will work in our case?

So with this extents will be scanned in extent status tree to see if any 
among those are under IO and are unwritten in func 
ext4_can_extents_be_merged.

https://patchwork.ozlabs.org/patch/1013837/


-Ritesh


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC 1/1] ext4: PoC implementation of option-1
  2019-08-22 12:00         ` Matthew Bobrowski
  2019-08-22 14:11           ` Ritesh Harjani
@ 2019-08-23 13:43           ` Ritesh Harjani
  2019-08-23 13:49             ` Ritesh Harjani
  1 sibling, 1 reply; 48+ messages in thread
From: Ritesh Harjani @ 2019-08-23 13:43 UTC (permalink / raw)
  To: mbobrowski
  Cc: riteshh, tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel,
	hch, aneesh.kumar

This is just a PoC implementation of the option-1 which I was
mentioning in the previous email.
As mentioned, it is adding some flags & private pointer
to iomap infrastructure.

I would let upto you and others to comment on this approach.
Please note that I have run only few basic tests.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/ext4.h        |  1 +
 fs/ext4/file.c        | 37 ++++++++++++++++++++++++++++++++++---
 fs/ext4/inode.c       | 31 +++++++++++++++++++++++++++++++
 fs/iomap/direct-io.c  | 16 +++++++++++++++-
 fs/xfs/xfs_file.c     |  3 ++-
 include/linux/iomap.h |  5 ++++-
 6 files changed, 87 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2ab91815f52d..93aa716c691e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1555,6 +1555,7 @@ enum {
 	EXT4_STATE_NO_EXPAND,		/* No space for expansion */
 	EXT4_STATE_DA_ALLOC_CLOSE,	/* Alloc DA blks on close */
 	EXT4_STATE_EXT_MIGRATE,		/* Inode is migrating */
+	EXT4_STATE_DIO_UNWRITTEN,	/* need convert on dio done*/
 	EXT4_STATE_NEWENTRY,		/* File just added to dir */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3736dbf28d0a..adfb56401312 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -314,8 +314,30 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
 	return ret;
 }
 
+static int ext4_dio_write_end_io_convert(struct inode *inode, loff_t offset,
+					 ssize_t size, void *private)
+{
+	int ret = 0;
+	ext4_io_end_t *io_end = private;
+
+	if (!io_end) {
+		WARN_ON(!ext4_test_inode_state(inode,
+				EXT4_STATE_DIO_UNWRITTEN));
+		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
+		if (ret)
+			return ret;
+		ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
+	} else {
+		io_end->offset = offset;
+		io_end->size = size;
+		ext4_put_io_end(io_end);
+	}
+	return ret;
+}
+
 static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
-				 ssize_t error, unsigned int flags)
+				 ssize_t error, unsigned int flags,
+				 void *private)
 {
 	int ret = 0;
 	handle_t *handle;
@@ -345,12 +367,21 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
 				ext4_orphan_del(handle, inode);
 			ext4_journal_stop(handle);
 		}
+
+		if (flags & IOMAP_DIO_UNWRITTEN) {
+			ret = ext4_dio_write_end_io_convert(inode, offset,
+					size, private);
+			if (ret < 0)
+				return ret;
+		}
+
 		return error;
 	}
 
 	if (flags & IOMAP_DIO_UNWRITTEN) {
-		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
-		if (ret)
+		ret = ext4_dio_write_end_io_convert(inode, offset,
+						    size, private);
+		if (ret < 0)
 			return ret;
 	}
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 15715d22808d..9d77ed2aa58c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3322,6 +3322,26 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 	return inode->i_state & I_DIRTY_DATASYNC;
 }
 
+static int ext4_iomap_init_io_end(struct inode *inode, struct iomap *iomap,
+				  unsigned flags)
+{
+	ext4_io_end_t *io_end;
+
+	if (flags & IOMAP_DIRECT_AIO) {
+		if (flags & IOMAP_PRIVATE)
+			return 0;
+		io_end = ext4_init_io_end(inode, GFP_KERNEL);
+		if (!io_end)
+			return -ENOMEM;
+		iomap->private = io_end;
+		ext4_set_io_unwritten_flag(inode, io_end);
+	} else {
+		iomap->private = NULL;
+		ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
+	}
+	return 0;
+}
+
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 			    unsigned flags, struct iomap *iomap)
 {
@@ -3433,6 +3453,17 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 				ret = ext4_map_blocks(handle, inode, &map,
 						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
 			}
+
+			if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+				int err;
+
+				err = ext4_iomap_init_io_end(inode, iomap,
+							     flags);
+				if (err < 0) {
+					ext4_journal_stop(handle);
+					return err;
+				}
+			}
 		}
 
 		if (ret < 0) {
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 2ccf1c6460d4..7d88683a0d93 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -17,6 +17,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_PRIVATE	(1 << 27)
 #define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
@@ -31,6 +32,7 @@ struct iomap_dio {
 	unsigned		flags;
 	int			error;
 	bool			wait_for_completion;
+	void			*private;	/* FS specific */
 
 	union {
 		/* used during submission and for synchronous completion: */
@@ -78,7 +80,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	ssize_t ret;
 
 	if (dio->end_io)
-		ret = dio->end_io(iocb, dio->size, dio->error, dio->flags);
+		ret = dio->end_io(iocb, dio->size, dio->error, dio->flags,
+				  dio->private);
 	else
 		ret = dio->error;
 
@@ -363,6 +366,11 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 {
 	struct iomap_dio *dio = data;
 
+	if (!(dio->flags & IOMAP_DIO_PRIVATE) && iomap->private) {
+		dio->private = iomap->private;
+		dio->flags |= IOMAP_DIO_PRIVATE;
+	}
+
 	switch (iomap->type) {
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(dio->flags & IOMAP_DIO_WRITE))
@@ -421,6 +429,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->end_io = end_io;
 	dio->error = 0;
 	dio->flags = 0;
+	dio->private = NULL;
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
@@ -458,6 +467,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		}
 		flags |= IOMAP_NOWAIT;
 	}
+	if (!is_sync_kiocb(iocb) && (flags & IOMAP_WRITE))
+		flags |= IOMAP_DIRECT_AIO;
 
 	ret = filemap_write_and_wait_range(mapping, start, end);
 	if (ret)
@@ -486,6 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	blk_start_plug(&plug);
 	do {
+		if (dio->flags & IOMAP_DIO_PRIVATE)
+			flags |= IOMAP_PRIVATE;
+
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f2bc3ac4a60e..205c4219986c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -370,7 +370,8 @@ xfs_dio_write_end_io(
 	struct kiocb		*iocb,
 	ssize_t			size,
 	ssize_t                 error,
-	unsigned		flags)
+	unsigned		flags,
+	void			*private)
 {
 	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 900284e5c06c..57d3679442f9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -102,6 +102,8 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_DIRECT_AIO	(1 << 6) /* AIO DIO */
+#define IOMAP_PRIVATE		(1 << 7) /* FS specific */
 
 struct iomap_ops {
 	/*
@@ -189,7 +191,8 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
 #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
 #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
 typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
-				 ssize_t error, unsigned int flags);
+				 ssize_t error, unsigned int flags,
+				 void *private);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 1/1] ext4: PoC implementation of option-1
  2019-08-23 13:43           ` [RFC 1/1] ext4: PoC implementation of option-1 Ritesh Harjani
@ 2019-08-23 13:49             ` Ritesh Harjani
  0 siblings, 0 replies; 48+ messages in thread
From: Ritesh Harjani @ 2019-08-23 13:49 UTC (permalink / raw)
  To: mbobrowski
  Cc: tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, hch,
	aneesh.kumar

FYI - This patch is built on top of Matthew's patch series just as a 
concept implementation for option-1 stated in previous email.


On 8/23/19 7:13 PM, Ritesh Harjani wrote:
> This is just a PoC implementation of the option-1 which I was
> mentioning in the previous email.
> As mentioned, it is adding some flags & private pointer
> to iomap infrastructure.
> 
> I would let upto you and others to comment on this approach.
> Please note that I have run only few basic tests.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>   fs/ext4/ext4.h        |  1 +
>   fs/ext4/file.c        | 37 ++++++++++++++++++++++++++++++++++---
>   fs/ext4/inode.c       | 31 +++++++++++++++++++++++++++++++
>   fs/iomap/direct-io.c  | 16 +++++++++++++++-
>   fs/xfs/xfs_file.c     |  3 ++-
>   include/linux/iomap.h |  5 ++++-
>   6 files changed, 87 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2ab91815f52d..93aa716c691e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1555,6 +1555,7 @@ enum {
>   	EXT4_STATE_NO_EXPAND,		/* No space for expansion */
>   	EXT4_STATE_DA_ALLOC_CLOSE,	/* Alloc DA blks on close */
>   	EXT4_STATE_EXT_MIGRATE,		/* Inode is migrating */
> +	EXT4_STATE_DIO_UNWRITTEN,	/* need convert on dio done*/
>   	EXT4_STATE_NEWENTRY,		/* File just added to dir */
>   	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
>   	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 3736dbf28d0a..adfb56401312 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -314,8 +314,30 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
>   	return ret;
>   }
>   
> +static int ext4_dio_write_end_io_convert(struct inode *inode, loff_t offset,
> +					 ssize_t size, void *private)
> +{
> +	int ret = 0;
> +	ext4_io_end_t *io_end = private;
> +
> +	if (!io_end) {
> +		WARN_ON(!ext4_test_inode_state(inode,
> +				EXT4_STATE_DIO_UNWRITTEN));
> +		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
> +		if (ret)
> +			return ret;
> +		ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> +	} else {
> +		io_end->offset = offset;
> +		io_end->size = size;
> +		ext4_put_io_end(io_end);
> +	}
> +	return ret;
> +}
> +
>   static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> -				 ssize_t error, unsigned int flags)
> +				 ssize_t error, unsigned int flags,
> +				 void *private)
>   {
>   	int ret = 0;
>   	handle_t *handle;
> @@ -345,12 +367,21 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>   				ext4_orphan_del(handle, inode);
>   			ext4_journal_stop(handle);
>   		}
> +
> +		if (flags & IOMAP_DIO_UNWRITTEN) {
> +			ret = ext4_dio_write_end_io_convert(inode, offset,
> +					size, private);
> +			if (ret < 0)
> +				return ret;
> +		}
> +
>   		return error;
>   	}
>   
>   	if (flags & IOMAP_DIO_UNWRITTEN) {
> -		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
> -		if (ret)
> +		ret = ext4_dio_write_end_io_convert(inode, offset,
> +						    size, private);
> +		if (ret < 0)
>   			return ret;
>   	}
>   
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 15715d22808d..9d77ed2aa58c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3322,6 +3322,26 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>   	return inode->i_state & I_DIRTY_DATASYNC;
>   }
>   
> +static int ext4_iomap_init_io_end(struct inode *inode, struct iomap *iomap,
> +				  unsigned flags)
> +{
> +	ext4_io_end_t *io_end;
> +
> +	if (flags & IOMAP_DIRECT_AIO) {
> +		if (flags & IOMAP_PRIVATE)
> +			return 0;
> +		io_end = ext4_init_io_end(inode, GFP_KERNEL);
> +		if (!io_end)
> +			return -ENOMEM;
> +		iomap->private = io_end;
> +		ext4_set_io_unwritten_flag(inode, io_end);
> +	} else {
> +		iomap->private = NULL;
> +		ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> +	}
> +	return 0;
> +}
> +
>   static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   			    unsigned flags, struct iomap *iomap)
>   {
> @@ -3433,6 +3453,17 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   				ret = ext4_map_blocks(handle, inode, &map,
>   						      EXT4_GET_BLOCKS_IO_CREATE_EXT);
>   			}
> +
> +			if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +				int err;
> +
> +				err = ext4_iomap_init_io_end(inode, iomap,
> +							     flags);
> +				if (err < 0) {
> +					ext4_journal_stop(handle);
> +					return err;
> +				}
> +			}
>   		}
>   
>   		if (ret < 0) {
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 2ccf1c6460d4..7d88683a0d93 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -17,6 +17,7 @@
>    * Private flags for iomap_dio, must not overlap with the public ones in
>    * iomap.h:
>    */
> +#define IOMAP_DIO_PRIVATE	(1 << 27)
>   #define IOMAP_DIO_WRITE_FUA	(1 << 28)
>   #define IOMAP_DIO_NEED_SYNC	(1 << 29)
>   #define IOMAP_DIO_WRITE		(1 << 30)
> @@ -31,6 +32,7 @@ struct iomap_dio {
>   	unsigned		flags;
>   	int			error;
>   	bool			wait_for_completion;
> +	void			*private;	/* FS specific */
>   
>   	union {
>   		/* used during submission and for synchronous completion: */
> @@ -78,7 +80,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   	ssize_t ret;
>   
>   	if (dio->end_io)
> -		ret = dio->end_io(iocb, dio->size, dio->error, dio->flags);
> +		ret = dio->end_io(iocb, dio->size, dio->error, dio->flags,
> +				  dio->private);
>   	else
>   		ret = dio->error;
>   
> @@ -363,6 +366,11 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>   {
>   	struct iomap_dio *dio = data;
>   
> +	if (!(dio->flags & IOMAP_DIO_PRIVATE) && iomap->private) {
> +		dio->private = iomap->private;
> +		dio->flags |= IOMAP_DIO_PRIVATE;
> +	}
> +
>   	switch (iomap->type) {
>   	case IOMAP_HOLE:
>   		if (WARN_ON_ONCE(dio->flags & IOMAP_DIO_WRITE))
> @@ -421,6 +429,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   	dio->end_io = end_io;
>   	dio->error = 0;
>   	dio->flags = 0;
> +	dio->private = NULL;
>   
>   	dio->submit.iter = iter;
>   	dio->submit.waiter = current;
> @@ -458,6 +467,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   		}
>   		flags |= IOMAP_NOWAIT;
>   	}
> +	if (!is_sync_kiocb(iocb) && (flags & IOMAP_WRITE))
> +		flags |= IOMAP_DIRECT_AIO;
>   
>   	ret = filemap_write_and_wait_range(mapping, start, end);
>   	if (ret)
> @@ -486,6 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   
>   	blk_start_plug(&plug);
>   	do {
> +		if (dio->flags & IOMAP_DIO_PRIVATE)
> +			flags |= IOMAP_PRIVATE;
> +
>   		ret = iomap_apply(inode, pos, count, flags, ops, dio,
>   				iomap_dio_actor);
>   		if (ret <= 0) {
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f2bc3ac4a60e..205c4219986c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -370,7 +370,8 @@ xfs_dio_write_end_io(
>   	struct kiocb		*iocb,
>   	ssize_t			size,
>   	ssize_t                 error,
> -	unsigned		flags)
> +	unsigned		flags,
> +	void			*private)
>   {
>   	struct inode		*inode = file_inode(iocb->ki_filp);
>   	struct xfs_inode	*ip = XFS_I(inode);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 900284e5c06c..57d3679442f9 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -102,6 +102,8 @@ struct iomap_page_ops {
>   #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
>   #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
>   #define IOMAP_NOWAIT		(1 << 5) /* do not block */
> +#define IOMAP_DIRECT_AIO	(1 << 6) /* AIO DIO */
> +#define IOMAP_PRIVATE		(1 << 7) /* FS specific */
>   
>   struct iomap_ops {
>   	/*
> @@ -189,7 +191,8 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
>   #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
>   #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
>   typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
> -				 ssize_t error, unsigned int flags);
> +				 ssize_t error, unsigned int flags,
> +				 void *private);
>   ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
>   int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-22 14:11           ` Ritesh Harjani
@ 2019-08-24  3:18             ` Matthew Bobrowski
  2019-08-24  3:55               ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-24  3:18 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: tytso, jack, adilger.kernel, linux-ext4, linux-fsdevel, hch,
	aneesh.kumar, darrick.wong

On Thu, Aug 22, 2019 at 07:41:25PM +0530, Ritesh Harjani wrote:
> 
> 
> On 8/22/19 5:30 PM, Matthew Bobrowski wrote:
> > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
> > > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> > > > But what I meant was this (I may be wrong here since I haven't
> > > > really looked into it), but for my understanding I would like to
> > > > discuss this -
> > > > 
> > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> > > > whether a newextent can be merged with ex1 in function
> > > > ext4_extents_can_be_merged. But now since we have removed that flag we have
> > > > no way of knowing that whether this inode has any unwritten extents or not
> > > > from any DIO path.
> > > > 
> > > > What I meant is isn't this removal of setting/unsetting of
> > > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> > > > ext4_extents_can_be_merged?
> > > 
> > > OK, I'm stuck and looking for either clarity, revalidation of my
> > > thought process, or any input on how to solve this problem for that
> > > matter.
> > > 
> > > In the current ext4 direct IO implementation, the dynamic state flag
> > > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
> > > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
> > > for ext4_io_end->flag, and the value of i_unwritten is
> > > incremented/decremented for asynchronous direct IO writes. All
> > > mechanisms by which are used to track and determine whether the inode,
> > > or an IO in flight against a particular inode have any pending
> > > unwritten extents that need to be converted after the IO has
> > > completed. In addition to this, we have ext4_can_extents_be_merged()
> > > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
> > > i_unwritten and using them to determine whether it can or cannot merge
> > > a requested extent into an existing extent.
> > > 
> > > This is all fine for the current direct IO implementation. However,
> > > while switching the direct IO code paths over to make use of the iomap
> > > infrastructure, I believe that we can no longer simply track whether
> > > an inode has unwritten extents needing to be converted by simply
> > > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
> > > inode. The reason being is that there can be multiple direct IO
> > > operations to unwritten extents running against the inode and we don't
> > > particularly distinguish synchronous from asynchronous writes within
> > > ext4_iomap_begin() as there's really no need. So, the only way to
> > > accurately determine whether extent conversion is deemed necessary for
> > > an IO operation whether it'd be synchronous or asynchronous is by
> > > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
> > > callback. I'm certain that this portion of the logic is correct, but
> > > we're still left with some issues when it comes to the checks that I
> > > previously mentioned in ext4_can_extents_be_merged(), which is the
> > > part I need some input on.
> > > 
> > > I was doing some thinking and I don't believe that making use of the
> > > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
> > > only for reasons that I've briefly mentioned above, but also because
> > > of the fact that it'll probably lead to a lot of inaccurate judgements
> > > while taking particular code paths and some really ugly code that
> > > creeps close to the definition of insanity. Rather, what if we solve
> > > this problem by continuing to just use i_unwritten to keep track of
> > > all the direct IOs to unwritten against running against an inode?
> > > Within ext4_iomap_begin() post successful creation of unwritten
> > > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
> > > subsequently within the ->end_io() callback whether we take the
> > > success or error path we'd call
> > > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
> > > still rely on this value to be used in the check within
> > > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
> > > has any...
> > 
> > Actually, no...
> > 
> > I've done some more thinking and what I suggested above around the use
> > of i_unwritten will also not work properly. Using iomap
> > infrastructure, there is the possibility of calling into the
> > ->iomap_begin() more than once for a single direct IO operation. This
> > means that by the time we even get to decrementing i_unwritten in the
> > ->end_io() callback after converting the unwritten extents we're
> > already running the possibility of i_unwritten becoming unbalanced
> > really quickly and staying that way. This also means that the
> > statement checking i_unwritten in ext4_can_extents_be_merged() will be
> > affected and potentially result in it being evaluated incorrectly. I
> > was thinking that we could just decrement i_unwritten in
> > ->iomap_end(), but that seems to me like it would be racy and also
> > lead to incorrect results. At this point I'm out of ideas on how to
> > solve this, so any other ideas would be appreciated!
> 
> I will let others also comment, if someone has any other better approach.
> 
> 1. One approach is to add the infrastructure in iomap with
> iomap_dio->private which is filesystem specific pointer. This can be
> updated by filesystem in ->iomap_begin call into iomap->private.
> And in case of iomap_dio_rw, this iomap->private will be copied to
> iomap_dio->private if not already set.
> 
> But I think this will eventually become hacky in the sense when you will
> have to determine on whether the dio->private is already set or not when
> iomap_apply will be called second time. It will become a problem with AIO
> DIO in ext4 since we use i_unwritten which tells us whether there is any
> unwritten extent but it does not tell whether this unwritten extent is due
> to a DIRECT AIO DIO in progress or a buffered one.
> 
> So we can ignore this approach - unless you or someone else have some good
> design ideas to build on top of above.

I'm not sure whether _this_ is the solution or not, so let's maybe
wait for others to comment. One thing that I and probably the iomap
maintainers would like to avoid is adding any special case code to
iomap infrastructure, if possible. I mean, from what you suggest it
seems to be rather generic and not overly intrusive, although I know
for a fact that iomap infrastructure exists because stuff like
buffer_heads and the old direct IO code ended up accommodating so many
different edge cases making it almost unmodifiable and unmaintainable.

> 2. Second approach which I was thinking is to track only those extents which
> are marked unwritten and are under IO. This can be done in ext4_map_blocks.
> This way we will not have to track a particular inode has any unwritten
> extents or not, but it will be extent based.
> Something similar was also done a while ago. Do you think this approach will
> work in our case?
> 
> So with this extents will be scanned in extent status tree to see if any
> among those are under IO and are unwritten in func
> ext4_can_extents_be_merged.
> 
> https://patchwork.ozlabs.org/patch/1013837/

Maybe this would be a better approach and I think that it'd
work. Based upon what I read within in that thread there weren't
really any objections to the idea, although I can't see that it made
it upstream, so I may be missing something?

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-24  3:18             ` Matthew Bobrowski
@ 2019-08-24  3:55               ` Darrick J. Wong
  2019-08-24 23:04                 ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2019-08-24  3:55 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Ritesh Harjani, tytso, jack, adilger.kernel, linux-ext4,
	linux-fsdevel, hch, aneesh.kumar

On Sat, Aug 24, 2019 at 01:18:31PM +1000, Matthew Bobrowski wrote:
> On Thu, Aug 22, 2019 at 07:41:25PM +0530, Ritesh Harjani wrote:
> > 
> > 
> > On 8/22/19 5:30 PM, Matthew Bobrowski wrote:
> > > On Wed, Aug 21, 2019 at 11:14:07PM +1000, Matthew Bobrowski wrote:
> > > > On Tue, Aug 13, 2019 at 05:57:22PM +0530, RITESH HARJANI wrote:
> > > > > But what I meant was this (I may be wrong here since I haven't
> > > > > really looked into it), but for my understanding I would like to
> > > > > discuss this -
> > > > > 
> > > > > So earlier with this flag(EXT4_STATE_DIO_UNWRITTEN) we were determining on
> > > > > whether a newextent can be merged with ex1 in function
> > > > > ext4_extents_can_be_merged. But now since we have removed that flag we have
> > > > > no way of knowing that whether this inode has any unwritten extents or not
> > > > > from any DIO path.
> > > > > 
> > > > > What I meant is isn't this removal of setting/unsetting of
> > > > > flag(EXT4_STATE_DIO_UNWRITTEN) changing the behavior of this func -
> > > > > ext4_extents_can_be_merged?
> > > > 
> > > > OK, I'm stuck and looking for either clarity, revalidation of my
> > > > thought process, or any input on how to solve this problem for that
> > > > matter.
> > > > 
> > > > In the current ext4 direct IO implementation, the dynamic state flag
> > > > EXT4_STATE_DIO_UNWRITTEN is set/unset for synchronous direct IO
> > > > writes. On the other hand, the flag EXT4_IO_END_UNWRITTEN is set/unset
> > > > for ext4_io_end->flag, and the value of i_unwritten is
> > > > incremented/decremented for asynchronous direct IO writes. All
> > > > mechanisms by which are used to track and determine whether the inode,
> > > > or an IO in flight against a particular inode have any pending
> > > > unwritten extents that need to be converted after the IO has
> > > > completed. In addition to this, we have ext4_can_extents_be_merged()
> > > > performing explicit checks against both EXT4_STATE_DIO_UNWRITTEN and
> > > > i_unwritten and using them to determine whether it can or cannot merge
> > > > a requested extent into an existing extent.
> > > > 
> > > > This is all fine for the current direct IO implementation. However,
> > > > while switching the direct IO code paths over to make use of the iomap
> > > > infrastructure, I believe that we can no longer simply track whether
> > > > an inode has unwritten extents needing to be converted by simply
> > > > setting and checking the EXT4_STATE_DIO_UNWRITTEN flag on the
> > > > inode. The reason being is that there can be multiple direct IO
> > > > operations to unwritten extents running against the inode and we don't
> > > > particularly distinguish synchronous from asynchronous writes within
> > > > ext4_iomap_begin() as there's really no need. So, the only way to
> > > > accurately determine whether extent conversion is deemed necessary for
> > > > an IO operation whether it'd be synchronous or asynchronous is by
> > > > checking the IOMAP_DIO_UNWRITTEN flag within the ->end_io()
> > > > callback. I'm certain that this portion of the logic is correct, but
> > > > we're still left with some issues when it comes to the checks that I
> > > > previously mentioned in ext4_can_extents_be_merged(), which is the
> > > > part I need some input on.
> > > > 
> > > > I was doing some thinking and I don't believe that making use of the
> > > > EXT4_STATE_DIO_UNWRITTEN flag is the solution at all here. This is not
> > > > only for reasons that I've briefly mentioned above, but also because
> > > > of the fact that it'll probably lead to a lot of inaccurate judgements
> > > > while taking particular code paths and some really ugly code that
> > > > creeps close to the definition of insanity. Rather, what if we solve
> > > > this problem by continuing to just use i_unwritten to keep track of
> > > > all the direct IOs to unwritten against running against an inode?
> > > > Within ext4_iomap_begin() post successful creation of unwritten
> > > > extents we'd call atomic_inc(&EXT4_I(inode)->i_unwritten) and
> > > > subsequently within the ->end_io() callback whether we take the
> > > > success or error path we'd call
> > > > atomic_dec(&EXT4_I(inode)->i_unwritten) accordingly? This way we can
> > > > still rely on this value to be used in the check within
> > > > ext4_can_extents_be_merged(). Open for alternate suggestions if anyone
> > > > has any...
> > > 
> > > Actually, no...
> > > 
> > > I've done some more thinking and what I suggested above around the use
> > > of i_unwritten will also not work properly. Using iomap
> > > infrastructure, there is the possibility of calling into the
> > > ->iomap_begin() more than once for a single direct IO operation. This
> > > means that by the time we even get to decrementing i_unwritten in the
> > > ->end_io() callback after converting the unwritten extents we're
> > > already running the possibility of i_unwritten becoming unbalanced
> > > really quickly and staying that way. This also means that the
> > > statement checking i_unwritten in ext4_can_extents_be_merged() will be
> > > affected and potentially result in it being evaluated incorrectly. I
> > > was thinking that we could just decrement i_unwritten in
> > > ->iomap_end(), but that seems to me like it would be racy and also
> > > lead to incorrect results. At this point I'm out of ideas on how to
> > > solve this, so any other ideas would be appreciated!
> > 
> > I will let others also comment, if someone has any other better approach.
> > 
> > 1. One approach is to add the infrastructure in iomap with
> > iomap_dio->private which is filesystem specific pointer. This can be
> > updated by filesystem in ->iomap_begin call into iomap->private.
> > And in case of iomap_dio_rw, this iomap->private will be copied to
> > iomap_dio->private if not already set.
> > 
> > But I think this will eventually become hacky in the sense when you will
> > have to determine on whether the dio->private is already set or not when
> > iomap_apply will be called second time. It will become a problem with AIO
> > DIO in ext4 since we use i_unwritten which tells us whether there is any
> > unwritten extent but it does not tell whether this unwritten extent is due
> > to a DIRECT AIO DIO in progress or a buffered one.
> > 
> > So we can ignore this approach - unless you or someone else have some good
> > design ideas to build on top of above.
> 
> I'm not sure whether _this_ is the solution or not, so let's maybe
> wait for others to comment. One thing that I and probably the iomap
> maintainers would like to avoid is adding any special case code to
> iomap infrastructure, if possible. I mean, from what you suggest it
> seems to be rather generic and not overly intrusive, although I know
> for a fact that iomap infrastructure exists because stuff like
> buffer_heads and the old direct IO code ended up accommodating so many
> different edge cases making it almost unmodifiable and unmaintainable.

I'm probably misunderstanding the ext4 extent cache horribly, but I keep
wondering why any of this is necessary -- why can't ext4 track the
unwritten status in the extent records directly?  And why is there all
this strange "can merge" logic?  If you need to convert blocks X to Y
to written state because a write to those blocks completed, isn't that
just manipulation of a bunch of incore records?  And can't you just seek
back and forth in the extent cache to look for adjacent records to merge
with? <confuseD>

(I'd really prefer not to go adding private fields all over the
place...)

--D

> > 2. Second approach which I was thinking is to track only those extents which
> > are marked unwritten and are under IO. This can be done in ext4_map_blocks.
> > This way we will not have to track a particular inode has any unwritten
> > extents or not, but it will be extent based.
> > Something similar was also done a while ago. Do you think this approach will
> > work in our case?
> > 
> > So with this extents will be scanned in extent status tree to see if any
> > among those are under IO and are unwritten in func
> > ext4_can_extents_be_merged.
> > 
> > https://patchwork.ozlabs.org/patch/1013837/
> 
> Maybe this would be a better approach and I think that it'd
> work. Based upon what I read within in that thread there weren't
> really any objections to the idea, although I can't see that it made
> it upstream, so I may be missing something?
> 
> --M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-24  3:55               ` Darrick J. Wong
@ 2019-08-24 23:04                 ` Christoph Hellwig
  2019-08-27  9:52                   ` Matthew Bobrowski
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-24 23:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Bobrowski, Ritesh Harjani, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, hch, aneesh.kumar

On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote:
> I'm probably misunderstanding the ext4 extent cache horribly, but I keep
> wondering why any of this is necessary -- why can't ext4 track the
> unwritten status in the extent records directly?  And why is there all
> this strange "can merge" logic?  If you need to convert blocks X to Y
> to written state because a write to those blocks completed, isn't that
> just manipulation of a bunch of incore records?  And can't you just seek
> back and forth in the extent cache to look for adjacent records to merge
> with? <confuseD>

Same here.  I'm not an ext4 expert, but here is what we do in XFS, which
hopefully works in some form for ext4 a well:

 - when starting a direct I/O we allocate any needed blocks and do so
   as unwritten extent.  The extent tree code will merge them in
   whatever way that seems suitable
 - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we
   call a function that walks the whole range covered by the ioend,
   and convert any unwritten extent to a normal written extent.  Any
   splitting and merging will be done as needed by the low-level
   extent tree code
 - this also means we don't need the xfs_ioen structure (which ext4)
   copied from for direct I/O at all (we used to have it initially,
   though including the time when ext4 copied this code).
 - we don't need the equivalent to the ext4_unwritten_wait call in
   ext4_file_write_iter because we serialize any non-aligned I/O
   instead of trying to optimize for weird corner cases


> (I'd really prefer not to go adding private fields all over the
> place...)

Agreed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-24 23:04                 ` Christoph Hellwig
@ 2019-08-27  9:52                   ` Matthew Bobrowski
  2019-08-28 12:05                     ` Matthew Bobrowski
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-27  9:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ritesh Harjani, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, aneesh.kumar

On Sat, Aug 24, 2019 at 04:04:27PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote:
> > I'm probably misunderstanding the ext4 extent cache horribly, but I keep
> > wondering why any of this is necessary -- why can't ext4 track the
> > unwritten status in the extent records directly?  And why is there all
> > this strange "can merge" logic?  If you need to convert blocks X to Y
> > to written state because a write to those blocks completed, isn't that
> > just manipulation of a bunch of incore records?  And can't you just seek
> > back and forth in the extent cache to look for adjacent records to merge
> > with? <confuseD>
> 
> Same here.  I'm not an ext4 expert, but here is what we do in XFS, which
> hopefully works in some form for ext4 a well:
> 
>  - when starting a direct I/O we allocate any needed blocks and do so
>    as unwritten extent.  The extent tree code will merge them in
>    whatever way that seems suitable
>  - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we
>    call a function that walks the whole range covered by the ioend,
>    and convert any unwritten extent to a normal written extent.  Any
>    splitting and merging will be done as needed by the low-level
>    extent tree code
>  - this also means we don't need the xfs_ioen structure (which ext4)
>    copied from for direct I/O at all (we used to have it initially,
>    though including the time when ext4 copied this code).
>  - we don't need the equivalent to the ext4_unwritten_wait call in
>    ext4_file_write_iter because we serialize any non-aligned I/O
>    instead of trying to optimize for weird corner cases

Yeah, so what you've detailed above is essentially the approach I've
taken in my patch series...

What is not clear to me at this point though is whether it is still
necessary to explicitly track unwritten extents via in-core inode
attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
IO code path implementation, which makes use of the iomap
infrastructure. Or, whether we can get away with simply not using
these in-core inode attributes and rely just on checks against the
extent record directly, as breifly mentioned by Darrick. I would think
that this type of check would be enough, however the checks around
whether the inode is currently undergoing direct IO were implemented
at some point, so there must be a reason for having them
(a9b8241594add)?

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-27  9:52                   ` Matthew Bobrowski
@ 2019-08-28 12:05                     ` Matthew Bobrowski
  2019-08-28 14:27                       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-28 12:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ritesh Harjani, tytso, jack, adilger.kernel,
	linux-ext4, linux-fsdevel, aneesh.kumar

On Tue, Aug 27, 2019 at 07:52:23PM +1000, Matthew Bobrowski wrote:
> On Sat, Aug 24, 2019 at 04:04:27PM -0700, Christoph Hellwig wrote:
> > On Fri, Aug 23, 2019 at 08:55:54PM -0700, Darrick J. Wong wrote:
> > > I'm probably misunderstanding the ext4 extent cache horribly, but I keep
> > > wondering why any of this is necessary -- why can't ext4 track the
> > > unwritten status in the extent records directly?  And why is there all
> > > this strange "can merge" logic?  If you need to convert blocks X to Y
> > > to written state because a write to those blocks completed, isn't that
> > > just manipulation of a bunch of incore records?  And can't you just seek
> > > back and forth in the extent cache to look for adjacent records to merge
> > > with? <confuseD>
> > 
> > Same here.  I'm not an ext4 expert, but here is what we do in XFS, which
> > hopefully works in some form for ext4 a well:
> > 
> >  - when starting a direct I/O we allocate any needed blocks and do so
> >    as unwritten extent.  The extent tree code will merge them in
> >    whatever way that seems suitable
> >  - if the IOMAP_DIO_UNWRITTEN is set on the iomap at ->end_io time we
> >    call a function that walks the whole range covered by the ioend,
> >    and convert any unwritten extent to a normal written extent.  Any
> >    splitting and merging will be done as needed by the low-level
> >    extent tree code
> >  - this also means we don't need the xfs_ioen structure (which ext4)
> >    copied from for direct I/O at all (we used to have it initially,
> >    though including the time when ext4 copied this code).
> >  - we don't need the equivalent to the ext4_unwritten_wait call in
> >    ext4_file_write_iter because we serialize any non-aligned I/O
> >    instead of trying to optimize for weird corner cases
> 
> Yeah, so what you've detailed above is essentially the approach I've
> taken in my patch series...
> 
> What is not clear to me at this point though is whether it is still
> necessary to explicitly track unwritten extents via in-core inode
> attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
> IO code path implementation, which makes use of the iomap
> infrastructure. Or, whether we can get away with simply not using
> these in-core inode attributes and rely just on checks against the
> extent record directly, as breifly mentioned by Darrick. I would think
> that this type of check would be enough, however the checks around
> whether the inode is currently undergoing direct IO were implemented
> at some point, so there must be a reason for having them
> (a9b8241594add)?

Maybe it's a silly question, although I'm wanting to clarify my
understanding around why it is that when we either try prepend or
append to an existing extent, we don't permit merging of extents if
the inode is undergoing direct IO?

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-28 12:05                     ` Matthew Bobrowski
@ 2019-08-28 14:27                       ` Theodore Y. Ts'o
  2019-08-28 18:02                         ` Jan Kara
  0 siblings, 1 reply; 48+ messages in thread
From: Theodore Y. Ts'o @ 2019-08-28 14:27 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Christoph Hellwig, Darrick J. Wong, Ritesh Harjani, jack,
	adilger.kernel, linux-ext4, linux-fsdevel, aneesh.kumar

On Wed, Aug 28, 2019 at 10:05:11PM +1000, Matthew Bobrowski wrote:
> > What is not clear to me at this point though is whether it is still
> > necessary to explicitly track unwritten extents via in-core inode
> > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
> > IO code path implementation, which makes use of the iomap
> > infrastructure. Or, whether we can get away with simply not using
> > these in-core inode attributes and rely just on checks against the
> > extent record directly, as breifly mentioned by Darrick. I would think
> > that this type of check would be enough, however the checks around
> > whether the inode is currently undergoing direct IO were implemented
> > at some point, so there must be a reason for having them
> > (a9b8241594add)?

The original reason why we created the DIO_STATE_UNWRITTEN flag was a
fast path, where the common case is writing blocks to an existing
location in a file where the blocks are already allocated, and marked
as written.  So consulting the on-disk extent tree to determine
whether unwritten extents need to be converted and/or split is
certainly doable.  However, it's expensive for the common case.  So
having a hint whether we need to schedule a workqueue to possibly
convert an unwritten region is helpful.  If we can just free the bio
and exit the I/O completion handler without having to take shared
locks to examine the on-disk extent tree, so much the better.

> Maybe it's a silly question, although I'm wanting to clarify my
> understanding around why it is that when we either try prepend or
> append to an existing extent, we don't permit merging of extents if

If I recall correctly, the reason for this check was mainly the
concern that we would end up merging an extent that we would then have
to split later on (when the direct I/O completed).

To be honest, i'm not 100% sure what would happen if we removed that
restriction; it might be that things would work just fine (just slower
in some workloads), or whether there is some hidden dependency that
would explode.  I suspect we'd have to try the experiment to be sure.

      		  	       	    - Ted
				    

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-28 14:27                       ` Theodore Y. Ts'o
@ 2019-08-28 18:02                         ` Jan Kara
  2019-08-29  6:36                           ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Kara @ 2019-08-28 18:02 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Matthew Bobrowski, Christoph Hellwig, Darrick J. Wong,
	Ritesh Harjani, jack, adilger.kernel, linux-ext4, linux-fsdevel,
	aneesh.kumar

On Wed 28-08-19 10:27:29, Theodore Y. Ts'o wrote:
> On Wed, Aug 28, 2019 at 10:05:11PM +1000, Matthew Bobrowski wrote:
> > > What is not clear to me at this point though is whether it is still
> > > necessary to explicitly track unwritten extents via in-core inode
> > > attributes i.e. ->i_unwritten and ->i_state_flags under the new direct
> > > IO code path implementation, which makes use of the iomap
> > > infrastructure. Or, whether we can get away with simply not using
> > > these in-core inode attributes and rely just on checks against the
> > > extent record directly, as breifly mentioned by Darrick. I would think
> > > that this type of check would be enough, however the checks around
> > > whether the inode is currently undergoing direct IO were implemented
> > > at some point, so there must be a reason for having them
> > > (a9b8241594add)?
> 
> The original reason why we created the DIO_STATE_UNWRITTEN flag was a
> fast path, where the common case is writing blocks to an existing
> location in a file where the blocks are already allocated, and marked
> as written.  So consulting the on-disk extent tree to determine
> whether unwritten extents need to be converted and/or split is
> certainly doable.  However, it's expensive for the common case.  So
> having a hint whether we need to schedule a workqueue to possibly
> convert an unwritten region is helpful.  If we can just free the bio
> and exit the I/O completion handler without having to take shared
> locks to examine the on-disk extent tree, so much the better.

Yes, but for determining whether extent conversion on IO completion is
needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to
us. So we don't have to track this internally in ext4 anymore.

> > Maybe it's a silly question, although I'm wanting to clarify my
> > understanding around why it is that when we either try prepend or
> > append to an existing extent, we don't permit merging of extents if
> 
> If I recall correctly, the reason for this check was mainly the
> concern that we would end up merging an extent that we would then have
> to split later on (when the direct I/O completed).
> 
> To be honest, i'm not 100% sure what would happen if we removed that
> restriction; it might be that things would work just fine (just slower
> in some workloads), or whether there is some hidden dependency that
> would explode.  I suspect we'd have to try the experiment to be sure.

As far as I remember the concern was that extent split may need block
allocation and we may not have enough free blocks to do it. These days we
have some blocks reserved in the filesystem to accomodate unexpected extent
splits so this shouldn't happen anymore so the only real concern is the
wasted performance due to unnecessary extent merge & split. Kind of a
stress test for this would be to fire of lots of sequential AIO DIO
requests against a hole in a file.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end()
  2019-08-12 12:52 ` [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
  2019-08-12 17:18   ` Christoph Hellwig
@ 2019-08-28 19:59   ` Jan Kara
  2019-08-28 21:54     ` Matthew Bobrowski
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Kara @ 2019-08-28 19:59 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon 12-08-19 22:52:53, Matthew Bobrowski wrote:
> In preparation for implementing the direct IO write code path modifications
> that make us of iomap infrastructure we need to move out the inode
> extension/truncate code from ext4_iomap_end() callback. For direct IO, if the
> current code remained it would behave incorrectly. If we update the inode size
> prior to converting unwritten extents we run the risk of allowing a racing
> direct IO read operation to find unwritten extents before they are converted.
> 
> The inode extension/truncate has been moved out into a new function
> ext4_handle_inode_extension(). This will be used by both direct IO and DAX
> code paths if the write results with the inode being extended.
> 
> Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
> ---
>  fs/ext4/file.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/ext4/inode.c | 48 +--------------------------------------------
>  2 files changed, 60 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 360eff7b6aa2..7470800c63b7 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -33,6 +33,7 @@
>  #include "ext4_jbd2.h"
>  #include "xattr.h"
>  #include "acl.h"
> +#include "truncate.h"
>  
>  static bool ext4_dio_checks(struct inode *inode)
>  {
> @@ -234,12 +235,62 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	return iov_iter_count(from);
>  }
>  
> +static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
> +				       size_t count)
> +{
> +	handle_t *handle;
> +	bool truncate = false;
> +	ext4_lblk_t written_blk, end_blk;
> +	int ret = 0, blkbits = inode->i_blkbits;
> +
> +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> +	if (IS_ERR(handle)) {
> +		ret = PTR_ERR(handle);
> +		goto orphan_del;
> +	}
> +
> +	if (ext4_update_inode_size(inode, size))
> +		ext4_mark_inode_dirty(handle, inode);
> +
> +	/*
> +	 * We may need truncate allocated but not written blocks
> +	 * beyond EOF.
> +	 */
> +	written_blk = ALIGN(size, 1 << blkbits);
> +	end_blk = ALIGN(size + count, 1 << blkbits);

So this seems to imply that 'size' is really offset where IO started but
ext4_update_inode_size(inode, size) above suggests 'size' is really where
IO has ended and that's indeed what you pass into
ext4_handle_inode_extension(). So I'd just make the calling convention for
ext4_handle_inode_extension() less confusing and pass 'offset' and 'len'
and fixup the math inside the function...

Otherwise the patch looks OK to me.

								Honza


> +	if (written_blk < end_blk && ext4_can_truncate(inode))
> +		truncate = true;
> +
> +	/*
> +	 * Remove the inode from the orphan list if it has been
> +	 * extended and everything went OK.
> +	 */
> +	if (!truncate && inode->i_nlink)
> +		ext4_orphan_del(handle, inode);
> +	ext4_journal_stop(handle);
> +
> +	if (truncate) {
> +		ext4_truncate_failed_write(inode);
> +orphan_del:
> +		/*
> +		 * If the truncate operation failed early the inode
> +		 * may still be on the orphan list. In that case, we
> +		 * need try remove the inode from the linked list in
> +		 * memory.
> +		 */
> +		if (inode->i_nlink)
> +			ext4_orphan_del(NULL, inode);
> +	}
> +	return ret;
> +}
> +
>  #ifdef CONFIG_FS_DAX
>  static ssize_t
>  ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
> -	struct inode *inode = file_inode(iocb->ki_filp);
> +	int err;
>  	ssize_t ret;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  
>  	if (!inode_trylock(inode)) {
>  		if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -257,6 +308,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		goto out;
>  
>  	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> +
> +	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> +		err = ext4_handle_inode_extension(inode, iocb->ki_pos,
> +						  iov_iter_count(from));
> +		if (err)
> +			ret = err;
> +	}
>  out:
>  	inode_unlock(inode);
>  	if (ret > 0)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 420fe3deed39..761ce6286b05 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3601,53 +3601,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  			  ssize_t written, unsigned flags, struct iomap *iomap)
>  {
> -	int ret = 0;
> -	handle_t *handle;
> -	int blkbits = inode->i_blkbits;
> -	bool truncate = false;
> -
> -	if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
> -		return 0;
> -
> -	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> -	if (IS_ERR(handle)) {
> -		ret = PTR_ERR(handle);
> -		goto orphan_del;
> -	}
> -	if (ext4_update_inode_size(inode, offset + written))
> -		ext4_mark_inode_dirty(handle, inode);
> -	/*
> -	 * We may need to truncate allocated but not written blocks beyond EOF.
> -	 */
> -	if (iomap->offset + iomap->length > 
> -	    ALIGN(inode->i_size, 1 << blkbits)) {
> -		ext4_lblk_t written_blk, end_blk;
> -
> -		written_blk = (offset + written) >> blkbits;
> -		end_blk = (offset + length) >> blkbits;
> -		if (written_blk < end_blk && ext4_can_truncate(inode))
> -			truncate = true;
> -	}
> -	/*
> -	 * Remove inode from orphan list if we were extending a inode and
> -	 * everything went fine.
> -	 */
> -	if (!truncate && inode->i_nlink &&
> -	    !list_empty(&EXT4_I(inode)->i_orphan))
> -		ext4_orphan_del(handle, inode);
> -	ext4_journal_stop(handle);
> -	if (truncate) {
> -		ext4_truncate_failed_write(inode);
> -orphan_del:
> -		/*
> -		 * If truncate failed early the inode might still be on the
> -		 * orphan list; we need to make sure the inode is removed from
> -		 * the orphan list in that case.
> -		 */
> -		if (inode->i_nlink)
> -			ext4_orphan_del(NULL, inode);
> -	}
> -	return ret;
> +	return 0;
>  }
>  
>  const struct iomap_ops ext4_iomap_ops = {
> -- 
> 2.16.4
> 
> 
> -- 
> Matthew Bobrowski
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-12 12:53 ` [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure Matthew Bobrowski
  2019-08-12 17:04   ` RITESH HARJANI
  2019-08-12 17:34   ` Christoph Hellwig
@ 2019-08-28 20:26   ` Jan Kara
  2019-08-28 22:32     ` Dave Chinner
  2019-08-29 11:45     ` Matthew Bobrowski
  2 siblings, 2 replies; 48+ messages in thread
From: Jan Kara @ 2019-08-28 20:26 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: linux-ext4, linux-fsdevel, jack, tytso, riteshh

On Mon 12-08-19 22:53:26, Matthew Bobrowski wrote:
> This patch introduces a new direct IO write code path implementation
> that makes use of the iomap infrastructure.
> 
> All direct IO write operations are now passed from the ->write_iter() callback
> to the new function ext4_dio_write_iter(). This function is responsible for
> calling into iomap infrastructure via iomap_dio_rw(). Snippets of the direct
> IO code from within ext4_file_write_iter(), such as checking whether the IO
> request is unaligned asynchronous IO, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
> 
> The block mapping flags that are passed to ext4_map_blocks() from within
> ext4_dio_get_block() and friends have effectively been taken out and
> introduced within the ext4_iomap_begin(). If ext4_map_blocks() happens to have
> instantiated blocks beyond the i_size, then we attempt to place the inode onto
> the orphan list. Despite being able to perform i_size extension checking
> earlier on in the direct IO code path, it makes most sense to perform this bit
> post successful block allocation.
> 
> The ->end_io() callback ext4_dio_write_end_io() is responsible for removing
> the inode from the orphan list and determining if we should truncate a failed
> write in the case of an error. We also convert a range of unwritten extents to
> written if IOMAP_DIO_UNWRITTEN is set and perform the necessary
> i_size/i_disksize extension if the iocb->ki_pos + dio->size > i_size_read(inode).
> 
> In the instance of a short write, we fallback to buffered IO and complete
> whatever is left the 'iter'. Any blocks that may have been allocated in
> preparation for direct IO will be reused by buffered IO, so there's no issue
> with leaving allocated blocks beyond EOF.
> 
> Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
> ---
>  fs/ext4/file.c  | 227 ++++++++++++++++++++++++++++++++++++++++----------------
>  fs/ext4/inode.c |  42 +++++++++--
>  2 files changed, 199 insertions(+), 70 deletions(-)

Overall this is very nice. Some smaller comments below.

> @@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	return iov_iter_count(from);
>  }
>  
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> +					struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EOPNOTSUPP;
> +		inode_lock(inode);
> +	}

Currently there's no support for IOCB_NOWAIT for buffered IO so you can
replace this with "inode_lock(inode)".

> @@ -284,6 +321,128 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
>  	return ret;
>  }
>  

I'd mention here that for cases where inode size is extended,
ext4_dio_write_iter() waits for DIO to complete and thus we are protected
by inode_lock in that case.

> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> +				 ssize_t error, unsigned int flags)
> +{
> +	int ret = 0;
> +	handle_t *handle;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error) {
> +		if (offset + size > i_size_read(inode))
> +			ext4_truncate_failed_write(inode);
> +
> +		/*
> +		 * The inode may have been placed onto the orphan list
> +		 * as a result of an extension. However, an error may
> +		 * have been encountered prior to being able to
> +		 * complete the write operation. Perform any necessary
> +		 * clean up in this case.
> +		 */
> +		if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> +			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> +			if (IS_ERR(handle)) {
> +				if (inode->i_nlink)
> +					ext4_orphan_del(NULL, inode);
> +				return PTR_ERR(handle);
> +			}
> +
> +			if (inode->i_nlink)
> +				ext4_orphan_del(handle, inode);
> +			ext4_journal_stop(handle);
> +		}
> +		return error;
> +	}
> +
> +	if (flags & IOMAP_DIO_UNWRITTEN) {
> +		ret = ext4_convert_unwritten_extents(NULL, inode, offset, size);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (offset + size > i_size_read(inode)) {
> +		ret = ext4_handle_inode_extension(inode, offset + size, 0);
> +		if (ret)
> +			return ret;
> +	}
> +	return ret;
> +}
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	loff_t offset = iocb->ki_pos;
> +	size_t count = iov_iter_count(from);
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	bool extend = false, overwrite = false, unaligned_aio = false;
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		inode_lock(inode);
> +	}
> +
> +	if (!ext4_dio_checks(inode)) {
> +		inode_unlock(inode);
> +		/*
> +		 * Fallback to buffered IO if the operation on the
> +		 * inode is not supported by direct IO.
> +		 */
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	/*
> +	 * Unaligned direct AIO must be serialized among each other as
> +	 * the zeroing of partial blocks of two competing unaligned
> +	 * AIOs can result in data corruption.
> +	 */
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> +		unaligned_aio = true;
> +		inode_dio_wait(inode);
> +	}
> +
> +	/*
> +	 * Determine whether the IO operation will overwrite allocated
> +	 * and initialized blocks. If so, check to see whether it is
> +	 * possible to take the dioread_nolock path.
> +	 */
> +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> +	    ext4_should_dioread_nolock(inode)) {
> +		overwrite = true;
> +		downgrade_write(&inode->i_rwsem);
> +	}
> +
> +	if (offset + count > i_size_read(inode) ||
> +	    offset + count > EXT4_I(inode)->i_disksize) {
> +		ext4_update_i_disksize(inode, inode->i_size);
> +		extend = true;
> +	}
> +
> +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> +	/*
> +	 * Unaligned direct AIO must be the only IO in flight or else
> +	 * any overlapping aligned IO after unaligned IO might result
> +	 * in data corruption.
> +	 */

Here I'd expand the comment to explain that we wait in case inode is
extended so that inode extension in ext4_dio_write_end_io() is properly
covered by inode_lock.

> +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> +		inode_dio_wait(inode);
> +
> +	if (ret >= 0 && iov_iter_count(from)) {
> +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +out:
> +	overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> +	return ret;
> +}
> +
>  #ifdef CONFIG_FS_DAX
>  static ssize_t
>  ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)

...

> @@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
>  		iomap->addr = IOMAP_NULL_ADDR;
>  	} else {
> -		if (map.m_flags & EXT4_MAP_MAPPED) {
> -			iomap->type = IOMAP_MAPPED;
> -		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> +		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>  			iomap->type = IOMAP_UNWRITTEN;
> +		} else if (map.m_flags & EXT4_MAP_MAPPED) {
> +			iomap->type = IOMAP_MAPPED;
>  		} else {
>  			WARN_ON_ONCE(1);
>  			return -EIO;

Possibly this hunk should go into a separate patch (since this is not
directly related with iomap conversion) with a changelog / comment
explaining why we need to check EXT4_MAP_UNWRITTEN first.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end()
  2019-08-28 19:59   ` Jan Kara
@ 2019-08-28 21:54     ` Matthew Bobrowski
  2019-08-29  8:18       ` Jan Kara
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-28 21:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, linux-fsdevel, tytso, riteshh

On Wed, Aug 28, 2019 at 09:59:14PM +0200, Jan Kara wrote:
> On Mon 12-08-19 22:52:53, Matthew Bobrowski wrote:
> > +static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
> > +				       size_t count)
> > +{
> > +	handle_t *handle;
> > +	bool truncate = false;
> > +	ext4_lblk_t written_blk, end_blk;
> > +	int ret = 0, blkbits = inode->i_blkbits;
> > +
> > +	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > +	if (IS_ERR(handle)) {
> > +		ret = PTR_ERR(handle);
> > +		goto orphan_del;
> > +	}
> > +
> > +	if (ext4_update_inode_size(inode, size))
> > +		ext4_mark_inode_dirty(handle, inode);
> > +
> > +	/*
> > +	 * We may need truncate allocated but not written blocks
> > +	 * beyond EOF.
> > +	 */
> > +	written_blk = ALIGN(size, 1 << blkbits);
> > +	end_blk = ALIGN(size + count, 1 << blkbits);
> 
> So this seems to imply that 'size' is really offset where IO started but
> ext4_update_inode_size(inode, size) above suggests 'size' is really where
> IO has ended and that's indeed what you pass into
> ext4_handle_inode_extension(). So I'd just make the calling convention for
> ext4_handle_inode_extension() less confusing and pass 'offset' and 'len'
> and fixup the math inside the function...

Agree, that will help with making things more transparent.

Also, one other thing while looking at this patch again. See comment
below.

> > @@ -257,6 +308,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  		goto out;
> >  
> >  	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> > +
> > +	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > +		err = ext4_handle_inode_extension(inode, iocb->ki_pos,
> > +						  iov_iter_count(from));
> > +		if (err)
> > +			ret = err;
> > +	}

I noticed that within ext4_dax_write_iter() we're not accommodating
for error cases. Subsequently, there's no clean up code that goes with
that. So, for example, in the case where we've added the inode onto
the orphan list as a result of an extension and we bump into any error
i.e. -ENOSPC, we'll be left with inconsistencies. Perhaps it might be
worthwhile to introduce a helper here
i.e. ext4_dax_handle_failed_write(), which would be the path taken to
perform any necessary clean up routines in the case of a failed write?
I think it'd be better to have this rather than polluting
ext4_dax_write_iter(). What do you think?

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-28 20:26   ` Jan Kara
@ 2019-08-28 22:32     ` Dave Chinner
  2019-08-29  8:03       ` Jan Kara
  2019-08-29 11:47       ` Matthew Bobrowski
  2019-08-29 11:45     ` Matthew Bobrowski
  1 sibling, 2 replies; 48+ messages in thread
From: Dave Chinner @ 2019-08-28 22:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-ext4, linux-fsdevel, tytso, riteshh

On Wed, Aug 28, 2019 at 10:26:19PM +0200, Jan Kara wrote:
> On Mon 12-08-19 22:53:26, Matthew Bobrowski wrote:
> > This patch introduces a new direct IO write code path implementation
> > that makes use of the iomap infrastructure.
> > 
> > All direct IO write operations are now passed from the ->write_iter() callback
> > to the new function ext4_dio_write_iter(). This function is responsible for
> > calling into iomap infrastructure via iomap_dio_rw(). Snippets of the direct
> > IO code from within ext4_file_write_iter(), such as checking whether the IO
> > request is unaligned asynchronous IO, or whether it will ber overwriting
> > allocated and initialized blocks has been moved out and into
> > ext4_dio_write_iter().
> > 
> > The block mapping flags that are passed to ext4_map_blocks() from within
> > ext4_dio_get_block() and friends have effectively been taken out and
> > introduced within the ext4_iomap_begin(). If ext4_map_blocks() happens to have
> > instantiated blocks beyond the i_size, then we attempt to place the inode onto
> > the orphan list. Despite being able to perform i_size extension checking
> > earlier on in the direct IO code path, it makes most sense to perform this bit
> > post successful block allocation.
> > 
> > The ->end_io() callback ext4_dio_write_end_io() is responsible for removing
> > the inode from the orphan list and determining if we should truncate a failed
> > write in the case of an error. We also convert a range of unwritten extents to
> > written if IOMAP_DIO_UNWRITTEN is set and perform the necessary
> > i_size/i_disksize extension if the iocb->ki_pos + dio->size > i_size_read(inode).
> > 
> > In the instance of a short write, we fallback to buffered IO and complete
> > whatever is left the 'iter'. Any blocks that may have been allocated in
> > preparation for direct IO will be reused by buffered IO, so there's no issue
> > with leaving allocated blocks beyond EOF.
> > 
> > Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
> > ---
> >  fs/ext4/file.c  | 227 ++++++++++++++++++++++++++++++++++++++++----------------
> >  fs/ext4/inode.c |  42 +++++++++--
> >  2 files changed, 199 insertions(+), 70 deletions(-)
> 
> Overall this is very nice. Some smaller comments below.
> 
> > @@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	return iov_iter_count(from);
> >  }
> >  
> > +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > +					struct iov_iter *from)
> > +{
> > +	ssize_t ret;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	if (!inode_trylock(inode)) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EOPNOTSUPP;
> > +		inode_lock(inode);
> > +	}
> 
> Currently there's no support for IOCB_NOWAIT for buffered IO so you can
> replace this with "inode_lock(inode)".

IOCB_NOWAIT is supported for buffered reads. It is not supported on
buffered writes (as yet), so this should return EOPNOTSUPP if
IOCB_NOWAIT is set, regardless of whether the lock can be grabbed or
not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-28 18:02                         ` Jan Kara
@ 2019-08-29  6:36                           ` Christoph Hellwig
  2019-08-29 11:20                             ` Matthew Bobrowski
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-29  6:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Y. Ts'o, Matthew Bobrowski, Christoph Hellwig,
	Darrick J. Wong, Ritesh Harjani, adilger.kernel, linux-ext4,
	linux-fsdevel, aneesh.kumar

On Wed, Aug 28, 2019 at 08:02:15PM +0200, Jan Kara wrote:
> > The original reason why we created the DIO_STATE_UNWRITTEN flag was a
> > fast path, where the common case is writing blocks to an existing
> > location in a file where the blocks are already allocated, and marked
> > as written.  So consulting the on-disk extent tree to determine
> > whether unwritten extents need to be converted and/or split is
> > certainly doable.  However, it's expensive for the common case.  So
> > having a hint whether we need to schedule a workqueue to possibly
> > convert an unwritten region is helpful.  If we can just free the bio
> > and exit the I/O completion handler without having to take shared
> > locks to examine the on-disk extent tree, so much the better.
> 
> Yes, but for determining whether extent conversion on IO completion is
> needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to
> us. So we don't have to track this internally in ext4 anymore.

Exactly.  As mentioned before the ioend to track unwritten thing was
in XFS by the time ext4 copied the ioend approach. but we actually got
rid of that long before the iomap conversion.  Maybe to make everything
easier to understand and bisect you might want to get rid of the ioend
for direct I/O in ext4 as a prep path as well.

The relevant commit is: 273dda76f757 ("xfs: don't use ioends for direct
write completions")

> > To be honest, i'm not 100% sure what would happen if we removed that
> > restriction; it might be that things would work just fine (just slower
> > in some workloads), or whether there is some hidden dependency that
> > would explode.  I suspect we'd have to try the experiment to be sure.
> 
> As far as I remember the concern was that extent split may need block
> allocation and we may not have enough free blocks to do it. These days we
> have some blocks reserved in the filesystem to accomodate unexpected extent
> splits so this shouldn't happen anymore so the only real concern is the
> wasted performance due to unnecessary extent merge & split. Kind of a
> stress test for this would be to fire of lots of sequential AIO DIO
> requests against a hole in a file.

Well, you can always add a don't merge flag to the actual allocation.
You might still get a merge for pathological case (fallocate adjacent
to a dio write just submitted), but if the merging is such a performance
over head here is easy ways to avoid it for the common case.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-28 22:32     ` Dave Chinner
@ 2019-08-29  8:03       ` Jan Kara
  2019-08-29 11:47       ` Matthew Bobrowski
  1 sibling, 0 replies; 48+ messages in thread
From: Jan Kara @ 2019-08-29  8:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Matthew Bobrowski, linux-ext4, linux-fsdevel, tytso, riteshh

On Thu 29-08-19 08:32:18, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 10:26:19PM +0200, Jan Kara wrote:
> > On Mon 12-08-19 22:53:26, Matthew Bobrowski wrote:
> > > This patch introduces a new direct IO write code path implementation
> > > that makes use of the iomap infrastructure.
> > > 
> > > All direct IO write operations are now passed from the ->write_iter() callback
> > > to the new function ext4_dio_write_iter(). This function is responsible for
> > > calling into iomap infrastructure via iomap_dio_rw(). Snippets of the direct
> > > IO code from within ext4_file_write_iter(), such as checking whether the IO
> > > request is unaligned asynchronous IO, or whether it will ber overwriting
> > > allocated and initialized blocks has been moved out and into
> > > ext4_dio_write_iter().
> > > 
> > > The block mapping flags that are passed to ext4_map_blocks() from within
> > > ext4_dio_get_block() and friends have effectively been taken out and
> > > introduced within the ext4_iomap_begin(). If ext4_map_blocks() happens to have
> > > instantiated blocks beyond the i_size, then we attempt to place the inode onto
> > > the orphan list. Despite being able to perform i_size extension checking
> > > earlier on in the direct IO code path, it makes most sense to perform this bit
> > > post successful block allocation.
> > > 
> > > The ->end_io() callback ext4_dio_write_end_io() is responsible for removing
> > > the inode from the orphan list and determining if we should truncate a failed
> > > write in the case of an error. We also convert a range of unwritten extents to
> > > written if IOMAP_DIO_UNWRITTEN is set and perform the necessary
> > > i_size/i_disksize extension if the iocb->ki_pos + dio->size > i_size_read(inode).
> > > 
> > > In the instance of a short write, we fallback to buffered IO and complete
> > > whatever is left the 'iter'. Any blocks that may have been allocated in
> > > preparation for direct IO will be reused by buffered IO, so there's no issue
> > > with leaving allocated blocks beyond EOF.
> > > 
> > > Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
> > > ---
> > >  fs/ext4/file.c  | 227 ++++++++++++++++++++++++++++++++++++++++----------------
> > >  fs/ext4/inode.c |  42 +++++++++--
> > >  2 files changed, 199 insertions(+), 70 deletions(-)
> > 
> > Overall this is very nice. Some smaller comments below.
> > 
> > > @@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  	return iov_iter_count(from);
> > >  }
> > >  
> > > +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > > +					struct iov_iter *from)
> > > +{
> > > +	ssize_t ret;
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +
> > > +	if (!inode_trylock(inode)) {
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EOPNOTSUPP;
> > > +		inode_lock(inode);
> > > +	}
> > 
> > Currently there's no support for IOCB_NOWAIT for buffered IO so you can
> > replace this with "inode_lock(inode)".
> 
> IOCB_NOWAIT is supported for buffered reads. It is not supported on
> buffered writes (as yet), so this should return EOPNOTSUPP if
> IOCB_NOWAIT is set, regardless of whether the lock can be grabbed or
> not.

Yeah, right. Thanks for correcting me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end()
  2019-08-28 21:54     ` Matthew Bobrowski
@ 2019-08-29  8:18       ` Jan Kara
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Kara @ 2019-08-29  8:18 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, linux-ext4, linux-fsdevel, tytso, riteshh

On Thu 29-08-19 07:54:21, Matthew Bobrowski wrote:
> On Wed, Aug 28, 2019 at 09:59:14PM +0200, Jan Kara wrote:
> > > @@ -257,6 +308,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >  		goto out;
> > >  
> > >  	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> > > +
> > > +	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> > > +		err = ext4_handle_inode_extension(inode, iocb->ki_pos,
> > > +						  iov_iter_count(from));
> > > +		if (err)
> > > +			ret = err;
> > > +	}
> 
> I noticed that within ext4_dax_write_iter() we're not accommodating
> for error cases. Subsequently, there's no clean up code that goes with
> that. So, for example, in the case where we've added the inode onto
> the orphan list as a result of an extension and we bump into any error
> i.e. -ENOSPC, we'll be left with inconsistencies. Perhaps it might be
> worthwhile to introduce a helper here
> i.e. ext4_dax_handle_failed_write(), which would be the path taken to
> perform any necessary clean up routines in the case of a failed write?
> I think it'd be better to have this rather than polluting
> ext4_dax_write_iter(). What do you think?

Esentially you'll need the same error-handling code as you have in
ext4_dio_write_end_io(). So probably just factor that out into a common
helper like ext4_handle_failed_inode_extension() and call it from
ext4_dio_write_end_io() and ext4_dax_write_iter().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-29  6:36                           ` Christoph Hellwig
@ 2019-08-29 11:20                             ` Matthew Bobrowski
  2019-08-29 14:41                               ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-29 11:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Theodore Y. Ts'o, Darrick J. Wong, Ritesh Harjani,
	adilger.kernel, linux-ext4, linux-fsdevel, aneesh.kumar

Awesome, and thank you *all* for your very valueable input.

On Wed, Aug 28, 2019 at 11:36:08PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 28, 2019 at 08:02:15PM +0200, Jan Kara wrote:
> > > The original reason why we created the DIO_STATE_UNWRITTEN flag was a
> > > fast path, where the common case is writing blocks to an existing
> > > location in a file where the blocks are already allocated, and marked
> > > as written.  So consulting the on-disk extent tree to determine
> > > whether unwritten extents need to be converted and/or split is
> > > certainly doable.  However, it's expensive for the common case.  So
> > > having a hint whether we need to schedule a workqueue to possibly
> > > convert an unwritten region is helpful.  If we can just free the bio
> > > and exit the I/O completion handler without having to take shared
> > > locks to examine the on-disk extent tree, so much the better.
> > 
> > Yes, but for determining whether extent conversion on IO completion is
> > needed we now use IOMAP_DIO_UNWRITTEN flag iomap infrastructure provides to
> > us. So we don't have to track this internally in ext4 anymore.
> 
> Exactly.  As mentioned before the ioend to track unwritten thing was
> in XFS by the time ext4 copied the ioend approach. but we actually got
> rid of that long before the iomap conversion.  Maybe to make everything
> easier to understand and bisect you might want to get rid of the ioend
> for direct I/O in ext4 as a prep path as well.
> 
> The relevant commit is: 273dda76f757 ("xfs: don't use ioends for direct
> write completions")

Uh ha! So, we conclude that there's no need to muck around with hairy
ioend's, or the need to denote whether there's unwritten extents held
against the inode using tricky state flag for that matter.

> > > To be honest, i'm not 100% sure what would happen if we removed that
> > > restriction; it might be that things would work just fine (just slower
> > > in some workloads), or whether there is some hidden dependency that
> > > would explode.  I suspect we'd have to try the experiment to be sure.
> > 
> > As far as I remember the concern was that extent split may need block
> > allocation and we may not have enough free blocks to do it. These days we
> > have some blocks reserved in the filesystem to accomodate unexpected extent
> > splits so this shouldn't happen anymore so the only real concern is the
> > wasted performance due to unnecessary extent merge & split. Kind of a
> > stress test for this would be to fire of lots of sequential AIO DIO
> > requests against a hole in a file.
> 
> Well, you can always add a don't merge flag to the actual allocation.
> You might still get a merge for pathological case (fallocate adjacent
> to a dio write just submitted), but if the merging is such a performance
> over head here is easy ways to avoid it for the common case.

After I've posted through the next version of this patch series, I
will attempt to perform some stress testing to see what the
performance hit could potentially be.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-28 20:26   ` Jan Kara
  2019-08-28 22:32     ` Dave Chinner
@ 2019-08-29 11:45     ` Matthew Bobrowski
  2019-08-29 12:38       ` Jan Kara
  1 sibling, 1 reply; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-29 11:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, linux-fsdevel, tytso, riteshh

On Wed, Aug 28, 2019 at 10:26:19PM +0200, Jan Kara wrote:
> On Mon 12-08-19 22:53:26, Matthew Bobrowski wrote:
> Overall this is very nice. Some smaller comments below.

Awesome, thanks for the review Jan!

> > @@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	return iov_iter_count(from);
> >  }
> >  
> > +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > +					struct iov_iter *from)
> > +{
> > +	ssize_t ret;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	if (!inode_trylock(inode)) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EOPNOTSUPP;
> > +		inode_lock(inode);
> > +	}
> 
> Currently there's no support for IOCB_NOWAIT for buffered IO so you can
> replace this with "inode_lock(inode)".

Noted. I've also taken into consideration what Dave mentioned in the
other thread around explicitly checking for IOCB_NOWAIT and returning
EOPTNOTSUPP irrespective whether we can acquire the lock or not.

> > @@ -284,6 +321,128 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t size,
> >  	return ret;
> >  }
> >  
> 
> I'd mention here that for cases where inode size is extended,
> ext4_dio_write_iter() waits for DIO to complete and thus we are protected
> by inode_lock in that case.

Easy.

> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> > +				 ssize_t error, unsigned int flags)
> 
> Here I'd expand the comment to explain that we wait in case inode is
> extended so that inode extension in ext4_dio_write_end_io() is properly
> covered by inode_lock.
>

Easy.

> > +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> > +		inode_dio_wait(inode);
> > +
> > +	if (ret >= 0 && iov_iter_count(from)) {
> > +		overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > +		return ext4_buffered_write_iter(iocb, from);
> > +	}
> > +out:
> > +	overwrite ? inode_unlock_shared(inode) : inode_unlock(inode);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_FS_DAX
> >  static ssize_t
> >  ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> 
> ...
> 
> > @@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> >  		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> >  		iomap->addr = IOMAP_NULL_ADDR;
> >  	} else {
> > -		if (map.m_flags & EXT4_MAP_MAPPED) {
> > -			iomap->type = IOMAP_MAPPED;
> > -		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > +		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> >  			iomap->type = IOMAP_UNWRITTEN;
> > +		} else if (map.m_flags & EXT4_MAP_MAPPED) {
> > +			iomap->type = IOMAP_MAPPED;
> >  		} else {
> >  			WARN_ON_ONCE(1);
> >  			return -EIO;
> 
> Possibly this hunk should go into a separate patch (since this is not
> directly related with iomap conversion) with a changelog / comment
> explaining why we need to check EXT4_MAP_UNWRITTEN first.

But wouldn't doing so break bisection? Seeing as though we needed to
change this statement specifically to accommodate for the weirdness
being returned from ext4_map_blocks()? i.e. map.m_flags being set to
either of the following:

	- (EXT4_MAP_NEW | EXT4_MAP_MAPPED)
        or
        - (EXT4_MAP_NEW | EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)

So, if we left the statement in its original form, we'd allocate
unwritten extents but never actually get around to converting them in
ext4_dio_write_end_io() as IOMAP_DIO_UNWRITTEN would never be set?

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-28 22:32     ` Dave Chinner
  2019-08-29  8:03       ` Jan Kara
@ 2019-08-29 11:47       ` Matthew Bobrowski
  1 sibling, 0 replies; 48+ messages in thread
From: Matthew Bobrowski @ 2019-08-29 11:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-ext4, linux-fsdevel, tytso, riteshh

On Thu, Aug 29, 2019 at 08:32:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 10:26:19PM +0200, Jan Kara wrote:
> > On Mon 12-08-19 22:53:26, Matthew Bobrowski wrote:
> > > @@ -235,6 +244,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  	return iov_iter_count(from);
> > >  }
> > >  
> > > +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > > +					struct iov_iter *from)
> > > +{
> > > +	ssize_t ret;
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +
> > > +	if (!inode_trylock(inode)) {
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EOPNOTSUPP;
> > > +		inode_lock(inode);
> > > +	}
> > 
> > Currently there's no support for IOCB_NOWAIT for buffered IO so you can
> > replace this with "inode_lock(inode)".
> 
> IOCB_NOWAIT is supported for buffered reads. It is not supported on
> buffered writes (as yet), so this should return EOPNOTSUPP if
> IOCB_NOWAIT is set, regardless of whether the lock can be grabbed or
> not.

Noted! Thank you Dave. ;-)

--M

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure
  2019-08-29 11:45     ` Matthew Bobrowski
@ 2019-08-29 12:38       ` Jan Kara
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Kara @ 2019-08-29 12:38 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, linux-ext4, linux-fsdevel, tytso, riteshh

On Thu 29-08-19 21:45:17, Matthew Bobrowski wrote:
> > > @@ -3581,10 +3611,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > >  		iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> > >  		iomap->addr = IOMAP_NULL_ADDR;
> > >  	} else {
> > > -		if (map.m_flags & EXT4_MAP_MAPPED) {
> > > -			iomap->type = IOMAP_MAPPED;
> > > -		} else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > > +		if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > >  			iomap->type = IOMAP_UNWRITTEN;
> > > +		} else if (map.m_flags & EXT4_MAP_MAPPED) {
> > > +			iomap->type = IOMAP_MAPPED;
> > >  		} else {
> > >  			WARN_ON_ONCE(1);
> > >  			return -EIO;
> > 
> > Possibly this hunk should go into a separate patch (since this is not
> > directly related with iomap conversion) with a changelog / comment
> > explaining why we need to check EXT4_MAP_UNWRITTEN first.
> 
> But wouldn't doing so break bisection? Seeing as though we needed to
> change this statement specifically to accommodate for the weirdness
> being returned from ext4_map_blocks()? i.e. map.m_flags being set to
> either of the following:
> 
> 	- (EXT4_MAP_NEW | EXT4_MAP_MAPPED)
>         or
>         - (EXT4_MAP_NEW | EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)
> 
> So, if we left the statement in its original form, we'd allocate
> unwritten extents but never actually get around to converting them in
> ext4_dio_write_end_io() as IOMAP_DIO_UNWRITTEN would never be set?

By splitting into a separate patch, I meant that patch would go before this
one. Original code in ext4_iomap_begin() never called ext4_map_blocks()
with a set of flags that can return with both EXT4_MAP_MAPPED and
EXT4_MAP_UNWRITTEN set so that patch would be a no-op but would fix that
landmine you tripped over.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/5] ext4: direct IO via iomap infrastructure
  2019-08-29 11:20                             ` Matthew Bobrowski
@ 2019-08-29 14:41                               ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2019-08-29 14:41 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Christoph Hellwig, Jan Kara, Theodore Y. Ts'o,
	Darrick J. Wong, Ritesh Harjani, adilger.kernel, linux-ext4,
	linux-fsdevel, aneesh.kumar

On Thu, Aug 29, 2019 at 09:20:50PM +1000, Matthew Bobrowski wrote:
> Uh ha! So, we conclude that there's no need to muck around with hairy
> ioend's, or the need to denote whether there's unwritten extents held
> against the inode using tricky state flag for that matter.

So in XFS we never had the i_unwritten counter, that is something
purely ext4 specific, and I can't really help with that unfortunately,
but maybe the people who implemented it might be able to help on that
one.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, back to index

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-12 12:52 [PATCH 0/5] ext4: direct IO via iomap infrastructure Matthew Bobrowski
2019-08-12 12:52 ` [PATCH 1/5] ext4: introduce direct IO read code path using " Matthew Bobrowski
2019-08-12 17:18   ` Christoph Hellwig
2019-08-12 20:17     ` Matthew Wilcox
2019-08-13 10:45       ` Matthew Bobrowski
2019-08-12 12:52 ` [PATCH 2/5] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
2019-08-12 17:18   ` Christoph Hellwig
2019-08-13 10:46     ` Matthew Bobrowski
2019-08-28 19:59   ` Jan Kara
2019-08-28 21:54     ` Matthew Bobrowski
2019-08-29  8:18       ` Jan Kara
2019-08-12 12:53 ` [PATCH 3/5] iomap: modify ->end_io() calling convention Matthew Bobrowski
2019-08-12 17:18   ` Christoph Hellwig
2019-08-13 10:43     ` Matthew Bobrowski
2019-08-12 12:53 ` [PATCH 4/5] ext4: introduce direct IO write code path using iomap infrastructure Matthew Bobrowski
2019-08-12 17:04   ` RITESH HARJANI
2019-08-13 12:58     ` Matthew Bobrowski
2019-08-13 14:35       ` Darrick J. Wong
2019-08-14  9:51         ` Matthew Bobrowski
2019-08-12 17:34   ` Christoph Hellwig
2019-08-13 10:45     ` Matthew Bobrowski
2019-08-28 20:26   ` Jan Kara
2019-08-28 22:32     ` Dave Chinner
2019-08-29  8:03       ` Jan Kara
2019-08-29 11:47       ` Matthew Bobrowski
2019-08-29 11:45     ` Matthew Bobrowski
2019-08-29 12:38       ` Jan Kara
2019-08-12 12:53 ` [PATCH 5/5] ext4: clean up redundant buffer_head direct IO code Matthew Bobrowski
2019-08-12 17:31 ` [PATCH 0/5] ext4: direct IO via iomap infrastructure RITESH HARJANI
2019-08-13 11:10   ` Matthew Bobrowski
2019-08-13 12:27     ` RITESH HARJANI
2019-08-14  9:48       ` Matthew Bobrowski
2019-08-14 11:58         ` RITESH HARJANI
2019-08-21 13:14       ` Matthew Bobrowski
2019-08-22 12:00         ` Matthew Bobrowski
2019-08-22 14:11           ` Ritesh Harjani
2019-08-24  3:18             ` Matthew Bobrowski
2019-08-24  3:55               ` Darrick J. Wong
2019-08-24 23:04                 ` Christoph Hellwig
2019-08-27  9:52                   ` Matthew Bobrowski
2019-08-28 12:05                     ` Matthew Bobrowski
2019-08-28 14:27                       ` Theodore Y. Ts'o
2019-08-28 18:02                         ` Jan Kara
2019-08-29  6:36                           ` Christoph Hellwig
2019-08-29 11:20                             ` Matthew Bobrowski
2019-08-29 14:41                               ` Christoph Hellwig
2019-08-23 13:43           ` [RFC 1/1] ext4: PoC implementation of option-1 Ritesh Harjani
2019-08-23 13:49             ` Ritesh Harjani

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org linux-ext4@archiver.kernel.org
	public-inbox-index linux-ext4


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/ public-inbox