linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v10] btrfs direct-io using iomap
@ 2020-06-29 19:23 Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch

This is an effort to use iomap for direct I/O in btrfs. This would
change the call from __blockdev_direct_io() to iomap_dio_rw().
These are remanants of the series which was revoked due to page
invalidation errors and adds patches in iomap for proper buffered
fallback.

The main objective is to lose the buffer head and use bio defined by
iomap code, and hopefully to use more of generic-FS codebase.

These patches are based and tested on vanilla. I have tested it against
xfstests.

The tree is available at
https://github.com/goldwynr/linux/tree/btrfs-iomap-dio

Changes since v1
- Incorporated back the efficiency change for inode locking
- Review comments about coding style and git comments
- Merge related patches into one
- Direct read to go through btrfs_direct_IO()
- Removal of no longer used function dio_end_io()

Changes since v2
- aligning iomap offset/length to the position/length of I/O
- Removed btrfs_dio_data
- Removed BTRFS_INODE_READDIO_NEED_LOCK
- Re-incorporating write efficiency changes caused lockdep_assert() in
  iomap to be triggered, remove that code.

Changes since v3
- Fixed freeze on generic/095. Use iomap_end() to account for
  failed/incomplete dio instead of btrfs_dio_data

Changes since v4
- moved lockdep_assert_held() to functions calling iomap_dio_rw()
  This may be called immidiately after calling inode lock and
  may feel not required, but it seems important.
- Removed comments which are no longer required
- Changed commit comments to make them more appropriate

Changes since v5
- restore inode_dio_wait() in truncate
- Removed lockdep_assert_held() near callers

Changes since v6
- Fixed hangs due to underlying device failures
- Removed the patch to wait while releasing pages

Changes since v7
- Moved reservation into btrfs iomap begin()/end() for correct
  reservation cleanup in case of device error.
- Merged patches switch to iomap, and adding btrfs_iomap_end()
  for easier bisection. The switch to iomap would fail in case
  of dio after buffered writes.

Changes since v8
- Added a flag to iomap_dio_rw() to return zero for buffered fallback

Changes since v10
- flag name changes for iomap_dio_rw()

--
Goldwyn




^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] iomap: Convert wait_for_completion to flags
  2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
@ 2020-06-29 19:23 ` Goldwyn Rodrigues
  2020-06-29 23:03   ` David Sterba
                     ` (2 more replies)
  2020-06-29 19:23 ` [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch,
	Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Convert wait_for_completion boolean to flags so we can pass more flags
to iomap_dio_rw()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/ext4/file.c        | 11 +++++++++--
 fs/gfs2/file.c        | 14 ++++++++++----
 fs/iomap/direct-io.c  |  3 ++-
 fs/xfs/xfs_file.c     | 15 +++++++++++----
 fs/zonefs/super.c     | 16 ++++++++++++----
 include/linux/iomap.h | 11 ++++++++++-
 6 files changed, 54 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c..0a123d8f0ce2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -53,6 +53,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret;
 	struct inode *inode = file_inode(iocb->ki_filp);
+	int flags = 0;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock_shared(inode))
@@ -74,8 +75,11 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
 	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+			   flags);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -457,6 +461,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
 	bool extend = false, unaligned_io = false;
 	bool ilock_shared = true;
+	int flags = 0;
 
 	/*
 	 * We initially start with shared inode lock unless it is
@@ -540,10 +545,12 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ext4_journal_stop(handle);
 	}
 
+	if (is_sync_kiocb(iocb) || unaligned_io || extend)
+		flags |= IOMAP_DIO_RWF_SYNCIO;
 	if (ilock_shared)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_io || extend);
+			   flags);
 
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index fe305e4bfd37..8e6ba6e7e528 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -767,6 +767,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	size_t count = iov_iter_count(to);
 	struct gfs2_holder gh;
 	ssize_t ret;
+	int flags = 0;
 
 	if (!count)
 		return 0; /* skip atime */
@@ -776,8 +777,10 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, flags);
 
 	gfs2_glock_dq(&gh);
 out_uninit:
@@ -794,6 +797,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	loff_t offset = iocb->ki_pos;
 	struct gfs2_holder gh;
 	ssize_t ret;
+	int flags = 0;
 
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
@@ -812,8 +816,10 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, flags);
 
 out:
 	gfs2_glock_dq(&gh);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ec7b78e6feca..fd22bff61569 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -405,7 +405,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion)
+		int dio_flags)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -415,6 +415,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	unsigned int flags = IOMAP_DIRECT;
 	struct blk_plug plug;
 	struct iomap_dio *dio;
+	bool wait_for_completion = dio_flags & IOMAP_DIO_RWF_SYNCIO;
 
 	if (!count)
 		return 0;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..072da01faa12 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -169,6 +169,8 @@ xfs_file_dio_aio_read(
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
 	size_t			count = iov_iter_count(to);
 	ssize_t			ret;
+	int			flags = 0;
+
 
 	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
 
@@ -183,8 +185,11 @@ xfs_file_dio_aio_read(
 	} else {
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	}
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
-			is_sync_kiocb(iocb));
+
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,	flags);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -483,6 +488,7 @@ xfs_file_dio_aio_write(
 	int			iolock;
 	size_t			count = iov_iter_count(from);
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
+	int			flags = 0;
 
 	/* DIO must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
@@ -546,9 +552,10 @@ xfs_file_dio_aio_write(
 	 * If unaligned, this is the only IO in-flight. Wait on it before we
 	 * release the iolock to prevent subsequent overlapping IO.
 	 */
+	if (is_sync_kiocb(iocb) || unaligned_io)
+		flags |= IOMAP_DIO_RWF_SYNCIO;
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_io);
+			   &xfs_dio_write_ops, flags);
 out:
 	xfs_iunlock(ip, iolock);
 
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 07bc42d62673..798e2e636887 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -670,6 +670,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 	bool append = false;
 	size_t count;
 	ssize_t ret;
+	int flags = 0;
 
 	/*
 	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
@@ -711,11 +712,15 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		append = sync;
 	}
 
-	if (append)
+	if (append) {
 		ret = zonefs_file_dio_append(iocb, from);
-	else
+	} else {
+		if (is_sync_kiocb(iocb))
+			flags |= IOMAP_DIO_RWF_SYNCIO;
+
 		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-				   &zonefs_write_dio_ops, sync);
+				&zonefs_write_dio_ops, flags);
+	}
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
@@ -814,6 +819,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct super_block *sb = inode->i_sb;
 	loff_t isize;
 	ssize_t ret;
+	int flags = 0;
 
 	/* Offline zones cannot be read */
 	if (unlikely(IS_IMMUTABLE(inode) && !(inode->i_mode & 0777)))
@@ -848,8 +854,10 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			goto inode_unlock;
 		}
 		file_accessed(iocb->ki_filp);
+		if (is_sync_kiocb(iocb))
+			flags |= IOMAP_DIO_RWF_SYNCIO;
 		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
-				   &zonefs_read_dio_ops, is_sync_kiocb(iocb));
+				   &zonefs_read_dio_ops, flags);
 	} else {
 		ret = generic_file_read_iter(iocb, to);
 		if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..8a4ba1635202 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -255,9 +255,18 @@ struct iomap_dio_ops {
 			struct bio *bio, loff_t file_offset);
 };
 
+/*
+ * Flags to pass iomap_dio_rw()
+ */
+
+/*
+ * Wait for completion of DIO
+ */
+#define IOMAP_DIO_RWF_SYNCIO			(1 << 0)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion);
+		int flags);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
@ 2020-06-29 19:23 ` Goldwyn Rodrigues
  2020-07-01  7:53   ` always fall back to buffered I/O after invalidation failures, was: " Christoph Hellwig
  2020-06-29 19:23 ` [PATCH 3/6] btrfs: switch to iomap_dio_rw() for dio Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch,
	Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
that if the page invalidation fails, return back control to the
filesystem so it may fallback to buffered mode.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/direct-io.c  |  8 +++++++-
 include/linux/iomap.h | 14 ++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index fd22bff61569..2459c76e41ab 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -484,8 +484,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 */
 	ret = invalidate_inode_pages2_range(mapping,
 			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-	if (ret)
+	if (ret) {
+		if (dio_flags & IOMAP_DIO_RWF_NO_STALE_PAGECACHE) {
+			if (ret == -EBUSY)
+				ret = 0;
+			goto out_free_dio;
+		}
 		dio_warn_stale_pagecache(iocb->ki_filp);
+	}
 	ret = 0;
 
 	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8a4ba1635202..2ebb8a298cd8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -262,7 +262,21 @@ struct iomap_dio_ops {
 /*
  * Wait for completion of DIO
  */
+
 #define IOMAP_DIO_RWF_SYNCIO			(1 << 0)
+/*
+ * Direct IO will attempt to keep the page cache coherent by
+ * invalidating the inode's page cache over the range of the DIO.
+ * That can fail if something else is actively using the page cache.
+ * If this happens and the DIO continues, the data in the page
+ * cache will become stale.
+ *
+ * Set this flag if you want the DIO to abort without issuing any IO
+ * or error if it fails to invalidate the page cache successfully.
+ * This allows the IO submitter to fallback to buffered IO to resubmit
+ * IO
+ */
+#define IOMAP_DIO_RWF_NO_STALE_PAGECACHE	(1 << 1)
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/6] btrfs: switch to iomap_dio_rw() for dio
  2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails Goldwyn Rodrigues
@ 2020-06-29 19:23 ` Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 4/6] fs: remove dio_end_io() Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch,
	Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:
 - for writes, call __endio_write_update_ordered()
 - for reads, unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

This patch removes last use of struct buffer_head from btrfs.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/Kconfig |   1 +
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/file.c  |  21 +++-
 fs/btrfs/inode.c | 317 ++++++++++++++++++++++-------------------------
 4 files changed, 170 insertions(+), 170 deletions(-)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 575636f6491e..68b95ad82126 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -14,6 +14,7 @@ config BTRFS_FS
 	select LZO_DECOMPRESS
 	select ZSTD_COMPRESS
 	select ZSTD_DECOMPRESS
+	select FS_IOMAP
 	select RAID6_PQ
 	select XOR_BLOCKS
 	select SRCU
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d404cce8ae40..677f170434e3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2935,6 +2935,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
+ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2520605afc25..9d486350f1bf 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1835,7 +1835,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	loff_t endbyte;
 	int err;
 
-	written = generic_file_direct_write(iocb, from);
+	written = btrfs_direct_IO(iocb, from);
 
 	if (written < 0 || !iov_iter_count(from))
 		return written;
@@ -3504,9 +3504,26 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret = 0;
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		inode_lock_shared(inode);
+		ret = btrfs_direct_IO(iocb, to);
+		inode_unlock_shared(inode);
+		if (ret < 0)
+			return ret;
+	}
+
+	return generic_file_buffered_read(iocb, to, ret);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 18d384f4af54..0fa75af35a1f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5,7 +5,6 @@
 
 #include <linux/kernel.h>
 #include <linux/bio.h>
-#include <linux/buffer_head.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
@@ -30,6 +29,7 @@
 #include <linux/swap.h>
 #include <linux/migrate.h>
 #include <linux/sched/mm.h>
+#include <linux/iomap.h>
 #include <asm/unaligned.h>
 #include "misc.h"
 #include "ctree.h"
@@ -58,9 +58,9 @@ struct btrfs_iget_args {
 
 struct btrfs_dio_data {
 	u64 reserve;
-	u64 unsubmitted_oe_range_start;
-	u64 unsubmitted_oe_range_end;
-	int overwrite;
+	loff_t length;
+	ssize_t submitted;
+	struct extent_changeset *data_reserved;
 };
 
 static const struct inode_operations btrfs_dir_inode_operations;
@@ -7069,7 +7069,7 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 }
 
 static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
-			      struct extent_state **cached_state, int writing)
+			      struct extent_state **cached_state, bool writing)
 {
 	struct btrfs_ordered_extent *ordered;
 	int ret = 0;
@@ -7207,30 +7207,7 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 }
 
 
-static int btrfs_get_blocks_direct_read(struct extent_map *em,
-					struct buffer_head *bh_result,
-					struct inode *inode,
-					u64 start, u64 len)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	if (em->block_start == EXTENT_MAP_HOLE ||
-			test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		return -ENOENT;
-
-	len = min(len, em->len - (start - em->start));
-
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = fs_info->fs_devices->latest_bdev;
-	set_buffer_mapped(bh_result);
-
-	return 0;
-}
-
 static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
 					 struct inode *inode,
 					 struct btrfs_dio_data *dio_data,
 					 u64 start, u64 len)
@@ -7292,7 +7269,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
 	if (IS_ERR(em)) {
@@ -7303,64 +7279,73 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	len = min(len, em->len - (start - em->start));
 
 skip_cow:
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = fs_info->fs_devices->latest_bdev;
-	set_buffer_mapped(bh_result);
-
-	if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		set_buffer_new(bh_result);
-
 	/*
 	 * Need to update the i_size under the extent lock so buffered
 	 * readers will get the updated i_size when we unlock.
 	 */
-	if (!dio_data->overwrite && start + len > i_size_read(inode))
+	if (start + len > i_size_read(inode))
 		i_size_write(inode, start + len);
 
-	WARN_ON(dio_data->reserve < len);
 	dio_data->reserve -= len;
-	dio_data->unsubmitted_oe_range_end = start + len;
-	current->journal_info = dio_data;
 out:
 	return ret;
 }
 
-static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
-				   struct buffer_head *bh_result, int create)
+static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
+		loff_t length, unsigned int flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_dio_data *dio_data = NULL;
-	u64 start = iblock << inode->i_blkbits;
 	u64 lockstart, lockend;
-	u64 len = bh_result->b_size;
+	const bool write = !!(flags & IOMAP_WRITE);
 	int ret = 0;
+	u64 len = length;
+	bool unlock_extents = false;
 
-	if (!create)
+	if (!write)
 		len = min_t(u64, len, fs_info->sectorsize);
 
 	lockstart = start;
 	lockend = start + len - 1;
 
-	if (current->journal_info) {
-		/*
-		 * Need to pull our outstanding extents and set journal_info to NULL so
-		 * that anything that needs to check if there's a transaction doesn't get
-		 * confused.
-		 */
-		dio_data = current->journal_info;
-		current->journal_info = NULL;
+	/*
+	 * The generic stuff only does filemap_write_and_wait_range, which
+	 * isn't enough if we've written compressed pages to this area, so we
+	 * need to flush the dirty pages again to make absolutely sure that any
+	 * outstanding dirty pages are on disk.
+	 */
+	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+		     &BTRFS_I(inode)->runtime_flags))
+		ret = filemap_fdatawrite_range(inode->i_mapping, start,
+					       start + length - 1);
+
+	dio_data = kzalloc(sizeof(*dio_data), GFP_NOFS);
+	if (!dio_data)
+		return -ENOMEM;
+
+	dio_data->length = length;
+	if (write) {
+		dio_data->reserve = round_up(length, fs_info->sectorsize);
+		ret = btrfs_delalloc_reserve_space(inode,
+				&dio_data->data_reserved,
+				start, dio_data->reserve);
+		if (ret) {
+			extent_changeset_free(dio_data->data_reserved);
+			kfree(dio_data);
+			return ret;
+		}
 	}
+	iomap->private = dio_data;
+
 
 	/*
 	 * If this errors out it's because we couldn't invalidate pagecache for
 	 * this range and we need to fallback to buffered.
 	 */
-	if (lock_extent_direct(inode, lockstart, lockend, &cached_state,
-			       create)) {
+	if (lock_extent_direct(inode, lockstart, lockend, &cached_state, write)) {
 		ret = -ENOTBLK;
 		goto err;
 	}
@@ -7392,36 +7377,48 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 		goto unlock_err;
 	}
 
-	if (create) {
-		ret = btrfs_get_blocks_direct_write(&em, bh_result, inode,
-						    dio_data, start, len);
+	len = min(len, em->len - (start - em->start));
+	if (write) {
+		ret = btrfs_get_blocks_direct_write(&em, inode, dio_data,
+						    start, len);
 		if (ret < 0)
 			goto unlock_err;
-
-		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
-				     lockend, &cached_state);
+		unlock_extents = true;
+		/* Recalc len in case the new em is smaller than requested */
+		len = min(len, em->len - (start - em->start));
 	} else {
-		ret = btrfs_get_blocks_direct_read(em, bh_result, inode,
-						   start, len);
-		/* Can be negative only if we read from a hole */
-		if (ret < 0) {
-			ret = 0;
-			free_extent_map(em);
-			goto unlock_err;
-		}
 		/*
 		 * We need to unlock only the end area that we aren't using.
 		 * The rest is going to be unlocked by the endio routine.
 		 */
-		lockstart = start + bh_result->b_size;
-		if (lockstart < lockend) {
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     lockstart, lockend, &cached_state);
-		} else {
-			free_extent_state(cached_state);
-		}
+		lockstart = start + len;
+		if (lockstart < lockend)
+			unlock_extents = true;
 	}
 
+	if (unlock_extents)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+				     lockstart, lockend, &cached_state);
+	else
+		free_extent_state(cached_state);
+
+	/*
+	 * Translate extent map information to iomap.
+	 * We trim the extents (and move the addr) even though iomap code does
+	 * that, since we have locked only the parts we are performing I/O in.
+	 */
+	if ((em->block_start == EXTENT_MAP_HOLE) ||
+	    (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) && !write)) {
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->addr = em->block_start + (start - em->start);
+		iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = start;
+	iomap->bdev = fs_info->fs_devices->latest_bdev;
+	iomap->length = len;
+
 	free_extent_map(em);
 
 	return 0;
@@ -7430,8 +7427,53 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			     &cached_state);
 err:
-	if (dio_data)
-		current->journal_info = dio_data;
+	if (dio_data) {
+		btrfs_delalloc_release_space(inode, dio_data->data_reserved,
+				start, dio_data->reserve, true);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), dio_data->reserve);
+		extent_changeset_free(dio_data->data_reserved);
+		kfree(dio_data);
+	}
+	return ret;
+}
+
+static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+		ssize_t written, unsigned int flags, struct iomap *iomap)
+{
+	int ret = 0;
+	struct btrfs_dio_data *dio_data = iomap->private;
+	size_t submitted = dio_data->submitted;
+	const bool write = !!(flags & IOMAP_WRITE);
+
+	if (!write && (iomap->type == IOMAP_HOLE)) {
+		/* If reading from a hole, unlock and return */
+		unlock_extent(&BTRFS_I(inode)->io_tree, pos, pos + length - 1);
+		goto out;
+	}
+
+	if (submitted < length) {
+		pos += submitted;
+		length -= submitted;
+		if (write)
+			__endio_write_update_ordered(inode, pos, length, false);
+		else
+			unlock_extent(&BTRFS_I(inode)->io_tree, pos,
+				      pos + length - 1);
+		ret = -ENOTBLK;
+	}
+
+	if (write) {
+		if (dio_data->reserve)
+			btrfs_delalloc_release_space(inode,
+					dio_data->data_reserved, pos,
+					dio_data->reserve, true);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), dio_data->length);
+		extent_changeset_free(dio_data->data_reserved);
+	}
+out:
+	kfree(dio_data);
+	iomap->private = NULL;
+
 	return ret;
 }
 
@@ -7454,7 +7496,7 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 			      dip->logical_offset + dip->bytes - 1);
 	}
 
-	dio_end_io(dip->dio_bio);
+	bio_endio(dip->dio_bio);
 	kfree(dip);
 }
 
@@ -7690,24 +7732,11 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
 	dip->dio_bio = dio_bio;
 	refcount_set(&dip->refs, 1);
-
-	if (write) {
-		struct btrfs_dio_data *dio_data = current->journal_info;
-
-		/*
-		 * Setting range start and end to the same value means that
-		 * no cleanup will happen in btrfs_direct_IO
-		 */
-		dio_data->unsubmitted_oe_range_end = dip->logical_offset +
-			dip->bytes;
-		dio_data->unsubmitted_oe_range_start =
-			dio_data->unsubmitted_oe_range_end;
-	}
 	return dip;
 }
 
-static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
-				loff_t file_offset)
+static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
+		struct bio *dio_bio, loff_t file_offset)
 {
 	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
@@ -7724,6 +7753,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	int ret;
 	blk_status_t status;
 	struct btrfs_io_geometry geom;
+	struct btrfs_dio_data *dio_data = iomap->private;
 
 	dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
 	if (!dip) {
@@ -7732,8 +7762,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 				file_offset + dio_bio->bi_iter.bi_size - 1);
 		}
 		dio_bio->bi_status = BLK_STS_RESOURCE;
-		dio_end_io(dio_bio);
-		return;
+		bio_endio(dio_bio);
+		return BLK_QC_T_NONE;
 	}
 
 	if (!write && csum) {
@@ -7804,15 +7834,17 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			goto out_err;
 		}
 
+		dio_data->submitted += clone_len;
 		clone_offset += clone_len;
 		start_sector += clone_len >> 9;
 		file_offset += clone_len;
 	} while (submit_len > 0);
-	return;
+	return BLK_QC_T_NONE;
 
 out_err:
 	dip->dio_bio->bi_status = status;
 	btrfs_dio_private_put(dip);
+	return BLK_QC_T_NONE;
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -7848,37 +7880,31 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 	return retval;
 }
 
-static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+static const struct iomap_ops btrfs_dio_iomap_ops = {
+	.iomap_begin            = btrfs_dio_iomap_begin,
+	.iomap_end              = btrfs_dio_iomap_end,
+};
+
+static const struct iomap_dio_ops btrfs_dops = {
+	.submit_io		= btrfs_submit_direct,
+};
+
+ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_dio_data dio_data = { 0 };
 	struct extent_changeset *data_reserved = NULL;
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
-	int flags = 0;
-	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
+	int flags = IOMAP_DIO_RWF_NO_STALE_PAGECACHE;
 
 	if (check_direct_IO(fs_info, iter, offset))
 		return 0;
 
-	inode_dio_begin(inode);
-
-	/*
-	 * The generic stuff only does filemap_write_and_wait_range, which
-	 * isn't enough if we've written compressed pages to this area, so
-	 * we need to flush the dirty pages again to make absolutely sure
-	 * that any outstanding dirty pages are on disk.
-	 */
 	count = iov_iter_count(iter);
-	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-		     &BTRFS_I(inode)->runtime_flags))
-		filemap_fdatawrite_range(inode->i_mapping, offset,
-					 offset + count - 1);
-
 	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * If the write DIO is beyond the EOF, we need update
@@ -7886,68 +7912,23 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		 * not unlock the i_mutex at this case.
 		 */
 		if (offset + count <= inode->i_size) {
-			dio_data.overwrite = 1;
 			inode_unlock(inode);
 			relock = true;
 		}
-		ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
-						   offset, count);
-		if (ret)
-			goto out;
-
-		/*
-		 * We need to know how many extents we reserved so that we can
-		 * do the accounting properly if we go over the number we
-		 * originally calculated.  Abuse current->journal_info for this.
-		 */
-		dio_data.reserve = round_up(count,
-					    fs_info->sectorsize);
-		dio_data.unsubmitted_oe_range_start = (u64)offset;
-		dio_data.unsubmitted_oe_range_end = (u64)offset;
-		current->journal_info = &dio_data;
 		down_read(&BTRFS_I(inode)->dio_sem);
-	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
-				     &BTRFS_I(inode)->runtime_flags)) {
-		inode_dio_end(inode);
-		flags = DIO_LOCKING | DIO_SKIP_HOLES;
-		wakeup = false;
 	}
 
-	ret = __blockdev_direct_IO(iocb, inode,
-				   fs_info->fs_devices->latest_bdev,
-				   iter, btrfs_get_blocks_direct, NULL,
-				   btrfs_submit_direct, flags);
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
+	ret = iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dops,
+			flags);
+
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
-		current->journal_info = NULL;
-		if (ret < 0 && ret != -EIOCBQUEUED) {
-			if (dio_data.reserve)
-				btrfs_delalloc_release_space(inode, data_reserved,
-					offset, dio_data.reserve, true);
-			/*
-			 * On error we might have left some ordered extents
-			 * without submitting corresponding bios for them, so
-			 * cleanup them up to avoid other tasks getting them
-			 * and waiting for them to complete forever.
-			 */
-			if (dio_data.unsubmitted_oe_range_start <
-			    dio_data.unsubmitted_oe_range_end)
-				__endio_write_update_ordered(inode,
-					dio_data.unsubmitted_oe_range_start,
-					dio_data.unsubmitted_oe_range_end -
-					dio_data.unsubmitted_oe_range_start,
-					false);
-		} else if (ret >= 0 && (size_t)ret < count)
-			btrfs_delalloc_release_space(inode, data_reserved,
-					offset, count - (size_t)ret, true);
-		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 	}
-out:
-	if (wakeup)
-		inode_dio_end(inode);
 	if (relock)
 		inode_lock(inode);
-
 	extent_changeset_free(data_reserved);
 	return ret;
 }
@@ -10246,7 +10227,7 @@ static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readahead	= btrfs_readahead,
-	.direct_IO	= btrfs_direct_IO,
+	.direct_IO	= noop_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
 #ifdef CONFIG_MIGRATION
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] fs: remove dio_end_io()
  2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2020-06-29 19:23 ` [PATCH 3/6] btrfs: switch to iomap_dio_rw() for dio Goldwyn Rodrigues
@ 2020-06-29 19:23 ` Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 5/6] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 6/6] btrfs: split btrfs_direct_IO to read and write part Goldwyn Rodrigues
  5 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch,
	Goldwyn Rodrigues, Nikolay Borisov, Johannes Thumshirn

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we removed the last user of dio_end_io(), remove the helper
function dio_end_io().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/direct-io.c     | 19 -------------------
 include/linux/fs.h |  2 --
 2 files changed, 21 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 6d5370eac2a8..1543b5af400e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,25 +386,6 @@ static void dio_bio_end_io(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 }
 
-/**
- * dio_end_io - handle the end io action for the given bio
- * @bio: The direct io bio thats being completed
- *
- * This is meant to be called by any filesystem that uses their own dio_submit_t
- * so that the DIO specific endio actions are dealt with after the filesystem
- * has done it's completion work.
- */
-void dio_end_io(struct bio *bio)
-{
-	struct dio *dio = bio->bi_private;
-
-	if (dio->is_async)
-		dio_bio_end_aio(bio);
-	else
-		dio_bio_end_io(bio);
-}
-EXPORT_SYMBOL_GPL(dio_end_io);
-
 static inline void
 dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	      struct block_device *bdev,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f881a892ea7..9b3f250d634c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3202,8 +3202,6 @@ enum {
 	DIO_SKIP_HOLES	= 0x02,
 };
 
-void dio_end_io(struct bio *bio);
-
 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 			     struct block_device *bdev, struct iov_iter *iter,
 			     get_block_t get_block,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/6] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2020-06-29 19:23 ` [PATCH 4/6] fs: remove dio_end_io() Goldwyn Rodrigues
@ 2020-06-29 19:23 ` Goldwyn Rodrigues
  2020-06-29 19:23 ` [PATCH 6/6] btrfs: split btrfs_direct_IO to read and write part Goldwyn Rodrigues
  5 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch,
	Goldwyn Rodrigues, Nikolay Borisov, Johannes Thumshirn

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we now perform direct reads using i_rwsem, we can remove this
inode flag used to co-ordinate unlocked reads.

The truncate call takes i_rwsem. This means it is correctly synchronized
with concurrent direct reads.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/btrfs_inode.h | 18 ------------------
 fs/btrfs/inode.c       |  3 ---
 2 files changed, 21 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e7d709505cb1..aeff56a0e105 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -28,7 +28,6 @@ enum {
 	BTRFS_INODE_NEEDS_FULL_SYNC,
 	BTRFS_INODE_COPY_EVERYTHING,
 	BTRFS_INODE_IN_DELALLOC_LIST,
-	BTRFS_INODE_READDIO_NEED_LOCK,
 	BTRFS_INODE_HAS_PROPS,
 	BTRFS_INODE_SNAPSHOT_FLUSH,
 };
@@ -313,23 +312,6 @@ struct btrfs_dio_private {
 	u8 csums[];
 };
 
-/*
- * Disable DIO read nolock optimization, so new dio readers will be forced
- * to grab i_mutex. It is used to avoid the endless truncate due to
- * nonlocked dio read.
- */
-static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
-{
-	set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
-	smp_mb();
-}
-
-static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
-{
-	smp_mb__before_atomic();
-	clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
-}
-
 /* Array of bytes with variable length, hexadecimal format 0x1234 */
 #define CSUM_FMT				"0x%*phN"
 #define CSUM_FMT_VALUE(size, bytes)		size, bytes
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0fa75af35a1f..264b676ebf29 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4835,10 +4835,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 
 		truncate_setsize(inode, newsize);
 
-		/* Disable nonlocked read DIO to avoid the endless truncate */
-		btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
 		inode_dio_wait(inode);
-		btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
 
 		ret = btrfs_truncate(inode, newsize == oldsize);
 		if (ret && inode->i_nlink) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/6] btrfs: split btrfs_direct_IO to read and write part
  2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2020-06-29 19:23 ` [PATCH 5/6] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
@ 2020-06-29 19:23 ` Goldwyn Rodrigues
  5 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-29 19:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, fdmanana, dsterba, david, darrick.wong, hch,
	Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

The read and write versions don't have anything in common except for the
call to iomap_dio_rw.  So split this function, and merge each half into
its only caller.

Originally proposed by Christoph Hellwig <hch@lst.de>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  4 +-
 fs/btrfs/file.c  | 95 +++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/inode.c | 82 +----------------------------------------
 3 files changed, 90 insertions(+), 91 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 677f170434e3..1037969cda63 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -28,6 +28,7 @@
 #include <linux/dynamic_debug.h>
 #include <linux/refcount.h>
 #include <linux/crc32c.h>
+#include <linux/iomap.h>
 #include "extent-io-tree.h"
 #include "extent_io.h"
 #include "extent_map.h"
@@ -2935,7 +2936,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
-ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
+extern const struct iomap_ops btrfs_dio_iomap_ops;
+extern const struct iomap_dio_ops btrfs_dops;
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9d486350f1bf..8c738f8101c4 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1825,21 +1825,65 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 	return num_written ? num_written : ret;
 }
 
-static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
+			       const struct iov_iter *iter, loff_t offset)
+{
+	const unsigned int blocksize_mask = fs_info->sectorsize - 1;
+
+	if (offset & blocksize_mask)
+		return -EINVAL;
+
+	if (iov_iter_alignment(iter) & blocksize_mask)
+		return -EINVAL;
+
+	return 0;
+}
+
+static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
-	loff_t pos;
-	ssize_t written;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	loff_t pos = iocb->ki_pos;
+	ssize_t written = 0;
 	ssize_t written_buffered;
 	loff_t endbyte;
 	int err;
+	size_t count = 0;
+	bool relock = false;
+	int flags = IOMAP_DIO_RWF_NO_STALE_PAGECACHE;
 
-	written = btrfs_direct_IO(iocb, from);
+	if (check_direct_IO(fs_info, from, pos))
+		goto buffered;
+
+	count = iov_iter_count(from);
+	/*
+	 * If the write DIO is beyond the EOF, we need update the isize, but it
+	 * is protected by i_mutex. So we can not unlock the i_mutex at this
+	 * case.
+	 */
+	if (pos + count <= inode->i_size) {
+		inode_unlock(inode);
+		relock = true;
+	} else if (iocb->ki_flags & IOCB_NOWAIT) {
+		return -EAGAIN;
+	}
+
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
+	down_read(&BTRFS_I(inode)->dio_sem);
+	written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dops,
+			       flags);
+	up_read(&BTRFS_I(inode)->dio_sem);
+
+	if (relock)
+		inode_lock(inode);
 
 	if (written < 0 || !iov_iter_count(from))
 		return written;
 
+buffered:
 	pos = iocb->ki_pos;
 	written_buffered = btrfs_buffered_write(iocb, from);
 	if (written_buffered < 0) {
@@ -1990,7 +2034,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		num_written = __btrfs_direct_write(iocb, from);
+		num_written = btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
 		if (num_written > 0)
@@ -3504,16 +3548,47 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static int check_direct_read(struct btrfs_fs_info *fs_info,
+			     const struct iov_iter *iter, loff_t offset)
+{
+	int ret;
+	int i, seg;
+
+	ret = check_direct_IO(fs_info, iter, offset);
+	if (ret < 0)
+		return ret;
+
+	for (seg = 0; seg < iter->nr_segs; seg++)
+		for (i = seg + 1; i < iter->nr_segs; i++)
+			if (iter->iov[seg].iov_base == iter->iov[i].iov_base)
+				return -EINVAL;
+	return 0;
+}
+
+static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+	int flags = IOMAP_DIO_RWF_NO_STALE_PAGECACHE;
+
+	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
+		return 0;
+
+	if (is_sync_kiocb(iocb))
+		flags |= IOMAP_DIO_RWF_SYNCIO;
+
+	inode_lock_shared(inode);
+	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dops, flags);
+	inode_unlock_shared(inode);
+	return ret;
+}
+
 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		struct inode *inode = file_inode(iocb->ki_filp);
-
-		inode_lock_shared(inode);
-		ret = btrfs_direct_IO(iocb, to);
-		inode_unlock_shared(inode);
+		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0)
 			return ret;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 264b676ebf29..864415a17b1f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -29,7 +29,6 @@
 #include <linux/swap.h>
 #include <linux/migrate.h>
 #include <linux/sched/mm.h>
-#include <linux/iomap.h>
 #include <asm/unaligned.h>
 #include "misc.h"
 #include "ctree.h"
@@ -7844,92 +7843,15 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 	return BLK_QC_T_NONE;
 }
 
-static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
-			       const struct iov_iter *iter, loff_t offset)
-{
-	int seg;
-	int i;
-	unsigned int blocksize_mask = fs_info->sectorsize - 1;
-	ssize_t retval = -EINVAL;
-
-	if (offset & blocksize_mask)
-		goto out;
-
-	if (iov_iter_alignment(iter) & blocksize_mask)
-		goto out;
-
-	/* If this is a write we don't need to check anymore */
-	if (iov_iter_rw(iter) != READ || !iter_is_iovec(iter))
-		return 0;
-	/*
-	 * Check to make sure we don't have duplicate iov_base's in this
-	 * iovec, if so return EINVAL, otherwise we'll get csum errors
-	 * when reading back.
-	 */
-	for (seg = 0; seg < iter->nr_segs; seg++) {
-		for (i = seg + 1; i < iter->nr_segs; i++) {
-			if (iter->iov[seg].iov_base == iter->iov[i].iov_base)
-				goto out;
-		}
-	}
-	retval = 0;
-out:
-	return retval;
-}
-
-static const struct iomap_ops btrfs_dio_iomap_ops = {
+const struct iomap_ops btrfs_dio_iomap_ops = {
 	.iomap_begin            = btrfs_dio_iomap_begin,
 	.iomap_end              = btrfs_dio_iomap_end,
 };
 
-static const struct iomap_dio_ops btrfs_dops = {
+const struct iomap_dio_ops btrfs_dops = {
 	.submit_io		= btrfs_submit_direct,
 };
 
-ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_changeset *data_reserved = NULL;
-	loff_t offset = iocb->ki_pos;
-	size_t count = 0;
-	bool relock = false;
-	ssize_t ret;
-	int flags = IOMAP_DIO_RWF_NO_STALE_PAGECACHE;
-
-	if (check_direct_IO(fs_info, iter, offset))
-		return 0;
-
-	count = iov_iter_count(iter);
-	if (iov_iter_rw(iter) == WRITE) {
-		/*
-		 * If the write DIO is beyond the EOF, we need update
-		 * the isize, but it is protected by i_mutex. So we can
-		 * not unlock the i_mutex at this case.
-		 */
-		if (offset + count <= inode->i_size) {
-			inode_unlock(inode);
-			relock = true;
-		}
-		down_read(&BTRFS_I(inode)->dio_sem);
-	}
-
-	if (is_sync_kiocb(iocb))
-		flags |= IOMAP_DIO_RWF_SYNCIO;
-
-	ret = iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dops,
-			flags);
-
-	if (iov_iter_rw(iter) == WRITE) {
-		up_read(&BTRFS_I(inode)->dio_sem);
-	}
-	if (relock)
-		inode_lock(inode);
-	extent_changeset_free(data_reserved);
-	return ret;
-}
-
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] iomap: Convert wait_for_completion to flags
  2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
@ 2020-06-29 23:03   ` David Sterba
  2020-06-30 16:35   ` David Sterba
  2020-07-01  7:50   ` Christoph Hellwig
  2 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2020-06-29 23:03 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, fdmanana, dsterba, david,
	darrick.wong, hch, Goldwyn Rodrigues

On Mon, Jun 29, 2020 at 02:23:48PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Convert wait_for_completion boolean to flags so we can pass more flags
> to iomap_dio_rw()

> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -405,7 +405,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		bool wait_for_completion)
> +		int dio_flags)

I think flags should be an unsigned type as it's for bitwise operations.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] iomap: Convert wait_for_completion to flags
  2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
  2020-06-29 23:03   ` David Sterba
@ 2020-06-30 16:35   ` David Sterba
  2020-07-01  7:34     ` Johannes Thumshirn
  2020-07-01  7:50   ` Christoph Hellwig
  2 siblings, 1 reply; 30+ messages in thread
From: David Sterba @ 2020-06-30 16:35 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, fdmanana, dsterba, david,
	darrick.wong, hch, Goldwyn Rodrigues

On Mon, Jun 29, 2020 at 02:23:48PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Convert wait_for_completion boolean to flags so we can pass more flags
> to iomap_dio_rw()
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/ext4/file.c        | 11 +++++++++--
>  fs/gfs2/file.c        | 14 ++++++++++----
>  fs/iomap/direct-io.c  |  3 ++-
>  fs/xfs/xfs_file.c     | 15 +++++++++++----
>  fs/zonefs/super.c     | 16 ++++++++++++----
>  include/linux/iomap.h | 11 ++++++++++-

Though it's an API change I think you should CC all involved subsystems'
mailinglists too.  I don't see GFS2 or zonefs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] iomap: Convert wait_for_completion to flags
  2020-06-30 16:35   ` David Sterba
@ 2020-07-01  7:34     ` Johannes Thumshirn
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2020-07-01  7:34 UTC (permalink / raw)
  To: dsterba, Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, fdmanana, david, darrick.wong, hch,
	Goldwyn Rodrigues

On 30/06/2020 18:35, David Sterba wrote:
> Though it's an API change I think you should CC all involved subsystems'
> mailinglists too.  I don't see GFS2 or zonefs.

zonefs doesn't have a list on it's own. We're using fsdevel. But both Damien 
and me are subscribed to the btrfs list as well so no biggie for zonefs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] iomap: Convert wait_for_completion to flags
  2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
  2020-06-29 23:03   ` David Sterba
  2020-06-30 16:35   ` David Sterba
@ 2020-07-01  7:50   ` Christoph Hellwig
  2 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-01  7:50 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, fdmanana, dsterba, david,
	darrick.wong, hch, Goldwyn Rodrigues

On Mon, Jun 29, 2020 at 02:23:48PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Convert wait_for_completion boolean to flags so we can pass more flags
> to iomap_dio_rw()

FYI, when I did this a while ago for a stalled patch series I used
a single namespace for the flags passed to iomap_dio_rw and the flags
store in dio->flag, which at that point seemed a bit cleaner to me:

http://git.infradead.org/users/hch/xfs.git/commitdiff/1733619fefab1b0020d3ab91ebd0a72ccec982af

^ permalink raw reply	[flat|nested] 30+ messages in thread

* always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-06-29 19:23 ` [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails Goldwyn Rodrigues
@ 2020-07-01  7:53   ` Christoph Hellwig
  2020-07-07 12:43     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-01  7:53 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, fdmanana, dsterba, david,
	darrick.wong, hch, Goldwyn Rodrigues, cluster-devel, linux-ext4,
	linux-xfs

On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> that if the page invalidation fails, return back control to the
> filesystem so it may fallback to buffered mode.
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

I'd like to start a discussion of this shouldn't really be the
default behavior.  If we have page cache that can't be invalidated it
actually makes a whole lot of sense to not do direct I/O, avoid the
warnings, etc.

Adding all the relevant lists.

> ---
>  fs/iomap/direct-io.c  |  8 +++++++-
>  include/linux/iomap.h | 14 ++++++++++++++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index fd22bff61569..2459c76e41ab 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -484,8 +484,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 */
>  	ret = invalidate_inode_pages2_range(mapping,
>  			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -	if (ret)
> +	if (ret) {
> +		if (dio_flags & IOMAP_DIO_RWF_NO_STALE_PAGECACHE) {
> +			if (ret == -EBUSY)
> +				ret = 0;
> +			goto out_free_dio;
> +		}
>  		dio_warn_stale_pagecache(iocb->ki_filp);
> +	}
>  	ret = 0;
>  
>  	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8a4ba1635202..2ebb8a298cd8 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -262,7 +262,21 @@ struct iomap_dio_ops {
>  /*
>   * Wait for completion of DIO
>   */
> +
>  #define IOMAP_DIO_RWF_SYNCIO			(1 << 0)
> +/*
> + * Direct IO will attempt to keep the page cache coherent by
> + * invalidating the inode's page cache over the range of the DIO.
> + * That can fail if something else is actively using the page cache.
> + * If this happens and the DIO continues, the data in the page
> + * cache will become stale.
> + *
> + * Set this flag if you want the DIO to abort without issuing any IO
> + * or error if it fails to invalidate the page cache successfully.
> + * This allows the IO submitter to fallback to buffered IO to resubmit
> + * IO
> + */
> +#define IOMAP_DIO_RWF_NO_STALE_PAGECACHE	(1 << 1)
>  
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -- 
> 2.26.2
---end quoted text---

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-01  7:53   ` always fall back to buffered I/O after invalidation failures, was: " Christoph Hellwig
@ 2020-07-07 12:43     ` Goldwyn Rodrigues
  2020-07-07 12:57       ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-07-07 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-btrfs, fdmanana, dsterba, david,
	darrick.wong, cluster-devel, linux-ext4, linux-xfs

On  9:53 01/07, Christoph Hellwig wrote:
> On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > that if the page invalidation fails, return back control to the
> > filesystem so it may fallback to buffered mode.
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> I'd like to start a discussion of this shouldn't really be the
> default behavior.  If we have page cache that can't be invalidated it
> actually makes a whole lot of sense to not do direct I/O, avoid the
> warnings, etc.
> 
> Adding all the relevant lists.

Since no one responded so far, let me see if I can stir the cauldron :)

What error should be returned in case of such an error? I think the
userspace process must be immediately informed if it in unable to
invalidate the page cache and complete the direct I/O. Currently, the
iomap code treats this as a writeback error and continues with the
direct I/O and the userspace process comes to know only during file
closure.

If such a change is incorporated, are the current userspace applications
prepared for it?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-07 12:43     ` Goldwyn Rodrigues
@ 2020-07-07 12:57       ` Matthew Wilcox
  2020-07-07 13:00         ` Christoph Hellwig
  2020-07-07 13:49         ` Goldwyn Rodrigues
  0 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2020-07-07 12:57 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, linux-fsdevel, linux-btrfs, fdmanana, dsterba,
	david, darrick.wong, cluster-devel, linux-ext4, linux-xfs

On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> On  9:53 01/07, Christoph Hellwig wrote:
> > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > that if the page invalidation fails, return back control to the
> > > filesystem so it may fallback to buffered mode.
> > > 
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > I'd like to start a discussion of this shouldn't really be the
> > default behavior.  If we have page cache that can't be invalidated it
> > actually makes a whole lot of sense to not do direct I/O, avoid the
> > warnings, etc.
> > 
> > Adding all the relevant lists.
> 
> Since no one responded so far, let me see if I can stir the cauldron :)
> 
> What error should be returned in case of such an error? I think the

Christoph's message is ambiguous.  I don't know if he means "fail the
I/O with an error" or "satisfy the I/O through the page cache".  I'm
strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
the page cache at all for direct I/O.  For reads, I think the page cache
should be used to satisfy any portion of the read which is currently
cached.  For writes, I think we should write into the page cache pages
which currently exist, and then force those pages to be written back,
but left in cache.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-07 12:57       ` Matthew Wilcox
@ 2020-07-07 13:00         ` Christoph Hellwig
  2020-07-08  6:51           ` Dave Chinner
  2020-07-07 13:49         ` Goldwyn Rodrigues
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-07 13:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Goldwyn Rodrigues, Christoph Hellwig, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, david, darrick.wong, cluster-devel,
	linux-ext4, linux-xfs

On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > On  9:53 01/07, Christoph Hellwig wrote:
> > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > > that if the page invalidation fails, return back control to the
> > > > filesystem so it may fallback to buffered mode.
> > > > 
> > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > I'd like to start a discussion of this shouldn't really be the
> > > default behavior.  If we have page cache that can't be invalidated it
> > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > warnings, etc.
> > > 
> > > Adding all the relevant lists.
> > 
> > Since no one responded so far, let me see if I can stir the cauldron :)
> > 
> > What error should be returned in case of such an error? I think the
> 
> Christoph's message is ambiguous.  I don't know if he means "fail the
> I/O with an error" or "satisfy the I/O through the page cache".  I'm
> strongly in favour of the latter.

Same here.  Sorry if my previous mail was unclear.

> Indeed, I'm in favour of not invalidating
> the page cache at all for direct I/O.  For reads, I think the page cache
> should be used to satisfy any portion of the read which is currently
> cached.  For writes, I think we should write into the page cache pages
> which currently exist, and then force those pages to be written back,
> but left in cache.

Something like that, yes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-07 12:57       ` Matthew Wilcox
  2020-07-07 13:00         ` Christoph Hellwig
@ 2020-07-07 13:49         ` Goldwyn Rodrigues
  2020-07-07 14:01           ` Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-07-07 13:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-fsdevel, linux-btrfs, fdmanana, dsterba,
	david, darrick.wong, cluster-devel, linux-ext4, linux-xfs

On 13:57 07/07, Matthew Wilcox wrote:
> On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > On  9:53 01/07, Christoph Hellwig wrote:
> > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > > that if the page invalidation fails, return back control to the
> > > > filesystem so it may fallback to buffered mode.
> > > > 
> > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > I'd like to start a discussion of this shouldn't really be the
> > > default behavior.  If we have page cache that can't be invalidated it
> > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > warnings, etc.
> > > 
> > > Adding all the relevant lists.
> > 
> > Since no one responded so far, let me see if I can stir the cauldron :)
> > 
> > What error should be returned in case of such an error? I think the
> 
> Christoph's message is ambiguous.  I don't know if he means "fail the
> I/O with an error" or "satisfy the I/O through the page cache".  I'm
> strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
> the page cache at all for direct I/O.  For reads, I think the page cache
> should be used to satisfy any portion of the read which is currently

That indeed would make reads faster. How about if the pages are dirty
during DIO reads?
Should a direct I/O read be responsible for making sure that the dirty
pages are written back. Technically direct I/O reads is that we are
reading from the device.

> cached.  For writes, I think we should write into the page cache pages
> which currently exist, and then force those pages to be written back,
> but left in cache.

Yes, that makes sense.
If this is implemented, what would be the difference between O_DIRECT
and O_DSYNC, if any?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-07 13:49         ` Goldwyn Rodrigues
@ 2020-07-07 14:01           ` Darrick J. Wong
  2020-07-07 14:30             ` Goldwyn Rodrigues
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2020-07-07 14:01 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, david, cluster-devel, linux-ext4, linux-xfs

On Tue, Jul 07, 2020 at 08:49:52AM -0500, Goldwyn Rodrigues wrote:
> On 13:57 07/07, Matthew Wilcox wrote:
> > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > 
> > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > > > that if the page invalidation fails, return back control to the
> > > > > filesystem so it may fallback to buffered mode.
> > > > > 
> > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > I'd like to start a discussion of this shouldn't really be the
> > > > default behavior.  If we have page cache that can't be invalidated it
> > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > warnings, etc.
> > > > 
> > > > Adding all the relevant lists.
> > > 
> > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > 
> > > What error should be returned in case of such an error? I think the
> > 
> > Christoph's message is ambiguous.  I don't know if he means "fail the
> > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
> > the page cache at all for direct I/O.  For reads, I think the page cache
> > should be used to satisfy any portion of the read which is currently
> 
> That indeed would make reads faster. How about if the pages are dirty
> during DIO reads?
> Should a direct I/O read be responsible for making sure that the dirty
> pages are written back. Technically direct I/O reads is that we are
> reading from the device.

The filemap_write_and_wait_range should persist that data, right?

> > cached.  For writes, I think we should write into the page cache pages
> > which currently exist, and then force those pages to be written back,
> > but left in cache.
> 
> Yes, that makes sense.
> If this is implemented, what would be the difference between O_DIRECT
> and O_DSYNC, if any?

Presumably a direct write would proceed as it does today if there's no
pagecache at all?

--D

> -- 
> Goldwyn

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-07 14:01           ` Darrick J. Wong
@ 2020-07-07 14:30             ` Goldwyn Rodrigues
  0 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2020-07-07 14:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Christoph Hellwig, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, david, cluster-devel, linux-ext4, linux-xfs

On  7:01 07/07, Darrick J. Wong wrote:
> On Tue, Jul 07, 2020 at 08:49:52AM -0500, Goldwyn Rodrigues wrote:
> > On 13:57 07/07, Matthew Wilcox wrote:
> > > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > > 
> > > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > > > > that if the page invalidation fails, return back control to the
> > > > > > filesystem so it may fallback to buffered mode.
> > > > > > 
> > > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > 
> > > > > I'd like to start a discussion of this shouldn't really be the
> > > > > default behavior.  If we have page cache that can't be invalidated it
> > > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > > warnings, etc.
> > > > > 
> > > > > Adding all the relevant lists.
> > > > 
> > > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > > 
> > > > What error should be returned in case of such an error? I think the
> > > 
> > > Christoph's message is ambiguous.  I don't know if he means "fail the
> > > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > > strongly in favour of the latter.  Indeed, I'm in favour of not invalidating
> > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > should be used to satisfy any portion of the read which is currently
> > 
> > That indeed would make reads faster. How about if the pages are dirty
> > during DIO reads?
> > Should a direct I/O read be responsible for making sure that the dirty
> > pages are written back. Technically direct I/O reads is that we are
> > reading from the device.
> 
> The filemap_write_and_wait_range should persist that data, right?

Right. filemap_write_and_wait_range() would not make sense for writes
though.

> 
> > > cached.  For writes, I think we should write into the page cache pages
> > > which currently exist, and then force those pages to be written back,
> > > but left in cache.
> > 
> > Yes, that makes sense.
> > If this is implemented, what would be the difference between O_DIRECT
> > and O_DSYNC, if any?
> 
> Presumably a direct write would proceed as it does today if there's no
> pagecache at all?
> 

Yes, correct. Just that it would leave pages in the cache instead of
invalidating it after DIO write is complete.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-07 13:00         ` Christoph Hellwig
@ 2020-07-08  6:51           ` Dave Chinner
  2020-07-08 13:54             ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2020-07-08  6:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Goldwyn Rodrigues, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, darrick.wong, cluster-devel, linux-ext4,
	linux-xfs

On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > 
> > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> > > > > that if the page invalidation fails, return back control to the
> > > > > filesystem so it may fallback to buffered mode.
> > > > > 
> > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > I'd like to start a discussion of this shouldn't really be the
> > > > default behavior.  If we have page cache that can't be invalidated it
> > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > warnings, etc.
> > > > 
> > > > Adding all the relevant lists.
> > > 
> > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > 
> > > What error should be returned in case of such an error? I think the
> > 
> > Christoph's message is ambiguous.  I don't know if he means "fail the
> > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > strongly in favour of the latter.
> 
> Same here.  Sorry if my previous mail was unclear.
> 
> > Indeed, I'm in favour of not invalidating
> > the page cache at all for direct I/O.  For reads, I think the page cache
> > should be used to satisfy any portion of the read which is currently
> > cached.  For writes, I think we should write into the page cache pages
> > which currently exist, and then force those pages to be written back,
> > but left in cache.
> 
> Something like that, yes.

So are we really willing to take the performance regression that
occurs from copying out of the page cache consuming lots more CPU
than an actual direct IO read? Or that direct IO writes suddenly
serialise because there are page cache pages and now we have to do
buffered IO?

Direct IO should be a deterministic, zero-copy IO path to/from
storage. Using the CPU to copy data during direct IO is the complete
opposite of the intended functionality, not to mention the behaviour
that many applications have been careful designed and tuned for.

Hence I think that forcing iomap to use cached pages for DIO is a
non-starter. I have no problems with providing infrastructure that
allows filesystems to -opt in- to using buffered IO for the direct
IO path. However, the change in IO behaviour caused by unpredicably
switching between direct IO and buffered IO (e.g. suddening DIO
writes serialise -all IO-) will cause unacceptible performance
regressions for many applications and be -very difficult to
diagnose- in production systems.

IOWs, we need to let the individual filesystems decide how they want
to use the page cache for direct IO. Just because we have new direct
IO infrastructure (i.e. iomap) it does not mean we can just make
wholesale changes to the direct IO path behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-08  6:51           ` Dave Chinner
@ 2020-07-08 13:54             ` Matthew Wilcox
  2020-07-08 16:54               ` Christoph Hellwig
  2020-07-09  2:25               ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2020-07-08 13:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Goldwyn Rodrigues, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, darrick.wong, cluster-devel, linux-ext4,
	linux-xfs

On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > Indeed, I'm in favour of not invalidating
> > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > should be used to satisfy any portion of the read which is currently
> > > cached.  For writes, I think we should write into the page cache pages
> > > which currently exist, and then force those pages to be written back,
> > > but left in cache.
> > 
> > Something like that, yes.
> 
> So are we really willing to take the performance regression that
> occurs from copying out of the page cache consuming lots more CPU
> than an actual direct IO read? Or that direct IO writes suddenly
> serialise because there are page cache pages and now we have to do
> buffered IO?
> 
> Direct IO should be a deterministic, zero-copy IO path to/from
> storage. Using the CPU to copy data during direct IO is the complete
> opposite of the intended functionality, not to mention the behaviour
> that many applications have been careful designed and tuned for.

Direct I/O isn't deterministic though.  If the file isn't shared, then
it works great, but as soon as you get mixed buffered and direct I/O,
everything is already terrible.  Direct I/Os perform pagecache lookups
already, but instead of using the data that we found in the cache, we
(if it's dirty) write it back, wait for the write to complete, remove
the page from the pagecache and then perform another I/O to get the data
that we just wrote out!  And then the app that's using buffered I/O has
to read it back in again.

Nobody's proposing changing Direct I/O to exclusively work through the
pagecache.  The proposal is to behave less weirdly when there's already
data in the pagecache.

I have had an objection raised off-list.  In a scenario with a block
device shared between two systems and an application which does direct
I/O, everything is normally fine.  If one of the systems uses tar to
back up the contents of the block device then the application on that
system will no longer see the writes from the other system because
there's nothing to invalidate the pagecache on the first system.

Unfortunately, this is in direct conflict with the performance
problem caused by some little arsewipe deciding to do:

$ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct of=/dev/null; done

... doesn't hurt me because my root filesystem is on ext4 which doesn't
purge the cache.  But anything using iomap gets all the pages for libc
kicked out of the cache, and that's a lot of fun.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-08 13:54             ` Matthew Wilcox
@ 2020-07-08 16:54               ` Christoph Hellwig
  2020-07-08 17:11                 ` Matthew Wilcox
  2020-07-09  8:26                 ` [Cluster-devel] " Steven Whitehouse
  2020-07-09  2:25               ` Dave Chinner
  1 sibling, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-08 16:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Christoph Hellwig, Goldwyn Rodrigues,
	linux-fsdevel, linux-btrfs, fdmanana, dsterba, darrick.wong,
	cluster-devel, linux-ext4, linux-xfs

On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> Direct I/O isn't deterministic though.  If the file isn't shared, then
> it works great, but as soon as you get mixed buffered and direct I/O,
> everything is already terrible.  Direct I/Os perform pagecache lookups
> already, but instead of using the data that we found in the cache, we
> (if it's dirty) write it back, wait for the write to complete, remove
> the page from the pagecache and then perform another I/O to get the data
> that we just wrote out!  And then the app that's using buffered I/O has
> to read it back in again.

Mostly agreed.  That being said I suspect invalidating clean cache
might still be a good idea.  The original idea was mostly on how
to deal with invalidation failures of any kind, but falling back for
any kind of dirty cache also makes at least some sense.

> I have had an objection raised off-list.  In a scenario with a block
> device shared between two systems and an application which does direct
> I/O, everything is normally fine.  If one of the systems uses tar to
> back up the contents of the block device then the application on that
> system will no longer see the writes from the other system because
> there's nothing to invalidate the pagecache on the first system.

Err, WTF?  If someone access shared block devices with random
applications all bets are off anyway.

> 
> Unfortunately, this is in direct conflict with the performance
> problem caused by some little arsewipe deciding to do:
> 
> $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct of=/dev/null; done
> 
> ... doesn't hurt me because my root filesystem is on ext4 which doesn't
> purge the cache.  But anything using iomap gets all the pages for libc
> kicked out of the cache, and that's a lot of fun.

ext4 uses iomap..

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-08 16:54               ` Christoph Hellwig
@ 2020-07-08 17:11                 ` Matthew Wilcox
  2020-07-09  8:26                 ` [Cluster-devel] " Steven Whitehouse
  1 sibling, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2020-07-08 17:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Goldwyn Rodrigues, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, darrick.wong, cluster-devel, linux-ext4,
	linux-xfs

On Wed, Jul 08, 2020 at 06:54:12PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> > Direct I/O isn't deterministic though.  If the file isn't shared, then
> > it works great, but as soon as you get mixed buffered and direct I/O,
> > everything is already terrible.  Direct I/Os perform pagecache lookups
> > already, but instead of using the data that we found in the cache, we
> > (if it's dirty) write it back, wait for the write to complete, remove
> > the page from the pagecache and then perform another I/O to get the data
> > that we just wrote out!  And then the app that's using buffered I/O has
> > to read it back in again.
> 
> Mostly agreed.  That being said I suspect invalidating clean cache
> might still be a good idea.  The original idea was mostly on how
> to deal with invalidation failures of any kind, but falling back for
> any kind of dirty cache also makes at least some sense.

That's certainly the btrfs problem that needs to be solved, but I think
it's all part of the directio misdesign.

> > I have had an objection raised off-list.  In a scenario with a block
> > device shared between two systems and an application which does direct
> > I/O, everything is normally fine.  If one of the systems uses tar to
> > back up the contents of the block device then the application on that
> > system will no longer see the writes from the other system because
> > there's nothing to invalidate the pagecache on the first system.
> 
> Err, WTF?  If someone access shared block devices with random
> applications all bets are off anyway.

That doesn't mean that customers don't do it.  It is, of course, not
recommended, but we suspect people do it anyway.  Because it does
work, unfortunately.  I'd be open to making this exact situation
deterministically not work (eg disallowing mixing O_DIRECT and
non-O_DIRECT openers of block devices), but making it suddenly
non-deterministically give you old data is a non-starter.

> > Unfortunately, this is in direct conflict with the performance
> > problem caused by some little arsewipe deciding to do:
> > 
> > $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct of=/dev/null; done
> > 
> > ... doesn't hurt me because my root filesystem is on ext4 which doesn't
> > purge the cache.  But anything using iomap gets all the pages for libc
> > kicked out of the cache, and that's a lot of fun.
> 
> ext4 uses iomap..

I happen to be running an older kernel that doesn't on this laptop ;-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-08 13:54             ` Matthew Wilcox
  2020-07-08 16:54               ` Christoph Hellwig
@ 2020-07-09  2:25               ` Dave Chinner
  2020-07-09 16:09                 ` Darrick J. Wong
  2020-07-12 11:36                 ` Avi Kivity
  1 sibling, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2020-07-09  2:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Goldwyn Rodrigues, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, darrick.wong, cluster-devel, linux-ext4,
	linux-xfs

On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > > Indeed, I'm in favour of not invalidating
> > > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > > should be used to satisfy any portion of the read which is currently
> > > > cached.  For writes, I think we should write into the page cache pages
> > > > which currently exist, and then force those pages to be written back,
> > > > but left in cache.
> > > 
> > > Something like that, yes.
> > 
> > So are we really willing to take the performance regression that
> > occurs from copying out of the page cache consuming lots more CPU
> > than an actual direct IO read? Or that direct IO writes suddenly
> > serialise because there are page cache pages and now we have to do
> > buffered IO?
> > 
> > Direct IO should be a deterministic, zero-copy IO path to/from
> > storage. Using the CPU to copy data during direct IO is the complete
> > opposite of the intended functionality, not to mention the behaviour
> > that many applications have been careful designed and tuned for.
> 
> Direct I/O isn't deterministic though.

When all the application is doing is direct IO, it is as
deterministic as the underlying storage behaviour. This is the best
we can possibly do from the perspective of the filesystem, and this
is largely what Direct IO requires from the filesystem.

Direct IO starts from delegating all responsibility for IO
synchronisation data coherency and integrity to userspace, and then
follows up by requiring the filesystem and kernel to stay out of the
IO path except where it is absolutely necessary to read or write
data to/from the underlying storage hardware. Serving Direct IO from
the page cache violates the second of these requirements.

> If the file isn't shared, then
> it works great, but as soon as you get mixed buffered and direct I/O,
> everything is already terrible.

Right, but that's *the rare exception* for applications using direct
IO, not the common fast path. It is the slow path where -shit has
already gone wrong on the production machine-, and it most
definitely does not change the DIO requirements that the filesystem
needs to provide userspace via the direct IO path.

Optimising the slow exception path is fine if it does not affect the
guarantees we try to provide through the common/fast path. If it is
does affect behaviour of the fast path, then we must be able to
either turn it off or provide our own alternative implementation
that does not have that cost.

> Direct I/Os perform pagecache lookups
> already, but instead of using the data that we found in the cache, we
> (if it's dirty) write it back, wait for the write to complete, remove
> the page from the pagecache and then perform another I/O to get the data
> that we just wrote out!  And then the app that's using buffered I/O has
> to read it back in again.

Yup, that's because we have a history going back 20+ years of people
finding performance regressions in applications using direct IO when
we leave incorrectly left pages in the page cache rather than
invalidating them and continuing to do direct IO.


> Nobody's proposing changing Direct I/O to exclusively work through the
> pagecache.  The proposal is to behave less weirdly when there's already
> data in the pagecache.

No, the proposal it to make direct IO behave *less
deterministically* if there is data in the page cache.

e.g. Instead of having a predicatable submission CPU overhead and
read latency of 100us for your data, this proposal makes the claim
that it is always better to burn 10x the IO submission CPU for a
single IO to copy the data and give that specific IO 10x lower
latency than it is to submit 10 async IOs to keep the IO pipeline
full.

What it fails to take into account is that in spending that CPU time
to copy the data, we haven't submitted 10 other IOs and so the
actual in-flight IO for the application has decreased. If
performance comes from keeping the IO pipeline as close to 100% full
as possible, then copying the data out of the page cache will cause
performance regressions.

i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
depth craters because we've only fulfilled 5 complete IOs instead of
submitting 50 entire IOs. This is the hidden cost of synchronous IO
via CPU data copying vs async IO via hardware offload, and if we
take that into account we must look at future hardware performance
trends to determine if this cost is going to increase or decrease in
future.

That is: CPUs are not getting faster anytime soon. IO subsystems are
still deep in the "performance doubles every 2 years" part of the
technology curve (pcie 3.0->4.0 just happened, 4->5 is a year away,
5->6 is 3-4 years away, etc). Hence our reality is that we are deep
within a performance trend curve that tells us synchronous CPU
operations are not getting faster, but IO bandwidth and IOPS are
going to increase massively over the next 5-10 years. Hence putting
(already expensive) synchronous CPU operations in the asynchronous
zero-data-touch IO fast path is -exactly the wrong direction to be
moving-.

This is simple math. The gap between IO latency and bandwidth and
CPU addressable memory latency and bandwidth is closing all the
time, and the closer that gap gets the less sense it makes to use
CPU addressable memory for buffering syscall based read and write
IO. We are not quite yet at the cross-over point, but we really
aren't that far from it.

> I have had an objection raised off-list.  In a scenario with a block
> device shared between two systems and an application which does direct
> I/O, everything is normally fine.  If one of the systems uses tar to
> back up the contents of the block device then the application on that
> system will no longer see the writes from the other system because
> there's nothing to invalidate the pagecache on the first system.

I'm sorry you have to deal with that. :(

Back in the world of local filesystems, sharing block device across
systems without using a clustered filesystem to maintain storage
device level data coherency across those multiple machines is not
supported in any way, shape or form, direct IO or not.

> Unfortunately, this is in direct conflict with the performance
> problem caused by some little arsewipe deciding to do:
> 
> $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct of=/dev/null; done

This has come up in the past, and I'm pretty sure I sent a patch to
stop iomap from invalidating the cache on DIO reads because the same
data is on disk as is in memory and it solves this whole problem.  I
cannot find that patch in the archives - I must have sent it inline
to the discussion as I'm about to do again.

Yeah, now it comes back to me - the context was about using
RWF_NOWAIT to detect page cache residency for page cache timing
attacks and I mentioned doing exactly the above as a mechanism to
demand trigger invalidation of the mapped page cache. The
solution was to only invalidate the page cache on DIO writes, but we
didn't do it straight away because I needed to audit the XFS code to
determine if there were still real reasons it was necessary.

The patch is attached below. The DIO read path is unchanged except
for the fact it skips the invalidation.  i.e. the dio read still
goes to disk, but we can leave the data in memory because it is the
same as what was on disk. This does not perturb DIO behaviour by
inserting synchronous page cache copies or exclusive inode locking
into the async DIO path and so will not cause DIO performance
regressions. However, it does allow mmap() and read() to be served
from cache at the same time as we issue overlapping DIO reads.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

iomap: Only invalidate page cache pages on direct IO writes

From: Dave Chinner <dchinner@redhat.com>

The historic requirement for XFS to invalidate cached pages on
direct IO reads has been lost in the twisty pages of history - it was
inherited from Irix, which implemented page cache invalidation on
read as a method of working around problems synchronising page
cache state with uncached IO.

XFS has carried this ever since. In the initial linux ports it was
necessary to get mmap and DIO to play "ok" together and not
immediately corrupt data. This was the state of play until the linux
kernel had infrastructure to track unwritten extents and synchronise
page faults with allocations and unwritten extent conversions
(->page_mkwrite infrastructure). IOws, the page cache invalidation
on DIO read was necessary to prevent trivial data corruptions. This
didn't solve all the problems, though.

There were peformance problems if we didn't invalidate the entire
page cache over the file on read - we couldn't easily determine if
the cached pages were over the range of the IO, and invalidation
required taking a serialising lock (i_mutex) on the inode. This
serialising lock was an issue for XFS, as it was the only exclusive
lock in the direct Io read path.

Hence if there were any cached pages, we'd just invalidate the
entire file in one go so that subsequent IOs didn't need to take the
serialising lock. This was a problem that prevented ranged
invalidation from being particularly useful for avoiding the
remaining coherency issues. This was solved with the conversion of
i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
use i_rwsem. Hence we could now just do ranged invalidation and the
performance problem went away.

However, page cache invalidation was still needed to serialise
sub-page/sub-block zeroing via direct IO against buffered IO because
bufferhead state attached to the cached page could get out of whack
when direct IOs were issued.  We've removed bufferheads from the
XFS code, and we don't carry any extent state on the cached pages
anymore, and so this problem has gone away, too.

IOWs, it would appear that we don't have any good reason to be
invalidating the page cache on DIO reads anymore. Hence remove the
invalidation on read because it is unnecessary overhead,
not needed to maintain coherency between mmap/buffered access and
direct IO anymore, and prevents anyone from using direct IO reads
from intentionally invalidating the page cache of a file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/direct-io.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ec7b78e6feca..ef0059eb34b5 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	/*
-	 * Try to invalidate cache pages for the range we're direct
-	 * writing.  If this invalidation fails, tough, the write will
-	 * still work, but racing two incompatible write paths is a
-	 * pretty crazy thing to do, so we don't support it 100%.
-	 */
-	ret = invalidate_inode_pages2_range(mapping,
-			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-	if (ret)
-		dio_warn_stale_pagecache(iocb->ki_filp);
-	ret = 0;
+	if (iov_iter_rw(iter) == WRITE) {
+		/*
+		 * Try to invalidate cache pages for the range we're direct
+		 * writing.  If this invalidation fails, tough, the write will
+		 * still work, but racing two incompatible write paths is a
+		 * pretty crazy thing to do, so we don't support it 100%.
+		 */
+		ret = invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+		if (ret)
+			dio_warn_stale_pagecache(iocb->ki_filp);
+		ret = 0;
 
-	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
-	    !inode->i_sb->s_dio_done_wq) {
-		ret = sb_init_dio_done_wq(inode->i_sb);
-		if (ret < 0)
-			goto out_free_dio;
+		if (!wait_for_completion &&
+		    !inode->i_sb->s_dio_done_wq) {
+			ret = sb_init_dio_done_wq(inode->i_sb);
+			if (ret < 0)
+				goto out_free_dio;
 	}
 
 	inode_dio_begin(inode);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-08 16:54               ` Christoph Hellwig
  2020-07-08 17:11                 ` Matthew Wilcox
@ 2020-07-09  8:26                 ` Steven Whitehouse
  1 sibling, 0 replies; 30+ messages in thread
From: Steven Whitehouse @ 2020-07-09  8:26 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: linux-xfs, fdmanana, darrick.wong, Goldwyn Rodrigues,
	Dave Chinner, dsterba, cluster-devel, linux-fsdevel, linux-ext4,
	linux-btrfs

Hi,

On 08/07/2020 17:54, Christoph Hellwig wrote:
> On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
>> Direct I/O isn't deterministic though.  If the file isn't shared, then
>> it works great, but as soon as you get mixed buffered and direct I/O,
>> everything is already terrible.  Direct I/Os perform pagecache lookups
>> already, but instead of using the data that we found in the cache, we
>> (if it's dirty) write it back, wait for the write to complete, remove
>> the page from the pagecache and then perform another I/O to get the data
>> that we just wrote out!  And then the app that's using buffered I/O has
>> to read it back in again.
> Mostly agreed.  That being said I suspect invalidating clean cache
> might still be a good idea.  The original idea was mostly on how
> to deal with invalidation failures of any kind, but falling back for
> any kind of dirty cache also makes at least some sense.
>
>> I have had an objection raised off-list.  In a scenario with a block
>> device shared between two systems and an application which does direct
>> I/O, everything is normally fine.  If one of the systems uses tar to
>> back up the contents of the block device then the application on that
>> system will no longer see the writes from the other system because
>> there's nothing to invalidate the pagecache on the first system.
> Err, WTF?  If someone access shared block devices with random
> applications all bets are off anyway.

On GFS2 the locking should take care of that. Not 100% sure about OCFS2 
without looking, but I'm fairly sure that they have a similar 
arrangement. So this shouldn't be a problem unless there is an 
additional cluster fs that I'm not aware of that they are using in this 
case. It would be good to confirm which fs they are using,

Steve.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-09  2:25               ` Dave Chinner
@ 2020-07-09 16:09                 ` Darrick J. Wong
  2020-07-09 17:05                   ` Matthew Wilcox
  2020-07-12 11:36                 ` Avi Kivity
  1 sibling, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2020-07-09 16:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Christoph Hellwig, Goldwyn Rodrigues,
	linux-fsdevel, linux-btrfs, fdmanana, dsterba, cluster-devel,
	linux-ext4, linux-xfs

On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> > > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > > > Indeed, I'm in favour of not invalidating
> > > > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > > > should be used to satisfy any portion of the read which is currently
> > > > > cached.  For writes, I think we should write into the page cache pages
> > > > > which currently exist, and then force those pages to be written back,
> > > > > but left in cache.
> > > > 
> > > > Something like that, yes.
> > > 
> > > So are we really willing to take the performance regression that
> > > occurs from copying out of the page cache consuming lots more CPU
> > > than an actual direct IO read? Or that direct IO writes suddenly
> > > serialise because there are page cache pages and now we have to do
> > > buffered IO?
> > > 
> > > Direct IO should be a deterministic, zero-copy IO path to/from
> > > storage. Using the CPU to copy data during direct IO is the complete
> > > opposite of the intended functionality, not to mention the behaviour
> > > that many applications have been careful designed and tuned for.
> > 
> > Direct I/O isn't deterministic though.
> 
> When all the application is doing is direct IO, it is as
> deterministic as the underlying storage behaviour. This is the best
> we can possibly do from the perspective of the filesystem, and this
> is largely what Direct IO requires from the filesystem.
> 
> Direct IO starts from delegating all responsibility for IO
> synchronisation data coherency and integrity to userspace, and then
> follows up by requiring the filesystem and kernel to stay out of the
> IO path except where it is absolutely necessary to read or write
> data to/from the underlying storage hardware. Serving Direct IO from
> the page cache violates the second of these requirements.
> 
> > If the file isn't shared, then
> > it works great, but as soon as you get mixed buffered and direct I/O,
> > everything is already terrible.
> 
> Right, but that's *the rare exception* for applications using direct
> IO, not the common fast path. It is the slow path where -shit has
> already gone wrong on the production machine-, and it most
> definitely does not change the DIO requirements that the filesystem
> needs to provide userspace via the direct IO path.
> 
> Optimising the slow exception path is fine if it does not affect the
> guarantees we try to provide through the common/fast path. If it is
> does affect behaviour of the fast path, then we must be able to
> either turn it off or provide our own alternative implementation
> that does not have that cost.
> 
> > Direct I/Os perform pagecache lookups
> > already, but instead of using the data that we found in the cache, we
> > (if it's dirty) write it back, wait for the write to complete, remove
> > the page from the pagecache and then perform another I/O to get the data
> > that we just wrote out!  And then the app that's using buffered I/O has
> > to read it back in again.
> 
> Yup, that's because we have a history going back 20+ years of people
> finding performance regressions in applications using direct IO when
> we leave incorrectly left pages in the page cache rather than
> invalidating them and continuing to do direct IO.
> 
> 
> > Nobody's proposing changing Direct I/O to exclusively work through the
> > pagecache.  The proposal is to behave less weirdly when there's already
> > data in the pagecache.
> 
> No, the proposal it to make direct IO behave *less
> deterministically* if there is data in the page cache.
> 
> e.g. Instead of having a predicatable submission CPU overhead and
> read latency of 100us for your data, this proposal makes the claim
> that it is always better to burn 10x the IO submission CPU for a
> single IO to copy the data and give that specific IO 10x lower
> latency than it is to submit 10 async IOs to keep the IO pipeline
> full.
> 
> What it fails to take into account is that in spending that CPU time
> to copy the data, we haven't submitted 10 other IOs and so the
> actual in-flight IO for the application has decreased. If
> performance comes from keeping the IO pipeline as close to 100% full
> as possible, then copying the data out of the page cache will cause
> performance regressions.
> 
> i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
> depth craters because we've only fulfilled 5 complete IOs instead of
> submitting 50 entire IOs. This is the hidden cost of synchronous IO
> via CPU data copying vs async IO via hardware offload, and if we
> take that into account we must look at future hardware performance
> trends to determine if this cost is going to increase or decrease in
> future.
> 
> That is: CPUs are not getting faster anytime soon. IO subsystems are
> still deep in the "performance doubles every 2 years" part of the
> technology curve (pcie 3.0->4.0 just happened, 4->5 is a year away,
> 5->6 is 3-4 years away, etc). Hence our reality is that we are deep
> within a performance trend curve that tells us synchronous CPU
> operations are not getting faster, but IO bandwidth and IOPS are
> going to increase massively over the next 5-10 years. Hence putting
> (already expensive) synchronous CPU operations in the asynchronous
> zero-data-touch IO fast path is -exactly the wrong direction to be
> moving-.
> 
> This is simple math. The gap between IO latency and bandwidth and
> CPU addressable memory latency and bandwidth is closing all the
> time, and the closer that gap gets the less sense it makes to use
> CPU addressable memory for buffering syscall based read and write
> IO. We are not quite yet at the cross-over point, but we really
> aren't that far from it.
> 
> > I have had an objection raised off-list.  In a scenario with a block
> > device shared between two systems and an application which does direct
> > I/O, everything is normally fine.  If one of the systems uses tar to
> > back up the contents of the block device then the application on that
> > system will no longer see the writes from the other system because
> > there's nothing to invalidate the pagecache on the first system.
> 
> I'm sorry you have to deal with that. :(
> 
> Back in the world of local filesystems, sharing block device across
> systems without using a clustered filesystem to maintain storage
> device level data coherency across those multiple machines is not
> supported in any way, shape or form, direct IO or not.
> 
> > Unfortunately, this is in direct conflict with the performance
> > problem caused by some little arsewipe deciding to do:
> > 
> > $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct of=/dev/null; done
> 
> This has come up in the past, and I'm pretty sure I sent a patch to
> stop iomap from invalidating the cache on DIO reads because the same
> data is on disk as is in memory and it solves this whole problem.  I
> cannot find that patch in the archives - I must have sent it inline
> to the discussion as I'm about to do again.

And, um, it'll get buried in the archives again. :(

Though I guess I could just suck it in and see what happens, and maybe
"the mists of time" won't be an issue this time around.  Particularly
since I was the one asking you about this very hunk of code last night
on irc. :P

Oh well, onto the review...

> Yeah, now it comes back to me - the context was about using
> RWF_NOWAIT to detect page cache residency for page cache timing
> attacks and I mentioned doing exactly the above as a mechanism to
> demand trigger invalidation of the mapped page cache. The
> solution was to only invalidate the page cache on DIO writes, but we
> didn't do it straight away because I needed to audit the XFS code to
> determine if there were still real reasons it was necessary.
> 
> The patch is attached below. The DIO read path is unchanged except
> for the fact it skips the invalidation.  i.e. the dio read still
> goes to disk, but we can leave the data in memory because it is the
> same as what was on disk. This does not perturb DIO behaviour by
> inserting synchronous page cache copies or exclusive inode locking
> into the async DIO path and so will not cause DIO performance
> regressions. However, it does allow mmap() and read() to be served
> from cache at the same time as we issue overlapping DIO reads.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> iomap: Only invalidate page cache pages on direct IO writes
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The historic requirement for XFS to invalidate cached pages on
> direct IO reads has been lost in the twisty pages of history - it was
> inherited from Irix, which implemented page cache invalidation on
> read as a method of working around problems synchronising page
> cache state with uncached IO.

Urk.

> XFS has carried this ever since. In the initial linux ports it was
> necessary to get mmap and DIO to play "ok" together and not
> immediately corrupt data. This was the state of play until the linux
> kernel had infrastructure to track unwritten extents and synchronise
> page faults with allocations and unwritten extent conversions
> (->page_mkwrite infrastructure). IOws, the page cache invalidation
> on DIO read was necessary to prevent trivial data corruptions. This
> didn't solve all the problems, though.
> 
> There were peformance problems if we didn't invalidate the entire
> page cache over the file on read - we couldn't easily determine if
> the cached pages were over the range of the IO, and invalidation
> required taking a serialising lock (i_mutex) on the inode. This
> serialising lock was an issue for XFS, as it was the only exclusive
> lock in the direct Io read path.
> 
> Hence if there were any cached pages, we'd just invalidate the
> entire file in one go so that subsequent IOs didn't need to take the
> serialising lock. This was a problem that prevented ranged
> invalidation from being particularly useful for avoiding the
> remaining coherency issues. This was solved with the conversion of
> i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> use i_rwsem. Hence we could now just do ranged invalidation and the
> performance problem went away.
> 
> However, page cache invalidation was still needed to serialise
> sub-page/sub-block zeroing via direct IO against buffered IO because
> bufferhead state attached to the cached page could get out of whack
> when direct IOs were issued.  We've removed bufferheads from the
> XFS code, and we don't carry any extent state on the cached pages
> anymore, and so this problem has gone away, too.
> 
> IOWs, it would appear that we don't have any good reason to be
> invalidating the page cache on DIO reads anymore. Hence remove the
> invalidation on read because it is unnecessary overhead,
> not needed to maintain coherency between mmap/buffered access and
> direct IO anymore, and prevents anyone from using direct IO reads
> from intentionally invalidating the page cache of a file.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/direct-io.c | 33 +++++++++++++++++----------------
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ec7b78e6feca..ef0059eb34b5 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (ret)
>  		goto out_free_dio;
>  
> -	/*
> -	 * Try to invalidate cache pages for the range we're direct
> -	 * writing.  If this invalidation fails, tough, the write will
> -	 * still work, but racing two incompatible write paths is a
> -	 * pretty crazy thing to do, so we don't support it 100%.

I always wondered about the repeated use of 'write' in this comment
despite the lack of any sort of WRITE check logic.  Seems fine to me,
let's throw it on the fstests pile and see what happens.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> -	 */
> -	ret = invalidate_inode_pages2_range(mapping,
> -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -	if (ret)
> -		dio_warn_stale_pagecache(iocb->ki_filp);
> -	ret = 0;
> +	if (iov_iter_rw(iter) == WRITE) {
> +		/*
> +		 * Try to invalidate cache pages for the range we're direct
> +		 * writing.  If this invalidation fails, tough, the write will
> +		 * still work, but racing two incompatible write paths is a
> +		 * pretty crazy thing to do, so we don't support it 100%.
> +		 */
> +		ret = invalidate_inode_pages2_range(mapping,
> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +		if (ret)
> +			dio_warn_stale_pagecache(iocb->ki_filp);
> +		ret = 0;
>  
> -	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> -	    !inode->i_sb->s_dio_done_wq) {
> -		ret = sb_init_dio_done_wq(inode->i_sb);
> -		if (ret < 0)
> -			goto out_free_dio;
> +		if (!wait_for_completion &&
> +		    !inode->i_sb->s_dio_done_wq) {
> +			ret = sb_init_dio_done_wq(inode->i_sb);
> +			if (ret < 0)
> +				goto out_free_dio;
>  	}
>  
>  	inode_dio_begin(inode);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-09 16:09                 ` Darrick J. Wong
@ 2020-07-09 17:05                   ` Matthew Wilcox
  2020-07-09 17:10                     ` Darrick J. Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2020-07-09 17:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Christoph Hellwig, Goldwyn Rodrigues,
	linux-fsdevel, linux-btrfs, fdmanana, dsterba, cluster-devel,
	linux-ext4, linux-xfs

On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > iomap: Only invalidate page cache pages on direct IO writes
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The historic requirement for XFS to invalidate cached pages on
> > direct IO reads has been lost in the twisty pages of history - it was
> > inherited from Irix, which implemented page cache invalidation on
> > read as a method of working around problems synchronising page
> > cache state with uncached IO.
> 
> Urk.
> 
> > XFS has carried this ever since. In the initial linux ports it was
> > necessary to get mmap and DIO to play "ok" together and not
> > immediately corrupt data. This was the state of play until the linux
> > kernel had infrastructure to track unwritten extents and synchronise
> > page faults with allocations and unwritten extent conversions
> > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > on DIO read was necessary to prevent trivial data corruptions. This
> > didn't solve all the problems, though.
> > 
> > There were peformance problems if we didn't invalidate the entire
> > page cache over the file on read - we couldn't easily determine if
> > the cached pages were over the range of the IO, and invalidation
> > required taking a serialising lock (i_mutex) on the inode. This
> > serialising lock was an issue for XFS, as it was the only exclusive
> > lock in the direct Io read path.
> > 
> > Hence if there were any cached pages, we'd just invalidate the
> > entire file in one go so that subsequent IOs didn't need to take the
> > serialising lock. This was a problem that prevented ranged
> > invalidation from being particularly useful for avoiding the
> > remaining coherency issues. This was solved with the conversion of
> > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > use i_rwsem. Hence we could now just do ranged invalidation and the
> > performance problem went away.
> > 
> > However, page cache invalidation was still needed to serialise
> > sub-page/sub-block zeroing via direct IO against buffered IO because
> > bufferhead state attached to the cached page could get out of whack
> > when direct IOs were issued.  We've removed bufferheads from the
> > XFS code, and we don't carry any extent state on the cached pages
> > anymore, and so this problem has gone away, too.
> > 
> > IOWs, it would appear that we don't have any good reason to be
> > invalidating the page cache on DIO reads anymore. Hence remove the
> > invalidation on read because it is unnecessary overhead,
> > not needed to maintain coherency between mmap/buffered access and
> > direct IO anymore, and prevents anyone from using direct IO reads
> > from intentionally invalidating the page cache of a file.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/iomap/direct-io.c | 33 +++++++++++++++++----------------
> >  1 file changed, 17 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index ec7b78e6feca..ef0059eb34b5 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	if (ret)
> >  		goto out_free_dio;
> >  
> > -	/*
> > -	 * Try to invalidate cache pages for the range we're direct
> > -	 * writing.  If this invalidation fails, tough, the write will
> > -	 * still work, but racing two incompatible write paths is a
> > -	 * pretty crazy thing to do, so we don't support it 100%.
> 
> I always wondered about the repeated use of 'write' in this comment
> despite the lack of any sort of WRITE check logic.  Seems fine to me,
> let's throw it on the fstests pile and see what happens.
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

> --D
> 
> > -	 */
> > -	ret = invalidate_inode_pages2_range(mapping,
> > -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -	if (ret)
> > -		dio_warn_stale_pagecache(iocb->ki_filp);
> > -	ret = 0;
> > +	if (iov_iter_rw(iter) == WRITE) {
> > +		/*
> > +		 * Try to invalidate cache pages for the range we're direct
> > +		 * writing.  If this invalidation fails, tough, the write will
> > +		 * still work, but racing two incompatible write paths is a
> > +		 * pretty crazy thing to do, so we don't support it 100%.
> > +		 */
> > +		ret = invalidate_inode_pages2_range(mapping,
> > +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > +		if (ret)
> > +			dio_warn_stale_pagecache(iocb->ki_filp);
> > +		ret = 0;
> >  
> > -	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > -	    !inode->i_sb->s_dio_done_wq) {
> > -		ret = sb_init_dio_done_wq(inode->i_sb);
> > -		if (ret < 0)
> > -			goto out_free_dio;
> > +		if (!wait_for_completion &&
> > +		    !inode->i_sb->s_dio_done_wq) {
> > +			ret = sb_init_dio_done_wq(inode->i_sb);
> > +			if (ret < 0)
> > +				goto out_free_dio;
> >  	}
> >  
> >  	inode_dio_begin(inode);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-09 17:05                   ` Matthew Wilcox
@ 2020-07-09 17:10                     ` Darrick J. Wong
  2020-07-09 22:59                       ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2020-07-09 17:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Christoph Hellwig, Goldwyn Rodrigues,
	linux-fsdevel, linux-btrfs, fdmanana, dsterba, cluster-devel,
	linux-ext4, linux-xfs

On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > iomap: Only invalidate page cache pages on direct IO writes
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The historic requirement for XFS to invalidate cached pages on
> > > direct IO reads has been lost in the twisty pages of history - it was
> > > inherited from Irix, which implemented page cache invalidation on
> > > read as a method of working around problems synchronising page
> > > cache state with uncached IO.
> > 
> > Urk.
> > 
> > > XFS has carried this ever since. In the initial linux ports it was
> > > necessary to get mmap and DIO to play "ok" together and not
> > > immediately corrupt data. This was the state of play until the linux
> > > kernel had infrastructure to track unwritten extents and synchronise
> > > page faults with allocations and unwritten extent conversions
> > > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > > on DIO read was necessary to prevent trivial data corruptions. This
> > > didn't solve all the problems, though.
> > > 
> > > There were peformance problems if we didn't invalidate the entire
> > > page cache over the file on read - we couldn't easily determine if
> > > the cached pages were over the range of the IO, and invalidation
> > > required taking a serialising lock (i_mutex) on the inode. This
> > > serialising lock was an issue for XFS, as it was the only exclusive
> > > lock in the direct Io read path.
> > > 
> > > Hence if there were any cached pages, we'd just invalidate the
> > > entire file in one go so that subsequent IOs didn't need to take the
> > > serialising lock. This was a problem that prevented ranged
> > > invalidation from being particularly useful for avoiding the
> > > remaining coherency issues. This was solved with the conversion of
> > > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > > use i_rwsem. Hence we could now just do ranged invalidation and the
> > > performance problem went away.
> > > 
> > > However, page cache invalidation was still needed to serialise
> > > sub-page/sub-block zeroing via direct IO against buffered IO because
> > > bufferhead state attached to the cached page could get out of whack
> > > when direct IOs were issued.  We've removed bufferheads from the
> > > XFS code, and we don't carry any extent state on the cached pages
> > > anymore, and so this problem has gone away, too.
> > > 
> > > IOWs, it would appear that we don't have any good reason to be
> > > invalidating the page cache on DIO reads anymore. Hence remove the
> > > invalidation on read because it is unnecessary overhead,
> > > not needed to maintain coherency between mmap/buffered access and
> > > direct IO anymore, and prevents anyone from using direct IO reads
> > > from intentionally invalidating the page cache of a file.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/iomap/direct-io.c | 33 +++++++++++++++++----------------
> > >  1 file changed, 17 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ec7b78e6feca..ef0059eb34b5 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >  	if (ret)
> > >  		goto out_free_dio;
> > >  
> > > -	/*
> > > -	 * Try to invalidate cache pages for the range we're direct
> > > -	 * writing.  If this invalidation fails, tough, the write will
> > > -	 * still work, but racing two incompatible write paths is a
> > > -	 * pretty crazy thing to do, so we don't support it 100%.
> > 
> > I always wondered about the repeated use of 'write' in this comment
> > despite the lack of any sort of WRITE check logic.  Seems fine to me,
> > let's throw it on the fstests pile and see what happens.
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> > --D
> > 
> > > -	 */
> > > -	ret = invalidate_inode_pages2_range(mapping,
> > > -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -	if (ret)
> > > -		dio_warn_stale_pagecache(iocb->ki_filp);
> > > -	ret = 0;
> > > +	if (iov_iter_rw(iter) == WRITE) {
> > > +		/*
> > > +		 * Try to invalidate cache pages for the range we're direct
> > > +		 * writing.  If this invalidation fails, tough, the write will
> > > +		 * still work, but racing two incompatible write paths is a
> > > +		 * pretty crazy thing to do, so we don't support it 100%.
> > > +		 */
> > > +		ret = invalidate_inode_pages2_range(mapping,
> > > +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > +		if (ret)
> > > +			dio_warn_stale_pagecache(iocb->ki_filp);
> > > +		ret = 0;
> > >  
> > > -	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > -	    !inode->i_sb->s_dio_done_wq) {
> > > -		ret = sb_init_dio_done_wq(inode->i_sb);
> > > -		if (ret < 0)
> > > -			goto out_free_dio;
> > > +		if (!wait_for_completion &&
> > > +		    !inode->i_sb->s_dio_done_wq) {
> > > +			ret = sb_init_dio_done_wq(inode->i_sb);
> > > +			if (ret < 0)
> > > +				goto out_free_dio;

...and yes I did add in the closing brace here. :P

--D

> > >  	}
> > >  
> > >  	inode_dio_begin(inode);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-09 17:10                     ` Darrick J. Wong
@ 2020-07-09 22:59                       ` Dave Chinner
  2020-07-10 16:03                         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2020-07-09 22:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Christoph Hellwig, Goldwyn Rodrigues,
	linux-fsdevel, linux-btrfs, fdmanana, dsterba, cluster-devel,
	linux-ext4, linux-xfs

On Thu, Jul 09, 2020 at 10:10:38AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> > On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > > -	 */
> > > > -	ret = invalidate_inode_pages2_range(mapping,
> > > > -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -	if (ret)
> > > > -		dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -	ret = 0;
> > > > +	if (iov_iter_rw(iter) == WRITE) {
> > > > +		/*
> > > > +		 * Try to invalidate cache pages for the range we're direct
> > > > +		 * writing.  If this invalidation fails, tough, the write will
> > > > +		 * still work, but racing two incompatible write paths is a
> > > > +		 * pretty crazy thing to do, so we don't support it 100%.
> > > > +		 */
> > > > +		ret = invalidate_inode_pages2_range(mapping,
> > > > +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > +		if (ret)
> > > > +			dio_warn_stale_pagecache(iocb->ki_filp);
> > > > +		ret = 0;
> > > >  
> > > > -	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > > -	    !inode->i_sb->s_dio_done_wq) {
> > > > -		ret = sb_init_dio_done_wq(inode->i_sb);
> > > > -		if (ret < 0)
> > > > -			goto out_free_dio;
> > > > +		if (!wait_for_completion &&
> > > > +		    !inode->i_sb->s_dio_done_wq) {
> > > > +			ret = sb_init_dio_done_wq(inode->i_sb);
> > > > +			if (ret < 0)
> > > > +				goto out_free_dio;
> 
> ...and yes I did add in the closing brace here. :P

Doh! I forgot to refresh the patch after fixing that. :/

Thanks!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-09 22:59                       ` Dave Chinner
@ 2020-07-10 16:03                         ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-10 16:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Matthew Wilcox, Christoph Hellwig,
	Goldwyn Rodrigues, linux-fsdevel, linux-btrfs, fdmanana, dsterba,
	cluster-devel, linux-ext4, linux-xfs

This looks sane - slightly updated version below to not bother with
the ret and a few tidyups.

That being said and to get back to the discussion in this thread:
I think it would be saner to give up on direct I/O in case of the
invalidation failure.  I've cooked up a patch on top of this one
(for which I had a few trivial cleanups).  It is still under testing,
but I'll send the two out in a new thread.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
  2020-07-09  2:25               ` Dave Chinner
  2020-07-09 16:09                 ` Darrick J. Wong
@ 2020-07-12 11:36                 ` Avi Kivity
  1 sibling, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2020-07-12 11:36 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: Christoph Hellwig, Goldwyn Rodrigues, linux-fsdevel, linux-btrfs,
	fdmanana, dsterba, darrick.wong, cluster-devel, linux-ext4,
	linux-xfs


On 09/07/2020 05.25, Dave Chinner wrote:
>
>> Nobody's proposing changing Direct I/O to exclusively work through the
>> pagecache.  The proposal is to behave less weirdly when there's already
>> data in the pagecache.
> No, the proposal it to make direct IO behave *less
> deterministically* if there is data in the page cache.
>
> e.g. Instead of having a predicatable submission CPU overhead and
> read latency of 100us for your data, this proposal makes the claim
> that it is always better to burn 10x the IO submission CPU for a
> single IO to copy the data and give that specific IO 10x lower
> latency than it is to submit 10 async IOs to keep the IO pipeline
> full.
>
> What it fails to take into account is that in spending that CPU time
> to copy the data, we haven't submitted 10 other IOs and so the
> actual in-flight IO for the application has decreased. If
> performance comes from keeping the IO pipeline as close to 100% full
> as possible, then copying the data out of the page cache will cause
> performance regressions.
>
> i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
> depth craters because we've only fulfilled 5 complete IOs instead of
> submitting 50 entire IOs. This is the hidden cost of synchronous IO
> via CPU data copying vs async IO via hardware offload, and if we
> take that into account we must look at future hardware performance
> trends to determine if this cost is going to increase or decrease in
> future.
>
> That is: CPUs are not getting faster anytime soon. IO subsystems are
> still deep in the "performance doubles every 2 years" part of the
> technology curve (pcie 3.0->4.0 just happened, 4->5 is a year away,
> 5->6 is 3-4 years away, etc). Hence our reality is that we are deep
> within a performance trend curve that tells us synchronous CPU
> operations are not getting faster, but IO bandwidth and IOPS are
> going to increase massively over the next 5-10 years. Hence putting
> (already expensive) synchronous CPU operations in the asynchronous
> zero-data-touch IO fast path is -exactly the wrong direction to be
> moving-.
>
> This is simple math. The gap between IO latency and bandwidth and
> CPU addressable memory latency and bandwidth is closing all the
> time, and the closer that gap gets the less sense it makes to use
> CPU addressable memory for buffering syscall based read and write
> IO. We are not quite yet at the cross-over point, but we really
> aren't that far from it.
>
>

My use-case supports this. The application uses AIO+DIO, but backup may 
bring pages into page cache. For me, it is best to ignore page cache (as 
long as it's clean, which it is for backup) and serve from disk as usual.



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-07-12 11:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
2020-06-29 23:03   ` David Sterba
2020-06-30 16:35   ` David Sterba
2020-07-01  7:34     ` Johannes Thumshirn
2020-07-01  7:50   ` Christoph Hellwig
2020-06-29 19:23 ` [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails Goldwyn Rodrigues
2020-07-01  7:53   ` always fall back to buffered I/O after invalidation failures, was: " Christoph Hellwig
2020-07-07 12:43     ` Goldwyn Rodrigues
2020-07-07 12:57       ` Matthew Wilcox
2020-07-07 13:00         ` Christoph Hellwig
2020-07-08  6:51           ` Dave Chinner
2020-07-08 13:54             ` Matthew Wilcox
2020-07-08 16:54               ` Christoph Hellwig
2020-07-08 17:11                 ` Matthew Wilcox
2020-07-09  8:26                 ` [Cluster-devel] " Steven Whitehouse
2020-07-09  2:25               ` Dave Chinner
2020-07-09 16:09                 ` Darrick J. Wong
2020-07-09 17:05                   ` Matthew Wilcox
2020-07-09 17:10                     ` Darrick J. Wong
2020-07-09 22:59                       ` Dave Chinner
2020-07-10 16:03                         ` Christoph Hellwig
2020-07-12 11:36                 ` Avi Kivity
2020-07-07 13:49         ` Goldwyn Rodrigues
2020-07-07 14:01           ` Darrick J. Wong
2020-07-07 14:30             ` Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 3/6] btrfs: switch to iomap_dio_rw() for dio Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 4/6] fs: remove dio_end_io() Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 5/6] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 6/6] btrfs: split btrfs_direct_IO to read and write part Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).