Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* RFC: iomap write invalidation
@ 2020-07-13  7:46 Christoph Hellwig
  2020-07-13  7:46 ` [PATCH 1/2] iomap: Only invalidate page cache pages on direct IO writes Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-13  7:46 UTC (permalink / raw)
  To: Dave Chinner, Goldwyn Rodrigues
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Matthew Wilcox,
	linux-btrfs, linux-fsdevel, cluster-devel, linux-ext4, linux-xfs

Hi all,

this series has two parts:  the first one picks up Dave's patch to avoid
invalidation entierly for reads, picked up deep down from the btrfs iomap
thread.  The second one falls back to buffered writes if invalidation fails
instead of leaving a stale cache around.  Let me know what you think about
this approch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] iomap: Only invalidate page cache pages on direct IO writes
  2020-07-13  7:46 RFC: iomap write invalidation Christoph Hellwig
@ 2020-07-13  7:46 ` Christoph Hellwig
  2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-13  7:46 UTC (permalink / raw)
  To: Dave Chinner, Goldwyn Rodrigues
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Matthew Wilcox,
	linux-btrfs, linux-fsdevel, cluster-devel, linux-ext4, linux-xfs,
	Dave Chinner, Darrick J . Wong

From: Dave Chinner <dchinner@redhat.com>

The historic requirement for XFS to invalidate cached pages on
direct IO reads has been lost in the twisty pages of history - it was
inherited from Irix, which implemented page cache invalidation on
read as a method of working around problems synchronising page
cache state with uncached IO.

XFS has carried this ever since. In the initial linux ports it was
necessary to get mmap and DIO to play "ok" together and not
immediately corrupt data. This was the state of play until the linux
kernel had infrastructure to track unwritten extents and synchronise
page faults with allocations and unwritten extent conversions
(->page_mkwrite infrastructure). IOws, the page cache invalidation
on DIO read was necessary to prevent trivial data corruptions. This
didn't solve all the problems, though.

There were peformance problems if we didn't invalidate the entire
page cache over the file on read - we couldn't easily determine if
the cached pages were over the range of the IO, and invalidation
required taking a serialising lock (i_mutex) on the inode. This
serialising lock was an issue for XFS, as it was the only exclusive
lock in the direct Io read path.

Hence if there were any cached pages, we'd just invalidate the
entire file in one go so that subsequent IOs didn't need to take the
serialising lock. This was a problem that prevented ranged
invalidation from being particularly useful for avoiding the
remaining coherency issues. This was solved with the conversion of
i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
use i_rwsem. Hence we could now just do ranged invalidation and the
performance problem went away.

However, page cache invalidation was still needed to serialise
sub-page/sub-block zeroing via direct IO against buffered IO because
bufferhead state attached to the cached page could get out of whack
when direct IOs were issued.  We've removed bufferheads from the
XFS code, and we don't carry any extent state on the cached pages
anymore, and so this problem has gone away, too.

IOWs, it would appear that we don't have any good reason to be
invalidating the page cache on DIO reads anymore. Hence remove the
invalidation on read because it is unnecessary overhead,
not needed to maintain coherency between mmap/buffered access and
direct IO anymore, and prevents anyone from using direct IO reads
from intentionally invalidating the page cache of a file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ec7b78e6fecaf9..190967e87b69e4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -475,23 +475,22 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	/*
-	 * Try to invalidate cache pages for the range we're direct
-	 * writing.  If this invalidation fails, tough, the write will
-	 * still work, but racing two incompatible write paths is a
-	 * pretty crazy thing to do, so we don't support it 100%.
-	 */
-	ret = invalidate_inode_pages2_range(mapping,
-			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-	if (ret)
-		dio_warn_stale_pagecache(iocb->ki_filp);
-	ret = 0;
+	if (iov_iter_rw(iter) == WRITE) {
+		/*
+		 * Try to invalidate cache pages for the range we are writing.
+		 * If this invalidation fails, tough, the write will still work,
+		 * but racing two incompatible write paths is a pretty crazy
+		 * thing to do, so we don't support it 100%.
+		 */
+		if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
+				end >> PAGE_SHIFT))
+			dio_warn_stale_pagecache(iocb->ki_filp);
 
-	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
-	    !inode->i_sb->s_dio_done_wq) {
-		ret = sb_init_dio_done_wq(inode->i_sb);
-		if (ret < 0)
-			goto out_free_dio;
+		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
+			ret = sb_init_dio_done_wq(inode->i_sb);
+			if (ret < 0)
+				goto out_free_dio;
+		}
 	}
 
 	inode_dio_begin(inode);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13  7:46 RFC: iomap write invalidation Christoph Hellwig
  2020-07-13  7:46 ` [PATCH 1/2] iomap: Only invalidate page cache pages on direct IO writes Christoph Hellwig
@ 2020-07-13  7:46 ` Christoph Hellwig
  2020-07-13 11:55   ` Matthew Wilcox
                     ` (3 more replies)
  2020-07-15  1:47 ` RFC: iomap write invalidation Dave Chinner
  2020-07-20 21:51 ` Goldwyn Rodrigues
  3 siblings, 4 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-13  7:46 UTC (permalink / raw)
  To: Dave Chinner, Goldwyn Rodrigues
  Cc: Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Matthew Wilcox,
	linux-btrfs, linux-fsdevel, cluster-devel, linux-ext4, linux-xfs

Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system.  Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/file.c       |  2 ++
 fs/gfs2/file.c       |  3 ++-
 fs/iomap/direct-io.c | 13 ++++++++-----
 fs/iomap/trace.h     |  1 +
 fs/xfs/xfs_file.c    |  4 ++--
 fs/zonefs/super.c    |  7 +++++--
 6 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c4c..0da6c2a2c32c1e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
 			   is_sync_kiocb(iocb) || unaligned_io || extend);
+	if (ret == -EREMCHG)
+		ret = 0;
 
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index fe305e4bfd3734..c7907d40c61d17 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -814,7 +814,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 
 	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
 			   is_sync_kiocb(iocb));
-
+	if (ret == -EREMCHG)
+		ret = 0;
 out:
 	gfs2_glock_dq(&gh);
 out_uninit:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 190967e87b69e4..62626235cdbe8d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -10,6 +10,7 @@
 #include <linux/backing-dev.h>
 #include <linux/uio.h>
 #include <linux/task_io_accounting_ops.h>
+#include "trace.h"
 
 #include "../internal.h"
 
@@ -478,13 +479,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
-		 * If this invalidation fails, tough, the write will still work,
-		 * but racing two incompatible write paths is a pretty crazy
-		 * thing to do, so we don't support it 100%.
+		 * If this invalidation fails, let the caller fall back to
+		 * buffered I/O.
 		 */
 		if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
-				end >> PAGE_SHIFT))
-			dio_warn_stale_pagecache(iocb->ki_filp);
+				end >> PAGE_SHIFT)) {
+			trace_iomap_dio_invalidate_fail(inode, pos, count);
+			ret = -EREMCHG;
+			goto out_free_dio;
+		}
 
 		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
 			ret = sb_init_dio_done_wq(inode->i_sb);
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 5693a39d52fb63..fdc7ae388476f5 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name,	\
 DEFINE_RANGE_EVENT(iomap_writepage);
 DEFINE_RANGE_EVENT(iomap_releasepage);
 DEFINE_RANGE_EVENT(iomap_invalidatepage);
+DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
 
 #define IOMAP_TYPE_STRINGS \
 	{ IOMAP_HOLE,		"HOLE" }, \
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d6c..551cca39fa3ba6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
 	xfs_iunlock(ip, iolock);
 
 	/*
-	 * No fallback to buffered IO on errors for XFS, direct IO will either
-	 * complete fully or fail.
+	 * No partial fallback to buffered IO on errors for XFS, direct IO will
+	 * either complete fully or fail.
 	 */
 	ASSERT(ret < 0 || ret == count);
 	return ret;
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 07bc42d62673ce..793850454b752f 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -786,8 +786,11 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
 		return -EFBIG;
 
-	if (iocb->ki_flags & IOCB_DIRECT)
-		return zonefs_file_dio_write(iocb, from);
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		ret = zonefs_file_dio_write(iocb, from);
+		if (ret != -EREMCHG)
+			return ret;
+	}
 
 	return zonefs_file_buffered_write(iocb, from);
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
@ 2020-07-13 11:55   ` Matthew Wilcox
  2020-07-14 11:00     ` Christoph Hellwig
  2020-07-13 12:20   ` Goldwyn Rodrigues
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-07-13 11:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Goldwyn Rodrigues, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs

On Mon, Jul 13, 2020 at 09:46:33AM +0200, Christoph Hellwig wrote:
> Failing to invalid the page cache means data in incoherent, which is
> a very bad state for the system.  Always fall back to buffered I/O
> through the page cache if we can't invalidate mappings.

Is that the right approach though?  I don't have a full picture in my head,
but wouldn't we be better off marking these pages as !Uptodate and doing
the direct I/O?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
  2020-07-13 11:55   ` Matthew Wilcox
@ 2020-07-13 12:20   ` Goldwyn Rodrigues
  2020-07-13 16:09     ` David Sterba
  2020-07-13 15:39   ` Darrick J. Wong
  2020-07-14  1:41   ` Damien Le Moal
  3 siblings, 1 reply; 27+ messages in thread
From: Goldwyn Rodrigues @ 2020-07-13 12:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	Matthew Wilcox, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs

On  9:46 13/07, Christoph Hellwig wrote:
> Failing to invalid the page cache means data in incoherent, which is
> a very bad state for the system.  Always fall back to buffered I/O
> through the page cache if we can't invalidate mappings.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Thanks. This will help btrfs. The current next tree contains the iomap
changes I recomended and would need to be reverted in order to
incorporate this. Once this is in the next tree I will (re)format the
btrfs iomap dio patches.

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/ext4/file.c       |  2 ++
>  fs/gfs2/file.c       |  3 ++-
>  fs/iomap/direct-io.c | 13 ++++++++-----
>  fs/iomap/trace.h     |  1 +
>  fs/xfs/xfs_file.c    |  4 ++--
>  fs/zonefs/super.c    |  7 +++++--
>  6 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c4c..0da6c2a2c32c1e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   is_sync_kiocb(iocb) || unaligned_io || extend);
> +	if (ret == -EREMCHG)
> +		ret = 0;
>  
>  	if (extend)
>  		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index fe305e4bfd3734..c7907d40c61d17 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -814,7 +814,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  
>  	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
>  			   is_sync_kiocb(iocb));
> -
> +	if (ret == -EREMCHG)
> +		ret = 0;
>  out:
>  	gfs2_glock_dq(&gh);
>  out_uninit:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 190967e87b69e4..62626235cdbe8d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -10,6 +10,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/uio.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include "trace.h"
>  
>  #include "../internal.h"
>  
> @@ -478,13 +479,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iov_iter_rw(iter) == WRITE) {
>  		/*
>  		 * Try to invalidate cache pages for the range we are writing.
> -		 * If this invalidation fails, tough, the write will still work,
> -		 * but racing two incompatible write paths is a pretty crazy
> -		 * thing to do, so we don't support it 100%.
> +		 * If this invalidation fails, let the caller fall back to
> +		 * buffered I/O.
>  		 */
>  		if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
> -				end >> PAGE_SHIFT))
> -			dio_warn_stale_pagecache(iocb->ki_filp);
> +				end >> PAGE_SHIFT)) {
> +			trace_iomap_dio_invalidate_fail(inode, pos, count);
> +			ret = -EREMCHG;
> +			goto out_free_dio;
> +		}
>  
>  		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>  			ret = sb_init_dio_done_wq(inode->i_sb);
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 5693a39d52fb63..fdc7ae388476f5 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name,	\
>  DEFINE_RANGE_EVENT(iomap_writepage);
>  DEFINE_RANGE_EVENT(iomap_releasepage);
>  DEFINE_RANGE_EVENT(iomap_invalidatepage);
> +DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
>  
>  #define IOMAP_TYPE_STRINGS \
>  	{ IOMAP_HOLE,		"HOLE" }, \
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 00db81eac80d6c..551cca39fa3ba6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
>  	xfs_iunlock(ip, iolock);
>  
>  	/*
> -	 * No fallback to buffered IO on errors for XFS, direct IO will either
> -	 * complete fully or fail.
> +	 * No partial fallback to buffered IO on errors for XFS, direct IO will
> +	 * either complete fully or fail.
>  	 */
>  	ASSERT(ret < 0 || ret == count);
>  	return ret;
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 07bc42d62673ce..793850454b752f 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -786,8 +786,11 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
>  		return -EFBIG;
>  
> -	if (iocb->ki_flags & IOCB_DIRECT)
> -		return zonefs_file_dio_write(iocb, from);
> +	if (iocb->ki_flags & IOCB_DIRECT) {
> +		ret = zonefs_file_dio_write(iocb, from);
> +		if (ret != -EREMCHG)
> +			return ret;
> +	}
>  
>  	return zonefs_file_buffered_write(iocb, from);
>  }
> -- 
> 2.26.2
> 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
  2020-07-13 11:55   ` Matthew Wilcox
  2020-07-13 12:20   ` Goldwyn Rodrigues
@ 2020-07-13 15:39   ` Darrick J. Wong
  2020-07-14 11:00     ` Christoph Hellwig
  2020-07-14  1:41   ` Damien Le Moal
  3 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2020-07-13 15:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Goldwyn Rodrigues, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Matthew Wilcox, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs

On Mon, Jul 13, 2020 at 09:46:33AM +0200, Christoph Hellwig wrote:
> Failing to invalid the page cache means data in incoherent, which is
> a very bad state for the system.  Always fall back to buffered I/O
> through the page cache if we can't invalidate mappings.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/ext4/file.c       |  2 ++
>  fs/gfs2/file.c       |  3 ++-
>  fs/iomap/direct-io.c | 13 ++++++++-----
>  fs/iomap/trace.h     |  1 +
>  fs/xfs/xfs_file.c    |  4 ++--
>  fs/zonefs/super.c    |  7 +++++--
>  6 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c4c..0da6c2a2c32c1e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   is_sync_kiocb(iocb) || unaligned_io || extend);
> +	if (ret == -EREMCHG)
> +		ret = 0;
>  
>  	if (extend)
>  		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index fe305e4bfd3734..c7907d40c61d17 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -814,7 +814,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  
>  	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
>  			   is_sync_kiocb(iocb));
> -
> +	if (ret == -EREMCHG)
> +		ret = 0;
>  out:
>  	gfs2_glock_dq(&gh);
>  out_uninit:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 190967e87b69e4..62626235cdbe8d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -10,6 +10,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/uio.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include "trace.h"
>  
>  #include "../internal.h"
>  
> @@ -478,13 +479,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iov_iter_rw(iter) == WRITE) {
>  		/*
>  		 * Try to invalidate cache pages for the range we are writing.
> -		 * If this invalidation fails, tough, the write will still work,
> -		 * but racing two incompatible write paths is a pretty crazy
> -		 * thing to do, so we don't support it 100%.
> +		 * If this invalidation fails, let the caller fall back to
> +		 * buffered I/O.
>  		 */
>  		if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
> -				end >> PAGE_SHIFT))
> -			dio_warn_stale_pagecache(iocb->ki_filp);
> +				end >> PAGE_SHIFT)) {
> +			trace_iomap_dio_invalidate_fail(inode, pos, count);
> +			ret = -EREMCHG;

-ENOTBLK is already being used as a "magic" return code that means
"retry this direct write as a buffered write".  Shouldn't we use that
instead?

-EREMCHG was a private hack we put in XFS for the one case where a
direct write had to be done through the page cache (non block-aligned
COW), but maybe it's time we put that to rest since the rest of the
world apparently thinks the magic fallback code is -ENOTBLK.

--D

> +			goto out_free_dio;
> +		}
>  
>  		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>  			ret = sb_init_dio_done_wq(inode->i_sb);
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 5693a39d52fb63..fdc7ae388476f5 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name,	\
>  DEFINE_RANGE_EVENT(iomap_writepage);
>  DEFINE_RANGE_EVENT(iomap_releasepage);
>  DEFINE_RANGE_EVENT(iomap_invalidatepage);
> +DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
>  
>  #define IOMAP_TYPE_STRINGS \
>  	{ IOMAP_HOLE,		"HOLE" }, \
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 00db81eac80d6c..551cca39fa3ba6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
>  	xfs_iunlock(ip, iolock);
>  
>  	/*
> -	 * No fallback to buffered IO on errors for XFS, direct IO will either
> -	 * complete fully or fail.
> +	 * No partial fallback to buffered IO on errors for XFS, direct IO will
> +	 * either complete fully or fail.
>  	 */
>  	ASSERT(ret < 0 || ret == count);
>  	return ret;
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 07bc42d62673ce..793850454b752f 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -786,8 +786,11 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
>  		return -EFBIG;
>  
> -	if (iocb->ki_flags & IOCB_DIRECT)
> -		return zonefs_file_dio_write(iocb, from);
> +	if (iocb->ki_flags & IOCB_DIRECT) {
> +		ret = zonefs_file_dio_write(iocb, from);
> +		if (ret != -EREMCHG)
> +			return ret;
> +	}
>  
>  	return zonefs_file_buffered_write(iocb, from);
>  }
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13 12:20   ` Goldwyn Rodrigues
@ 2020-07-13 16:09     ` David Sterba
  0 siblings, 0 replies; 27+ messages in thread
From: David Sterba @ 2020-07-13 16:09 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Matthew Wilcox, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs

On Mon, Jul 13, 2020 at 07:20:50AM -0500, Goldwyn Rodrigues wrote:
> On  9:46 13/07, Christoph Hellwig wrote:
> > Failing to invalid the page cache means data in incoherent, which is
> > a very bad state for the system.  Always fall back to buffered I/O
> > through the page cache if we can't invalidate mappings.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Thanks. This will help btrfs. The current next tree contains the iomap
> changes I recomended and would need to be reverted in order to
> incorporate this. Once this is in the next tree I will (re)format the
> btrfs iomap dio patches.

Patches in next don't need to be reverted, if you put together a branch
with any iomap or other prep patches + btrfs iomap-dio I'll replace it
in the for-next snapshot I push.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
                     ` (2 preceding siblings ...)
  2020-07-13 15:39   ` Darrick J. Wong
@ 2020-07-14  1:41   ` Damien Le Moal
  3 siblings, 0 replies; 27+ messages in thread
From: Damien Le Moal @ 2020-07-14  1:41 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues
  Cc: Naohiro Aota, Johannes Thumshirn, Matthew Wilcox, linux-btrfs,
	linux-fsdevel, cluster-devel, linux-ext4, linux-xfs

On 2020/07/13 16:51, Christoph Hellwig wrote:
> Failing to invalid the page cache means data in incoherent, which is

s/in incoherent/is incoherent

> a very bad state for the system.  Always fall back to buffered I/O
> through the page cache if we can't invalidate mappings.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/ext4/file.c       |  2 ++
>  fs/gfs2/file.c       |  3 ++-
>  fs/iomap/direct-io.c | 13 ++++++++-----
>  fs/iomap/trace.h     |  1 +
>  fs/xfs/xfs_file.c    |  4 ++--
>  fs/zonefs/super.c    |  7 +++++--
>  6 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c4c..0da6c2a2c32c1e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>  			   is_sync_kiocb(iocb) || unaligned_io || extend);
> +	if (ret == -EREMCHG)
> +		ret = 0;
>  
>  	if (extend)
>  		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index fe305e4bfd3734..c7907d40c61d17 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -814,7 +814,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  
>  	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
>  			   is_sync_kiocb(iocb));
> -
> +	if (ret == -EREMCHG)
> +		ret = 0;
>  out:
>  	gfs2_glock_dq(&gh);
>  out_uninit:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 190967e87b69e4..62626235cdbe8d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -10,6 +10,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/uio.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include "trace.h"
>  
>  #include "../internal.h"
>  
> @@ -478,13 +479,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iov_iter_rw(iter) == WRITE) {
>  		/*
>  		 * Try to invalidate cache pages for the range we are writing.
> -		 * If this invalidation fails, tough, the write will still work,
> -		 * but racing two incompatible write paths is a pretty crazy
> -		 * thing to do, so we don't support it 100%.
> +		 * If this invalidation fails, let the caller fall back to
> +		 * buffered I/O.
>  		 */
>  		if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
> -				end >> PAGE_SHIFT))
> -			dio_warn_stale_pagecache(iocb->ki_filp);
> +				end >> PAGE_SHIFT)) {
> +			trace_iomap_dio_invalidate_fail(inode, pos, count);
> +			ret = -EREMCHG;

I am wondering if it is OK to unconditionally always return -EREMCHG here.
Shouldn't this depend on the return code of invalidate_inode_pages2_range() ?
ret may be the value returned by mapping->a_ops->launder_page(page) instead of
-EBUSY that invalidate_inode_pages2_range() otherwise returns for a failed
invalidation. Isn't their any error condition that would be better served by not
forcing the fallback to buffered write ? E.g. -ENOMEM ?



> +			goto out_free_dio;
> +		}
>  
>  		if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
>  			ret = sb_init_dio_done_wq(inode->i_sb);
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 5693a39d52fb63..fdc7ae388476f5 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name,	\
>  DEFINE_RANGE_EVENT(iomap_writepage);
>  DEFINE_RANGE_EVENT(iomap_releasepage);
>  DEFINE_RANGE_EVENT(iomap_invalidatepage);
> +DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
>  
>  #define IOMAP_TYPE_STRINGS \
>  	{ IOMAP_HOLE,		"HOLE" }, \
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 00db81eac80d6c..551cca39fa3ba6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
>  	xfs_iunlock(ip, iolock);
>  
>  	/*
> -	 * No fallback to buffered IO on errors for XFS, direct IO will either
> -	 * complete fully or fail.
> +	 * No partial fallback to buffered IO on errors for XFS, direct IO will
> +	 * either complete fully or fail.
>  	 */
>  	ASSERT(ret < 0 || ret == count);
>  	return ret;
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 07bc42d62673ce..793850454b752f 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -786,8 +786,11 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
>  		return -EFBIG;
>  
> -	if (iocb->ki_flags & IOCB_DIRECT)
> -		return zonefs_file_dio_write(iocb, from);
> +	if (iocb->ki_flags & IOCB_DIRECT) {
> +		ret = zonefs_file_dio_write(iocb, from);
> +		if (ret != -EREMCHG)
> +			return ret;
> +	}

This looks fine. This would happen only for conventional zone writes since
sequential zone writes cannot ever issue direct IOs on top of cached data as
that would be a forbidden "overwrite" operation.

>  
>  	return zonefs_file_buffered_write(iocb, from);
>  }
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13 11:55   ` Matthew Wilcox
@ 2020-07-14 11:00     ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-14 11:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, linux-btrfs,
	linux-fsdevel, cluster-devel, linux-ext4, linux-xfs

On Mon, Jul 13, 2020 at 12:55:09PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 13, 2020 at 09:46:33AM +0200, Christoph Hellwig wrote:
> > Failing to invalid the page cache means data in incoherent, which is
> > a very bad state for the system.  Always fall back to buffered I/O
> > through the page cache if we can't invalidate mappings.
> 
> Is that the right approach though?  I don't have a full picture in my head,
> but wouldn't we be better off marking these pages as !Uptodate and doing
> the direct I/O?

Isn't that a problem if e.g. pages are mapped into userspace and mlocked?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures
  2020-07-13 15:39   ` Darrick J. Wong
@ 2020-07-14 11:00     ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-14 11:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, Goldwyn Rodrigues,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, Matthew Wilcox,
	linux-btrfs, linux-fsdevel, cluster-devel, linux-ext4, linux-xfs

On Mon, Jul 13, 2020 at 08:39:20AM -0700, Darrick J. Wong wrote:
> -ENOTBLK is already being used as a "magic" return code that means
> "retry this direct write as a buffered write".  Shouldn't we use that
> instead?
> 
> -EREMCHG was a private hack we put in XFS for the one case where a
> direct write had to be done through the page cache (non block-aligned
> COW), but maybe it's time we put that to rest since the rest of the
> world apparently thinks the magic fallback code is -ENOTBLK.

Sure, I can switch the error code.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-13  7:46 RFC: iomap write invalidation Christoph Hellwig
  2020-07-13  7:46 ` [PATCH 1/2] iomap: Only invalidate page cache pages on direct IO writes Christoph Hellwig
  2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
@ 2020-07-15  1:47 ` Dave Chinner
  2020-07-20 21:51 ` Goldwyn Rodrigues
  3 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2020-07-15  1:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Matthew Wilcox, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs

On Mon, Jul 13, 2020 at 09:46:31AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series has two parts:  the first one picks up Dave's patch to avoid
> invalidation entierly for reads, picked up deep down from the btrfs iomap
> thread.  The second one falls back to buffered writes if invalidation fails
> instead of leaving a stale cache around.  Let me know what you think about
> this approch.

Either we maintain application level concurrency for direct IOs and
ignore the stale data in the page cache, or we kill application IO
concurrency and keep the page cache coherent.

It's a lose-lose choice and I'm on the fence as to which is the
lesser of two evils.

The main factor is whether the buffered IO fallback can be
diagnosed. There's a new tracepoint for that case, so at least we
will be able to tell if the fallback co-incides with application
performance cratering. Hopefully this will only be a rare event.

So, to hoist myself on my own petard: correctness first, performance
second.

Acked-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-13  7:46 RFC: iomap write invalidation Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-07-15  1:47 ` RFC: iomap write invalidation Dave Chinner
@ 2020-07-20 21:51 ` Goldwyn Rodrigues
  2020-07-21 14:53   ` Christoph Hellwig
  3 siblings, 1 reply; 27+ messages in thread
From: Goldwyn Rodrigues @ 2020-07-20 21:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	Matthew Wilcox, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs

Hi Christoph,

On  9:46 13/07, Christoph Hellwig wrote:
> Hi all,
> 
> this series has two parts:  the first one picks up Dave's patch to avoid
> invalidation entierly for reads, picked up deep down from the btrfs iomap
> thread.  The second one falls back to buffered writes if invalidation fails
> instead of leaving a stale cache around.  Let me know what you think about
> this approch.

Which kernel version are these changes expected?
btrfs dio switch to iomap depends on this.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-20 21:51 ` Goldwyn Rodrigues
@ 2020-07-21 14:53   ` Christoph Hellwig
  2020-07-21 14:59     ` Darrick J. Wong
  2020-07-21 15:04     ` Matthew Wilcox
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-21 14:53 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Matthew Wilcox, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs

On Mon, Jul 20, 2020 at 04:51:25PM -0500, Goldwyn Rodrigues wrote:
> Hi Christoph,
> 
> On  9:46 13/07, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series has two parts:  the first one picks up Dave's patch to avoid
> > invalidation entierly for reads, picked up deep down from the btrfs iomap
> > thread.  The second one falls back to buffered writes if invalidation fails
> > instead of leaving a stale cache around.  Let me know what you think about
> > this approch.
> 
> Which kernel version are these changes expected?
> btrfs dio switch to iomap depends on this.

No idea.  Darrick, are you ok with picking this up soon so that
Goldwyn can pull it in?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 14:53   ` Christoph Hellwig
@ 2020-07-21 14:59     ` Darrick J. Wong
  2020-07-21 15:04     ` Matthew Wilcox
  1 sibling, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2020-07-21 14:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, Matthew Wilcox, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs

On Tue, Jul 21, 2020 at 04:53:13PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 20, 2020 at 04:51:25PM -0500, Goldwyn Rodrigues wrote:
> > Hi Christoph,
> > 
> > On  9:46 13/07, Christoph Hellwig wrote:
> > > Hi all,
> > > 
> > > this series has two parts:  the first one picks up Dave's patch to avoid
> > > invalidation entierly for reads, picked up deep down from the btrfs iomap
> > > thread.  The second one falls back to buffered writes if invalidation fails
> > > instead of leaving a stale cache around.  Let me know what you think about
> > > this approch.
> > 
> > Which kernel version are these changes expected?
> > btrfs dio switch to iomap depends on this.
> 
> No idea.  Darrick, are you ok with picking this up soon so that
> Goldwyn can pull it in?

Yes; I was nearly about to ping you to see if you were going to have
time to push out a corrected patch 2 in time for ... 5.9?  5.10?

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 14:53   ` Christoph Hellwig
  2020-07-21 14:59     ` Darrick J. Wong
@ 2020-07-21 15:04     ` Matthew Wilcox
  2020-07-21 15:06       ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-07-21 15:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs

On Tue, Jul 21, 2020 at 04:53:13PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 20, 2020 at 04:51:25PM -0500, Goldwyn Rodrigues wrote:
> > Hi Christoph,
> > 
> > On  9:46 13/07, Christoph Hellwig wrote:
> > > Hi all,
> > > 
> > > this series has two parts:  the first one picks up Dave's patch to avoid
> > > invalidation entierly for reads, picked up deep down from the btrfs iomap
> > > thread.  The second one falls back to buffered writes if invalidation fails
> > > instead of leaving a stale cache around.  Let me know what you think about
> > > this approch.
> > 
> > Which kernel version are these changes expected?
> > btrfs dio switch to iomap depends on this.
> 
> No idea.  Darrick, are you ok with picking this up soon so that
> Goldwyn can pull it in?

I thought you were going to respin this with EREMCHG changed to ENOTBLK?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:04     ` Matthew Wilcox
@ 2020-07-21 15:06       ` Christoph Hellwig
  2020-07-21 15:14         ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-21 15:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Goldwyn Rodrigues, Dave Chinner,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, linux-btrfs,
	linux-fsdevel, cluster-devel, linux-ext4, linux-xfs

On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 04:53:13PM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 20, 2020 at 04:51:25PM -0500, Goldwyn Rodrigues wrote:
> > > Hi Christoph,
> > > 
> > > On  9:46 13/07, Christoph Hellwig wrote:
> > > > Hi all,
> > > > 
> > > > this series has two parts:  the first one picks up Dave's patch to avoid
> > > > invalidation entierly for reads, picked up deep down from the btrfs iomap
> > > > thread.  The second one falls back to buffered writes if invalidation fails
> > > > instead of leaving a stale cache around.  Let me know what you think about
> > > > this approch.
> > > 
> > > Which kernel version are these changes expected?
> > > btrfs dio switch to iomap depends on this.
> > 
> > No idea.  Darrick, are you ok with picking this up soon so that
> > Goldwyn can pull it in?
> 
> I thought you were going to respin this with EREMCHG changed to ENOTBLK?

Oh, true.  I'll do that ASAP.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:06       ` Christoph Hellwig
@ 2020-07-21 15:14         ` Matthew Wilcox
  2020-07-21 15:16           ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-07-21 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs, Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> 
> Oh, true.  I'll do that ASAP.

Michael, could we add this to manpages?

--- write.2.old	2020-07-21 11:11:17.740491825 -0400
+++ write.2	2020-07-21 11:13:05.900389249 -0400
@@ -192,6 +192,12 @@
 .IR count ,
 or the file offset is not suitably aligned.
 .TP
+.B ENOTBLK
+The file was opened with the
+.B O_DIRECT
+flag, and the I/O could not be completed due to another process using
+the file.
+.TP
 .B EIO
 A low-level I/O error occurred while modifying the inode.
 This error may relate to the write-back of data written by an earlier


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:14         ` Matthew Wilcox
@ 2020-07-21 15:16           ` Christoph Hellwig
  2020-07-21 15:27             ` Darrick J. Wong
  2020-07-21 15:31             ` Matthew Wilcox
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-21 15:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Goldwyn Rodrigues, Dave Chinner,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, linux-btrfs,
	linux-fsdevel, cluster-devel, linux-ext4, linux-xfs,
	Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > 
> > Oh, true.  I'll do that ASAP.
> 
> Michael, could we add this to manpages?

Umm, no.  -ENOTBLK is internal - the file systems will retry using
buffered I/O and the error shall never escape to userspace (or even the
VFS for that matter).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:16           ` Christoph Hellwig
@ 2020-07-21 15:27             ` Darrick J. Wong
  2020-07-21 15:41               ` Christoph Hellwig
  2020-07-21 15:31             ` Matthew Wilcox
  1 sibling, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2020-07-21 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Goldwyn Rodrigues, Dave Chinner, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs, Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 05:16:16PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > > 
> > > Oh, true.  I'll do that ASAP.
> > 
> > Michael, could we add this to manpages?
> 
> Umm, no.  -ENOTBLK is internal - the file systems will retry using
> buffered I/O and the error shall never escape to userspace (or even the
> VFS for that matter).

It's worth dropping a comment somewhere that ENOTBLK is the desired
"fall back to buffered" errcode, seeing as Dave and I missed that in
XFS...

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:16           ` Christoph Hellwig
  2020-07-21 15:27             ` Darrick J. Wong
@ 2020-07-21 15:31             ` Matthew Wilcox
  2020-07-21 15:42               ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-07-21 15:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs, Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 05:16:16PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > > 
> > > Oh, true.  I'll do that ASAP.
> > 
> > Michael, could we add this to manpages?
> 
> Umm, no.  -ENOTBLK is internal - the file systems will retry using
> buffered I/O and the error shall never escape to userspace (or even the
> VFS for that matter).

Ah, I made the mistake of believing the comments that I could see in
your patch instead of reading the code.

Can I suggest deleting this comment:

        /*
         * No fallback to buffered IO on errors for XFS, direct IO will either
         * complete fully or fail.
         */

and rewording this one:

                /*
                 * Allow a directio write to fall back to a buffered
                 * write *only* in the case that we're doing a reflink
                 * CoW.  In all other directio scenarios we do not
                 * allow an operation to fall back to buffered mode.
                 */

as part of your revised patchset?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:27             ` Darrick J. Wong
@ 2020-07-21 15:41               ` Christoph Hellwig
  2020-07-21 15:59                 ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-21 15:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Matthew Wilcox, Goldwyn Rodrigues,
	Dave Chinner, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	linux-btrfs, linux-fsdevel, cluster-devel, linux-ext4, linux-xfs,
	Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 08:27:54AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 21, 2020 at 05:16:16PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > > > 
> > > > Oh, true.  I'll do that ASAP.
> > > 
> > > Michael, could we add this to manpages?
> > 
> > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > buffered I/O and the error shall never escape to userspace (or even the
> > VFS for that matter).
> 
> It's worth dropping a comment somewhere that ENOTBLK is the desired
> "fall back to buffered" errcode, seeing as Dave and I missed that in
> XFS...

Sounds like a good idea, but what would a good place be?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:31             ` Matthew Wilcox
@ 2020-07-21 15:42               ` Christoph Hellwig
  2020-07-21 15:52                 ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-21 15:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Goldwyn Rodrigues, Dave Chinner,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, linux-btrfs,
	linux-fsdevel, cluster-devel, linux-ext4, linux-xfs,
	Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 04:31:36PM +0100, Matthew Wilcox wrote:
> > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > buffered I/O and the error shall never escape to userspace (or even the
> > VFS for that matter).
> 
> Ah, I made the mistake of believing the comments that I could see in
> your patch instead of reading the code.
> 
> Can I suggest deleting this comment:
> 
>         /*
>          * No fallback to buffered IO on errors for XFS, direct IO will either
>          * complete fully or fail.
>          */
> 
> and rewording this one:
> 
>                 /*
>                  * Allow a directio write to fall back to a buffered
>                  * write *only* in the case that we're doing a reflink
>                  * CoW.  In all other directio scenarios we do not
>                  * allow an operation to fall back to buffered mode.
>                  */
> 
> as part of your revised patchset?

That isn't actually true.  In current mainline we only fallback on
reflink RMW cases, but with this series we also fall back for
invalidation failures.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:42               ` Christoph Hellwig
@ 2020-07-21 15:52                 ` Matthew Wilcox
  2020-07-21 16:03                   ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2020-07-21 15:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Goldwyn Rodrigues, Dave Chinner, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, linux-btrfs, linux-fsdevel, cluster-devel,
	linux-ext4, linux-xfs, Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 05:42:40PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 04:31:36PM +0100, Matthew Wilcox wrote:
> > > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > > buffered I/O and the error shall never escape to userspace (or even the
> > > VFS for that matter).
> > 
> > Ah, I made the mistake of believing the comments that I could see in
> > your patch instead of reading the code.
> > 
> > Can I suggest deleting this comment:
> > 
> >         /*
> >          * No fallback to buffered IO on errors for XFS, direct IO will either
> >          * complete fully or fail.
> >          */
> > 
> > and rewording this one:
> > 
> >                 /*
> >                  * Allow a directio write to fall back to a buffered
> >                  * write *only* in the case that we're doing a reflink
> >                  * CoW.  In all other directio scenarios we do not
> >                  * allow an operation to fall back to buffered mode.
> >                  */
> > 
> > as part of your revised patchset?
> 
> That isn't actually true.  In current mainline we only fallback on
> reflink RMW cases, but with this series we also fall back for
> invalidation failures.

... that's why I'm suggesting that you delete the first one and rewrite
the second one.  Because they aren't true.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:41               ` Christoph Hellwig
@ 2020-07-21 15:59                 ` Darrick J. Wong
  2020-07-21 16:01                   ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2020-07-21 15:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Goldwyn Rodrigues, Dave Chinner, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs, Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 05:41:32PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 08:27:54AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 21, 2020 at 05:16:16PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > > > > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > > > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > > > > 
> > > > > Oh, true.  I'll do that ASAP.
> > > > 
> > > > Michael, could we add this to manpages?
> > > 
> > > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > > buffered I/O and the error shall never escape to userspace (or even the
> > > VFS for that matter).
> > 
> > It's worth dropping a comment somewhere that ENOTBLK is the desired
> > "fall back to buffered" errcode, seeing as Dave and I missed that in
> > XFS...
> 
> Sounds like a good idea, but what would a good place be?

In the comment that precedes iomap_dio_rw() for the iomap version,
and...

...ye $deity, the old direct-io.c file is a mess of wrappers.  Uh...  a
new comment preceding __blockdev_direct_IO?  Or blockdev_direct_IO?  Or
both?

Or I guess the direct_IO documentation in vfs.rst...?

``direct_IO``
	called by the generic read/write routines to perform direct_IO -
	that is IO requests which bypass the page cache and transfer
	data directly between the storage and the application's address
	space.  This function can return -ENOTBLK to signal that it is
	necessary to fallback to buffered IO.  Note that
	blockdev_direct_IO and variants can also return -ENOTBLK.

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:59                 ` Darrick J. Wong
@ 2020-07-21 16:01                   ` Christoph Hellwig
  2020-07-21 16:05                     ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2020-07-21 16:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Matthew Wilcox, Goldwyn Rodrigues,
	Dave Chinner, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	linux-btrfs, linux-fsdevel, cluster-devel, linux-ext4, linux-xfs,
	Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 08:59:25AM -0700, Darrick J. Wong wrote:
> In the comment that precedes iomap_dio_rw() for the iomap version,

maybe let's just do that..

> ``direct_IO``
> 	called by the generic read/write routines to perform direct_IO -
> 	that is IO requests which bypass the page cache and transfer
> 	data directly between the storage and the application's address
> 	space.  This function can return -ENOTBLK to signal that it is
> 	necessary to fallback to buffered IO.  Note that
> 	blockdev_direct_IO and variants can also return -ENOTBLK.

->direct_IO is not used for iomap and various other implementations.
In fact it is a horrible hack that I've been trying to get rid of
for a while.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 15:52                 ` Matthew Wilcox
@ 2020-07-21 16:03                   ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2020-07-21 16:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Goldwyn Rodrigues, Dave Chinner,
	Damien Le Moal, Naohiro Aota, Johannes Thumshirn, linux-btrfs,
	linux-fsdevel, cluster-devel, linux-ext4, linux-xfs,
	Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 04:52:01PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 05:42:40PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 21, 2020 at 04:31:36PM +0100, Matthew Wilcox wrote:
> > > > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > > > buffered I/O and the error shall never escape to userspace (or even the
> > > > VFS for that matter).
> > > 
> > > Ah, I made the mistake of believing the comments that I could see in
> > > your patch instead of reading the code.
> > > 
> > > Can I suggest deleting this comment:
> > > 
> > >         /*
> > >          * No fallback to buffered IO on errors for XFS, direct IO will either
> > >          * complete fully or fail.
> > >          */
> > > 
> > > and rewording this one:
> > > 
> > >                 /*
> > >                  * Allow a directio write to fall back to a buffered
> > >                  * write *only* in the case that we're doing a reflink
> > >                  * CoW.  In all other directio scenarios we do not
> > >                  * allow an operation to fall back to buffered mode.
> > >                  */
> > > 
> > > as part of your revised patchset?
> > 
> > That isn't actually true.  In current mainline we only fallback on
> > reflink RMW cases, but with this series we also fall back for
> > invalidation failures.
> 
> ... that's why I'm suggesting that you delete the first one and rewrite
> the second one.  Because they aren't true.

/*
 * We allow only three outcomes of a directio: (1) it succeeds
 * completely; (2) it fails with a negative error code; or (3) it
 * returns -ENOTBLK to signal that we should fall back to buffered IO.
 *
 * The third scenario should only happen if iomap refuses to perform the
 * direct IO, or the direct write request requires CoW but is not aligned
 * to the filesystem block size.
 */

?

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RFC: iomap write invalidation
  2020-07-21 16:01                   ` Christoph Hellwig
@ 2020-07-21 16:05                     ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2020-07-21 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Goldwyn Rodrigues, Dave Chinner, Damien Le Moal,
	Naohiro Aota, Johannes Thumshirn, linux-btrfs, linux-fsdevel,
	cluster-devel, linux-ext4, linux-xfs, Michael Kerrisk, linux-man

On Tue, Jul 21, 2020 at 06:01:43PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 21, 2020 at 08:59:25AM -0700, Darrick J. Wong wrote:
> > In the comment that precedes iomap_dio_rw() for the iomap version,
> 
> maybe let's just do that..
> 
> > ``direct_IO``
> > 	called by the generic read/write routines to perform direct_IO -
> > 	that is IO requests which bypass the page cache and transfer
> > 	data directly between the storage and the application's address
> > 	space.  This function can return -ENOTBLK to signal that it is
> > 	necessary to fallback to buffered IO.  Note that
> > 	blockdev_direct_IO and variants can also return -ENOTBLK.
> 
> ->direct_IO is not used for iomap and various other implementations.
> In fact it is a horrible hack that I've been trying to get rid of
> for a while.

Agreed, but for now there are still a number of fses who are still on
the old directio code; let's try to keep the drainbamage/confusion
potential to a minimum so it doesn't spread to our iomap shinyness. :)

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, back to index

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-13  7:46 RFC: iomap write invalidation Christoph Hellwig
2020-07-13  7:46 ` [PATCH 1/2] iomap: Only invalidate page cache pages on direct IO writes Christoph Hellwig
2020-07-13  7:46 ` [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures Christoph Hellwig
2020-07-13 11:55   ` Matthew Wilcox
2020-07-14 11:00     ` Christoph Hellwig
2020-07-13 12:20   ` Goldwyn Rodrigues
2020-07-13 16:09     ` David Sterba
2020-07-13 15:39   ` Darrick J. Wong
2020-07-14 11:00     ` Christoph Hellwig
2020-07-14  1:41   ` Damien Le Moal
2020-07-15  1:47 ` RFC: iomap write invalidation Dave Chinner
2020-07-20 21:51 ` Goldwyn Rodrigues
2020-07-21 14:53   ` Christoph Hellwig
2020-07-21 14:59     ` Darrick J. Wong
2020-07-21 15:04     ` Matthew Wilcox
2020-07-21 15:06       ` Christoph Hellwig
2020-07-21 15:14         ` Matthew Wilcox
2020-07-21 15:16           ` Christoph Hellwig
2020-07-21 15:27             ` Darrick J. Wong
2020-07-21 15:41               ` Christoph Hellwig
2020-07-21 15:59                 ` Darrick J. Wong
2020-07-21 16:01                   ` Christoph Hellwig
2020-07-21 16:05                     ` Darrick J. Wong
2020-07-21 15:31             ` Matthew Wilcox
2020-07-21 15:42               ` Christoph Hellwig
2020-07-21 15:52                 ` Matthew Wilcox
2020-07-21 16:03                   ` Darrick J. Wong

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git