All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] iomap: data corruption fixes and more
@ 2018-11-19 21:17 Dave Chinner
  2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-19 21:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi folks,

Since clone/dedupe/copy_file_range support was added to fsx, I've
found several data corruptions and issues with these functions. The
patchset addresses some of the issues I've found, including a
problem with the FUA optimisation that I found by inspection when
looking at this code.

Several of the problems stem from do_splice_direct() and how it
behaves on files opened O_DIRECT. I'm pretty sure this has never
been tested on XFS because it just doesn't work and has stale data
exposure problems. And it's so slow it's not funny because it can
only shuffle 16 pages at a time through the splice pipe, which means
synchronous 64k IO for large data movement. do_splice_direct() is a
horrible data copying primitive.

The first 3 patches are data integrity/corruption fixes for the
iomap code, patch 4 increases the do_splice_direct() pipe size to
the maximum allowed without privilege to improve performance, and
the last patch fixes an error reporting inconsistency between
dedeupe and the documented errors in the man page.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
  2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
@ 2018-11-19 21:17 ` Dave Chinner
  2018-11-20  7:48   ` Christoph Hellwig
  2018-11-20 23:00   ` Darrick J. Wong
  2018-11-19 21:17 ` [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-19 21:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When we write into an unwritten extent via direct IO, we dirty
metadata on IO completion to convert the unwritten extent to
written. However, when we do the FUA optimisation checks, the inode
may be clean and so we issue a FUA write into the unwritten extent.
This means we then bypass the generic_write_sync() call after
unwritten extent conversion has ben done and we don't force the
modified metadata to stable storage.

This violates O_DSYNC semantics. The window of exposure is a single
IO, as the next DIO write will see the inode has dirty metadata and
hence will not use the FUA optimisation. Calling
generic_write_sync() after completion of the second IO will also
sync the first write and it's metadata.

Fix this by avoiding the FUA optimisation when writing to unwritten
extents.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 64ce240217a1..72f3864a2e6b 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1596,12 +1596,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 	if (iomap->flags & IOMAP_F_NEW) {
 		need_zeroout = true;
-	} else {
+	} else if (iomap->type == IOMAP_MAPPED) {
 		/*
-		 * Use a FUA write if we need datasync semantics, this
-		 * is a pure data IO that doesn't require any metadata
-		 * updates and the underlying device supports FUA. This
-		 * allows us to avoid cache flushes on IO completion.
+		 * Use a FUA write if we need datasync semantics, this is a pure
+		 * data IO that doesn't require any metadata updates (including
+		 * after IO completion such as unwritten extent conversion) and
+		 * the underlying device supports FUA. This allows us to avoid
+		 * cache flushes on IO completion.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF
  2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
  2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
@ 2018-11-19 21:17 ` Dave Chinner
  2018-11-20 23:01   ` Darrick J. Wong
  2018-11-19 21:17 ` [PATCH 3/5] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-11-19 21:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

If we are soing sub-block dio that extends EOF, we need to zero
the unused tail of the block to initialise the data in it it. If we
do not zero the tail of the block, then an immediate mmap read of
the EOF block will expose stale data beyond EOF to userspace. Found
with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.

Fix this by detecting if the end of the DIO write is beyond EOF
and zeroing the tail if necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 72f3864a2e6b..77c214194edf 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1677,7 +1677,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		dio->submit.cookie = submit_bio(bio);
 	} while (nr_pages);
 
-	if (need_zeroout) {
+	/*
+	 * We need to zeroout the tail of a sub-block write if the extent type
+	 * requires zeroing or the write extends beyond EOF. If we don't zero
+	 * the block tail in the latter case, we can expose stale data via mmap
+	 * reads of the EOF block.
+	 */
+	if (need_zeroout ||
+	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
 		/* zero out from the end of the write to the end of the block */
 		pad = pos & (fs_block_size - 1);
 		if (pad)
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] iomap: dio data corruption and spurious errors when pipes fill
  2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
  2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
  2018-11-19 21:17 ` [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
@ 2018-11-19 21:17 ` Dave Chinner
  2018-11-19 21:17 ` [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-19 21:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When doing direct IO to a pipe for do_splice_direct(), then pipe is
trivial to fill up and overflow as it can only hold 16 pages. At
this point bio_iov_iter_get_pages() then returns -EFAULT, and we
abort the IO submission process. Unfortunately, iomap_dio_rw()
propagates the error back up the stack.

The error is converted from the EFAULT to EAGAIN in
generic_file_splice_read() to tell the splice layers that the pipe
is full. do_splice_direct() completely fails to handle EAGAIN errors
(it aborts on error) and returns EAGAIN to the caller.

copy_file_write() then completely fails to handle EAGAIN as well,
and so returns EAGAIN to userspace, having failed to copy the data
it was asked to.

Avoid this whole steaming pile of fail by having iomap_dio_rw()
silently swallow EFAULT errors and so do short reads.

To make matters worse, iomap_dio_actor() has a stale data exposure
bug bio_iov_iter_get_pages() fails - it does not zero the tail block
that it may have been left uncovered by partial IO. Fix the error
handling case to drop to the sub-block zeroing rather than
immmediately returning the -EFAULT error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 77c214194edf..d51e7a2ae641 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1580,7 +1580,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	bool use_fua = false;
-	int nr_pages, ret;
+	int nr_pages, ret = 0;
 	size_t copied = 0;
 
 	if ((pos | length | align) & ((1 << blkbits) - 1))
@@ -1645,8 +1645,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		ret = bio_iov_iter_get_pages(bio, &iter);
 		if (unlikely(ret)) {
+			/*
+			 * We have to stop part way through an IO. We must fall
+			 * through to the sub-block tail zeroing here, otherwise
+			 * this short IO may expose stale data in the tail of
+			 * the block we haven't written data to.
+			 */
 			bio_put(bio);
-			return copied ? copied : ret;
+			goto zero_tail;
 		}
 
 		n = bio->bi_iter.bi_size;
@@ -1683,6 +1689,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	 * the block tail in the latter case, we can expose stale data via mmap
 	 * reads of the EOF block.
 	 */
+zero_tail:
 	if (need_zeroout ||
 	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
 		/* zero out from the end of the write to the end of the block */
@@ -1690,7 +1697,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		if (pad)
 			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
 	}
-	return copied;
+	return copied ? copied : ret;
 }
 
 static loff_t
@@ -1865,6 +1872,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 				dio->wait_for_completion = true;
 				ret = 0;
 			}
+
+			/*
+			 * Splicing to pipes can fail on a full pipe. We have to
+			 * swallow this to make it look like a short IO
+			 * otherwise the higher splice layers will completely
+			 * mishandle the error and stop moving data.
+			 */
+			if (ret == -EFAULT)
+				ret = 0;
 			break;
 		}
 		pos += ret;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor()
  2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
                   ` (2 preceding siblings ...)
  2018-11-19 21:17 ` [PATCH 3/5] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
@ 2018-11-19 21:17 ` Dave Chinner
  2018-11-20  7:49   ` Christoph Hellwig
  2018-11-20 23:02   ` Darrick J. Wong
  2018-11-19 21:17 ` [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
  2018-11-21  6:50 ` [PATCH 6/5] iomap: readpages doesn't zero page tail beyond EOF properly Dave Chinner
  5 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-19 21:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When copy_file_range() is called on files that have been opened with
O_DIRECT, do_splice_direct() does a manual copy of the range one
pipe buffer at a time. The default is 16 pages, which means on
x86_64 it is limited to 64kB IO. This is extremely slow - 64k
synchrnous read/write will run at maybe 5-10MB/s on a spinning disk
and be seek bound. It will be faster on SSDs, but still very
inefficient.

Increase the pipe size to the maximum allowed user size so that we
can get decent throughput for this highly sub-optimal copy loop.  We
also don't care if changing the pipe size fails - that will just
result in a slower copy.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/pipe.c                 | 2 +-
 fs/splice.c               | 8 ++++++++
 include/linux/pipe_fs_i.h | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c0977d..5885d29cdaa6 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1025,7 +1025,7 @@ unsigned int round_pipe_size(unsigned long size)
  * Allocate a new array of pipe buffers and copy the info over. Returns the
  * pipe size if successful, or return -ERROR on error.
  */
-static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
+long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 {
 	struct pipe_buffer *bufs;
 	unsigned int size, nr_pages;
diff --git a/fs/splice.c b/fs/splice.c
index 3553f1956508..64dcc622db8d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -34,6 +34,7 @@
 #include <linux/socket.h>
 #include <linux/compat.h>
 #include <linux/sched/signal.h>
+#include <linux/pipe_fs_i.h>
 
 #include "internal.h"
 
@@ -931,6 +932,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		current->splice_pipe = pipe;
 	}
 
+	/*
+	 * Try to increase the data holding capacity of the pipe so we can do
+	 * larger IOs. This may not increase the size at all because maximum
+	 * user pipe size is administrator controlled, but we still should try.
+	 */
+	pipe_set_size(pipe, pipe_max_size);
+
 	/*
 	 * Do the splice.
 	 */
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 5a3bb3b7c9ad..e7d728e5b5f8 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
 
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned long size);
+long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg);
 
 #endif
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
                   ` (3 preceding siblings ...)
  2018-11-19 21:17 ` [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
@ 2018-11-19 21:17 ` Dave Chinner
  2018-11-20  7:49   ` Christoph Hellwig
  2018-11-21  6:50 ` [PATCH 6/5] iomap: readpages doesn't zero page tail beyond EOF properly Dave Chinner
  5 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-11-19 21:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

It returns EINVAL when the operation is not supported by the
filesystem. Fix it to return EOPNOTSUPP to be consistent with
the man page and clone_file_range().

Clean up the inconsistent error return handling while I'm there.
(I know, lipstick on a pig, but every little bit helps...)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index bfcb4ced5664..4dae0399c75a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	off = same->src_offset;
 	len = same->src_length;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode))
-		goto out;
+		return -EISDIR;
 
-	ret = -EINVAL;
 	if (!S_ISREG(src->i_mode))
-		goto out;
+		return -EINVAL;
+
+	if (!file->f_op->remap_file_range)
+		return -EOPNOTSUPP;
 
 	ret = remap_verify_area(file, off, len, false);
 	if (ret < 0)
-		goto out;
+		return ret;
 	ret = 0;
 
 	if (off + len > i_size_read(src))
@@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 		fdput(dst_fd);
 next_loop:
 		if (fatal_signal_pending(current))
-			goto out;
+			break;
 	}
-
-out:
 	return ret;
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range);
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
  2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
@ 2018-11-20  7:48   ` Christoph Hellwig
  2018-11-20 23:00   ` Darrick J. Wong
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2018-11-20  7:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Nov 20, 2018 at 08:17:38AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we write into an unwritten extent via direct IO, we dirty
> metadata on IO completion to convert the unwritten extent to
> written. However, when we do the FUA optimisation checks, the inode
> may be clean and so we issue a FUA write into the unwritten extent.
> This means we then bypass the generic_write_sync() call after
> unwritten extent conversion has ben done and we don't force the
> modified metadata to stable storage.
> 
> This violates O_DSYNC semantics. The window of exposure is a single
> IO, as the next DIO write will see the inode has dirty metadata and
> hence will not use the FUA optimisation. Calling
> generic_write_sync() after completion of the second IO will also
> sync the first write and it's metadata.
> 
> Fix this by avoiding the FUA optimisation when writing to unwritten
> extents.

Ouch, yes.  We can't skip the log force when converting unwritten
extent.  If we really cared we could try to use FUA and only do
the log force vs needing a full device flush, but that would require
a fair amount of of work.

So this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor()
  2018-11-19 21:17 ` [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
@ 2018-11-20  7:49   ` Christoph Hellwig
  2018-11-20 23:02   ` Darrick J. Wong
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2018-11-20  7:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, viro

[adding Al for pipe bits]

On Tue, Nov 20, 2018 at 08:17:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When copy_file_range() is called on files that have been opened with
> O_DIRECT, do_splice_direct() does a manual copy of the range one
> pipe buffer at a time. The default is 16 pages, which means on
> x86_64 it is limited to 64kB IO. This is extremely slow - 64k
> synchrnous read/write will run at maybe 5-10MB/s on a spinning disk
> and be seek bound. It will be faster on SSDs, but still very
> inefficient.
> 
> Increase the pipe size to the maximum allowed user size so that we
> can get decent throughput for this highly sub-optimal copy loop.  We
> also don't care if changing the pipe size fails - that will just
> result in a slower copy.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/pipe.c                 | 2 +-
>  fs/splice.c               | 8 ++++++++
>  include/linux/pipe_fs_i.h | 1 +
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index bdc5d3c0977d..5885d29cdaa6 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1025,7 +1025,7 @@ unsigned int round_pipe_size(unsigned long size)
>   * Allocate a new array of pipe buffers and copy the info over. Returns the
>   * pipe size if successful, or return -ERROR on error.
>   */
> -static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
> +long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
>  {
>  	struct pipe_buffer *bufs;
>  	unsigned int size, nr_pages;
> diff --git a/fs/splice.c b/fs/splice.c
> index 3553f1956508..64dcc622db8d 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -34,6 +34,7 @@
>  #include <linux/socket.h>
>  #include <linux/compat.h>
>  #include <linux/sched/signal.h>
> +#include <linux/pipe_fs_i.h>
>  
>  #include "internal.h"
>  
> @@ -931,6 +932,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
>  		current->splice_pipe = pipe;
>  	}
>  
> +	/*
> +	 * Try to increase the data holding capacity of the pipe so we can do
> +	 * larger IOs. This may not increase the size at all because maximum
> +	 * user pipe size is administrator controlled, but we still should try.
> +	 */
> +	pipe_set_size(pipe, pipe_max_size);
> +
>  	/*
>  	 * Do the splice.
>  	 */
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 5a3bb3b7c9ad..e7d728e5b5f8 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
>  
>  int create_pipe_files(struct file **, int);
>  unsigned int round_pipe_size(unsigned long size);
> +long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg);
>  
>  #endif
> -- 
> 2.19.1
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  2018-11-19 21:17 ` [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
@ 2018-11-20  7:49   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2018-11-20  7:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Nov 20, 2018 at 08:17:42AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It returns EINVAL when the operation is not supported by the
> filesystem. Fix it to return EOPNOTSUPP to be consistent with
> the man page and clone_file_range().
> 
> Clean up the inconsistent error return handling while I'm there.
> (I know, lipstick on a pig, but every little bit helps...)
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks good.  And I really like the removal of the pointless goto
out uses..

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
  2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
  2018-11-20  7:48   ` Christoph Hellwig
@ 2018-11-20 23:00   ` Darrick J. Wong
  1 sibling, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2018-11-20 23:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Nov 20, 2018 at 08:17:38AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we write into an unwritten extent via direct IO, we dirty
> metadata on IO completion to convert the unwritten extent to
> written. However, when we do the FUA optimisation checks, the inode
> may be clean and so we issue a FUA write into the unwritten extent.
> This means we then bypass the generic_write_sync() call after
> unwritten extent conversion has ben done and we don't force the
> modified metadata to stable storage.
> 
> This violates O_DSYNC semantics. The window of exposure is a single
> IO, as the next DIO write will see the inode has dirty metadata and
> hence will not use the FUA optimisation. Calling
> generic_write_sync() after completion of the second IO will also
> sync the first write and it's metadata.
> 
> Fix this by avoiding the FUA optimisation when writing to unwritten
> extents.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 64ce240217a1..72f3864a2e6b 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1596,12 +1596,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  	if (iomap->flags & IOMAP_F_NEW) {
>  		need_zeroout = true;
> -	} else {
> +	} else if (iomap->type == IOMAP_MAPPED) {
>  		/*
> -		 * Use a FUA write if we need datasync semantics, this
> -		 * is a pure data IO that doesn't require any metadata
> -		 * updates and the underlying device supports FUA. This
> -		 * allows us to avoid cache flushes on IO completion.
> +		 * Use a FUA write if we need datasync semantics, this is a pure
> +		 * data IO that doesn't require any metadata updates (including
> +		 * after IO completion such as unwritten extent conversion) and
> +		 * the underlying device supports FUA. This allows us to avoid
> +		 * cache flushes on IO completion.
>  		 */
>  		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>  		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF
  2018-11-19 21:17 ` [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
@ 2018-11-20 23:01   ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2018-11-20 23:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Nov 20, 2018 at 08:17:39AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we are soing sub-block dio that extends EOF, we need to zero
> the unused tail of the block to initialise the data in it it. If we
> do not zero the tail of the block, then an immediate mmap read of
> the EOF block will expose stale data beyond EOF to userspace. Found
> with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
> 
> Fix this by detecting if the end of the DIO write is beyond EOF
> and zeroing the tail if necessary.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Looks ok, will s/soing/doing/ on the changelog before committing.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 72f3864a2e6b..77c214194edf 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1677,7 +1677,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		dio->submit.cookie = submit_bio(bio);
>  	} while (nr_pages);
>  
> -	if (need_zeroout) {
> +	/*
> +	 * We need to zeroout the tail of a sub-block write if the extent type
> +	 * requires zeroing or the write extends beyond EOF. If we don't zero
> +	 * the block tail in the latter case, we can expose stale data via mmap
> +	 * reads of the EOF block.
> +	 */
> +	if (need_zeroout ||
> +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>  		/* zero out from the end of the write to the end of the block */
>  		pad = pos & (fs_block_size - 1);
>  		if (pad)
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor()
  2018-11-19 21:17 ` [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
  2018-11-20  7:49   ` Christoph Hellwig
@ 2018-11-20 23:02   ` Darrick J. Wong
  1 sibling, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2018-11-20 23:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Nov 20, 2018 at 08:17:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When copy_file_range() is called on files that have been opened with
> O_DIRECT, do_splice_direct() does a manual copy of the range one
> pipe buffer at a time. The default is 16 pages, which means on
> x86_64 it is limited to 64kB IO. This is extremely slow - 64k
> synchrnous read/write will run at maybe 5-10MB/s on a spinning disk
> and be seek bound. It will be faster on SSDs, but still very
> inefficient.
> 
> Increase the pipe size to the maximum allowed user size so that we
> can get decent throughput for this highly sub-optimal copy loop.  We
> also don't care if changing the pipe size fails - that will just
> result in a slower copy.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks fine to me but what do I know about pipe administration policy? :)

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/pipe.c                 | 2 +-
>  fs/splice.c               | 8 ++++++++
>  include/linux/pipe_fs_i.h | 1 +
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index bdc5d3c0977d..5885d29cdaa6 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1025,7 +1025,7 @@ unsigned int round_pipe_size(unsigned long size)
>   * Allocate a new array of pipe buffers and copy the info over. Returns the
>   * pipe size if successful, or return -ERROR on error.
>   */
> -static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
> +long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
>  {
>  	struct pipe_buffer *bufs;
>  	unsigned int size, nr_pages;
> diff --git a/fs/splice.c b/fs/splice.c
> index 3553f1956508..64dcc622db8d 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -34,6 +34,7 @@
>  #include <linux/socket.h>
>  #include <linux/compat.h>
>  #include <linux/sched/signal.h>
> +#include <linux/pipe_fs_i.h>
>  
>  #include "internal.h"
>  
> @@ -931,6 +932,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
>  		current->splice_pipe = pipe;
>  	}
>  
> +	/*
> +	 * Try to increase the data holding capacity of the pipe so we can do
> +	 * larger IOs. This may not increase the size at all because maximum
> +	 * user pipe size is administrator controlled, but we still should try.
> +	 */
> +	pipe_set_size(pipe, pipe_max_size);
> +
>  	/*
>  	 * Do the splice.
>  	 */
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 5a3bb3b7c9ad..e7d728e5b5f8 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
>  
>  int create_pipe_files(struct file **, int);
>  unsigned int round_pipe_size(unsigned long size);
> +long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg);
>  
>  #endif
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 6/5] iomap: readpages doesn't zero page tail beyond EOF properly
  2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
                   ` (4 preceding siblings ...)
  2018-11-19 21:17 ` [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
@ 2018-11-21  6:50 ` Dave Chinner
  2018-11-21  8:27   ` [PATCH 6/5 V2] " Dave Chinner
  5 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2018-11-21  6:50 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel


From: Dave Chinner <dchinner@redhat.com>

When we read the EOF page of the file via readpages, we need
to zero the region beyond EOF that we either do not read or
should not contain data so that mmap does not expose stale data to
user applications.

However, iomap_adjust_read_range() fails to detect EOF correctly,
and so fsx on 1k block size filesystems fails very quickly with
mapreads exposing data beyond EOF. There are two problems here.

Firstly, when calculating the end block of the EOF byte, we have
to round the size by one to avoid a block aligned EOF from reporting
a block too large. i.e. a size of 1024 bytes is 1 block, which in
index terms is block 0. Therefore we have to calculate the end block
from (isize - 1), not isize.

The second bug is determining if the current page spans EOF, and so
whether we need split it into two half, one for the IO, and the
other for zeroing. Unfortunately, the code that checks whether
we should split the block doesn't actually check if we span EOF, it
just checks if the read spans the /offset in the page/ that EOF
sits on. So it splits every read into two if EOF is not page
aligned, regardless of whether we are reading the EOF block or not.

Hence we need to restrict the "does the read span EOF" check to
just the page that spans EOF, not every page we read.

This patch results in correct EOF detection through readpages:

xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
bprint:               iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864

As you can see, it now does full page reads until the last one which
is split correctly at the block aligned EOF, reading 3072 bytes and
zeroing the last 1024 bytes.

This fixes the stale data beyond EOF problem that fsx quickly
uncovers on 1k block size filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index d51e7a2ae641..17f71a7ae8be 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -142,13 +142,14 @@ static void
 iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	loff_t orig_pos = *pos;
+	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
 	unsigned poff = offset_in_page(*pos);
 	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
-	unsigned end = offset_in_page(i_size_read(inode)) >> block_bits;
 
 	/*
 	 * If the block size is smaller than the page size we need to check the
@@ -183,8 +184,11 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * handle both halves separately so that we properly zero data in the
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
-	if (first <= end && last > end)
-		plen -= (last - end) * block_size;
+	if (orig_pos <= isize && orig_pos + plen > isize) {
+		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		if (first <= end && last > end)
+			plen -= (last - end) * block_size;
+	}
 
 	*offp = poff;
 	*lenp = plen;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/5 V2] iomap: readpages doesn't zero page tail beyond EOF properly
  2018-11-21  6:50 ` [PATCH 6/5] iomap: readpages doesn't zero page tail beyond EOF properly Dave Chinner
@ 2018-11-21  8:27   ` Dave Chinner
  2018-11-21 16:20     ` Darrick J. Wong
  2018-11-21 16:34     ` Christoph Hellwig
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2018-11-21  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel


From: Dave Chinner <dchinner@redhat.com>

When we read the EOF page of the file via readpages, we need
to zero the region beyond EOF that we either do not read or
should not contain data so that mmap does not expose stale data to
user applications.

However, iomap_adjust_read_range() fails to detect EOF correctly,
and so fsx on 1k block size filesystems fails very quickly with
mapreads exposing data beyond EOF. There are two problems here.

Firstly, when calculating the end block of the EOF byte, we have
to round the size by one to avoid a block aligned EOF from reporting
a block too large. i.e. a size of 1024 bytes is 1 block, which in
index terms is block 0. Therefore we have to calculate the end block
from (isize - 1), not isize.

The second bug is determining if the current page spans EOF, and so
whether we need split it into two half, one for the IO, and the
other for zeroing. Unfortunately, the code that checks whether
we should split the block doesn't actually check if we span EOF, it
just checks if the read spans the /offset in the page/ that EOF
sits on. So it splits every read into two if EOF is not page
aligned, regardless of whether we are reading the EOF block or not.

Hence we need to restrict the "does the read span EOF" check to
just the page that spans EOF, not every page we read.

This patch results in correct EOF detection through readpages:

xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864

As you can see, it now does full page reads until the last one which
is split correctly at the block aligned EOF, reading 3072 bytes and
zeroing the last 1024 bytes. The original version of the patch got
this right, but it got another case wrong.

The EOF detection crossing really needs to the the original length
as plen, while it starts at the end of the block, will be shortened
as up-to-date blocks are found on the page. This means "orig_pos +
plen" no longer points to the end of the page, and so will not
correctly detect EOF crossing. Hence we have to use the length
passed in to detect this partial page case:

xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296

Heere we see a trace where the first block on the EOF page is up to
date, hence poff = 1024 bytes. The offset into the page of EOF is
3072, so the range we want to read is 1024 - 3071, and the range we
want to zero is 3072 - 4095. You can see this is split correctly
now.
 
This fixes the stale data beyond EOF problem that fsx quickly
uncovers on 1k block size filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---

Testing the first version of this patch failed the direct IO fsx
test case I've been using at ~16.5 million ops when it tripped over
a partially up to date EOF page. I used the wrong length to detect
EOF being crossed, this update fixes that and it no longer fails at
16.5m ops. We'll see how long it lasts this time....

 fs/iomap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index d51e7a2ae641..b95110867eee 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -142,13 +142,14 @@ static void
 iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
 {
+	loff_t orig_pos = *pos;
+	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
 	unsigned poff = offset_in_page(*pos);
 	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
-	unsigned end = offset_in_page(i_size_read(inode)) >> block_bits;
 
 	/*
 	 * If the block size is smaller than the page size we need to check the
@@ -183,8 +184,11 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * handle both halves separately so that we properly zero data in the
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
-	if (first <= end && last > end)
-		plen -= (last - end) * block_size;
+	if (orig_pos <= isize && orig_pos + length > isize) {
+		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		if (first <= end && last > end)
+			plen -= (last - end) * block_size;
+	}
 
 	*offp = poff;
 	*lenp = plen;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/5 V2] iomap: readpages doesn't zero page tail beyond EOF properly
  2018-11-21  8:27   ` [PATCH 6/5 V2] " Dave Chinner
@ 2018-11-21 16:20     ` Darrick J. Wong
  2018-11-21 16:34     ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2018-11-21 16:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 21, 2018 at 07:27:36PM +1100, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we read the EOF page of the file via readpages, we need
> to zero the region beyond EOF that we either do not read or
> should not contain data so that mmap does not expose stale data to
> user applications.
> 
> However, iomap_adjust_read_range() fails to detect EOF correctly,
> and so fsx on 1k block size filesystems fails very quickly with
> mapreads exposing data beyond EOF. There are two problems here.
> 
> Firstly, when calculating the end block of the EOF byte, we have
> to round the size by one to avoid a block aligned EOF from reporting
> a block too large. i.e. a size of 1024 bytes is 1 block, which in
> index terms is block 0. Therefore we have to calculate the end block
> from (isize - 1), not isize.
> 
> The second bug is determining if the current page spans EOF, and so
> whether we need split it into two half, one for the IO, and the
> other for zeroing. Unfortunately, the code that checks whether
> we should split the block doesn't actually check if we span EOF, it
> just checks if the read spans the /offset in the page/ that EOF
> sits on. So it splits every read into two if EOF is not page
> aligned, regardless of whether we are reading the EOF block or not.
> 
> Hence we need to restrict the "does the read span EOF" check to
> just the page that spans EOF, not every page we read.
> 
> This patch results in correct EOF detection through readpages:
> 
> xfs_vm_readpages:     dev 259:0 ino 0x43 nr_pages 24
> xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
> iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
> xfs_iomap_found:      dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
> iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
> iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
> 
> As you can see, it now does full page reads until the last one which
> is split correctly at the block aligned EOF, reading 3072 bytes and
> zeroing the last 1024 bytes. The original version of the patch got
> this right, but it got another case wrong.
> 
> The EOF detection crossing really needs to the the original length
> as plen, while it starts at the end of the block, will be shortened
> as up-to-date blocks are found on the page. This means "orig_pos +
> plen" no longer points to the end of the page, and so will not
> correctly detect EOF crossing. Hence we have to use the length
> passed in to detect this partial page case:
> 
> xfs_filemap_fault:    dev 259:1 ino 0x43  write_fault 0
> xfs_vm_readpage:      dev 259:1 ino 0x43 nr_pages 1
> xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
> iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
> xfs_iomap_found:      dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
> iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
> 
> Heere we see a trace where the first block on the EOF page is up to
> date, hence poff = 1024 bytes. The offset into the page of EOF is
> 3072, so the range we want to read is 1024 - 3071, and the range we
> want to zero is 3072 - 4095. You can see this is split correctly
> now.
>  
> This fixes the stale data beyond EOF problem that fsx quickly
> uncovers on 1k block size filesystems.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok, will test...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> 
> Testing the first version of this patch failed the direct IO fsx
> test case I've been using at ~16.5 million ops when it tripped over
> a partially up to date EOF page. I used the wrong length to detect
> EOF being crossed, this update fixes that and it no longer fails at
> 16.5m ops. We'll see how long it lasts this time....
> 
>  fs/iomap.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index d51e7a2ae641..b95110867eee 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -142,13 +142,14 @@ static void
>  iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
>  {
> +	loff_t orig_pos = *pos;
> +	loff_t isize = i_size_read(inode);
>  	unsigned block_bits = inode->i_blkbits;
>  	unsigned block_size = (1 << block_bits);
>  	unsigned poff = offset_in_page(*pos);
>  	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
>  	unsigned first = poff >> block_bits;
>  	unsigned last = (poff + plen - 1) >> block_bits;
> -	unsigned end = offset_in_page(i_size_read(inode)) >> block_bits;
>  
>  	/*
>  	 * If the block size is smaller than the page size we need to check the
> @@ -183,8 +184,11 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  	 * handle both halves separately so that we properly zero data in the
>  	 * page cache for blocks that are entirely outside of i_size.
>  	 */
> -	if (first <= end && last > end)
> -		plen -= (last - end) * block_size;
> +	if (orig_pos <= isize && orig_pos + length > isize) {
> +		unsigned end = offset_in_page(isize - 1) >> block_bits;
> +		if (first <= end && last > end)
> +			plen -= (last - end) * block_size;
> +	}
>  
>  	*offp = poff;
>  	*lenp = plen;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/5 V2] iomap: readpages doesn't zero page tail beyond EOF properly
  2018-11-21  8:27   ` [PATCH 6/5 V2] " Dave Chinner
  2018-11-21 16:20     ` Darrick J. Wong
@ 2018-11-21 16:34     ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2018-11-21 16:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Nov 21, 2018 at 07:27:36PM +1100, Dave Chinner wrote:
> index d51e7a2ae641..b95110867eee 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -142,13 +142,14 @@ static void
>  iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
>  {
> +	loff_t orig_pos = *pos;
> +	loff_t isize = i_size_read(inode);
>  	unsigned block_bits = inode->i_blkbits;
>  	unsigned block_size = (1 << block_bits);
>  	unsigned poff = offset_in_page(*pos);
>  	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
>  	unsigned first = poff >> block_bits;
>  	unsigned last = (poff + plen - 1) >> block_bits;
> -	unsigned end = offset_in_page(i_size_read(inode)) >> block_bits;
>  
>  	/*
>  	 * If the block size is smaller than the page size we need to check the
> @@ -183,8 +184,11 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
>  	 * handle both halves separately so that we properly zero data in the
>  	 * page cache for blocks that are entirely outside of i_size.
>  	 */
> -	if (first <= end && last > end)
> -		plen -= (last - end) * block_size;
> +	if (orig_pos <= isize && orig_pos + length > isize) {
> +		unsigned end = offset_in_page(isize - 1) >> block_bits;
> +		if (first <= end && last > end)
> +			plen -= (last - end) * block_size;

Please add an empty line afte the variable declaration.

Otherwise this looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-11-22  3:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
2018-11-20  7:48   ` Christoph Hellwig
2018-11-20 23:00   ` Darrick J. Wong
2018-11-19 21:17 ` [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
2018-11-20 23:01   ` Darrick J. Wong
2018-11-19 21:17 ` [PATCH 3/5] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
2018-11-19 21:17 ` [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
2018-11-20  7:49   ` Christoph Hellwig
2018-11-20 23:02   ` Darrick J. Wong
2018-11-19 21:17 ` [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
2018-11-20  7:49   ` Christoph Hellwig
2018-11-21  6:50 ` [PATCH 6/5] iomap: readpages doesn't zero page tail beyond EOF properly Dave Chinner
2018-11-21  8:27   ` [PATCH 6/5 V2] " Dave Chinner
2018-11-21 16:20     ` Darrick J. Wong
2018-11-21 16:34     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.