linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2]: dedupe/copy_file_range fixes
@ 2018-11-08 22:19 Dave Chinner
  2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Dave Chinner @ 2018-11-08 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi folks,

Here's the first couple of fixes that fell out of the tree last
night after Darrick added and clone/dedupe_file_range to fsx and
copy_file_range to both fsx and fsstress yesterday. There will be
more coming.

FYI, do_splice_direct() is broken for direct IO on XFS. It's been
broken ever since the iomap infrastructure was introduced in 4.10
two years ago. Any code that can use splice to read data via direct
IO is definitely broken (copy_file_range demonstrates that). And my
initial testing on 64k block sizes indicates that direct IO writes
via splice are also broken because the short writes expose stale
data in partially written blocks.

As such, copy_file_range() is broken for direct IO on XFS w/o reflink
support (i.e. default configs). It has been broken since at least
4.10, and maybe as far back as when copy_file_range() was first
introduced in 4.5.

Yet in all this time, nobody has noticed that is is broken. That's
because nobody added test infrastructure support for the
copy_file_range() syscall. Sure we added a command to xfs_io to be
able to copy files via the interface, but that doesn't mean the
interface is exercised in any useful manner.

So, how long do you think it took fsx to find these problems? A few
hours? A few million operations? Nope. It took 32 operations and
0.3s - it failed on the second copy_file_range() call it made.
This is from generic/263:

skipping zero size read
truncating to largest ever: 0xe400
4 trunc from 0x0 to 0xe400
5 insert        from 0x6000 to 0x18000, (0x12000 bytes)
6 zero  from 0x91be to 0x1edf6, (0x15c38 bytes)
7 write 0x3ac00 thru    0x3cdff (0x2200 bytes)
8 mapread       0x36000 thru    0x3be19 (0x5e1a bytes)
9 mapwrite      0x73200 thru    0x7928c (0x608d bytes)
10 mapread      0x3d000 thru    0x3f9c2 (0x29c3 bytes)
11 collapse     from 0x2b000 to 0x45000, (0x1a000 bytes)
12 punch        from 0x495fa to 0x5f28d, (0x15c93 bytes)
13 falloc       from 0x2f42a to 0x4a8f4 (0x1b4ca bytes)
14 zero from 0x530b7 to 0x5f28d, (0xc1d6 bytes)
15 mapwrite     0x55e00 thru    0x70d6e (0x1af6f bytes)
16 read 0x2e000 thru    0x38fff (0xb000 bytes)
17 collapse     from 0x3f000 to 0x4f000, (0x10000 bytes)
18 copy from 0x28000 to 0x43000, (0x1b000 bytes) at 0x4400
19 collapse     from 0x2c000 to 0x45000, (0x19000 bytes)
20 write        0x54a00 thru    0x709ff (0x1c000 bytes)
21 read 0x53000 thru    0x69fff (0x17000 bytes)
22 mapwrite     0x1f200 thru    0x394bb (0x1a2bc bytes)
23 mapread      0x43000 thru    0x5a2d8 (0x172d9 bytes)
24 mapwrite     0x23000 thru    0x38812 (0x15813 bytes)
25 write        0x47800 thru    0x587ff (0x11000 bytes)
26 clone        from 0x3000 to 0x12000, (0xf000 bytes) at 0x61000
skipping unsupported clone range
27 read 0x6c000 thru    0x6efff (0x3000 bytes)
28 dedupe       from 0x12000 to 0x1e000, (0xc000 bytes) at 0x4000
skipping unsupported dedupe range
29 insert       from 0x31000 to 0x33000, (0x2000 bytes)
fallocating to largest ever: 0x49915
30 falloc       from 0x2deac to 0x49915 (0x1ba69 bytes)
31 dedupe       from 0x6f000 to 0x72000, (0x3000 bytes) at 0x25000
skipping unsupported dedupe range
32 copy from 0x4b000 to 0x64000, (0x19000 bytes) at 0x2800
copy range: 0x4b000 to 0x64000 at 0x2800
do_copy_range:: Resource temporarily unavailable

The ease of reproducing this bug indicates just how badly our
engineering processes have failed here. They simply haven't been
followed at all - we know that if we don't add test coverage with
new features and/or interfaces, then the new code will be buggy and
broken. Yet maintainers still seem to ignore whether code is
testable or even fit for purpose and merge new code that has not and
cannot be tested adequately now and into the future.

I'm not a happy camper right now.

-Dave.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner
@ 2018-11-08 22:19 ` Dave Chinner
  2018-11-08 23:03   ` Darrick J. Wong
  2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
  2018-11-09  0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
  2 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2018-11-08 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

It returns EINVAL when the operation is not supported by the
filesystem. Fix it to return EOPNOTSUPP to be consistent with
the man page and clone_file_range().

Clean up the inconsistent error return handling while I'm there.
(I know, lipstick on a pig, but every little bit helps...)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index bfcb4ced5664..aa43224bcec6 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	off = same->src_offset;
 	len = same->src_length;
 
-	ret = -EISDIR;
+	if (!file->f_op->remap_file_range)
+		return -EOPNOTSUPP;
+
 	if (S_ISDIR(src->i_mode))
-		goto out;
+		return -EISDIR;
 
-	ret = -EINVAL;
 	if (!S_ISREG(src->i_mode))
-		goto out;
+		return -EINVAL;
 
 	ret = remap_verify_area(file, off, len, false);
 	if (ret < 0)
-		goto out;
+		return ret;
 	ret = 0;
 
 	if (off + len > i_size_read(src))
@@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 		fdput(dst_fd);
 next_loop:
 		if (fatal_signal_pending(current))
-			goto out;
+			break;
 	}
-
-out:
 	return ret;
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range);
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill
  2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner
  2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
@ 2018-11-08 22:19 ` Dave Chinner
  2018-11-13 16:25   ` Darrick J. Wong
  2018-11-15 10:15   ` Christoph Hellwig
  2018-11-09  0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
  2 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2018-11-08 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When doing direct IO to a pipe for do_splice_direct(), then pipe is
trivial to fill up and overflow as it can only hold 16 pages. At
this point bio_iov_iter_get_pages() then returns -EFAULT, and we
abort the IO submission process. Unfortunately, iomap_dio_rw()
propagates the error back up the stack.

The error is converted from the EFAULT to EAGAIN in
generic_file_splice_read() to tell the splice layers that the pipe
is full. do_splice_direct() completely fails to handle EAGAIN errors
(it aborts on error) and returns EAGAIN to the caller.

copy_file_write() then compeltely fails to handle EAGAIN as well,
and so returns EAGAIN to userspace, having failed to copy the data
it was asked to.

Avoid this whole steaming pile of fail by having iomap_dio_rw()
silently swallow EFAULT errors and so do short reads.

To make matters worse, iomap_dio_actor() has a stale data exposure
bug bio_iov_iter_get_pages() fails - it does not zero the tail block
that it may have been left uncovered by partial IO. Fix the error
handling case to drop to the sub-block zeroing rather than
immmediately returning the -EFAULT error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 7aacd48c593e..2fda13bc37d8 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1775,7 +1775,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	bool use_fua = false;
-	int nr_pages, ret;
+	int nr_pages, ret = 0;
 	size_t copied = 0;
 
 	if ((pos | length | align) & ((1 << blkbits) - 1))
@@ -1839,8 +1839,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		ret = bio_iov_iter_get_pages(bio, &iter);
 		if (unlikely(ret)) {
+			/*
+			 * We have to stop part way through an IO. We must fall through
+			 * to the sub-block tail zeroing here, otherwise this
+			 * short IO may expose stale data in the tail of the
+			 * block we haven't written data to.
+			 */
 			bio_put(bio);
-			return copied ? copied : ret;
+			goto zero_tail;
 		}
 
 		n = bio->bi_iter.bi_size;
@@ -1877,6 +1883,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	 * the block tail in the latter case, we can expose stale data via mmap
 	 * reads of the EOF block.
 	 */
+zero_tail:
 	if (need_zeroout ||
 	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
 		/* zero out from the end of the write to the end of the block */
@@ -1884,7 +1891,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		if (pad)
 			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
 	}
-	return copied;
+	return copied ? copied : ret;
 }
 
 static loff_t
@@ -2079,6 +2086,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 				dio->wait_for_completion = true;
 				ret = 0;
 			}
+
+			/*
+			 * splicing to pipes can fail on a full pipe. We have to
+			 * swallow this to make it look like a short IO
+			 * otherwise the higher splice layers will completely
+			 * mishandle the error and stop moving data.
+			 */
+			if (ret == -EFAULT)
+				ret = 0;
 			break;
 		}
 		pos += ret;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
@ 2018-11-08 23:03   ` Darrick J. Wong
  2018-11-09  6:47     ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2018-11-08 23:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 09:19:08AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It returns EINVAL when the operation is not supported by the
> filesystem. Fix it to return EOPNOTSUPP to be consistent with
> the man page and clone_file_range().
> 
> Clean up the inconsistent error return handling while I'm there.
> (I know, lipstick on a pig, but every little bit helps...)
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok, would rather just shred this ioctl and make a better one. :P

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/read_write.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index bfcb4ced5664..aa43224bcec6 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>  	off = same->src_offset;
>  	len = same->src_length;
>  
> -	ret = -EISDIR;
> +	if (!file->f_op->remap_file_range)
> +		return -EOPNOTSUPP;
> +
>  	if (S_ISDIR(src->i_mode))
> -		goto out;
> +		return -EISDIR;
>  
> -	ret = -EINVAL;
>  	if (!S_ISREG(src->i_mode))
> -		goto out;
> +		return -EINVAL;
>  
>  	ret = remap_verify_area(file, off, len, false);
>  	if (ret < 0)
> -		goto out;
> +		return ret;
>  	ret = 0;
>  
>  	if (off + len > i_size_read(src))
> @@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>  		fdput(dst_fd);
>  next_loop:
>  		if (fatal_signal_pending(current))
> -			goto out;
> +			break;
>  	}
> -
> -out:
>  	return ret;
>  }
>  EXPORT_SYMBOL(vfs_dedupe_file_range);
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor()
  2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner
  2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
  2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
@ 2018-11-09  0:54 ` Dave Chinner
  2018-11-13 16:22   ` Darrick J. Wong
  2018-11-15 10:17   ` Christoph Hellwig
  2 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2018-11-09  0:54 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel


From: Dave Chinner <dchinner@redhat.com>

When copy_file_range() is called on files that have been opened with
O_DIRECT, do_splice_direct() does a manual copy of the range one
pipe buffer at a time. The default is 16 pages, which means on
x86_64 it is limited to 64kB IO. This is extremely slow - 64k
synchrnous read/write will run at maybe 5-10MB/s on a spinning disk
and be seek bound. It will be faster on SSDs, but still very
inefficient.

Increase the pipe size to the maximum allowed user size so that we
can get decent throughput for this highly sub-optimal copy loop. Add
a new function to the pipe code that lets us set the pipe size to
the maximum allowed without root permissions to keep things really
simple. We also don't care if changing the pipe size fails - that
will just result in a slower copy.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/pipe.c                 | 10 ++++++++++
 fs/splice.c               |  7 +++++++
 include/linux/pipe_fs_i.h |  1 +
 3 files changed, 18 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index bdc5d3c0977d..436bc0464569 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	return ret;
 }
 
+/*
+ * Set the pipe to the maximum allowable user size. Advisory only, will
+ * swallow any errors and return the resultant pipe size.
+ */
+long pipe_set_max_safe_size(struct pipe_inode_info *pipe)
+{
+	pipe_set_size(pipe, pipe_max_size);
+	return pipe->buffers * PAGE_SIZE;
+}
+
 /*
  * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
  * location, so checking ->i_pipe is not enough to verify that this is a
diff --git a/fs/splice.c b/fs/splice.c
index 3553f1956508..9749139da731 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		current->splice_pipe = pipe;
 	}
 
+	/*
+	 * Try to increase the data holding capacity of the pipe so we can do
+	 * larger IOs. This may not increase the size at all because maximum
+	 * user pipe size is administrator controlled, but we still should try.
+	 */
+	pipe_set_max_safe_size(pipe);
+
 	/*
 	 * Do the splice.
 	 */
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 5a3bb3b7c9ad..962ba4cfcb74 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
 
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned long size);
+long pipe_set_max_safe_size(struct pipe_inode_info *pipe);
 
 #endif

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  2018-11-08 23:03   ` Darrick J. Wong
@ 2018-11-09  6:47     ` Darrick J. Wong
  2018-11-16  5:33       ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2018-11-09  6:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 08, 2018 at 03:03:49PM -0800, Darrick J. Wong wrote:
> On Fri, Nov 09, 2018 at 09:19:08AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > It returns EINVAL when the operation is not supported by the
> > filesystem. Fix it to return EOPNOTSUPP to be consistent with
> > the man page and clone_file_range().
> > 
> > Clean up the inconsistent error return handling while I'm there.
> > (I know, lipstick on a pig, but every little bit helps...)
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Looks ok, would rather just shred this ioctl and make a better one. :P
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> --D
> 
> > ---
> >  fs/read_write.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index bfcb4ced5664..aa43224bcec6 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
> >  	off = same->src_offset;
> >  	len = same->src_length;
> >  
> > -	ret = -EISDIR;
> > +	if (!file->f_op->remap_file_range)
> > +		return -EOPNOTSUPP;

Minor downside: this causes regressions in generic/158 because
directories and device files don't have a ->remap_file_range
implementation.  I think it can be solved by moving this check after the
!S_ISREG check below.

--D

> > +
> >  	if (S_ISDIR(src->i_mode))
> > -		goto out;
> > +		return -EISDIR;
> >  
> > -	ret = -EINVAL;
> >  	if (!S_ISREG(src->i_mode))
> > -		goto out;
> > +		return -EINVAL;
> >  
> >  	ret = remap_verify_area(file, off, len, false);
> >  	if (ret < 0)
> > -		goto out;
> > +		return ret;
> >  	ret = 0;
> >  
> >  	if (off + len > i_size_read(src))
> > @@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
> >  		fdput(dst_fd);
> >  next_loop:
> >  		if (fatal_signal_pending(current))
> > -			goto out;
> > +			break;
> >  	}
> > -
> > -out:
> >  	return ret;
> >  }
> >  EXPORT_SYMBOL(vfs_dedupe_file_range);
> > -- 
> > 2.19.1
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor()
  2018-11-09  0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
@ 2018-11-13 16:22   ` Darrick J. Wong
  2018-11-13 21:41     ` Dave Chinner
  2018-11-15 10:17   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2018-11-13 16:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 11:54:10AM +1100, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When copy_file_range() is called on files that have been opened with
> O_DIRECT, do_splice_direct() does a manual copy of the range one
> pipe buffer at a time. The default is 16 pages, which means on
> x86_64 it is limited to 64kB IO. This is extremely slow - 64k
> synchrnous read/write will run at maybe 5-10MB/s on a spinning disk
> and be seek bound. It will be faster on SSDs, but still very
> inefficient.
> 
> Increase the pipe size to the maximum allowed user size so that we
> can get decent throughput for this highly sub-optimal copy loop. Add
> a new function to the pipe code that lets us set the pipe size to
> the maximum allowed without root permissions to keep things really
> simple. We also don't care if changing the pipe size fails - that
> will just result in a slower copy.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/pipe.c                 | 10 ++++++++++
>  fs/splice.c               |  7 +++++++
>  include/linux/pipe_fs_i.h |  1 +
>  3 files changed, 18 insertions(+)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index bdc5d3c0977d..436bc0464569 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
>  	return ret;
>  }
>  
> +/*
> + * Set the pipe to the maximum allowable user size. Advisory only, will
> + * swallow any errors and return the resultant pipe size.
> + */
> +long pipe_set_max_safe_size(struct pipe_inode_info *pipe)
> +{
> +	pipe_set_size(pipe, pipe_max_size);
> +	return pipe->buffers * PAGE_SIZE;
> +}
> +
>  /*
>   * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
>   * location, so checking ->i_pipe is not enough to verify that this is a
> diff --git a/fs/splice.c b/fs/splice.c
> index 3553f1956508..9749139da731 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
>  		current->splice_pipe = pipe;
>  	}
>  
> +	/*
> +	 * Try to increase the data holding capacity of the pipe so we can do
> +	 * larger IOs. This may not increase the size at all because maximum
> +	 * user pipe size is administrator controlled, but we still should try.
> +	 */
> +	pipe_set_max_safe_size(pipe);

I get where you're going with this, but I have two questions:

- Is it safe to be enlarging the pipe buffer size unconditionally?

- Especially if we didn't just create the splice pipe?  Suppose someone
  comes along later trying to splice things and doesn't realize the pipe
  is now 1MB...

Then I started wondering about the splice_pipe lifetime and couldn't
figure out if it ever gets detached from current prior to do_exit.
I don't think it does, which means that we're stuck with the 1MB
kernel memory allocation until the process dies.

--D

> +
>  	/*
>  	 * Do the splice.
>  	 */
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 5a3bb3b7c9ad..962ba4cfcb74 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
>  
>  int create_pipe_files(struct file **, int);
>  unsigned int round_pipe_size(unsigned long size);
> +long pipe_set_max_safe_size(struct pipe_inode_info *pipe);
>  
>  #endif

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill
  2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
@ 2018-11-13 16:25   ` Darrick J. Wong
  2018-11-15 10:15   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2018-11-13 16:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 09:19:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When doing direct IO to a pipe for do_splice_direct(), then pipe is
> trivial to fill up and overflow as it can only hold 16 pages. At
> this point bio_iov_iter_get_pages() then returns -EFAULT, and we
> abort the IO submission process. Unfortunately, iomap_dio_rw()
> propagates the error back up the stack.
> 
> The error is converted from the EFAULT to EAGAIN in
> generic_file_splice_read() to tell the splice layers that the pipe
> is full. do_splice_direct() completely fails to handle EAGAIN errors
> (it aborts on error) and returns EAGAIN to the caller.
> 
> copy_file_write() then compeltely fails to handle EAGAIN as well,

"completely"...

> and so returns EAGAIN to userspace, having failed to copy the data
> it was asked to.
> 
> Avoid this whole steaming pile of fail by having iomap_dio_rw()
> silently swallow EFAULT errors and so do short reads.
> 
> To make matters worse, iomap_dio_actor() has a stale data exposure
> bug bio_iov_iter_get_pages() fails - it does not zero the tail block
> that it may have been left uncovered by partial IO. Fix the error
> handling case to drop to the sub-block zeroing rather than
> immmediately returning the -EFAULT error.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Other than that it looks ok and I'll give it a spin,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 7aacd48c593e..2fda13bc37d8 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1775,7 +1775,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	struct bio *bio;
>  	bool need_zeroout = false;
>  	bool use_fua = false;
> -	int nr_pages, ret;
> +	int nr_pages, ret = 0;
>  	size_t copied = 0;
>  
>  	if ((pos | length | align) & ((1 << blkbits) - 1))
> @@ -1839,8 +1839,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		ret = bio_iov_iter_get_pages(bio, &iter);
>  		if (unlikely(ret)) {
> +			/*
> +			 * We have to stop part way through an IO. We must fall through
> +			 * to the sub-block tail zeroing here, otherwise this
> +			 * short IO may expose stale data in the tail of the
> +			 * block we haven't written data to.
> +			 */
>  			bio_put(bio);
> -			return copied ? copied : ret;
> +			goto zero_tail;
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> @@ -1877,6 +1883,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	 * the block tail in the latter case, we can expose stale data via mmap
>  	 * reads of the EOF block.
>  	 */
> +zero_tail:
>  	if (need_zeroout ||
>  	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>  		/* zero out from the end of the write to the end of the block */
> @@ -1884,7 +1891,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		if (pad)
>  			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
>  	}
> -	return copied;
> +	return copied ? copied : ret;
>  }
>  
>  static loff_t
> @@ -2079,6 +2086,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  				dio->wait_for_completion = true;
>  				ret = 0;
>  			}
> +
> +			/*
> +			 * splicing to pipes can fail on a full pipe. We have to
> +			 * swallow this to make it look like a short IO
> +			 * otherwise the higher splice layers will completely
> +			 * mishandle the error and stop moving data.
> +			 */
> +			if (ret == -EFAULT)
> +				ret = 0;
>  			break;
>  		}
>  		pos += ret;
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor()
  2018-11-13 16:22   ` Darrick J. Wong
@ 2018-11-13 21:41     ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2018-11-13 21:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Tue, Nov 13, 2018 at 08:22:37AM -0800, Darrick J. Wong wrote:
> On Fri, Nov 09, 2018 at 11:54:10AM +1100, Dave Chinner wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When copy_file_range() is called on files that have been opened with
> > O_DIRECT, do_splice_direct() does a manual copy of the range one
> > pipe buffer at a time. The default is 16 pages, which means on
> > x86_64 it is limited to 64kB IO. This is extremely slow - 64k
> > synchrnous read/write will run at maybe 5-10MB/s on a spinning disk
> > and be seek bound. It will be faster on SSDs, but still very
> > inefficient.
> > 
> > Increase the pipe size to the maximum allowed user size so that we
> > can get decent throughput for this highly sub-optimal copy loop. Add
> > a new function to the pipe code that lets us set the pipe size to
> > the maximum allowed without root permissions to keep things really
> > simple. We also don't care if changing the pipe size fails - that
> > will just result in a slower copy.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/pipe.c                 | 10 ++++++++++
> >  fs/splice.c               |  7 +++++++
> >  include/linux/pipe_fs_i.h |  1 +
> >  3 files changed, 18 insertions(+)
> > 
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index bdc5d3c0977d..436bc0464569 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * Set the pipe to the maximum allowable user size. Advisory only, will
> > + * swallow any errors and return the resultant pipe size.
> > + */
> > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe)
> > +{
> > +	pipe_set_size(pipe, pipe_max_size);
> > +	return pipe->buffers * PAGE_SIZE;
> > +}
> > +
> >  /*
> >   * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
> >   * location, so checking ->i_pipe is not enough to verify that this is a
> > diff --git a/fs/splice.c b/fs/splice.c
> > index 3553f1956508..9749139da731 100644
> > --- a/fs/splice.c
> > +++ b/fs/splice.c
> > @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
> >  		current->splice_pipe = pipe;
> >  	}
> >  
> > +	/*
> > +	 * Try to increase the data holding capacity of the pipe so we can do
> > +	 * larger IOs. This may not increase the size at all because maximum
> > +	 * user pipe size is administrator controlled, but we still should try.
> > +	 */
> > +	pipe_set_max_safe_size(pipe);
> 
> I get where you're going with this, but I have two questions:
> 
> - Is it safe to be enlarging the pipe buffer size unconditionally?

Don't see why it would be unsafe.

> - Especially if we didn't just create the splice pipe?  Suppose someone
>   comes along later trying to splice things and doesn't realize the pipe
>   is now 1MB...

The splice code is supposed to handle arbitrary pipe sizes
correctly. if something breaks because it has assumptions about how
much data a pipe can hold, it's already broken.

> Then I started wondering about the splice_pipe lifetime and couldn't
> figure out if it ever gets detached from current prior to do_exit.
> I don't think it does, which means that we're stuck with the 1MB
> kernel memory allocation until the process dies.

If you are using do_splice_direct(), you either have a short term
process (i.e. a cp type utility) or you are moving bulk data around,
in which case 1MB of extra memory isn't a big deal.

And given that the default pipe size is dependent on PAGE_SIZE (i.e.
the default is 16 pages, not 64kB) then on 64k page architectures we
are already using pipes of 1MB capacity by default.

I could make this contingent on O_DIRECT, but then we have the
problem that 64k pipes aren't big enough for efficient buffered IO
with 64k block size filesystems, either. IOWs, the pipe size in
do_splice_direct needs to be increased whichever way we look at it.

This all said, I really think we need to imlpement our own
->copy_file_range() code that uses iomap to iterate data-only
extents (i.e. hole-preserving) copying and does well formed IO.
IOWs, only fall back to do_splice_direct() when doing copies to/from
non-XFS filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill
  2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
  2018-11-13 16:25   ` Darrick J. Wong
@ 2018-11-15 10:15   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-15 10:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Fri, Nov 09, 2018 at 09:19:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When doing direct IO to a pipe for do_splice_direct(), then pipe is
> trivial to fill up and overflow as it can only hold 16 pages. At
> this point bio_iov_iter_get_pages() then returns -EFAULT, and we
> abort the IO submission process. Unfortunately, iomap_dio_rw()
> propagates the error back up the stack.
> 
> The error is converted from the EFAULT to EAGAIN in
> generic_file_splice_read() to tell the splice layers that the pipe
> is full. do_splice_direct() completely fails to handle EAGAIN errors
> (it aborts on error) and returns EAGAIN to the caller.
> 
> copy_file_write() then compeltely fails to handle EAGAIN as well,
> and so returns EAGAIN to userspace, having failed to copy the data
> it was asked to.
> 
> Avoid this whole steaming pile of fail by having iomap_dio_rw()
> silently swallow EFAULT errors and so do short reads.
> 
> To make matters worse, iomap_dio_actor() has a stale data exposure
> bug bio_iov_iter_get_pages() fails - it does not zero the tail block
> that it may have been left uncovered by partial IO. Fix the error
> handling case to drop to the sub-block zeroing rather than
> immmediately returning the -EFAULT error.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 7aacd48c593e..2fda13bc37d8 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1775,7 +1775,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	struct bio *bio;
>  	bool need_zeroout = false;
>  	bool use_fua = false;
> -	int nr_pages, ret;
> +	int nr_pages, ret = 0;
>  	size_t copied = 0;
>  
>  	if ((pos | length | align) & ((1 << blkbits) - 1))
> @@ -1839,8 +1839,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  		ret = bio_iov_iter_get_pages(bio, &iter);
>  		if (unlikely(ret)) {
> +			/*
> +			 * We have to stop part way through an IO. We must fall through

Overly long line.

> +			 * to the sub-block tail zeroing here, otherwise this
> +			 * short IO may expose stale data in the tail of the
> +			 * block we haven't written data to.
> +			 */
>  			bio_put(bio);
> -			return copied ? copied : ret;
> +			goto zero_tail;
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> @@ -1877,6 +1883,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	 * the block tail in the latter case, we can expose stale data via mmap
>  	 * reads of the EOF block.
>  	 */
> +zero_tail:
>  	if (need_zeroout ||
>  	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>  		/* zero out from the end of the write to the end of the block */
> @@ -1884,7 +1891,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		if (pad)
>  			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
>  	}
> -	return copied;
> +	return copied ? copied : ret;
>  }
>  
>  static loff_t
> @@ -2079,6 +2086,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  				dio->wait_for_completion = true;
>  				ret = 0;
>  			}
> +
> +			/*
> +			 * splicing to pipes can fail on a full pipe. We have to

Please start comments that contains full sentences with an uppercase
letter.

Thse nitpicks aside this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor()
  2018-11-09  0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
  2018-11-13 16:22   ` Darrick J. Wong
@ 2018-11-15 10:17   ` Christoph Hellwig
  2018-11-16  5:42     ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2018-11-15 10:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

> +long pipe_set_max_safe_size(struct pipe_inode_info *pipe)
> +{
> +	pipe_set_size(pipe, pipe_max_size);
> +	return pipe->buffers * PAGE_SIZE;
> +}

This should probably return an unsigned value, given that we don't return
errors.  Then again the callers ignores the return value entirely.
Wouldn't it be easier to just call pipe_set_size from splice.c after
removing the static marker?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP
  2018-11-09  6:47     ` Darrick J. Wong
@ 2018-11-16  5:33       ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2018-11-16  5:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 08, 2018 at 10:47:13PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 08, 2018 at 03:03:49PM -0800, Darrick J. Wong wrote:
> > On Fri, Nov 09, 2018 at 09:19:08AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > It returns EINVAL when the operation is not supported by the
> > > filesystem. Fix it to return EOPNOTSUPP to be consistent with
> > > the man page and clone_file_range().
> > > 
> > > Clean up the inconsistent error return handling while I'm there.
> > > (I know, lipstick on a pig, but every little bit helps...)
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Looks ok, would rather just shred this ioctl and make a better one. :P
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > --D
> > 
> > > ---
> > >  fs/read_write.c | 15 +++++++--------
> > >  1 file changed, 7 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index bfcb4ced5664..aa43224bcec6 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
> > >  	off = same->src_offset;
> > >  	len = same->src_length;
> > >  
> > > -	ret = -EISDIR;
> > > +	if (!file->f_op->remap_file_range)
> > > +		return -EOPNOTSUPP;
> 
> Minor downside: this causes regressions in generic/158 because
> directories and device files don't have a ->remap_file_range
> implementation.  I think it can be solved by moving this check after the
> !S_ISREG check below.

Done.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor()
  2018-11-15 10:17   ` Christoph Hellwig
@ 2018-11-16  5:42     ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2018-11-16  5:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Thu, Nov 15, 2018 at 02:17:35AM -0800, Christoph Hellwig wrote:
> > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe)
> > +{
> > +	pipe_set_size(pipe, pipe_max_size);
> > +	return pipe->buffers * PAGE_SIZE;
> > +}
> 
> This should probably return an unsigned value, given that we don't return
> errors.  Then again the callers ignores the return value entirely.
> Wouldn't it be easier to just call pipe_set_size from splice.c after
> removing the static marker?

I didn't want to use pipe_max_size outside of the pipe code. it's
definitely simpler without a wrapper if using pipe_max_size directly
is acceptable.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-11-16 15:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner
2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
2018-11-08 23:03   ` Darrick J. Wong
2018-11-09  6:47     ` Darrick J. Wong
2018-11-16  5:33       ` Dave Chinner
2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
2018-11-13 16:25   ` Darrick J. Wong
2018-11-15 10:15   ` Christoph Hellwig
2018-11-09  0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
2018-11-13 16:22   ` Darrick J. Wong
2018-11-13 21:41     ` Dave Chinner
2018-11-15 10:17   ` Christoph Hellwig
2018-11-16  5:42     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).