* [PATCH 0/2]: dedupe/copy_file_range fixes @ 2018-11-08 22:19 Dave Chinner 2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Dave Chinner @ 2018-11-08 22:19 UTC (permalink / raw) To: linux-xfs; +Cc: linux-fsdevel Hi folks, Here's the first couple of fixes that fell out of the tree last night after Darrick added and clone/dedupe_file_range to fsx and copy_file_range to both fsx and fsstress yesterday. There will be more coming. FYI, do_splice_direct() is broken for direct IO on XFS. It's been broken ever since the iomap infrastructure was introduced in 4.10 two years ago. Any code that can use splice to read data via direct IO is definitely broken (copy_file_range demonstrates that). And my initial testing on 64k block sizes indicates that direct IO writes via splice are also broken because the short writes expose stale data in partially written blocks. As such, copy_file_range() is broken for direct IO on XFS w/o reflink support (i.e. default configs). It has been broken since at least 4.10, and maybe as far back as when copy_file_range() was first introduced in 4.5. Yet in all this time, nobody has noticed that is is broken. That's because nobody added test infrastructure support for the copy_file_range() syscall. Sure we added a command to xfs_io to be able to copy files via the interface, but that doesn't mean the interface is exercised in any useful manner. So, how long do you think it took fsx to find these problems? A few hours? A few million operations? Nope. It took 32 operations and 0.3s - it failed on the second copy_file_range() call it made. This is from generic/263: skipping zero size read truncating to largest ever: 0xe400 4 trunc from 0x0 to 0xe400 5 insert from 0x6000 to 0x18000, (0x12000 bytes) 6 zero from 0x91be to 0x1edf6, (0x15c38 bytes) 7 write 0x3ac00 thru 0x3cdff (0x2200 bytes) 8 mapread 0x36000 thru 0x3be19 (0x5e1a bytes) 9 mapwrite 0x73200 thru 0x7928c (0x608d bytes) 10 mapread 0x3d000 thru 0x3f9c2 (0x29c3 bytes) 11 collapse from 0x2b000 to 0x45000, (0x1a000 bytes) 12 punch from 0x495fa to 0x5f28d, (0x15c93 bytes) 13 falloc from 0x2f42a to 0x4a8f4 (0x1b4ca bytes) 14 zero from 0x530b7 to 0x5f28d, (0xc1d6 bytes) 15 mapwrite 0x55e00 thru 0x70d6e (0x1af6f bytes) 16 read 0x2e000 thru 0x38fff (0xb000 bytes) 17 collapse from 0x3f000 to 0x4f000, (0x10000 bytes) 18 copy from 0x28000 to 0x43000, (0x1b000 bytes) at 0x4400 19 collapse from 0x2c000 to 0x45000, (0x19000 bytes) 20 write 0x54a00 thru 0x709ff (0x1c000 bytes) 21 read 0x53000 thru 0x69fff (0x17000 bytes) 22 mapwrite 0x1f200 thru 0x394bb (0x1a2bc bytes) 23 mapread 0x43000 thru 0x5a2d8 (0x172d9 bytes) 24 mapwrite 0x23000 thru 0x38812 (0x15813 bytes) 25 write 0x47800 thru 0x587ff (0x11000 bytes) 26 clone from 0x3000 to 0x12000, (0xf000 bytes) at 0x61000 skipping unsupported clone range 27 read 0x6c000 thru 0x6efff (0x3000 bytes) 28 dedupe from 0x12000 to 0x1e000, (0xc000 bytes) at 0x4000 skipping unsupported dedupe range 29 insert from 0x31000 to 0x33000, (0x2000 bytes) fallocating to largest ever: 0x49915 30 falloc from 0x2deac to 0x49915 (0x1ba69 bytes) 31 dedupe from 0x6f000 to 0x72000, (0x3000 bytes) at 0x25000 skipping unsupported dedupe range 32 copy from 0x4b000 to 0x64000, (0x19000 bytes) at 0x2800 copy range: 0x4b000 to 0x64000 at 0x2800 do_copy_range:: Resource temporarily unavailable The ease of reproducing this bug indicates just how badly our engineering processes have failed here. They simply haven't been followed at all - we know that if we don't add test coverage with new features and/or interfaces, then the new code will be buggy and broken. Yet maintainers still seem to ignore whether code is testable or even fit for purpose and merge new code that has not and cannot be tested adequately now and into the future. I'm not a happy camper right now. -Dave. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP 2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner @ 2018-11-08 22:19 ` Dave Chinner 2018-11-08 23:03 ` Darrick J. Wong 2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner 2018-11-09 0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner 2 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2018-11-08 22:19 UTC (permalink / raw) To: linux-xfs; +Cc: linux-fsdevel From: Dave Chinner <dchinner@redhat.com> It returns EINVAL when the operation is not supported by the filesystem. Fix it to return EOPNOTSUPP to be consistent with the man page and clone_file_range(). Clean up the inconsistent error return handling while I'm there. (I know, lipstick on a pig, but every little bit helps...) Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/read_write.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index bfcb4ced5664..aa43224bcec6 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) off = same->src_offset; len = same->src_length; - ret = -EISDIR; + if (!file->f_op->remap_file_range) + return -EOPNOTSUPP; + if (S_ISDIR(src->i_mode)) - goto out; + return -EISDIR; - ret = -EINVAL; if (!S_ISREG(src->i_mode)) - goto out; + return -EINVAL; ret = remap_verify_area(file, off, len, false); if (ret < 0) - goto out; + return ret; ret = 0; if (off + len > i_size_read(src)) @@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) fdput(dst_fd); next_loop: if (fatal_signal_pending(current)) - goto out; + break; } - -out: return ret; } EXPORT_SYMBOL(vfs_dedupe_file_range); -- 2.19.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP 2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner @ 2018-11-08 23:03 ` Darrick J. Wong 2018-11-09 6:47 ` Darrick J. Wong 0 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2018-11-08 23:03 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel On Fri, Nov 09, 2018 at 09:19:08AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > It returns EINVAL when the operation is not supported by the > filesystem. Fix it to return EOPNOTSUPP to be consistent with > the man page and clone_file_range(). > > Clean up the inconsistent error return handling while I'm there. > (I know, lipstick on a pig, but every little bit helps...) > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Looks ok, would rather just shred this ioctl and make a better one. :P Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/read_write.c | 15 +++++++-------- > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index bfcb4ced5664..aa43224bcec6 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > off = same->src_offset; > len = same->src_length; > > - ret = -EISDIR; > + if (!file->f_op->remap_file_range) > + return -EOPNOTSUPP; > + > if (S_ISDIR(src->i_mode)) > - goto out; > + return -EISDIR; > > - ret = -EINVAL; > if (!S_ISREG(src->i_mode)) > - goto out; > + return -EINVAL; > > ret = remap_verify_area(file, off, len, false); > if (ret < 0) > - goto out; > + return ret; > ret = 0; > > if (off + len > i_size_read(src)) > @@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > fdput(dst_fd); > next_loop: > if (fatal_signal_pending(current)) > - goto out; > + break; > } > - > -out: > return ret; > } > EXPORT_SYMBOL(vfs_dedupe_file_range); > -- > 2.19.1 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP 2018-11-08 23:03 ` Darrick J. Wong @ 2018-11-09 6:47 ` Darrick J. Wong 2018-11-16 5:33 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2018-11-09 6:47 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel On Thu, Nov 08, 2018 at 03:03:49PM -0800, Darrick J. Wong wrote: > On Fri, Nov 09, 2018 at 09:19:08AM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > It returns EINVAL when the operation is not supported by the > > filesystem. Fix it to return EOPNOTSUPP to be consistent with > > the man page and clone_file_range(). > > > > Clean up the inconsistent error return handling while I'm there. > > (I know, lipstick on a pig, but every little bit helps...) > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Looks ok, would rather just shred this ioctl and make a better one. :P > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > --D > > > --- > > fs/read_write.c | 15 +++++++-------- > > 1 file changed, 7 insertions(+), 8 deletions(-) > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > index bfcb4ced5664..aa43224bcec6 100644 > > --- a/fs/read_write.c > > +++ b/fs/read_write.c > > @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > > off = same->src_offset; > > len = same->src_length; > > > > - ret = -EISDIR; > > + if (!file->f_op->remap_file_range) > > + return -EOPNOTSUPP; Minor downside: this causes regressions in generic/158 because directories and device files don't have a ->remap_file_range implementation. I think it can be solved by moving this check after the !S_ISREG check below. --D > > + > > if (S_ISDIR(src->i_mode)) > > - goto out; > > + return -EISDIR; > > > > - ret = -EINVAL; > > if (!S_ISREG(src->i_mode)) > > - goto out; > > + return -EINVAL; > > > > ret = remap_verify_area(file, off, len, false); > > if (ret < 0) > > - goto out; > > + return ret; > > ret = 0; > > > > if (off + len > i_size_read(src)) > > @@ -2147,10 +2148,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > > fdput(dst_fd); > > next_loop: > > if (fatal_signal_pending(current)) > > - goto out; > > + break; > > } > > - > > -out: > > return ret; > > } > > EXPORT_SYMBOL(vfs_dedupe_file_range); > > -- > > 2.19.1 > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP 2018-11-09 6:47 ` Darrick J. Wong @ 2018-11-16 5:33 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2018-11-16 5:33 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel On Thu, Nov 08, 2018 at 10:47:13PM -0800, Darrick J. Wong wrote: > On Thu, Nov 08, 2018 at 03:03:49PM -0800, Darrick J. Wong wrote: > > On Fri, Nov 09, 2018 at 09:19:08AM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > It returns EINVAL when the operation is not supported by the > > > filesystem. Fix it to return EOPNOTSUPP to be consistent with > > > the man page and clone_file_range(). > > > > > > Clean up the inconsistent error return handling while I'm there. > > > (I know, lipstick on a pig, but every little bit helps...) > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > Looks ok, would rather just shred this ioctl and make a better one. :P > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --D > > > > > --- > > > fs/read_write.c | 15 +++++++-------- > > > 1 file changed, 7 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > index bfcb4ced5664..aa43224bcec6 100644 > > > --- a/fs/read_write.c > > > +++ b/fs/read_write.c > > > @@ -2094,17 +2094,18 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > > > off = same->src_offset; > > > len = same->src_length; > > > > > > - ret = -EISDIR; > > > + if (!file->f_op->remap_file_range) > > > + return -EOPNOTSUPP; > > Minor downside: this causes regressions in generic/158 because > directories and device files don't have a ->remap_file_range > implementation. I think it can be solved by moving this check after the > !S_ISREG check below. Done. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill 2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner 2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner @ 2018-11-08 22:19 ` Dave Chinner 2018-11-13 16:25 ` Darrick J. Wong 2018-11-15 10:15 ` Christoph Hellwig 2018-11-09 0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner 2 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2018-11-08 22:19 UTC (permalink / raw) To: linux-xfs; +Cc: linux-fsdevel From: Dave Chinner <dchinner@redhat.com> When doing direct IO to a pipe for do_splice_direct(), then pipe is trivial to fill up and overflow as it can only hold 16 pages. At this point bio_iov_iter_get_pages() then returns -EFAULT, and we abort the IO submission process. Unfortunately, iomap_dio_rw() propagates the error back up the stack. The error is converted from the EFAULT to EAGAIN in generic_file_splice_read() to tell the splice layers that the pipe is full. do_splice_direct() completely fails to handle EAGAIN errors (it aborts on error) and returns EAGAIN to the caller. copy_file_write() then compeltely fails to handle EAGAIN as well, and so returns EAGAIN to userspace, having failed to copy the data it was asked to. Avoid this whole steaming pile of fail by having iomap_dio_rw() silently swallow EFAULT errors and so do short reads. To make matters worse, iomap_dio_actor() has a stale data exposure bug bio_iov_iter_get_pages() fails - it does not zero the tail block that it may have been left uncovered by partial IO. Fix the error handling case to drop to the sub-block zeroing rather than immmediately returning the -EFAULT error. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/iomap.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/iomap.c b/fs/iomap.c index 7aacd48c593e..2fda13bc37d8 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1775,7 +1775,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, struct bio *bio; bool need_zeroout = false; bool use_fua = false; - int nr_pages, ret; + int nr_pages, ret = 0; size_t copied = 0; if ((pos | length | align) & ((1 << blkbits) - 1)) @@ -1839,8 +1839,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, ret = bio_iov_iter_get_pages(bio, &iter); if (unlikely(ret)) { + /* + * We have to stop part way through an IO. We must fall through + * to the sub-block tail zeroing here, otherwise this + * short IO may expose stale data in the tail of the + * block we haven't written data to. + */ bio_put(bio); - return copied ? copied : ret; + goto zero_tail; } n = bio->bi_iter.bi_size; @@ -1877,6 +1883,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, * the block tail in the latter case, we can expose stale data via mmap * reads of the EOF block. */ +zero_tail: if (need_zeroout || ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { /* zero out from the end of the write to the end of the block */ @@ -1884,7 +1891,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, if (pad) iomap_dio_zero(dio, iomap, pos, fs_block_size - pad); } - return copied; + return copied ? copied : ret; } static loff_t @@ -2079,6 +2086,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, dio->wait_for_completion = true; ret = 0; } + + /* + * splicing to pipes can fail on a full pipe. We have to + * swallow this to make it look like a short IO + * otherwise the higher splice layers will completely + * mishandle the error and stop moving data. + */ + if (ret == -EFAULT) + ret = 0; break; } pos += ret; -- 2.19.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill 2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner @ 2018-11-13 16:25 ` Darrick J. Wong 2018-11-15 10:15 ` Christoph Hellwig 1 sibling, 0 replies; 13+ messages in thread From: Darrick J. Wong @ 2018-11-13 16:25 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel On Fri, Nov 09, 2018 at 09:19:09AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When doing direct IO to a pipe for do_splice_direct(), then pipe is > trivial to fill up and overflow as it can only hold 16 pages. At > this point bio_iov_iter_get_pages() then returns -EFAULT, and we > abort the IO submission process. Unfortunately, iomap_dio_rw() > propagates the error back up the stack. > > The error is converted from the EFAULT to EAGAIN in > generic_file_splice_read() to tell the splice layers that the pipe > is full. do_splice_direct() completely fails to handle EAGAIN errors > (it aborts on error) and returns EAGAIN to the caller. > > copy_file_write() then compeltely fails to handle EAGAIN as well, "completely"... > and so returns EAGAIN to userspace, having failed to copy the data > it was asked to. > > Avoid this whole steaming pile of fail by having iomap_dio_rw() > silently swallow EFAULT errors and so do short reads. > > To make matters worse, iomap_dio_actor() has a stale data exposure > bug bio_iov_iter_get_pages() fails - it does not zero the tail block > that it may have been left uncovered by partial IO. Fix the error > handling case to drop to the sub-block zeroing rather than > immmediately returning the -EFAULT error. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Other than that it looks ok and I'll give it a spin, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/iomap.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 7aacd48c593e..2fda13bc37d8 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1775,7 +1775,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > struct bio *bio; > bool need_zeroout = false; > bool use_fua = false; > - int nr_pages, ret; > + int nr_pages, ret = 0; > size_t copied = 0; > > if ((pos | length | align) & ((1 << blkbits) - 1)) > @@ -1839,8 +1839,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > ret = bio_iov_iter_get_pages(bio, &iter); > if (unlikely(ret)) { > + /* > + * We have to stop part way through an IO. We must fall through > + * to the sub-block tail zeroing here, otherwise this > + * short IO may expose stale data in the tail of the > + * block we haven't written data to. > + */ > bio_put(bio); > - return copied ? copied : ret; > + goto zero_tail; > } > > n = bio->bi_iter.bi_size; > @@ -1877,6 +1883,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > * the block tail in the latter case, we can expose stale data via mmap > * reads of the EOF block. > */ > +zero_tail: > if (need_zeroout || > ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { > /* zero out from the end of the write to the end of the block */ > @@ -1884,7 +1891,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > if (pad) > iomap_dio_zero(dio, iomap, pos, fs_block_size - pad); > } > - return copied; > + return copied ? copied : ret; > } > > static loff_t > @@ -2079,6 +2086,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > dio->wait_for_completion = true; > ret = 0; > } > + > + /* > + * splicing to pipes can fail on a full pipe. We have to > + * swallow this to make it look like a short IO > + * otherwise the higher splice layers will completely > + * mishandle the error and stop moving data. > + */ > + if (ret == -EFAULT) > + ret = 0; > break; > } > pos += ret; > -- > 2.19.1 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill 2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner 2018-11-13 16:25 ` Darrick J. Wong @ 2018-11-15 10:15 ` Christoph Hellwig 1 sibling, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2018-11-15 10:15 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel On Fri, Nov 09, 2018 at 09:19:09AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When doing direct IO to a pipe for do_splice_direct(), then pipe is > trivial to fill up and overflow as it can only hold 16 pages. At > this point bio_iov_iter_get_pages() then returns -EFAULT, and we > abort the IO submission process. Unfortunately, iomap_dio_rw() > propagates the error back up the stack. > > The error is converted from the EFAULT to EAGAIN in > generic_file_splice_read() to tell the splice layers that the pipe > is full. do_splice_direct() completely fails to handle EAGAIN errors > (it aborts on error) and returns EAGAIN to the caller. > > copy_file_write() then compeltely fails to handle EAGAIN as well, > and so returns EAGAIN to userspace, having failed to copy the data > it was asked to. > > Avoid this whole steaming pile of fail by having iomap_dio_rw() > silently swallow EFAULT errors and so do short reads. > > To make matters worse, iomap_dio_actor() has a stale data exposure > bug bio_iov_iter_get_pages() fails - it does not zero the tail block > that it may have been left uncovered by partial IO. Fix the error > handling case to drop to the sub-block zeroing rather than > immmediately returning the -EFAULT error. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/iomap.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 7aacd48c593e..2fda13bc37d8 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1775,7 +1775,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > struct bio *bio; > bool need_zeroout = false; > bool use_fua = false; > - int nr_pages, ret; > + int nr_pages, ret = 0; > size_t copied = 0; > > if ((pos | length | align) & ((1 << blkbits) - 1)) > @@ -1839,8 +1839,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > ret = bio_iov_iter_get_pages(bio, &iter); > if (unlikely(ret)) { > + /* > + * We have to stop part way through an IO. We must fall through Overly long line. > + * to the sub-block tail zeroing here, otherwise this > + * short IO may expose stale data in the tail of the > + * block we haven't written data to. > + */ > bio_put(bio); > - return copied ? copied : ret; > + goto zero_tail; > } > > n = bio->bi_iter.bi_size; > @@ -1877,6 +1883,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > * the block tail in the latter case, we can expose stale data via mmap > * reads of the EOF block. > */ > +zero_tail: > if (need_zeroout || > ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { > /* zero out from the end of the write to the end of the block */ > @@ -1884,7 +1891,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > if (pad) > iomap_dio_zero(dio, iomap, pos, fs_block_size - pad); > } > - return copied; > + return copied ? copied : ret; > } > > static loff_t > @@ -2079,6 +2086,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > dio->wait_for_completion = true; > ret = 0; > } > + > + /* > + * splicing to pipes can fail on a full pipe. We have to Please start comments that contains full sentences with an uppercase letter. Thse nitpicks aside this looks fine to me: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() 2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner 2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner 2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner @ 2018-11-09 0:54 ` Dave Chinner 2018-11-13 16:22 ` Darrick J. Wong 2018-11-15 10:17 ` Christoph Hellwig 2 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2018-11-09 0:54 UTC (permalink / raw) To: linux-xfs; +Cc: linux-fsdevel From: Dave Chinner <dchinner@redhat.com> When copy_file_range() is called on files that have been opened with O_DIRECT, do_splice_direct() does a manual copy of the range one pipe buffer at a time. The default is 16 pages, which means on x86_64 it is limited to 64kB IO. This is extremely slow - 64k synchrnous read/write will run at maybe 5-10MB/s on a spinning disk and be seek bound. It will be faster on SSDs, but still very inefficient. Increase the pipe size to the maximum allowed user size so that we can get decent throughput for this highly sub-optimal copy loop. Add a new function to the pipe code that lets us set the pipe size to the maximum allowed without root permissions to keep things really simple. We also don't care if changing the pipe size fails - that will just result in a slower copy. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/pipe.c | 10 ++++++++++ fs/splice.c | 7 +++++++ include/linux/pipe_fs_i.h | 1 + 3 files changed, 18 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c0977d..436bc0464569 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) return ret; } +/* + * Set the pipe to the maximum allowable user size. Advisory only, will + * swallow any errors and return the resultant pipe size. + */ +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) +{ + pipe_set_size(pipe, pipe_max_size); + return pipe->buffers * PAGE_SIZE; +} + /* * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same * location, so checking ->i_pipe is not enough to verify that this is a diff --git a/fs/splice.c b/fs/splice.c index 3553f1956508..9749139da731 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, current->splice_pipe = pipe; } + /* + * Try to increase the data holding capacity of the pipe so we can do + * larger IOs. This may not increase the size at all because maximum + * user pipe size is administrator controlled, but we still should try. + */ + pipe_set_max_safe_size(pipe); + /* * Do the splice. */ diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 5a3bb3b7c9ad..962ba4cfcb74 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file); int create_pipe_files(struct file **, int); unsigned int round_pipe_size(unsigned long size); +long pipe_set_max_safe_size(struct pipe_inode_info *pipe); #endif ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() 2018-11-09 0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner @ 2018-11-13 16:22 ` Darrick J. Wong 2018-11-13 21:41 ` Dave Chinner 2018-11-15 10:17 ` Christoph Hellwig 1 sibling, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2018-11-13 16:22 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel On Fri, Nov 09, 2018 at 11:54:10AM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > When copy_file_range() is called on files that have been opened with > O_DIRECT, do_splice_direct() does a manual copy of the range one > pipe buffer at a time. The default is 16 pages, which means on > x86_64 it is limited to 64kB IO. This is extremely slow - 64k > synchrnous read/write will run at maybe 5-10MB/s on a spinning disk > and be seek bound. It will be faster on SSDs, but still very > inefficient. > > Increase the pipe size to the maximum allowed user size so that we > can get decent throughput for this highly sub-optimal copy loop. Add > a new function to the pipe code that lets us set the pipe size to > the maximum allowed without root permissions to keep things really > simple. We also don't care if changing the pipe size fails - that > will just result in a slower copy. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/pipe.c | 10 ++++++++++ > fs/splice.c | 7 +++++++ > include/linux/pipe_fs_i.h | 1 + > 3 files changed, 18 insertions(+) > > diff --git a/fs/pipe.c b/fs/pipe.c > index bdc5d3c0977d..436bc0464569 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > return ret; > } > > +/* > + * Set the pipe to the maximum allowable user size. Advisory only, will > + * swallow any errors and return the resultant pipe size. > + */ > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > +{ > + pipe_set_size(pipe, pipe_max_size); > + return pipe->buffers * PAGE_SIZE; > +} > + > /* > * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same > * location, so checking ->i_pipe is not enough to verify that this is a > diff --git a/fs/splice.c b/fs/splice.c > index 3553f1956508..9749139da731 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, > current->splice_pipe = pipe; > } > > + /* > + * Try to increase the data holding capacity of the pipe so we can do > + * larger IOs. This may not increase the size at all because maximum > + * user pipe size is administrator controlled, but we still should try. > + */ > + pipe_set_max_safe_size(pipe); I get where you're going with this, but I have two questions: - Is it safe to be enlarging the pipe buffer size unconditionally? - Especially if we didn't just create the splice pipe? Suppose someone comes along later trying to splice things and doesn't realize the pipe is now 1MB... Then I started wondering about the splice_pipe lifetime and couldn't figure out if it ever gets detached from current prior to do_exit. I don't think it does, which means that we're stuck with the 1MB kernel memory allocation until the process dies. --D > + > /* > * Do the splice. > */ > diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h > index 5a3bb3b7c9ad..962ba4cfcb74 100644 > --- a/include/linux/pipe_fs_i.h > +++ b/include/linux/pipe_fs_i.h > @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file); > > int create_pipe_files(struct file **, int); > unsigned int round_pipe_size(unsigned long size); > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe); > > #endif ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() 2018-11-13 16:22 ` Darrick J. Wong @ 2018-11-13 21:41 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2018-11-13 21:41 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel On Tue, Nov 13, 2018 at 08:22:37AM -0800, Darrick J. Wong wrote: > On Fri, Nov 09, 2018 at 11:54:10AM +1100, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > When copy_file_range() is called on files that have been opened with > > O_DIRECT, do_splice_direct() does a manual copy of the range one > > pipe buffer at a time. The default is 16 pages, which means on > > x86_64 it is limited to 64kB IO. This is extremely slow - 64k > > synchrnous read/write will run at maybe 5-10MB/s on a spinning disk > > and be seek bound. It will be faster on SSDs, but still very > > inefficient. > > > > Increase the pipe size to the maximum allowed user size so that we > > can get decent throughput for this highly sub-optimal copy loop. Add > > a new function to the pipe code that lets us set the pipe size to > > the maximum allowed without root permissions to keep things really > > simple. We also don't care if changing the pipe size fails - that > > will just result in a slower copy. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/pipe.c | 10 ++++++++++ > > fs/splice.c | 7 +++++++ > > include/linux/pipe_fs_i.h | 1 + > > 3 files changed, 18 insertions(+) > > > > diff --git a/fs/pipe.c b/fs/pipe.c > > index bdc5d3c0977d..436bc0464569 100644 > > --- a/fs/pipe.c > > +++ b/fs/pipe.c > > @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > > return ret; > > } > > > > +/* > > + * Set the pipe to the maximum allowable user size. Advisory only, will > > + * swallow any errors and return the resultant pipe size. > > + */ > > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > > +{ > > + pipe_set_size(pipe, pipe_max_size); > > + return pipe->buffers * PAGE_SIZE; > > +} > > + > > /* > > * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same > > * location, so checking ->i_pipe is not enough to verify that this is a > > diff --git a/fs/splice.c b/fs/splice.c > > index 3553f1956508..9749139da731 100644 > > --- a/fs/splice.c > > +++ b/fs/splice.c > > @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, > > current->splice_pipe = pipe; > > } > > > > + /* > > + * Try to increase the data holding capacity of the pipe so we can do > > + * larger IOs. This may not increase the size at all because maximum > > + * user pipe size is administrator controlled, but we still should try. > > + */ > > + pipe_set_max_safe_size(pipe); > > I get where you're going with this, but I have two questions: > > - Is it safe to be enlarging the pipe buffer size unconditionally? Don't see why it would be unsafe. > - Especially if we didn't just create the splice pipe? Suppose someone > comes along later trying to splice things and doesn't realize the pipe > is now 1MB... The splice code is supposed to handle arbitrary pipe sizes correctly. if something breaks because it has assumptions about how much data a pipe can hold, it's already broken. > Then I started wondering about the splice_pipe lifetime and couldn't > figure out if it ever gets detached from current prior to do_exit. > I don't think it does, which means that we're stuck with the 1MB > kernel memory allocation until the process dies. If you are using do_splice_direct(), you either have a short term process (i.e. a cp type utility) or you are moving bulk data around, in which case 1MB of extra memory isn't a big deal. And given that the default pipe size is dependent on PAGE_SIZE (i.e. the default is 16 pages, not 64kB) then on 64k page architectures we are already using pipes of 1MB capacity by default. I could make this contingent on O_DIRECT, but then we have the problem that 64k pipes aren't big enough for efficient buffered IO with 64k block size filesystems, either. IOWs, the pipe size in do_splice_direct needs to be increased whichever way we look at it. This all said, I really think we need to imlpement our own ->copy_file_range() code that uses iomap to iterate data-only extents (i.e. hole-preserving) copying and does well formed IO. IOWs, only fall back to do_splice_direct() when doing copies to/from non-XFS filesystems. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() 2018-11-09 0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner 2018-11-13 16:22 ` Darrick J. Wong @ 2018-11-15 10:17 ` Christoph Hellwig 2018-11-16 5:42 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2018-11-15 10:17 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > +{ > + pipe_set_size(pipe, pipe_max_size); > + return pipe->buffers * PAGE_SIZE; > +} This should probably return an unsigned value, given that we don't return errors. Then again the callers ignores the return value entirely. Wouldn't it be easier to just call pipe_set_size from splice.c after removing the static marker? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() 2018-11-15 10:17 ` Christoph Hellwig @ 2018-11-16 5:42 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2018-11-16 5:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel On Thu, Nov 15, 2018 at 02:17:35AM -0800, Christoph Hellwig wrote: > > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > > +{ > > + pipe_set_size(pipe, pipe_max_size); > > + return pipe->buffers * PAGE_SIZE; > > +} > > This should probably return an unsigned value, given that we don't return > errors. Then again the callers ignores the return value entirely. > Wouldn't it be easier to just call pipe_set_size from splice.c after > removing the static marker? I didn't want to use pipe_max_size outside of the pipe code. it's definitely simpler without a wrapper if using pipe_max_size directly is acceptable. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-11-16 15:53 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-11-08 22:19 [PATCH 0/2]: dedupe/copy_file_range fixes Dave Chinner 2018-11-08 22:19 ` [PATCH 1/2] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner 2018-11-08 23:03 ` Darrick J. Wong 2018-11-09 6:47 ` Darrick J. Wong 2018-11-16 5:33 ` Dave Chinner 2018-11-08 22:19 ` [PATCH 2/2] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner 2018-11-13 16:25 ` Darrick J. Wong 2018-11-15 10:15 ` Christoph Hellwig 2018-11-09 0:54 ` [PATCH 3/2] splice: increase pipe size in splice_direct_to_actor() Dave Chinner 2018-11-13 16:22 ` Darrick J. Wong 2018-11-13 21:41 ` Dave Chinner 2018-11-15 10:17 ` Christoph Hellwig 2018-11-16 5:42 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).