[PATCH v4 00/25] fs: fixes for serious clone/dedupe problems

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems
@ 2018-10-13  0:05 Darrick J. Wong
  2018-10-13  0:05 ` [PATCH 01/25] xfs: add a per-xfs trace_printk macro Darrick J. Wong
                   ` (24 more replies)
  0 siblings, 25 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:05 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

Hi all,

Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
reflink implementation, and tracked it down to reflink forgetting to do
some of the file-extending activities that must happen for regular
writes.

We then started auditing the clone, dedupe, and copyfile code and
realized that from a file contents perspective, clonerange isn't any
different from a regular file write.  Unfortunately, we also noticed
that *unlike* a regular write, clonerange skips a ton of overflow
checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
and RLIMIT_FSIZE.  We also observed that cloning into a file did not
strip security privileges (suid, capabilities) like a regular write
would.  I also noticed that xfs and ocfs2 need to dump the page cache
before remapping blocks, not after.

In fixing the range checking problems I also realized that both dedupe
and copyfile tell userspace how much of the requested operation was
acted upon.  Since the range validation can shorten a clone request (or
we can ENOSPC midway through), we might as well plumb the short
operation reporting back through the VFS indirection code to userspace.

So, here's the whole giant pile of patches[1] that fix all the problems.
This branch is against current upstream (4.19-rc7+).  The patch
"generic: test reflink side effects" recently sent to fstests exercises
the fixes in this series.  Tests are in [2].

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/25] xfs: add a per-xfs trace_printk macro
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
@ 2018-10-13  0:05 ` Darrick J. Wong
  2018-10-13  0:05 ` [PATCH 02/25] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF Darrick J. Wong
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:05 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a "xfs_tprintk" macro so that developers can use trace_printk to
print out arbitrary debugging information with the XFS device name
attached to the trace output.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_error.h |    6 ++++++
 1 file changed, 6 insertions(+)


diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 246d3e989c6c..5caa8bdf6c38 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -76,6 +76,11 @@ extern int xfs_errortag_set(struct xfs_mount *mp, unsigned int error_tag,
 		unsigned int tag_value);
 extern int xfs_errortag_add(struct xfs_mount *mp, unsigned int error_tag);
 extern int xfs_errortag_clearall(struct xfs_mount *mp);
+
+/* trace printk version of xfs_err and friends */
+#define xfs_tprintk(mp, fmt, args...) \
+	trace_printk("dev %d:%d " fmt, MAJOR((mp)->m_super->s_dev), \
+			MINOR((mp)->m_super->s_dev), ##args)
 #else
 #define xfs_errortag_init(mp)			(0)
 #define xfs_errortag_del(mp)
@@ -83,6 +88,7 @@ extern int xfs_errortag_clearall(struct xfs_mount *mp);
 #define xfs_errortag_set(mp, tag, val)		(ENOSYS)
 #define xfs_errortag_add(mp, tag)		(ENOSYS)
 #define xfs_errortag_clearall(mp)		(ENOSYS)
+#define xfs_tprintk(mp, fmt, args...)		do { } while (0)
 #endif /* DEBUG */
 
 /*


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/25] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
  2018-10-13  0:05 ` [PATCH 01/25] xfs: add a per-xfs trace_printk macro Darrick J. Wong
@ 2018-10-13  0:05 ` Darrick J. Wong
  2018-10-13  0:06 ` [PATCH 03/25] vfs: check file ranges before cloning files Darrick J. Wong
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:05 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, Christoph Hellwig,
	ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

vfs_clone_file_prep_inodes cannot return 0 if it is asked to remap from
a zero byte file because that's what btrfs does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c |    3 ---
 1 file changed, 3 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 8a2737f0d61d..260797b01851 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1740,10 +1740,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
-	/* Are we going all the way to the end? */
 	isize = i_size_read(inode_in);
-	if (isize == 0)
-		return 0;
 
 	/* Zero length dedupe exits immediately; reflink goes to EOF. */
 	if (*len == 0) {


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/25] vfs: check file ranges before cloning files
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
  2018-10-13  0:05 ` [PATCH 01/25] xfs: add a per-xfs trace_printk macro Darrick J. Wong
  2018-10-13  0:05 ` [PATCH 02/25] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-13  0:06 ` [PATCH 04/25] vfs: strengthen checking of file range inputs to generic_remap_checks Darrick J. Wong
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	Christoph Hellwig, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ocfs2/refcounttree.c |    2 +
 fs/read_write.c         |   55 +++++++++----------------------------
 fs/xfs/xfs_reflink.c    |    2 +
 include/linux/fs.h      |    9 ++++--
 mm/filemap.c            |   69 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 90 insertions(+), 47 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 7a5ee145c733..19e03936c5e1 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4850,7 +4850,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
 			&len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index 260797b01851..d6e8e242a15f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1717,13 +1717,12 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
-int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-			       struct inode *inode_out, loff_t pos_out,
-			       u64 *len, bool is_dedupe)
+int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+			struct file *file_out, loff_t pos_out,
+			u64 *len, bool is_dedupe)
 {
-	loff_t bs = inode_out->i_sb->s_blocksize;
-	loff_t blen;
-	loff_t isize;
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
 	bool same_inode = (inode_in == inode_out);
 	int ret;
 
@@ -1740,10 +1739,10 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
-	isize = i_size_read(inode_in);
-
 	/* Zero length dedupe exits immediately; reflink goes to EOF. */
 	if (*len == 0) {
+		loff_t isize = i_size_read(inode_in);
+
 		if (is_dedupe || pos_in == isize)
 			return 0;
 		if (pos_in > isize)
@@ -1751,36 +1750,11 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 		*len = isize - pos_in;
 	}
 
-	/* Ensure offsets don't wrap and the input is inside i_size */
-	if (pos_in + *len < pos_in || pos_out + *len < pos_out ||
-	    pos_in + *len > isize)
-		return -EINVAL;
-
-	/* Don't allow dedupe past EOF in the dest file */
-	if (is_dedupe) {
-		loff_t	disize;
-
-		disize = i_size_read(inode_out);
-		if (pos_out >= disize || pos_out + *len > disize)
-			return -EINVAL;
-	}
-
-	/* If we're linking to EOF, continue to the block boundary. */
-	if (pos_in + *len == isize)
-		blen = ALIGN(isize, bs) - pos_in;
-	else
-		blen = *len;
-
-	/* Only reflink if we're aligned to block boundaries */
-	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
-	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
-		return -EINVAL;
-
-	/* Don't allow overlapped reflink within the same file */
-	if (same_inode) {
-		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
-			return -EINVAL;
-	}
+	/* Check that we don't violate system file offset limits. */
+	ret = generic_remap_checks(file_in, pos_in, file_out, pos_out, len,
+			is_dedupe);
+	if (ret)
+		return ret;
 
 	/* Wait for the completion of any pending IOs on both files */
 	inode_dio_wait(inode_in);
@@ -1813,7 +1787,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 
 	return 1;
 }
-EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
+EXPORT_SYMBOL(vfs_clone_file_prep);
 
 int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			struct file *file_out, loff_t pos_out, u64 len)
@@ -1851,9 +1825,6 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	if (pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
 	ret = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len);
 	if (!ret) {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 42ea7bab9144..281d5f53f2ec 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1326,7 +1326,7 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
 			len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 897eae8faee1..ba93a6e7dac4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1825,9 +1825,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
-extern int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-				      struct inode *inode_out, loff_t pos_out,
-				      u64 *len, bool is_dedupe);
+extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+			       struct file *file_out, loff_t pos_out,
+			       u64 *count, bool is_dedupe);
 extern int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			       struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
@@ -2967,6 +2967,9 @@ extern int sb_min_blocksize(struct super_block *, int);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				uint64_t *count, bool is_dedupe);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..47e6bfd45a91 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2974,6 +2974,75 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+/*
+ * Performs necessary checks before doing a clone.
+ *
+ * Can adjust amount of bytes to clone.
+ * Returns appropriate error code that caller should return or
+ * zero in case the clone should be allowed.
+ */
+int generic_remap_checks(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 uint64_t *req_count, bool is_dedupe)
+{
+	struct inode *inode_in = file_in->f_mapping->host;
+	struct inode *inode_out = file_out->f_mapping->host;
+	uint64_t count = *req_count;
+	uint64_t bcount;
+	loff_t size_in, size_out;
+	loff_t bs = inode_out->i_sb->s_blocksize;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (pos_in + count < pos_in || pos_out + count < pos_out)
+		return -EINVAL;
+
+	size_in = i_size_read(inode_in);
+	size_out = i_size_read(inode_out);
+
+	/* Dedupe requires both ranges to be within EOF. */
+	if (is_dedupe &&
+	    (pos_in >= size_in || pos_in + count > size_in ||
+	     pos_out >= size_out || pos_out + count > size_out))
+		return -EINVAL;
+
+	/* Ensure the infile range is within the infile. */
+	if (pos_in >= size_in)
+		return -EINVAL;
+	count = min(count, size_in - (uint64_t)pos_in);
+
+	/*
+	 * If the user wanted us to link to the infile's EOF, round up to the
+	 * next block boundary for this check.
+	 *
+	 * Otherwise, make sure the count is also block-aligned, having
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (pos_in + count == size_in) {
+		bcount = ALIGN(size_in, bs) - pos_in;
+	} else {
+		if (!IS_ALIGNED(count, bs))
+			return -EINVAL;
+
+		bcount = count;
+	}
+
+	/* Don't allow overlapped cloning within the same file. */
+	if (inode_in == inode_out &&
+	    pos_out + bcount > pos_in &&
+	    pos_out < pos_in + bcount)
+		return -EINVAL;
+
+	/* For now we don't support changing the length. */
+	if (*req_count != count)
+		return -EINVAL;
+
+	return 0;
+}
+
 int pagecache_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/25] vfs: strengthen checking of file range inputs to generic_remap_checks
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (2 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 03/25] vfs: check file ranges before cloning files Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-13  0:06 ` [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block Darrick J. Wong
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	Christoph Hellwig, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

File range remapping, if allowed to run past the destination file's EOF,
is an optimization on a regular file write.  Regular file writes that
extend the file length are subject to various constraints which are not
checked by range cloning.

This is a correctness problem because we're never allowed to touch
ranges that the page cache can't support (s_maxbytes); we're not
supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't
set; and we must obey resource limits (RLIMIT_FSIZE).

Therefore, add these checks to the new generic_remap_checks function so
that we curtail unexpected behavior.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 mm/filemap.c |   91 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 59 insertions(+), 32 deletions(-)


diff --git a/mm/filemap.c b/mm/filemap.c
index 47e6bfd45a91..08ad210fee49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2915,6 +2915,49 @@ struct page *read_cache_page_gfp(struct address_space *mapping,
 }
 EXPORT_SYMBOL(read_cache_page_gfp);
 
+static int generic_access_check_limits(struct file *file, loff_t pos,
+				       loff_t *count)
+{
+	struct inode *inode = file->f_mapping->host;
+
+	/* Don't exceed the LFS limits. */
+	if (unlikely(pos + *count > MAX_NON_LFS &&
+				!(file->f_flags & O_LARGEFILE))) {
+		if (pos >= MAX_NON_LFS)
+			return -EFBIG;
+		*count = min(*count, (loff_t)MAX_NON_LFS - pos);
+	}
+
+	/*
+	 * Don't operate on ranges the page cache doesn't support.
+	 *
+	 * If we have written data it becomes a short write.  If we have
+	 * exceeded without writing data we send a signal and return EFBIG.
+	 * Linus frestrict idea will clean these up nicely..
+	 */
+	if (unlikely(pos >= inode->i_sb->s_maxbytes))
+		return -EFBIG;
+
+	*count = min(*count, inode->i_sb->s_maxbytes - pos);
+	return 0;
+}
+
+static int generic_write_check_limits(struct file *file, loff_t pos,
+				      loff_t *count)
+{
+	unsigned long limit = rlimit(RLIMIT_FSIZE);
+
+	if (limit != RLIM_INFINITY) {
+		if (pos >= limit) {
+			send_sig(SIGXFSZ, current, 0);
+			return -EFBIG;
+		}
+		*count = min(*count, (loff_t)limit - pos);
+	}
+
+	return generic_access_check_limits(file, pos, count);
+}
+
 /*
  * Performs necessary checks before doing a write
  *
@@ -2926,8 +2969,8 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
-	unsigned long limit = rlimit(RLIMIT_FSIZE);
-	loff_t pos;
+	loff_t count;
+	int ret;
 
 	if (!iov_iter_count(from))
 		return 0;
@@ -2936,40 +2979,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_flags & IOCB_APPEND)
 		iocb->ki_pos = i_size_read(inode);
 
-	pos = iocb->ki_pos;
-
 	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
 		return -EINVAL;
 
-	if (limit != RLIM_INFINITY) {
-		if (iocb->ki_pos >= limit) {
-			send_sig(SIGXFSZ, current, 0);
-			return -EFBIG;
-		}
-		iov_iter_truncate(from, limit - (unsigned long)pos);
-	}
+	count = iov_iter_count(from);
+	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+	if (ret)
+		return ret;
 
-	/*
-	 * LFS rule
-	 */
-	if (unlikely(pos + iov_iter_count(from) > MAX_NON_LFS &&
-				!(file->f_flags & O_LARGEFILE))) {
-		if (pos >= MAX_NON_LFS)
-			return -EFBIG;
-		iov_iter_truncate(from, MAX_NON_LFS - (unsigned long)pos);
-	}
-
-	/*
-	 * Are we about to exceed the fs block limit ?
-	 *
-	 * If we have written data it becomes a short write.  If we have
-	 * exceeded without writing data we send a signal and return EFBIG.
-	 * Linus frestrict idea will clean these up nicely..
-	 */
-	if (unlikely(pos >= inode->i_sb->s_maxbytes))
-		return -EFBIG;
-
-	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
+	iov_iter_truncate(from, count);
 	return iov_iter_count(from);
 }
 EXPORT_SYMBOL(generic_write_checks);
@@ -2991,6 +3009,7 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	uint64_t bcount;
 	loff_t size_in, size_out;
 	loff_t bs = inode_out->i_sb->s_blocksize;
+	int ret;
 
 	/* The start of both ranges must be aligned to an fs block. */
 	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
@@ -3014,6 +3033,14 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 		return -EINVAL;
 	count = min(count, size_in - (uint64_t)pos_in);
 
+	ret = generic_access_check_limits(file_in, pos_in, &count);
+	if (ret)
+		return ret;
+
+	ret = generic_write_check_limits(file_out, pos_out, &count);
+	if (ret)
+		return ret;
+
 	/*
 	 * If the user wanted us to link to the infile's EOF, round up to the
 	 * next block boundary for this check.


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (3 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 04/25] vfs: strengthen checking of file range inputs to generic_remap_checks Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-14 17:11   ` Christoph Hellwig
  2018-10-13  0:06 ` [PATCH 06/25] vfs: skip zero-length dedupe requests Darrick J. Wong
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

A deduplication data corruption is exposed in XFS and btrfs. It is
caused by extending the block match range to include the partial EOF
block, but then allowing unknown data beyond EOF to be considered a
"match" to data in the destination file because the comparison is only
made to the end of the source file. This corrupts the destination file
when the source extent is shared with it.

The VFS remapping prep functions  only support whole block dedupe, but
we still need to appear to support whole file dedupe correctly.  Hence
if the dedupe request includes the last block of the souce file, don't
include it in the actual dedupe operation. If the rest of the range
dedupes successfully, then reject the entire request.  A subsequent
patch will enable us to shorten dedupe requests correctly.

When reflinking sub-file ranges, a data corruption can occur when the
source file range includes a partial EOF block. This shares the unknown
data beyond EOF into the second file at a position inside EOF, exposing
stale data in the second file.

If the reflink request includes the last block of the souce file, only
proceed with the reflink operation if it lands at or past the
destination file's current EOF. If it lands within the destination file
EOF, reject the entire request with -EINVAL and make the caller go the
hard way.  A subsequent patch will enable us to shorten reflink requests
correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index d6e8e242a15f..067ff5698e0b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1723,6 +1723,7 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
+	u64 blkmask = i_blocksize(inode_in) - 1;
 	bool same_inode = (inode_in == inode_out);
 	int ret;
 
@@ -1785,6 +1786,22 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 			return -EBADE;
 	}
 
+	/* Are we doing a partial EOF block remapping of some kind? */
+	if (*len & blkmask) {
+		/*
+		 * If the dedupe data matches, chop off the partial EOF block
+		 * from the source file so we don't try to dedupe the partial
+		 * EOF block.
+		 *
+		 * If the user is attempting to remap a partial EOF block and
+		 * it's inside the destination EOF then reject it.
+		 */
+		if (is_dedupe)
+			*len &= ~blkmask;
+		else if (pos_out + *len < i_size_read(inode_out))
+			return -EINVAL;
+	}
+
 	return 1;
 }
 EXPORT_SYMBOL(vfs_clone_file_prep);


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/25] vfs: skip zero-length dedupe requests
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (4 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-13  0:06 ` [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range Darrick J. Wong
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	Christoph Hellwig, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Don't bother calling the filesystem for a zero-length dedupe request;
we can return zero and exit.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index 067ff5698e0b..2d84d18dc095 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1991,6 +1991,11 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 	if (!dst_file->f_op->dedupe_file_range)
 		goto out_drop_write;
 
+	if (len == 0) {
+		ret = 0;
+		goto out_drop_write;
+	}
+
 	ret = dst_file->f_op->dedupe_file_range(src_file, src_pos,
 						dst_file, dst_pos, len);
 out_drop_write:


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (5 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 06/25] vfs: skip zero-length dedupe requests Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-14 17:19   ` Christoph Hellwig
  2018-10-13  0:06 ` [PATCH 08/25] vfs: rename vfs_clone_file_prep to be more descriptive Darrick J. Wong
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation.  The differences between the two can
be made in the prep functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/vfs.txt |   12 ++++------
 fs/btrfs/ctree.h                  |    8 ++-----
 fs/btrfs/file.c                   |    3 +-
 fs/btrfs/ioctl.c                  |   45 +++++++++++++++++++------------------
 fs/cifs/cifsfs.c                  |   22 +++++++++++-------
 fs/nfs/nfs4file.c                 |   10 ++++++--
 fs/ocfs2/file.c                   |   24 +++++++-------------
 fs/overlayfs/file.c               |   30 ++++++++++++++-----------
 fs/read_write.c                   |   18 +++++++--------
 fs/xfs/xfs_file.c                 |   23 ++++++-------------
 include/linux/fs.h                |   27 +++++++++++++++++++---
 11 files changed, 116 insertions(+), 106 deletions(-)


diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index a6c6a8af48a2..2ec27203e4a6 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -883,8 +883,9 @@ struct file_operations {
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
-	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
-	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
+	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				u64 len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 };
 
@@ -960,11 +961,8 @@ otherwise noted.
 
   copy_file_range: called by the copy_file_range(2) system call.
 
-  clone_file_range: called by the ioctl(2) system call for FICLONERANGE and
-	FICLONE commands.
-
-  dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE
-	command.
+  remap_file_range: called by the ioctl(2) system call for FICLONERANGE and
+	FICLONE and FIDEDUPERANGE commands to remap file ranges.
 
   fadvise: possibly called by the fadvise64() system call.
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..124a05662fc2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3218,9 +3218,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 				struct btrfs_ioctl_space_info *space);
 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
 			       struct btrfs_ioctl_balance_args *bargs);
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
-			    struct file *dst_file, loff_t dst_loff,
-			    u64 olen);
 
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
@@ -3250,8 +3247,9 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages,
 		      size_t num_pages, loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
-int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
-			   struct file *file_out, loff_t pos_out, u64 len);
+int btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
+			   struct file *file_out, loff_t pos_out, u64 len,
+			   unsigned int remap_flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2be00e873e92..9a963f061393 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3269,8 +3269,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_compat_ioctl,
 #endif
-	.clone_file_range = btrfs_clone_file_range,
-	.dedupe_file_range = btrfs_dedupe_file_range,
+	.remap_file_range = btrfs_remap_file_range,
 };
 
 void __cold btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d60b6caf09e8..bed5b8f9ec09 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3627,26 +3627,6 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 	return ret;
 }
 
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
-			    struct file *dst_file, loff_t dst_loff,
-			    u64 olen)
-{
-	struct inode *src = file_inode(src_file);
-	struct inode *dst = file_inode(dst_file);
-	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
-
-	if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
-		/*
-		 * Btrfs does not support blocksize < page_size. As a
-		 * result, btrfs_cmp_data() won't correctly handle
-		 * this situation without an update.
-		 */
-		return -EINVAL;
-	}
-
-	return btrfs_extent_same(src, src_loff, olen, dst, dst_loff);
-}
-
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
 				     struct inode *inode,
 				     u64 endoff,
@@ -4348,9 +4328,30 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	return ret;
 }
 
-int btrfs_clone_file_range(struct file *src_file, loff_t off,
-		struct file *dst_file, loff_t destoff, u64 len)
+int btrfs_remap_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len,
+		unsigned int remap_flags)
 {
+	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
+		return -EINVAL;
+
+	if (remap_flags & RFR_SAME_DATA) {
+		struct inode *src = file_inode(src_file);
+		struct inode *dst = file_inode(dst_file);
+		u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+
+		if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
+			/*
+			 * Btrfs does not support blocksize < page_size. As a
+			 * result, btrfs_cmp_data() won't correctly handle
+			 * this situation without an update.
+			 */
+			return -EINVAL;
+		}
+
+		return btrfs_extent_same(src, off, len, dst, destoff);
+	}
+
 	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
 }
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 7065426b3280..06b2587fcc77 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -975,8 +975,9 @@ const struct inode_operations cifs_symlink_inode_ops = {
 	.listxattr = cifs_listxattr,
 };
 
-static int cifs_clone_file_range(struct file *src_file, loff_t off,
-		struct file *dst_file, loff_t destoff, u64 len)
+static int cifs_remap_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len,
+		unsigned int remap_flags)
 {
 	struct inode *src_inode = file_inode(src_file);
 	struct inode *target_inode = file_inode(dst_file);
@@ -986,6 +987,9 @@ static int cifs_clone_file_range(struct file *src_file, loff_t off,
 	unsigned int xid;
 	int rc;
 
+	if (!remap_check_flags(remap_flags, 0))
+		return -EINVAL;
+
 	cifs_dbg(FYI, "clone range\n");
 
 	xid = get_xid();
@@ -1134,7 +1138,7 @@ const struct file_operations cifs_file_ops = {
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1153,7 +1157,7 @@ const struct file_operations cifs_file_strict_ops = {
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1172,7 +1176,7 @@ const struct file_operations cifs_file_direct_ops = {
 	.splice_write = iter_file_splice_write,
 	.unlocked_ioctl  = cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -1191,7 +1195,7 @@ const struct file_operations cifs_file_nobrl_ops = {
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1209,7 +1213,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1227,7 +1231,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
 	.splice_write = iter_file_splice_write,
 	.unlocked_ioctl  = cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -1239,7 +1243,7 @@ const struct file_operations cifs_dir_ops = {
 	.read    = generic_read_dir,
 	.unlocked_ioctl  = cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
-	.clone_file_range = cifs_clone_file_range,
+	.remap_file_range = cifs_remap_file_range,
 	.llseek = generic_file_llseek,
 	.fsync = cifs_dir_fsync,
 };
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4288a6ecaf75..2452b1941f36 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -180,8 +180,9 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
-		struct file *dst_file, loff_t dst_off, u64 count)
+static int nfs42_remap_file_range(struct file *src_file, loff_t src_off,
+		struct file *dst_file, loff_t dst_off, u64 count,
+		unsigned int remap_flags)
 {
 	struct inode *dst_inode = file_inode(dst_file);
 	struct nfs_server *server = NFS_SERVER(dst_inode);
@@ -190,6 +191,9 @@ static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
 	bool same_inode = false;
 	int ret;
 
+	if (!remap_check_flags(remap_flags, 0))
+		return -EINVAL;
+
 	/* check alignment w.r.t. clone_blksize */
 	ret = -EINVAL;
 	if (bs) {
@@ -262,7 +266,7 @@ const struct file_operations nfs4_file_operations = {
 	.copy_file_range = nfs4_copy_file_range,
 	.llseek		= nfs4_file_llseek,
 	.fallocate	= nfs42_fallocate,
-	.clone_file_range = nfs42_clone_file_range,
+	.remap_file_range = nfs42_remap_file_range,
 #else
 	.llseek		= nfs_file_llseek,
 #endif
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 9fa35cb6f6e0..852cdfaadd89 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2527,24 +2527,18 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
-static int ocfs2_file_clone_range(struct file *file_in,
+static int ocfs2_remap_file_range(struct file *file_in,
 				  loff_t pos_in,
 				  struct file *file_out,
 				  loff_t pos_out,
-				  u64 len)
+				  u64 len,
+				  unsigned int remap_flags)
 {
-	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					 len, false);
-}
+	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
+		return -EINVAL;
 
-static int ocfs2_file_dedupe_range(struct file *file_in,
-				   loff_t pos_in,
-				   struct file *file_out,
-				   loff_t pos_out,
-				   u64 len)
-{
 	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					  len, true);
+					 len, remap_flags & RFR_SAME_DATA);
 }
 
 const struct inode_operations ocfs2_file_iops = {
@@ -2586,8 +2580,7 @@ const struct file_operations ocfs2_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
-	.clone_file_range = ocfs2_file_clone_range,
-	.dedupe_file_range = ocfs2_file_dedupe_range,
+	.remap_file_range = ocfs2_remap_file_range,
 };
 
 const struct file_operations ocfs2_dops = {
@@ -2633,8 +2626,7 @@ const struct file_operations ocfs2_fops_no_plocks = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
-	.clone_file_range = ocfs2_file_clone_range,
-	.dedupe_file_range = ocfs2_file_dedupe_range,
+	.remap_file_range = ocfs2_remap_file_range,
 };
 
 const struct file_operations ocfs2_dops_no_plocks = {
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 986313da0c88..455bf49bd07b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -489,26 +489,31 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 			    OVL_COPY);
 }
 
-static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out, u64 len)
+static int ovl_remap_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				u64 len, unsigned int remap_flags)
 {
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    OVL_CLONE);
-}
+	enum ovl_copyop op;
+
+	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
+		return -EINVAL;
+
+	if (remap_flags & RFR_SAME_DATA)
+		op = OVL_DEDUPE;
+	else
+		op = OVL_CLONE;
 
-static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
-				 struct file *file_out, loff_t pos_out, u64 len)
-{
 	/*
 	 * Don't copy up because of a dedupe request, this wouldn't make sense
 	 * most of the time (data would be duplicated instead of deduplicated).
 	 */
-	if (!ovl_inode_upper(file_inode(file_in)) ||
-	    !ovl_inode_upper(file_inode(file_out)))
+	if (op == OVL_DEDUPE &&
+	    (!ovl_inode_upper(file_inode(file_in)) ||
+	     !ovl_inode_upper(file_inode(file_out))))
 		return -EPERM;
 
 	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    OVL_DEDUPE);
+			    op);
 }
 
 const struct file_operations ovl_file_operations = {
@@ -525,6 +530,5 @@ const struct file_operations ovl_file_operations = {
 	.compat_ioctl	= ovl_compat_ioctl,
 
 	.copy_file_range	= ovl_copy_file_range,
-	.clone_file_range	= ovl_clone_file_range,
-	.dedupe_file_range	= ovl_dedupe_file_range,
+	.remap_file_range	= ovl_remap_file_range,
 };
diff --git a/fs/read_write.c b/fs/read_write.c
index 2d84d18dc095..fd3fe05060a4 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1588,9 +1588,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	 * Try cloning first, this is supported by more file systems, and
 	 * more efficient if both clone and copy are supported (e.g. NFS).
 	 */
-	if (file_in->f_op->clone_file_range) {
-		ret = file_in->f_op->clone_file_range(file_in, pos_in,
-				file_out, pos_out, len);
+	if (file_in->f_op->remap_file_range) {
+		ret = file_in->f_op->remap_file_range(file_in, pos_in,
+				file_out, pos_out, len, 0);
 		if (ret == 0) {
 			ret = len;
 			goto done;
@@ -1831,7 +1831,7 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in,
 	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
-	if (!file_in->f_op->clone_file_range)
+	if (!file_in->f_op->remap_file_range)
 		return -EOPNOTSUPP;
 
 	ret = clone_verify_area(file_in, pos_in, len, false);
@@ -1842,8 +1842,8 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_in->f_op->clone_file_range(file_in, pos_in,
-			file_out, pos_out, len);
+	ret = file_in->f_op->remap_file_range(file_in, pos_in,
+			file_out, pos_out, len, 0);
 	if (!ret) {
 		fsnotify_access(file_in);
 		fsnotify_modify(file_out);
@@ -1988,7 +1988,7 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 		goto out_drop_write;
 
 	ret = -EINVAL;
-	if (!dst_file->f_op->dedupe_file_range)
+	if (!dst_file->f_op->remap_file_range)
 		goto out_drop_write;
 
 	if (len == 0) {
@@ -1996,8 +1996,8 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 		goto out_drop_write;
 	}
 
-	ret = dst_file->f_op->dedupe_file_range(src_file, src_pos,
-						dst_file, dst_pos, len);
+	ret = dst_file->f_op->remap_file_range(src_file, src_pos, dst_file,
+			dst_pos, len, RFR_SAME_DATA);
 out_drop_write:
 	mnt_drop_write_file(dst_file);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 61a5ad2600e8..7cce438f856a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -920,27 +920,19 @@ xfs_file_fallocate(
 }
 
 STATIC int
-xfs_file_clone_range(
+xfs_file_remap_range(
 	struct file	*file_in,
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	u64		len)
+	u64		len,
+	unsigned int	remap_flags)
 {
-	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-				     len, false);
-}
+	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
+		return -EINVAL;
 
-STATIC int
-xfs_file_dedupe_range(
-	struct file	*file_in,
-	loff_t		pos_in,
-	struct file	*file_out,
-	loff_t		pos_out,
-	u64		len)
-{
 	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-				     len, true);
+			len, remap_flags & RFR_SAME_DATA);
 }
 
 STATIC int
@@ -1175,8 +1167,7 @@ const struct file_operations xfs_file_operations = {
 	.fsync		= xfs_file_fsync,
 	.get_unmapped_area = thp_get_unmapped_area,
 	.fallocate	= xfs_file_fallocate,
-	.clone_file_range = xfs_file_clone_range,
-	.dedupe_file_range = xfs_file_dedupe_range,
+	.remap_file_range = xfs_file_remap_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ba93a6e7dac4..11fe36576d34 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1721,6 +1721,26 @@ struct block_device_operations;
 #define NOMMU_VMFLAGS \
 	(NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC)
 
+/*
+ * These flags control the behavior of the remap_file_range function pointer.
+ *
+ * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
+ */
+#define RFR_SAME_DATA		(1 << 0)
+
+#define RFR_VALID_FLAGS		(RFR_SAME_DATA)
+
+/*
+ * Filesystem remapping implementations should call this helper on their
+ * remap flags to filter out flags that the implementation doesn't support.
+ *
+ * Returns true if the flags are ok, false otherwise.
+ */
+static inline bool remap_check_flags(unsigned int remap_flags,
+				     unsigned int supported_flags)
+{
+	return (remap_flags & ~(supported_flags & RFR_VALID_FLAGS)) == 0;
+}
 
 struct iov_iter;
 
@@ -1759,10 +1779,9 @@ struct file_operations {
 #endif
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
 			loff_t, size_t, unsigned int);
-	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
-			u64);
-	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
-			u64);
+	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				u64 len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/25] vfs: rename vfs_clone_file_prep to be more descriptive
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (6 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-13  0:06 ` [PATCH 09/25] vfs: rename clone_verify_area to remap_verify_area Darrick J. Wong
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only.  Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ocfs2/refcounttree.c |    2 +-
 fs/read_write.c         |    8 ++++----
 fs/xfs/xfs_reflink.c    |    2 +-
 include/linux/fs.h      |    6 +++---
 4 files changed, 9 insertions(+), 9 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 19e03936c5e1..36c56dfbe485 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4850,7 +4850,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
+	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
 			&len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index fd3fe05060a4..65285524e4c3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1717,9 +1717,9 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
-int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
-			struct file *file_out, loff_t pos_out,
-			u64 *len, bool is_dedupe)
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  u64 *len, bool is_dedupe)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -1804,7 +1804,7 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 
 	return 1;
 }
-EXPORT_SYMBOL(vfs_clone_file_prep);
+EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			struct file *file_out, loff_t pos_out, u64 len)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 281d5f53f2ec..a7757a128a78 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1326,7 +1326,7 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
+	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
 			len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11fe36576d34..686905be04c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1844,9 +1844,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
-extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
-			       struct file *file_out, loff_t pos_out,
-			       u64 *count, bool is_dedupe);
+extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+					 struct file *file_out, loff_t pos_out,
+					 u64 *count, bool is_dedupe);
 extern int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			       struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/25] vfs: rename clone_verify_area to remap_verify_area
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (7 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 08/25] vfs: rename vfs_clone_file_prep to be more descriptive Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-13  0:06 ` [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata Darrick J. Wong
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Since we use clone_verify_area for both clone and dedupe range checks,
rename the function to make it clear that it's for both.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 65285524e4c3..ff6fcb3b99dd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1686,7 +1686,7 @@ SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
 	return ret;
 }
 
-static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
+static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool write)
 {
 	struct inode *inode = file_inode(file);
 
@@ -1834,11 +1834,11 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (!file_in->f_op->remap_file_range)
 		return -EOPNOTSUPP;
 
-	ret = clone_verify_area(file_in, pos_in, len, false);
+	ret = remap_verify_area(file_in, pos_in, len, false);
 	if (ret)
 		return ret;
 
-	ret = clone_verify_area(file_out, pos_out, len, true);
+	ret = remap_verify_area(file_out, pos_out, len, true);
 	if (ret)
 		return ret;
 
@@ -1971,7 +1971,7 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 	if (ret)
 		return ret;
 
-	ret = clone_verify_area(dst_file, dst_pos, len, true);
+	ret = remap_verify_area(dst_file, dst_pos, len, true);
 	if (ret < 0)
 		goto out_drop_write;
 
@@ -2033,7 +2033,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	if (!S_ISREG(src->i_mode))
 		goto out;
 
-	ret = clone_verify_area(file, off, len, false);
+	ret = remap_verify_area(file, off, len, false);
 	if (ret < 0)
 		goto out;
 	ret = 0;


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (8 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 09/25] vfs: rename clone_verify_area to remap_verify_area Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-14 17:21   ` Christoph Hellwig
  2018-10-13  0:06 ` [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep Darrick J. Wong
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new VFS helper to handle inode metadata updates when remapping
into a file.  If the operation can possibly alter the file contents, we
must update the ctime and mtime and remove security privileges, just
like we do for regular file writes.  Wire up ocfs2 to ensure consistent
behavior.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ocfs2/refcounttree.c |    8 ++++++++
 fs/read_write.c         |   24 ++++++++++++++++++++++++
 fs/xfs/xfs_reflink.c    |   29 +++++++----------------------
 include/linux/fs.h      |    1 +
 4 files changed, 40 insertions(+), 22 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 36c56dfbe485..ee1ed11379b3 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4855,6 +4855,14 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	if (ret <= 0)
 		goto out_unlock;
 
+	/*
+	 * Update inode timestamps and remove security privileges before we
+	 * take the ilock.
+	 */
+	ret = generic_remap_file_range_touch(file_out, is_dedupe);
+	if (ret)
+		goto out_unlock;
+
 	/* Lock out changes to the allocation maps and remap. */
 	down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
diff --git a/fs/read_write.c b/fs/read_write.c
index ff6fcb3b99dd..7b837d12f75d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1806,6 +1806,30 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
+/* Update inode timestamps and remove security privileges when remapping. */
+int generic_remap_file_range_touch(struct file *file, bool is_dedupe)
+{
+	int ret;
+
+	/* If can't alter the file contents, we're done. */
+	if (is_dedupe)
+		return 0;
+
+	/* Update the timestamps, since we can alter file contents. */
+	if (!(file->f_mode & FMODE_NOCMTIME)) {
+		ret = file_update_time(file);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Clear the security bits if the process is not being run by root.
+	 * This keeps people from modifying setuid and setgid binaries.
+	 */
+	return file_remove_privs(file);
+}
+EXPORT_SYMBOL(generic_remap_file_range_touch);
+
 int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			struct file *file_out, loff_t pos_out, u64 len)
 {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index a7757a128a78..99f2ea4fcaba 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1371,28 +1371,13 @@ xfs_reflink_remap_prep(
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
 				   PAGE_ALIGN(pos_out + *len) - 1);
 
-	/* If we're altering the file contents... */
-	if (!is_dedupe) {
-		/*
-		 * ...update the timestamps (which will grab the ilock again
-		 * from xfs_fs_dirty_inode, so we have to call it before we
-		 * take the ilock).
-		 */
-		if (!(file_out->f_mode & FMODE_NOCMTIME)) {
-			ret = file_update_time(file_out);
-			if (ret)
-				goto out_unlock;
-		}
-
-		/*
-		 * ...clear the security bits if the process is not being run
-		 * by root.  This keeps people from modifying setuid and setgid
-		 * binaries.
-		 */
-		ret = file_remove_privs(file_out);
-		if (ret)
-			goto out_unlock;
-	}
+	/*
+	 * Update inode timestamps and remove security privileges before we
+	 * take the ilock.
+	 */
+	ret = generic_remap_file_range_touch(file_out, is_dedupe);
+	if (ret)
+		goto out_unlock;
 
 	return 1;
 out_unlock:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 686905be04c0..91fd3c77763b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1847,6 +1847,7 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 u64 *count, bool is_dedupe);
+extern int generic_remap_file_range_touch(struct file *file, bool is_dedupe);
 extern int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			       struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (9 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata Darrick J. Wong
@ 2018-10-13  0:06 ` Darrick J. Wong
  2018-10-14 17:22   ` Christoph Hellwig
  2018-10-14 17:37   ` Christoph Hellwig
  2018-10-13  0:07 ` [PATCH 12/25] vfs: pass remap flags to generic_remap_checks Darrick J. Wong
                   ` (13 subsequent siblings)
  24 siblings, 2 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:06 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ocfs2/file.c         |    2 +-
 fs/ocfs2/refcounttree.c |    6 +++---
 fs/ocfs2/refcounttree.h |    2 +-
 fs/read_write.c         |   10 ++++++----
 fs/xfs/xfs_file.c       |    2 +-
 fs/xfs/xfs_reflink.c    |   16 +++++++++-------
 fs/xfs/xfs_reflink.h    |    3 ++-
 include/linux/fs.h      |    5 +++--
 8 files changed, 26 insertions(+), 20 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 852cdfaadd89..53c8676a0daf 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2538,7 +2538,7 @@ static int ocfs2_remap_file_range(struct file *file_in,
 		return -EINVAL;
 
 	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					 len, remap_flags & RFR_SAME_DATA);
+					 len, remap_flags);
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index ee1ed11379b3..270a5b1919f6 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4825,7 +4825,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 			      struct file *file_out,
 			      loff_t pos_out,
 			      u64 len,
-			      bool is_dedupe)
+			      unsigned int remap_flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -4851,7 +4851,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe);
+			&len, remap_flags);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -4859,7 +4859,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	 * Update inode timestamps and remove security privileges before we
 	 * take the ilock.
 	 */
-	ret = generic_remap_file_range_touch(file_out, is_dedupe);
+	ret = generic_remap_file_range_touch(file_out, remap_flags);
 	if (ret)
 		goto out_unlock;
 
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 4af55bf4b35b..d2c5f526edff 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -120,6 +120,6 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 			      struct file *file_out,
 			      loff_t pos_out,
 			      u64 len,
-			      bool is_dedupe);
+			      unsigned int remap_flags);
 
 #endif /* OCFS2_REFCOUNTTREE_H */
diff --git a/fs/read_write.c b/fs/read_write.c
index 7b837d12f75d..5d24e9854765 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1712,18 +1712,20 @@ static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool write)
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
  * sense, and then flush all dirty data.  Caller must ensure that the
- * inodes have been locked against any other modifications.
+ * inodes have been locked against any other modifications.  This function
+ * takes RFR_* flags in remap_flags.
  *
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  u64 *len, bool is_dedupe)
+				  u64 *len, unsigned int remap_flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	u64 blkmask = i_blocksize(inode_in) - 1;
+	bool is_dedupe = (remap_flags & RFR_SAME_DATA);
 	bool same_inode = (inode_in == inode_out);
 	int ret;
 
@@ -1807,12 +1809,12 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 /* Update inode timestamps and remove security privileges when remapping. */
-int generic_remap_file_range_touch(struct file *file, bool is_dedupe)
+int generic_remap_file_range_touch(struct file *file, unsigned int remap_flags)
 {
 	int ret;
 
 	/* If can't alter the file contents, we're done. */
-	if (is_dedupe)
+	if (remap_flags & RFR_SAME_DATA)
 		return 0;
 
 	/* Update the timestamps, since we can alter file contents. */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7cce438f856a..dce01729e522 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -932,7 +932,7 @@ xfs_file_remap_range(
 		return -EINVAL;
 
 	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-			len, remap_flags & RFR_SAME_DATA);
+			len, remap_flags);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 99f2ea4fcaba..ada3b80267c6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -921,10 +921,11 @@ xfs_reflink_update_dest(
 	struct xfs_inode	*dest,
 	xfs_off_t		newlen,
 	xfs_extlen_t		cowextsize,
-	bool			is_dedupe)
+	unsigned int		remap_flags)
 {
 	struct xfs_mount	*mp = dest->i_mount;
 	struct xfs_trans	*tp;
+	bool			is_dedupe = (remap_flags & RFR_SAME_DATA);
 	int			error;
 
 	if (is_dedupe && newlen <= i_size_read(VFS_I(dest)) && cowextsize == 0)
@@ -1296,13 +1297,14 @@ xfs_reflink_remap_prep(
 	struct file		*file_out,
 	loff_t			pos_out,
 	u64			*len,
-	bool			is_dedupe)
+	unsigned int		remap_flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
+	bool			is_dedupe = (remap_flags & RFR_SAME_DATA);
 	u64			blkmask = i_blocksize(inode_in) - 1;
 	ssize_t			ret;
 
@@ -1327,7 +1329,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, is_dedupe);
+			len, remap_flags);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -1375,7 +1377,7 @@ xfs_reflink_remap_prep(
 	 * Update inode timestamps and remove security privileges before we
 	 * take the ilock.
 	 */
-	ret = generic_remap_file_range_touch(file_out, is_dedupe);
+	ret = generic_remap_file_range_touch(file_out, remap_flags);
 	if (ret)
 		goto out_unlock;
 
@@ -1395,7 +1397,7 @@ xfs_reflink_remap_range(
 	struct file		*file_out,
 	loff_t			pos_out,
 	u64			len,
-	bool			is_dedupe)
+	unsigned int		remap_flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
 	struct xfs_inode	*src = XFS_I(inode_in);
@@ -1415,7 +1417,7 @@ xfs_reflink_remap_range(
 
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe);
+			&len, remap_flags);
 	if (ret <= 0)
 		return ret;
 
@@ -1442,7 +1444,7 @@ xfs_reflink_remap_range(
 		cowextsize = src->i_d.di_cowextsize;
 
 	ret = xfs_reflink_update_dest(dest, pos_out + len, cowextsize,
-			is_dedupe);
+			remap_flags);
 
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c585ad9552b2..6f82d628bf17 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -28,7 +28,8 @@ extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
-		struct file *file_out, loff_t pos_out, u64 len, bool is_dedupe);
+		struct file *file_out, loff_t pos_out, u64 len,
+		unsigned int remap_flags);
 extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp,
 		struct xfs_inode *ip, bool *has_shared);
 extern int xfs_reflink_clear_inode_flag(struct xfs_inode *ip,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 91fd3c77763b..b67f108932a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1846,8 +1846,9 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
-					 u64 *count, bool is_dedupe);
-extern int generic_remap_file_range_touch(struct file *file, bool is_dedupe);
+					 u64 *count, unsigned int remap_flags);
+extern int generic_remap_file_range_touch(struct file *file,
+					  unsigned int remap_flags);
 extern int do_clone_file_range(struct file *file_in, loff_t pos_in,
 			       struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/25] vfs: pass remap flags to generic_remap_checks
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (10 preceding siblings ...)
  2018-10-13  0:06 ` [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-13  0:07 ` [PATCH 13/25] vfs: make remap_file_range functions take and return bytes completed Darrick J. Wong
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Pass the same remap flags to generic_remap_checks for consistency.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c    |    2 +-
 include/linux/fs.h |    2 +-
 mm/filemap.c       |    4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 5d24e9854765..0c43997bd4a1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1755,7 +1755,7 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 
 	/* Check that we don't violate system file offset limits. */
 	ret = generic_remap_checks(file_in, pos_in, file_out, pos_out, len,
-			is_dedupe);
+			remap_flags);
 	if (ret)
 		return ret;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b67f108932a5..b59637b2f484 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2990,7 +2990,7 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
-				uint64_t *count, bool is_dedupe);
+				uint64_t *count, unsigned int remap_flags);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 08ad210fee49..c34a89a35d5a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3001,7 +3001,7 @@ EXPORT_SYMBOL(generic_write_checks);
  */
 int generic_remap_checks(struct file *file_in, loff_t pos_in,
 			 struct file *file_out, loff_t pos_out,
-			 uint64_t *req_count, bool is_dedupe)
+			 uint64_t *req_count, unsigned int remap_flags)
 {
 	struct inode *inode_in = file_in->f_mapping->host;
 	struct inode *inode_out = file_out->f_mapping->host;
@@ -3023,7 +3023,7 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	size_out = i_size_read(inode_out);
 
 	/* Dedupe requires both ranges to be within EOF. */
-	if (is_dedupe &&
+	if ((remap_flags & RFR_SAME_DATA) &&
 	    (pos_in >= size_in || pos_in + count > size_in ||
 	     pos_out >= size_out || pos_out + count > size_out))
 		return -EINVAL;


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/25] vfs: make remap_file_range functions take and return bytes completed
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (11 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 12/25] vfs: pass remap flags to generic_remap_checks Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-13  0:07 ` [PATCH 14/25] vfs: plumb RFR_* remap flags through the vfs clone functions Darrick J. Wong
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on.  This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value.  For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/vfs.txt |    6 ++---
 fs/btrfs/ctree.h                  |    6 ++---
 fs/btrfs/ioctl.c                  |   13 ++++++----
 fs/cifs/cifsfs.c                  |    6 ++---
 fs/ioctl.c                        |   10 +++++++-
 fs/nfs/nfs4file.c                 |    6 ++---
 fs/nfsd/vfs.c                     |    8 +++++-
 fs/ocfs2/file.c                   |   16 ++++++-------
 fs/ocfs2/refcounttree.c           |    2 +-
 fs/ocfs2/refcounttree.h           |    2 +-
 fs/overlayfs/copy_up.c            |    6 ++---
 fs/overlayfs/file.c               |   12 +++++----
 fs/read_write.c                   |   47 ++++++++++++++++++++-----------------
 fs/xfs/xfs_file.c                 |    9 +++++--
 fs/xfs/xfs_reflink.c              |    4 ++-
 fs/xfs/xfs_reflink.h              |    2 +-
 include/linux/fs.h                |   27 ++++++++++++---------
 mm/filemap.c                      |    2 +-
 18 files changed, 105 insertions(+), 79 deletions(-)


diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 2ec27203e4a6..393909585bd8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -883,9 +883,9 @@ struct file_operations {
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
-	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out,
-				u64 len, unsigned int remap_flags);
+	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 };
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 124a05662fc2..771a961d77ad 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3247,9 +3247,9 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages,
 		      size_t num_pages, loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
-int btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
-			   struct file *file_out, loff_t pos_out, u64 len,
-			   unsigned int remap_flags);
+loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t len, unsigned int remap_flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index bed5b8f9ec09..3e0aaca9e072 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4328,10 +4328,12 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	return ret;
 }
 
-int btrfs_remap_file_range(struct file *src_file, loff_t off,
-		struct file *dst_file, loff_t destoff, u64 len,
+loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, loff_t len,
 		unsigned int remap_flags)
 {
+	int ret;
+
 	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
 		return -EINVAL;
 
@@ -4349,10 +4351,11 @@ int btrfs_remap_file_range(struct file *src_file, loff_t off,
 			return -EINVAL;
 		}
 
-		return btrfs_extent_same(src, off, len, dst, destoff);
+		ret = btrfs_extent_same(src, off, len, dst, destoff);
+	} else {
+		ret = btrfs_clone_files(dst_file, src_file, off, len, destoff);
 	}
-
-	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
+	return ret < 0 ? ret : len;
 }
 
 static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 06b2587fcc77..816a1b52767e 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -975,8 +975,8 @@ const struct inode_operations cifs_symlink_inode_ops = {
 	.listxattr = cifs_listxattr,
 };
 
-static int cifs_remap_file_range(struct file *src_file, loff_t off,
-		struct file *dst_file, loff_t destoff, u64 len,
+static loff_t cifs_remap_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, loff_t len,
 		unsigned int remap_flags)
 {
 	struct inode *src_inode = file_inode(src_file);
@@ -1029,7 +1029,7 @@ static int cifs_remap_file_range(struct file *src_file, loff_t off,
 	unlock_two_nondirectories(src_inode, target_inode);
 out:
 	free_xid(xid);
-	return rc;
+	return rc < 0 ? rc : len;
 }
 
 ssize_t cifs_file_copychunk_range(unsigned int xid,
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 2005529af560..72537b68c272 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -223,6 +223,7 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
 			     u64 off, u64 olen, u64 destoff)
 {
 	struct fd src_file = fdget(srcfd);
+	loff_t cloned;
 	int ret;
 
 	if (!src_file.file)
@@ -230,7 +231,14 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
 	ret = -EXDEV;
 	if (src_file.file->f_path.mnt != dst_file->f_path.mnt)
 		goto fdput;
-	ret = vfs_clone_file_range(src_file.file, off, dst_file, destoff, olen);
+	cloned = vfs_clone_file_range(src_file.file, off, dst_file, destoff,
+				      olen);
+	if (cloned < 0)
+		ret = cloned;
+	else if (olen && cloned != olen)
+		ret = -EINVAL;
+	else
+		ret = 0;
 fdput:
 	fdput(src_file);
 	return ret;
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 2452b1941f36..eafa39162cfc 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -180,8 +180,8 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static int nfs42_remap_file_range(struct file *src_file, loff_t src_off,
-		struct file *dst_file, loff_t dst_off, u64 count,
+static loff_t nfs42_remap_file_range(struct file *src_file, loff_t src_off,
+		struct file *dst_file, loff_t dst_off, loff_t count,
 		unsigned int remap_flags)
 {
 	struct inode *dst_inode = file_inode(dst_file);
@@ -244,7 +244,7 @@ static int nfs42_remap_file_range(struct file *src_file, loff_t src_off,
 		inode_unlock(src_inode);
 	}
 out:
-	return ret;
+	return ret < 0 ? ret : count;
 }
 #endif /* CONFIG_NFS_V4_2 */
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b53e76391e52..ac6cb6101cbe 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -541,8 +541,12 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct svc_fh *fhp,
 __be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst,
 		u64 dst_pos, u64 count)
 {
-	return nfserrno(vfs_clone_file_range(src, src_pos, dst, dst_pos,
-					     count));
+	loff_t cloned;
+
+	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count);
+	if (count && cloned != count)
+		cloned = -EINVAL;
+	return nfserrno(cloned < 0 ? cloned : 0);
 }
 
 ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 53c8676a0daf..e6ffed70398e 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2527,18 +2527,18 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
-static int ocfs2_remap_file_range(struct file *file_in,
-				  loff_t pos_in,
-				  struct file *file_out,
-				  loff_t pos_out,
-				  u64 len,
-				  unsigned int remap_flags)
+static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
+				     struct file *file_out, loff_t pos_out,
+				     loff_t len, unsigned int remap_flags)
 {
+	int ret;
+
 	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
 		return -EINVAL;
 
-	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					 len, remap_flags);
+	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					len, remap_flags);
+	return ret < 0 ? ret : len;
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 270a5b1919f6..a3df118bf3b9 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4824,7 +4824,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 			      loff_t pos_in,
 			      struct file *file_out,
 			      loff_t pos_out,
-			      u64 len,
+			      loff_t len,
 			      unsigned int remap_flags)
 {
 	struct inode *inode_in = file_inode(file_in);
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index d2c5f526edff..eb65c1d0843c 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -119,7 +119,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 			      loff_t pos_in,
 			      struct file *file_out,
 			      loff_t pos_out,
-			      u64 len,
+			      loff_t len,
 			      unsigned int remap_flags);
 
 #endif /* OCFS2_REFCOUNTTREE_H */
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 1cc797a08a5b..8750b7235516 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -125,6 +125,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 	struct file *new_file;
 	loff_t old_pos = 0;
 	loff_t new_pos = 0;
+	loff_t cloned;
 	int error = 0;
 
 	if (len == 0)
@@ -141,11 +142,10 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 	}
 
 	/* Try to use clone_file_range to clone up within the same fs */
-	error = do_clone_file_range(old_file, 0, new_file, 0, len);
-	if (!error)
+	cloned = do_clone_file_range(old_file, 0, new_file, 0, len);
+	if (cloned == len)
 		goto out;
 	/* Couldn't clone, so now we try to copy the data */
-	error = 0;
 
 	/* FIXME: copy up sparse files efficiently */
 	while (len) {
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 455bf49bd07b..177731b21bad 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -434,14 +434,14 @@ enum ovl_copyop {
 	OVL_DEDUPE,
 };
 
-static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in,
+static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
-			    u64 len, unsigned int flags, enum ovl_copyop op)
+			    loff_t len, unsigned int flags, enum ovl_copyop op)
 {
 	struct inode *inode_out = file_inode(file_out);
 	struct fd real_in, real_out;
 	const struct cred *old_cred;
-	ssize_t ret;
+	loff_t ret;
 
 	ret = ovl_real_fdget(file_out, &real_out);
 	if (ret)
@@ -489,9 +489,9 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 			    OVL_COPY);
 }
 
-static int ovl_remap_file_range(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out,
-				u64 len, unsigned int remap_flags)
+static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   loff_t len, unsigned int remap_flags)
 {
 	enum ovl_copyop op;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index 0c43997bd4a1..29ec0a6bfa2f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1589,10 +1589,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	 * more efficient if both clone and copy are supported (e.g. NFS).
 	 */
 	if (file_in->f_op->remap_file_range) {
-		ret = file_in->f_op->remap_file_range(file_in, pos_in,
-				file_out, pos_out, len, 0);
-		if (ret == 0) {
-			ret = len;
+		loff_t cloned;
+
+		cloned = file_in->f_op->remap_file_range(file_in, pos_in,
+				file_out, pos_out,
+				min_t(loff_t, MAX_RW_COUNT, len), 0);
+		if (cloned > 0) {
+			ret = cloned;
 			goto done;
 		}
 	}
@@ -1686,11 +1689,12 @@ SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
 	return ret;
 }
 
-static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool write)
+static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
+			     bool write)
 {
 	struct inode *inode = file_inode(file);
 
-	if (unlikely(pos < 0))
+	if (unlikely(pos < 0 || len < 0))
 		return -EINVAL;
 
 	 if (unlikely((loff_t) (pos + len) < 0))
@@ -1720,7 +1724,7 @@ static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  u64 *len, unsigned int remap_flags)
+				  loff_t *len, unsigned int remap_flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -1832,12 +1836,12 @@ int generic_remap_file_range_touch(struct file *file, unsigned int remap_flags)
 }
 EXPORT_SYMBOL(generic_remap_file_range_touch);
 
-int do_clone_file_range(struct file *file_in, loff_t pos_in,
-			struct file *file_out, loff_t pos_out, u64 len)
+loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
+			   struct file *file_out, loff_t pos_out, loff_t len)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
-	int ret;
+	loff_t ret;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
@@ -1870,19 +1874,19 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in,
 
 	ret = file_in->f_op->remap_file_range(file_in, pos_in,
 			file_out, pos_out, len, 0);
-	if (!ret) {
-		fsnotify_access(file_in);
-		fsnotify_modify(file_out);
-	}
+	if (ret < 0)
+		return ret;
 
+	fsnotify_access(file_in);
+	fsnotify_modify(file_out);
 	return ret;
 }
 EXPORT_SYMBOL(do_clone_file_range);
 
-int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
-			 struct file *file_out, loff_t pos_out, u64 len)
+loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out, loff_t len)
 {
-	int ret;
+	loff_t ret;
 
 	file_start_write(file_out);
 	ret = do_clone_file_range(file_in, pos_in, file_out, pos_out, len);
@@ -1988,10 +1992,11 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
-int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
-			      struct file *dst_file, loff_t dst_pos, u64 len)
+loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
+				 struct file *dst_file, loff_t dst_pos,
+				 loff_t len)
 {
-	s64 ret;
+	loff_t ret;
 
 	ret = mnt_want_write_file(dst_file);
 	if (ret)
@@ -2040,7 +2045,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	int i;
 	int ret;
 	u16 count = same->dest_count;
-	int deduped;
+	loff_t deduped;
 
 	if (!(file->f_mode & FMODE_READ))
 		return -EINVAL;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dce01729e522..bc9e94bcb7a3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -919,20 +919,23 @@ xfs_file_fallocate(
 	return error;
 }
 
-STATIC int
+STATIC loff_t
 xfs_file_remap_range(
 	struct file	*file_in,
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	u64		len,
+	loff_t		len,
 	unsigned int	remap_flags)
 {
+	int		ret;
+
 	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
 		return -EINVAL;
 
-	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
 			len, remap_flags);
+	return ret < 0 ? ret : len;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index ada3b80267c6..b24a2a1c4db1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1296,7 +1296,7 @@ xfs_reflink_remap_prep(
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
-	u64			*len,
+	loff_t			*len,
 	unsigned int		remap_flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
@@ -1396,7 +1396,7 @@ xfs_reflink_remap_range(
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
-	u64			len,
+	loff_t			len,
 	unsigned int		remap_flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 6f82d628bf17..c3c46c276fe1 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -28,7 +28,7 @@ extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
-		struct file *file_out, loff_t pos_out, u64 len,
+		struct file *file_out, loff_t pos_out, loff_t len,
 		unsigned int remap_flags);
 extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp,
 		struct xfs_inode *ip, bool *has_shared);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b59637b2f484..035d8a88f633 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1779,9 +1779,9 @@ struct file_operations {
 #endif
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
 			loff_t, size_t, unsigned int);
-	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out,
-				u64 len, unsigned int remap_flags);
+	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
@@ -1846,21 +1846,24 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
-					 u64 *count, unsigned int remap_flags);
+					 loff_t *count,
+					 unsigned int remap_flags);
 extern int generic_remap_file_range_touch(struct file *file,
 					  unsigned int remap_flags);
-extern int do_clone_file_range(struct file *file_in, loff_t pos_in,
-			       struct file *file_out, loff_t pos_out, u64 len);
-extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out, u64 len);
+extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t len);
+extern loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   loff_t len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
 					 loff_t len, bool *is_same);
 extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
-extern int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
-				     struct file *dst_file, loff_t dst_pos,
-				     u64 len);
+extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
+					struct file *dst_file, loff_t dst_pos,
+					loff_t len);
 
 
 struct super_operations {
@@ -2990,7 +2993,7 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
-				uint64_t *count, unsigned int remap_flags);
+				loff_t *count, unsigned int remap_flags);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index c34a89a35d5a..369cfd164e90 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3001,7 +3001,7 @@ EXPORT_SYMBOL(generic_write_checks);
  */
 int generic_remap_checks(struct file *file_in, loff_t pos_in,
 			 struct file *file_out, loff_t pos_out,
-			 uint64_t *req_count, unsigned int remap_flags)
+			 loff_t *req_count, unsigned int remap_flags)
 {
 	struct inode *inode_in = file_in->f_mapping->host;
 	struct inode *inode_out = file_out->f_mapping->host;


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 14/25] vfs: plumb RFR_* remap flags through the vfs clone functions
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (12 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 13/25] vfs: make remap_file_range functions take and return bytes completed Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-13  0:07 ` [PATCH 15/25] vfs: plumb RFR_* remap flags through the vfs dedupe functions Darrick J. Wong
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Plumb a remap_flags argument through the {do,vfs}_clone_file_range
functions so that clone can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ioctl.c             |    2 +-
 fs/nfsd/vfs.c          |    2 +-
 fs/overlayfs/copy_up.c |    2 +-
 fs/overlayfs/file.c    |    6 +++---
 fs/read_write.c        |   13 +++++++++----
 include/linux/fs.h     |    4 ++--
 6 files changed, 17 insertions(+), 12 deletions(-)


diff --git a/fs/ioctl.c b/fs/ioctl.c
index 72537b68c272..505275ec5596 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -232,7 +232,7 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
 	if (src_file.file->f_path.mnt != dst_file->f_path.mnt)
 		goto fdput;
 	cloned = vfs_clone_file_range(src_file.file, off, dst_file, destoff,
-				      olen);
+				      olen, 0);
 	if (cloned < 0)
 		ret = cloned;
 	else if (olen && cloned != olen)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ac6cb6101cbe..726fc5b2b27a 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -543,7 +543,7 @@ __be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst,
 {
 	loff_t cloned;
 
-	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count);
+	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count, 0);
 	if (count && cloned != count)
 		cloned = -EINVAL;
 	return nfserrno(cloned < 0 ? cloned : 0);
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 8750b7235516..5f82fece64a0 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -142,7 +142,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 	}
 
 	/* Try to use clone_file_range to clone up within the same fs */
-	cloned = do_clone_file_range(old_file, 0, new_file, 0, len);
+	cloned = do_clone_file_range(old_file, 0, new_file, 0, len, 0);
 	if (cloned == len)
 		goto out;
 	/* Couldn't clone, so now we try to copy the data */
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 177731b21bad..e5cc17281d0b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -462,7 +462,7 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 
 	case OVL_CLONE:
 		ret = vfs_clone_file_range(real_in.file, pos_in,
-					   real_out.file, pos_out, len);
+					   real_out.file, pos_out, len, flags);
 		break;
 
 	case OVL_DEDUPE:
@@ -512,8 +512,8 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
 	     !ovl_inode_upper(file_inode(file_out))))
 		return -EPERM;
 
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    op);
+	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len,
+			    remap_flags, op);
 }
 
 const struct file_operations ovl_file_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 29ec0a6bfa2f..30076ed38091 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1837,12 +1837,15 @@ int generic_remap_file_range_touch(struct file *file, unsigned int remap_flags)
 EXPORT_SYMBOL(generic_remap_file_range_touch);
 
 loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
-			   struct file *file_out, loff_t pos_out, loff_t len)
+			   struct file *file_out, loff_t pos_out,
+			   loff_t len, unsigned int remap_flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	loff_t ret;
 
+	WARN_ON_ONCE(remap_flags);
+
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
@@ -1873,7 +1876,7 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 		return ret;
 
 	ret = file_in->f_op->remap_file_range(file_in, pos_in,
-			file_out, pos_out, len, 0);
+			file_out, pos_out, len, remap_flags);
 	if (ret < 0)
 		return ret;
 
@@ -1884,12 +1887,14 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 EXPORT_SYMBOL(do_clone_file_range);
 
 loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
-			    struct file *file_out, loff_t pos_out, loff_t len)
+			    struct file *file_out, loff_t pos_out,
+			    loff_t len, unsigned int remap_flags)
 {
 	loff_t ret;
 
 	file_start_write(file_out);
-	ret = do_clone_file_range(file_in, pos_in, file_out, pos_out, len);
+	ret = do_clone_file_range(file_in, pos_in, file_out, pos_out, len,
+				  remap_flags);
 	file_end_write(file_out);
 
 	return ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 035d8a88f633..4acda4809027 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1852,10 +1852,10 @@ extern int generic_remap_file_range_touch(struct file *file,
 					  unsigned int remap_flags);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  loff_t len);
+				  loff_t len, unsigned int remap_flags);
 extern loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 				   struct file *file_out, loff_t pos_out,
-				   loff_t len);
+				   loff_t len, unsigned int remap_flags);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
 					 loff_t len, bool *is_same);


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 15/25] vfs: plumb RFR_* remap flags through the vfs dedupe functions
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (13 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 14/25] vfs: plumb RFR_* remap flags through the vfs clone functions Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-13  0:07 ` [PATCH 16/25] vfs: make remapping to source file eof more explicit Darrick J. Wong
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Plumb a remap_flags argument through the vfs_dedupe_file_range_one
functions so that dedupe can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c |    3 ++-
 fs/read_write.c     |    9 ++++++---
 include/linux/fs.h  |    2 +-
 3 files changed, 9 insertions(+), 5 deletions(-)


diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index e5cc17281d0b..8f7a162768f2 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -467,7 +467,8 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 
 	case OVL_DEDUPE:
 		ret = vfs_dedupe_file_range_one(real_in.file, pos_in,
-						real_out.file, pos_out, len);
+						real_out.file, pos_out, len,
+						flags);
 		break;
 	}
 	revert_creds(old_cred);
diff --git a/fs/read_write.c b/fs/read_write.c
index 30076ed38091..81e0f969da59 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1999,10 +1999,12 @@ EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
 loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 				 struct file *dst_file, loff_t dst_pos,
-				 loff_t len)
+				 loff_t len, unsigned int remap_flags)
 {
 	loff_t ret;
 
+	WARN_ON_ONCE(remap_flags & ~(RFR_SAME_DATA));
+
 	ret = mnt_want_write_file(dst_file);
 	if (ret)
 		return ret;
@@ -2033,7 +2035,7 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 	}
 
 	ret = dst_file->f_op->remap_file_range(src_file, src_pos, dst_file,
-			dst_pos, len, RFR_SAME_DATA);
+			dst_pos, len, remap_flags | RFR_SAME_DATA);
 out_drop_write:
 	mnt_drop_write_file(dst_file);
 
@@ -2101,7 +2103,8 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 		}
 
 		deduped = vfs_dedupe_file_range_one(file, off, dst_file,
-						    info->dest_offset, len);
+						    info->dest_offset, len,
+						    0);
 		if (deduped == -EBADE)
 			info->status = FILE_DEDUPE_RANGE_DIFFERS;
 		else if (deduped < 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4acda4809027..d77b8d90d65e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1863,7 +1863,7 @@ extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
 extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 					struct file *dst_file, loff_t dst_pos,
-					loff_t len);
+					loff_t len, unsigned int remap_flags);
 
 
 struct super_operations {


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 16/25] vfs: make remapping to source file eof more explicit
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (14 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 15/25] vfs: plumb RFR_* remap flags through the vfs dedupe functions Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-14 17:24   ` Christoph Hellwig
  2018-10-13  0:07 ` [PATCH 17/25] vfs: enable remap callers that can handle short operations Darrick J. Wong
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a RFR_TO_SRC_EOF flag to explicitly declare that the caller wants
the remap implementation to remap to the end of the source file, once
the files are locked.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ioctl.c         |    3 ++-
 fs/nfsd/vfs.c      |    4 +++-
 fs/read_write.c    |   13 ++++++++-----
 include/linux/fs.h |    8 +++++++-
 4 files changed, 20 insertions(+), 8 deletions(-)


diff --git a/fs/ioctl.c b/fs/ioctl.c
index 505275ec5596..088cf240ca10 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -224,6 +224,7 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
 {
 	struct fd src_file = fdget(srcfd);
 	loff_t cloned;
+	unsigned int remap_flags = olen == 0 ? RFR_TO_SRC_EOF : 0;
 	int ret;
 
 	if (!src_file.file)
@@ -232,7 +233,7 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
 	if (src_file.file->f_path.mnt != dst_file->f_path.mnt)
 		goto fdput;
 	cloned = vfs_clone_file_range(src_file.file, off, dst_file, destoff,
-				      olen, 0);
+				      olen, remap_flags);
 	if (cloned < 0)
 		ret = cloned;
 	else if (olen && cloned != olen)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 726fc5b2b27a..0dc65047df1a 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -542,8 +542,10 @@ __be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst,
 		u64 dst_pos, u64 count)
 {
 	loff_t cloned;
+	unsigned int remap_flags = count == 0 ? RFR_TO_SRC_EOF : 0;
 
-	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count, 0);
+	cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count,
+				      remap_flags);
 	if (count && cloned != count)
 		cloned = -EINVAL;
 	return nfserrno(cloned < 0 ? cloned : 0);
diff --git a/fs/read_write.c b/fs/read_write.c
index 81e0f969da59..c02fc5144d15 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1746,15 +1746,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
-	/* Zero length dedupe exits immediately; reflink goes to EOF. */
-	if (*len == 0) {
+	/*
+	 * If the caller asked to go all the way to the end of the source file,
+	 * set *len now that we have the file locked.
+	 */
+	if (remap_flags & RFR_TO_SRC_EOF) {
 		loff_t isize = i_size_read(inode_in);
 
-		if (is_dedupe || pos_in == isize)
-			return 0;
 		if (pos_in > isize)
 			return -EINVAL;
 		*len = isize - pos_in;
+		if (*len == 0)
+			return 0;
 	}
 
 	/* Check that we don't violate system file offset limits. */
@@ -1844,7 +1847,7 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 	struct inode *inode_out = file_inode(file_out);
 	loff_t ret;
 
-	WARN_ON_ONCE(remap_flags);
+	WARN_ON_ONCE(remap_flags & ~(RFR_TO_SRC_EOF));
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d77b8d90d65e..b9c314f9d5a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1725,10 +1725,15 @@ struct block_device_operations;
  * These flags control the behavior of the remap_file_range function pointer.
  *
  * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
+ * RFR_TO_SRC_EOF: remap to the end of the source file
  */
 #define RFR_SAME_DATA		(1 << 0)
+#define RFR_TO_SRC_EOF		(1 << 1)
 
-#define RFR_VALID_FLAGS		(RFR_SAME_DATA)
+#define RFR_VALID_FLAGS		(RFR_SAME_DATA | RFR_TO_SRC_EOF)
+
+/* Implemented by the VFS, so these are advisory. */
+#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF)
 
 /*
  * Filesystem remapping implementations should call this helper on their
@@ -1739,6 +1744,7 @@ struct block_device_operations;
 static inline bool remap_check_flags(unsigned int remap_flags,
 				     unsigned int supported_flags)
 {
+	remap_flags &= ~RFR_VFS_FLAGS;
 	return (remap_flags & ~(supported_flags & RFR_VALID_FLAGS)) == 0;
 }
 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 17/25] vfs: enable remap callers that can handle short operations
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (15 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 16/25] vfs: make remapping to source file eof more explicit Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-13  0:07 ` [PATCH 18/25] vfs: hide file range comparison function Darrick J. Wong
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it.  Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c    |   22 +++++++++++++++-------
 include/linux/fs.h |    7 +++++--
 mm/filemap.c       |   16 ++++++++++++----
 3 files changed, 32 insertions(+), 13 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index c02fc5144d15..86a8c2557d18 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 		cloned = file_in->f_op->remap_file_range(file_in, pos_in,
 				file_out, pos_out,
-				min_t(loff_t, MAX_RW_COUNT, len), 0);
+				min_t(loff_t, MAX_RW_COUNT, len),
+				RFR_CAN_SHORTEN);
 		if (cloned > 0) {
 			ret = cloned;
 			goto done;
@@ -1797,6 +1798,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 
 	/* Are we doing a partial EOF block remapping of some kind? */
 	if (*len & blkmask) {
+		loff_t	new_len = *len;
+
 		/*
 		 * If the dedupe data matches, chop off the partial EOF block
 		 * from the source file so we don't try to dedupe the partial
@@ -1804,11 +1807,16 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 		 *
 		 * If the user is attempting to remap a partial EOF block and
 		 * it's inside the destination EOF then reject it.
+		 *
+		 * If possible, shorten the request instead of rejecting it.
 		 */
-		if (is_dedupe)
-			*len &= ~blkmask;
-		else if (pos_out + *len < i_size_read(inode_out))
-			return -EINVAL;
+		if (is_dedupe || pos_out + *len < i_size_read(inode_out))
+			new_len &= ~blkmask;
+
+		if (new_len != *len && !(remap_flags & RFR_CAN_SHORTEN))
+			return is_dedupe ? -EBADE : -EINVAL;
+
+		*len = new_len;
 	}
 
 	return 1;
@@ -2006,7 +2014,7 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 {
 	loff_t ret;
 
-	WARN_ON_ONCE(remap_flags & ~(RFR_SAME_DATA));
+	WARN_ON_ONCE(remap_flags & ~(RFR_SAME_DATA | RFR_CAN_SHORTEN));
 
 	ret = mnt_want_write_file(dst_file);
 	if (ret)
@@ -2107,7 +2115,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 
 		deduped = vfs_dedupe_file_range_one(file, off, dst_file,
 						    info->dest_offset, len,
-						    0);
+						    RFR_CAN_SHORTEN);
 		if (deduped == -EBADE)
 			info->status = FILE_DEDUPE_RANGE_DIFFERS;
 		else if (deduped < 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b9c314f9d5a4..57cb56bbc30a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1726,14 +1726,17 @@ struct block_device_operations;
  *
  * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
  * RFR_TO_SRC_EOF: remap to the end of the source file
+ * RFR_CAN_SHORTEN: caller can handle a shortened request
  */
 #define RFR_SAME_DATA		(1 << 0)
 #define RFR_TO_SRC_EOF		(1 << 1)
+#define RFR_CAN_SHORTEN		(1 << 2)
 
-#define RFR_VALID_FLAGS		(RFR_SAME_DATA | RFR_TO_SRC_EOF)
+#define RFR_VALID_FLAGS		(RFR_SAME_DATA | RFR_TO_SRC_EOF | \
+				 RFR_CAN_SHORTEN)
 
 /* Implemented by the VFS, so these are advisory. */
-#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF)
+#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF | RFR_CAN_SHORTEN)
 
 /*
  * Filesystem remapping implementations should call this helper on their
diff --git a/mm/filemap.c b/mm/filemap.c
index 369cfd164e90..bccbd3621238 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3051,8 +3051,12 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	if (pos_in + count == size_in) {
 		bcount = ALIGN(size_in, bs) - pos_in;
 	} else {
-		if (!IS_ALIGNED(count, bs))
-			return -EINVAL;
+		if (!IS_ALIGNED(count, bs)) {
+			if (remap_flags & RFR_CAN_SHORTEN)
+				count = ALIGN_DOWN(count, bs);
+			else
+				return -EINVAL;
+		}
 
 		bcount = count;
 	}
@@ -3063,10 +3067,14 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	    pos_out < pos_in + bcount)
 		return -EINVAL;
 
-	/* For now we don't support changing the length. */
-	if (*req_count != count)
+	/*
+	 * We shortened the request but the caller can't deal with that, so
+	 * bounce the request back to userspace.
+	 */
+	if (*req_count != count && !(remap_flags & RFR_CAN_SHORTEN))
 		return -EINVAL;
 
+	*req_count = count;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 18/25] vfs: hide file range comparison function
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (16 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 17/25] vfs: enable remap callers that can handle short operations Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-14 17:43   ` Christoph Hellwig
  2018-10-13  0:07 ` [PATCH 19/25] vfs: implement opportunistic short dedupe Darrick J. Wong
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

There are no callers of vfs_dedupe_file_range_compare, so we might as
well make it a static helper and remove the export.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c    |  191 ++++++++++++++++++++++++++--------------------------
 include/linux/fs.h |    3 -
 2 files changed, 95 insertions(+), 99 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 86a8c2557d18..ce3d5c4b1d34 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1714,6 +1714,101 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
 	return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
 }
 
+/*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
+{
+	struct address_space *mapping;
+	struct page *page;
+	pgoff_t n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ * Caller must have locked both inodes to prevent write races.
+ */
+static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+					 struct inode *dest, loff_t destoff,
+					 loff_t len, bool *is_same)
+{
+	loff_t src_poff;
+	loff_t dest_poff;
+	void *src_addr;
+	void *dest_addr;
+	struct page *src_page;
+	struct page *dest_page;
+	loff_t cmp_len;
+	bool same;
+	int error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		if (cmp_len <= 0)
+			goto out_error;
+
+		src_page = vfs_dedupe_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = vfs_dedupe_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	return error;
+}
+
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
  * sense, and then flush all dirty data.  Caller must ensure that the
@@ -1912,102 +2007,6 @@ loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
 
-/*
- * Read a page's worth of file data into the page cache.  Return the page
- * locked.
- */
-static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
-{
-	struct address_space *mapping;
-	struct page *page;
-	pgoff_t n;
-
-	n = offset >> PAGE_SHIFT;
-	mapping = inode->i_mapping;
-	page = read_mapping_page(mapping, n, NULL);
-	if (IS_ERR(page))
-		return page;
-	if (!PageUptodate(page)) {
-		put_page(page);
-		return ERR_PTR(-EIO);
-	}
-	lock_page(page);
-	return page;
-}
-
-/*
- * Compare extents of two files to see if they are the same.
- * Caller must have locked both inodes to prevent write races.
- */
-int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
-				  struct inode *dest, loff_t destoff,
-				  loff_t len, bool *is_same)
-{
-	loff_t src_poff;
-	loff_t dest_poff;
-	void *src_addr;
-	void *dest_addr;
-	struct page *src_page;
-	struct page *dest_page;
-	loff_t cmp_len;
-	bool same;
-	int error;
-
-	error = -EINVAL;
-	same = true;
-	while (len) {
-		src_poff = srcoff & (PAGE_SIZE - 1);
-		dest_poff = destoff & (PAGE_SIZE - 1);
-		cmp_len = min(PAGE_SIZE - src_poff,
-			      PAGE_SIZE - dest_poff);
-		cmp_len = min(cmp_len, len);
-		if (cmp_len <= 0)
-			goto out_error;
-
-		src_page = vfs_dedupe_get_page(src, srcoff);
-		if (IS_ERR(src_page)) {
-			error = PTR_ERR(src_page);
-			goto out_error;
-		}
-		dest_page = vfs_dedupe_get_page(dest, destoff);
-		if (IS_ERR(dest_page)) {
-			error = PTR_ERR(dest_page);
-			unlock_page(src_page);
-			put_page(src_page);
-			goto out_error;
-		}
-		src_addr = kmap_atomic(src_page);
-		dest_addr = kmap_atomic(dest_page);
-
-		flush_dcache_page(src_page);
-		flush_dcache_page(dest_page);
-
-		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
-			same = false;
-
-		kunmap_atomic(dest_addr);
-		kunmap_atomic(src_addr);
-		unlock_page(dest_page);
-		unlock_page(src_page);
-		put_page(dest_page);
-		put_page(src_page);
-
-		if (!same)
-			break;
-
-		srcoff += cmp_len;
-		destoff += cmp_len;
-		len -= cmp_len;
-	}
-
-	*is_same = same;
-	return 0;
-
-out_error:
-	return error;
-}
-EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
-
 loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 				 struct file *dst_file, loff_t dst_pos,
 				 loff_t len, unsigned int remap_flags)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 57cb56bbc30a..f0603ed007e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1865,9 +1865,6 @@ extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 extern loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
-extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
-					 struct inode *dest, loff_t destoff,
-					 loff_t len, bool *is_same);
 extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
 extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 19/25] vfs: implement opportunistic short dedupe
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (17 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 18/25] vfs: hide file range comparison function Darrick J. Wong
@ 2018-10-13  0:07 ` Darrick J. Wong
  2018-10-14 17:26   ` Christoph Hellwig
  2018-10-13  0:08 ` [PATCH 20/25] ocfs2: truncate page cache for clone destination file before remapping Darrick J. Wong
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:07 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

For a given dedupe request, the bytes_deduped field in the control
structure tells userspace if we managed to deduplicate some, but not all
of, the requested regions starting from the file offsets supplied.
However, due to sloppy coding, the current dedupe code returns
FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different.
Fix this so that we can actually support partial request completion.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/fs.h |    7 +++++--
 2 files changed, 43 insertions(+), 12 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index ce3d5c4b1d34..edd2e9ceb71b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1737,13 +1737,26 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 	return page;
 }
 
+static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2,
+				      unsigned int len)
+{
+	const char *orig_s1;
+
+	for (orig_s1 = s1; len > 0; s1++, s2++, len--)
+		if (*s1 != *s2)
+			break;
+
+	return s1 - orig_s1;
+}
+
 /*
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
 static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
-					 loff_t len, bool *is_same)
+					 loff_t *req_len,
+					 unsigned int remap_flags)
 {
 	loff_t src_poff;
 	loff_t dest_poff;
@@ -1751,8 +1764,11 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 	void *dest_addr;
 	struct page *src_page;
 	struct page *dest_page;
-	loff_t cmp_len;
+	loff_t len = *req_len;
+	loff_t same_len = 0;
 	bool same;
+	unsigned int cmp_len;
+	unsigned int cmp_same;
 	int error;
 
 	error = -EINVAL;
@@ -1762,7 +1778,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		dest_poff = destoff & (PAGE_SIZE - 1);
 		cmp_len = min(PAGE_SIZE - src_poff,
 			      PAGE_SIZE - dest_poff);
-		cmp_len = min(cmp_len, len);
+		cmp_len = min_t(loff_t, cmp_len, len);
 		if (cmp_len <= 0)
 			goto out_error;
 
@@ -1784,7 +1800,10 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		flush_dcache_page(src_page);
 		flush_dcache_page(dest_page);
 
-		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+		cmp_same = vfs_dedupe_memcmp(src_addr + src_poff,
+					     dest_addr + dest_poff, cmp_len);
+		same_len += cmp_same;
+		if (cmp_same != cmp_len)
 			same = false;
 
 		kunmap_atomic(dest_addr);
@@ -1802,7 +1821,17 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		len -= cmp_len;
 	}
 
-	*is_same = same;
+	/*
+	 * If less than the whole range matched, we have to back down to the
+	 * nearest block boundary.
+	 */
+	if (*req_len != same_len) {
+		if (!(remap_flags & RFR_SHORT_DEDUPE))
+			return -EBADE;
+
+		*req_len = ALIGN_DOWN(same_len, dest->i_sb->s_blocksize);
+	}
+
 	return 0;
 
 out_error:
@@ -1881,13 +1910,11 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	 * Check that the extents are the same.
 	 */
 	if (is_dedupe) {
-		bool		is_same = false;
-
 		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+				inode_out, pos_out, len, remap_flags);
 		if (ret)
 			return ret;
-		if (!is_same)
+		if (*len == 0)
 			return -EBADE;
 	}
 
@@ -2013,7 +2040,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 {
 	loff_t ret;
 
-	WARN_ON_ONCE(remap_flags & ~(RFR_SAME_DATA | RFR_CAN_SHORTEN));
+	WARN_ON_ONCE(remap_flags & ~(RFR_SAME_DATA | RFR_CAN_SHORTEN |
+				     RFR_SHORT_DEDUPE));
 
 	ret = mnt_want_write_file(dst_file);
 	if (ret)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f0603ed007e9..18b6db85ab64 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1727,16 +1727,19 @@ struct block_device_operations;
  * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
  * RFR_TO_SRC_EOF: remap to the end of the source file
  * RFR_CAN_SHORTEN: caller can handle a shortened request
+ * RFR_SHORT_DEDUPE: deduplicate from byte 0 until the file data don't match
  */
 #define RFR_SAME_DATA		(1 << 0)
 #define RFR_TO_SRC_EOF		(1 << 1)
 #define RFR_CAN_SHORTEN		(1 << 2)
+#define RFR_SHORT_DEDUPE	(1 << 3)
 
 #define RFR_VALID_FLAGS		(RFR_SAME_DATA | RFR_TO_SRC_EOF | \
-				 RFR_CAN_SHORTEN)
+				 RFR_CAN_SHORTEN | RFR_SHORT_DEDUPE)
 
 /* Implemented by the VFS, so these are advisory. */
-#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF | RFR_CAN_SHORTEN)
+#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF | RFR_CAN_SHORTEN | \
+				 RFR_SHORT_DEDUPE)
 
 /*
  * Filesystem remapping implementations should call this helper on their


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 20/25] ocfs2: truncate page cache for clone destination file before remapping
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (18 preceding siblings ...)
  2018-10-13  0:07 ` [PATCH 19/25] vfs: implement opportunistic short dedupe Darrick J. Wong
@ 2018-10-13  0:08 ` Darrick J. Wong
  2018-10-13  0:08 ` [PATCH 21/25] ocfs2: fix pagecache truncation prior to reflink Darrick J. Wong
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:08 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

When cloning blocks into another file, truncate the page cache before we
start remapping blocks so that concurrent reads wait for us to finish.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index a3df118bf3b9..851ba3ae7ce8 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4869,14 +4869,12 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
 				  SINGLE_DEPTH_NESTING);
 
-	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-					 out_bh, pos_out, len);
-
 	/* Zap any page cache for the destination file's range. */
-	if (!ret)
-		truncate_inode_pages_range(&inode_out->i_data, pos_out,
-					   PAGE_ALIGN(pos_out + len) - 1);
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
 
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
 	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
 		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 21/25] ocfs2: fix pagecache truncation prior to reflink
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (19 preceding siblings ...)
  2018-10-13  0:08 ` [PATCH 20/25] ocfs2: truncate page cache for clone destination file before remapping Darrick J. Wong
@ 2018-10-13  0:08 ` Darrick J. Wong
  2018-10-13  0:08 ` [PATCH 22/25] ocfs2: support partial clone range and dedupe range Darrick J. Wong
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:08 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range should be fine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 851ba3ae7ce8..b9e0418a1974 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4870,8 +4870,9 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 				  SINGLE_DEPTH_NESTING);
 
 	/* Zap any page cache for the destination file's range. */
-	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
+	truncate_inode_pages_range(&inode_out->i_data,
+				   round_down(pos_out, PAGE_SIZE),
+				   round_up(pos_out + len, PAGE_SIZE) - 1);
 
 	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
 					 out_bh, pos_out, len);


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 22/25] ocfs2: support partial clone range and dedupe range
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (20 preceding siblings ...)
  2018-10-13  0:08 ` [PATCH 21/25] ocfs2: fix pagecache truncation prior to reflink Darrick J. Wong
@ 2018-10-13  0:08 ` Darrick J. Wong
  2018-10-14 17:41   ` Christoph Hellwig
  2018-10-13  0:08 ` [PATCH 23/25] xfs: fix pagecache truncation prior to reflink Darrick J. Wong
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:08 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Change the ocfs2 remap code to allow for returning partial results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/file.c         |    7 +----
 fs/ocfs2/refcounttree.c |   73 ++++++++++++++++++++++++++---------------------
 fs/ocfs2/refcounttree.h |   12 ++++----
 3 files changed, 48 insertions(+), 44 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index e6ffed70398e..061ae2c4bd4a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2531,14 +2531,11 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
 				     struct file *file_out, loff_t pos_out,
 				     loff_t len, unsigned int remap_flags)
 {
-	int ret;
-
 	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
 		return -EINVAL;
 
-	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					len, remap_flags);
-	return ret < 0 ? ret : len;
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, remap_flags);
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index b9e0418a1974..4eacdd703874 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4507,14 +4507,14 @@ static int ocfs2_reflink_update_dest(struct inode *dest,
 }
 
 /* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
-static int ocfs2_reflink_remap_extent(struct inode *s_inode,
-				      struct buffer_head *s_bh,
-				      loff_t pos_in,
-				      struct inode *t_inode,
-				      struct buffer_head *t_bh,
-				      loff_t pos_out,
-				      loff_t len,
-				      struct ocfs2_cached_dealloc_ctxt *dealloc)
+static loff_t ocfs2_reflink_remap_extent(struct inode *s_inode,
+					 struct buffer_head *s_bh,
+					 loff_t pos_in,
+					 struct inode *t_inode,
+					 struct buffer_head *t_bh,
+					 loff_t pos_out,
+					 loff_t len,
+					 struct ocfs2_cached_dealloc_ctxt *dealloc)
 {
 	struct ocfs2_extent_tree s_et;
 	struct ocfs2_extent_tree t_et;
@@ -4522,6 +4522,7 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode,
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_super *osb;
+	loff_t remapped = 0;
 	loff_t pstart, plen;
 	u32 p_cluster, num_clusters, slast, spos, tpos;
 	unsigned int ext_flags;
@@ -4605,30 +4606,32 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode,
 next_loop:
 		spos += num_clusters;
 		tpos += num_clusters;
+		remapped += ocfs2_clusters_to_bytes(t_inode->i_sb,
+				num_clusters);
 	}
 
-out:
-	return ret;
+	return remapped;
 out_unlock_refcount:
 	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
 	brelse(ref_root_bh);
-	return ret;
+out:
+	return remapped > 0 ? remapped : ret;
 }
 
 /* Set up refcount tree and remap s_inode to t_inode. */
-static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
-				      struct buffer_head *s_bh,
-				      loff_t pos_in,
-				      struct inode *t_inode,
-				      struct buffer_head *t_bh,
-				      loff_t pos_out,
-				      loff_t len)
+static loff_t ocfs2_reflink_remap_blocks(struct inode *s_inode,
+					 struct buffer_head *s_bh,
+					 loff_t pos_in,
+					 struct inode *t_inode,
+					 struct buffer_head *t_bh,
+					 loff_t pos_out,
+					 loff_t len)
 {
 	struct ocfs2_cached_dealloc_ctxt dealloc;
 	struct ocfs2_super *osb;
 	struct ocfs2_dinode *dis;
 	struct ocfs2_dinode *dit;
-	int ret;
+	loff_t ret;
 
 	osb = OCFS2_SB(s_inode->i_sb);
 	dis = (struct ocfs2_dinode *)s_bh->b_data;
@@ -4700,7 +4703,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
 	/* Actually remap extents now. */
 	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
 					 pos_out, len, &dealloc);
-	if (ret) {
+	if (ret < 0) {
 		mlog_errno(ret);
 		goto out;
 	}
@@ -4820,18 +4823,19 @@ static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
 }
 
 /* Link a range of blocks from one file to another. */
-int ocfs2_reflink_remap_range(struct file *file_in,
-			      loff_t pos_in,
-			      struct file *file_out,
-			      loff_t pos_out,
-			      loff_t len,
-			      unsigned int remap_flags)
+loff_t ocfs2_reflink_remap_range(struct file *file_in,
+				 loff_t pos_in,
+				 struct file *file_out,
+				 loff_t pos_out,
+				 loff_t len,
+				 unsigned int remap_flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
 	struct buffer_head *in_bh = NULL, *out_bh = NULL;
 	bool same_inode = (inode_in == inode_out);
+	loff_t remapped = 0;
 	ssize_t ret;
 
 	if (!ocfs2_refcount_tree(osb))
@@ -4855,6 +4859,11 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	if (ret <= 0)
 		goto out_unlock;
 
+	if (len == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
 	/*
 	 * Update inode timestamps and remove security privileges before we
 	 * take the ilock.
@@ -4874,12 +4883,13 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 				   round_down(pos_out, PAGE_SIZE),
 				   round_up(pos_out + len, PAGE_SIZE) - 1);
 
-	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-					 out_bh, pos_out, len);
+	remapped = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in,
+			inode_out, out_bh, pos_out, len);
 	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
 		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
-	if (ret) {
+	if (remapped < 0) {
+		ret = remapped;
 		mlog_errno(ret);
 		goto out_unlock;
 	}
@@ -4897,10 +4907,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		goto out_unlock;
 	}
 
-	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
-	return 0;
-
 out_unlock:
 	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
-	return ret;
+	return remapped > 0 ? remapped : ret;
 }
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index eb65c1d0843c..9e64daba395d 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -115,11 +115,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
-int ocfs2_reflink_remap_range(struct file *file_in,
-			      loff_t pos_in,
-			      struct file *file_out,
-			      loff_t pos_out,
-			      loff_t len,
-			      unsigned int remap_flags);
+loff_t ocfs2_reflink_remap_range(struct file *file_in,
+				 loff_t pos_in,
+				 struct file *file_out,
+				 loff_t pos_out,
+				 loff_t len,
+				 unsigned int remap_flags);
 
 #endif /* OCFS2_REFCOUNTTREE_H */


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 23/25] xfs: fix pagecache truncation prior to reflink
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (21 preceding siblings ...)
  2018-10-13  0:08 ` [PATCH 22/25] ocfs2: support partial clone range and dedupe range Darrick J. Wong
@ 2018-10-13  0:08 ` Darrick J. Wong
  2018-10-13  0:08 ` [PATCH 24/25] xfs: support returning partial reflink results Darrick J. Wong
  2018-10-13  0:08 ` [PATCH 25/25] xfs: remove redundant remap partial EOF block checks Darrick J. Wong
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:08 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, Dave Chinner, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range shouldn't be a problem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_reflink.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index b24a2a1c4db1..e1592e751cc2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1370,8 +1370,9 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	/* Zap any page cache for the destination file's range. */
-	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + *len) - 1);
+	truncate_inode_pages_range(&inode_out->i_data,
+			round_down(pos_out, PAGE_SIZE),
+			round_up(pos_out + *len, PAGE_SIZE) - 1);
 
 	/*
 	 * Update inode timestamps and remove security privileges before we


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 24/25] xfs: support returning partial reflink results
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (22 preceding siblings ...)
  2018-10-13  0:08 ` [PATCH 23/25] xfs: fix pagecache truncation prior to reflink Darrick J. Wong
@ 2018-10-13  0:08 ` Darrick J. Wong
  2018-10-14 17:35   ` Christoph Hellwig
  2018-10-13  0:08 ` [PATCH 25/25] xfs: remove redundant remap partial EOF block checks Darrick J. Wong
  24 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:08 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace.  However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |    5 +----
 fs/xfs/xfs_reflink.c |   20 +++++++++++++++-----
 fs/xfs/xfs_reflink.h |    2 +-
 3 files changed, 17 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bc9e94bcb7a3..b2b15b8dc4a1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -928,14 +928,11 @@ xfs_file_remap_range(
 	loff_t		len,
 	unsigned int	remap_flags)
 {
-	int		ret;
-
 	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
 		return -EINVAL;
 
-	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
 			len, remap_flags);
-	return ret < 0 ? ret : len;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e1592e751cc2..66a8ddb9c058 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1123,6 +1123,7 @@ xfs_reflink_remap_blocks(
 	struct xfs_inode	*dest,
 	xfs_fileoff_t		destoff,
 	xfs_filblks_t		len,
+	xfs_filblks_t		*remapped_len,
 	xfs_off_t		new_isize)
 {
 	struct xfs_bmbt_irec	imap;
@@ -1130,6 +1131,7 @@ xfs_reflink_remap_blocks(
 	int			error = 0;
 	xfs_filblks_t		range_len;
 
+	*remapped_len = 0;
 	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
 	while (len) {
 		uint		lock_mode;
@@ -1168,6 +1170,7 @@ xfs_reflink_remap_blocks(
 		srcoff += range_len;
 		destoff += range_len;
 		len -= range_len;
+		*remapped_len += range_len;
 	}
 
 	return 0;
@@ -1391,7 +1394,7 @@ xfs_reflink_remap_prep(
 /*
  * Link a range of blocks from one file to another.
  */
-int
+loff_t
 xfs_reflink_remap_range(
 	struct file		*file_in,
 	loff_t			pos_in,
@@ -1406,9 +1409,10 @@ xfs_reflink_remap_range(
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	struct xfs_mount	*mp = src->i_mount;
 	xfs_fileoff_t		sfsbno, dfsbno;
-	xfs_filblks_t		fsblen;
+	xfs_filblks_t		fsblen, remappedfsb = 0;
+	loff_t			remapped_bytes = 0;
 	xfs_extlen_t		cowextsize;
-	ssize_t			ret;
+	int			ret;
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
 		return -EOPNOTSUPP;
@@ -1424,11 +1428,17 @@ xfs_reflink_remap_range(
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
+	if (len == 0) {
+		ret = 0;
+		goto out_unlock;
+	}
+
 	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
 	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
 	fsblen = XFS_B_TO_FSB(mp, len);
 	ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
-			pos_out + len);
+			&remappedfsb, pos_out + len);
+	remapped_bytes = min_t(int64_t, len, XFS_FSB_TO_B(mp, remappedfsb));
 	if (ret)
 		goto out_unlock;
 
@@ -1451,7 +1461,7 @@ xfs_reflink_remap_range(
 	xfs_reflink_remap_unlock(file_in, file_out);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
-	return ret;
+	return remapped_bytes > 0 ? remapped_bytes : ret;
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c3c46c276fe1..cbc26ff79a8f 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -27,7 +27,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
-extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
+extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,
 		unsigned int remap_flags);
 extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 25/25] xfs: remove redundant remap partial EOF block checks
  2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
                   ` (23 preceding siblings ...)
  2018-10-13  0:08 ` [PATCH 24/25] xfs: support returning partial reflink results Darrick J. Wong
@ 2018-10-13  0:08 ` Darrick J. Wong
  24 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-13  0:08 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, Dave Chinner, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we've moved the partial EOF block checks to the VFS helpers, we
can remove the redundantn functionality from XFS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_reflink.c |   20 --------------------
 1 file changed, 20 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 66a8ddb9c058..709735880efb 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1307,8 +1307,6 @@ xfs_reflink_remap_prep(
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
-	bool			is_dedupe = (remap_flags & RFR_SAME_DATA);
-	u64			blkmask = i_blocksize(inode_in) - 1;
 	ssize_t			ret;
 
 	/* Lock both files against IO */
@@ -1336,24 +1334,6 @@ xfs_reflink_remap_prep(
 	if (ret <= 0)
 		goto out_unlock;
 
-	/*
-	 * If the dedupe data matches, chop off the partial EOF block
-	 * from the source file so we don't try to dedupe the partial
-	 * EOF block.
-	 */
-	if (is_dedupe) {
-		*len &= ~blkmask;
-	} else if (*len & blkmask) {
-		/*
-		 * The user is attempting to share a partial EOF block,
-		 * if it's inside the destination EOF then reject it.
-		 */
-		if (pos_out + *len < i_size_read(inode_out)) {
-			ret = -EINVAL;
-			goto out_unlock;
-		}
-	}
-
 	/* Attach dquots to dest inode before changing block map */
 	ret = xfs_qm_dqattach(dest);
 	if (ret)


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block
  2018-10-13  0:06 ` [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block Darrick J. Wong
@ 2018-10-14 17:11   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-13  0:06 ` [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range Darrick J. Wong
@ 2018-10-14 17:19   ` Christoph Hellwig
  2018-10-15  6:04     ` Amir Goldstein
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

>  	unsigned (*mmap_capabilities)(struct file *);
>  #endif
>  	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
> -	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> -	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> +	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> +				struct file *file_out, loff_t pos_out,
> +				u64 len, unsigned int remap_flags);

None of the other methods in this file name their parameters.  While
I generally don't like people leaving them out, in the end consistency
is even more important.

> +int btrfs_remap_file_range(struct file *src_file, loff_t off,
> +		struct file *dst_file, loff_t destoff, u64 len,
> +		unsigned int remap_flags)
>  {
> +	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
> +		return -EINVAL;
> +
> +	if (remap_flags & RFR_SAME_DATA) {

So at least for btrfs there seems to be no shared code at all below
the function calls.  This kinda speaks against the argument that
they fundamentally are the same..

> +/*
> + * These flags control the behavior of the remap_file_range function pointer.
> + *
> + * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
> + */
> +#define RFR_SAME_DATA		(1 << 0)
> +
> +#define RFR_VALID_FLAGS		(RFR_SAME_DATA)

RFR?  Why not REMAP_FILE_*  Also why not the well understood
REMAP_FILE_DEDUP instead of the odd SAME_DATA?

> +
> +/*
> + * Filesystem remapping implementations should call this helper on their
> + * remap flags to filter out flags that the implementation doesn't support.
> + *
> + * Returns true if the flags are ok, false otherwise.
> + */
> +static inline bool remap_check_flags(unsigned int remap_flags,
> +				     unsigned int supported_flags)
> +{
> +	return (remap_flags & ~(supported_flags & RFR_VALID_FLAGS)) == 0;
> +}

Any reason to even bother with a helper for this?  ->fallocate
seems to be doing fine without the helper, and the resulting code
seems a lot easier to understand to me.

> @@ -1759,10 +1779,9 @@ struct file_operations {
>  #endif
>  	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
>  			loff_t, size_t, unsigned int);
> -	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
> -			u64);
> -	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
> -			u64);
> +	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> +				struct file *file_out, loff_t pos_out,
> +				u64 len, unsigned int remap_flags);

Same comment here.  Didn't we have some nice doc tools to avoid this
duplication? :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata
  2018-10-13  0:06 ` [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata Darrick J. Wong
@ 2018-10-14 17:21   ` Christoph Hellwig
  2018-10-15 16:30     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

> +/* Update inode timestamps and remove security privileges when remapping. */
> +int generic_remap_file_range_touch(struct file *file, bool is_dedupe)
> +{
> +	int ret;
> +
> +	/* If can't alter the file contents, we're done. */
> +	if (is_dedupe)
> +		return 0;
> +
> +	/* Update the timestamps, since we can alter file contents. */
> +	if (!(file->f_mode & FMODE_NOCMTIME)) {
> +		ret = file_update_time(file);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/*
> +	 * Clear the security bits if the process is not being run by root.
> +	 * This keeps people from modifying setuid and setgid binaries.
> +	 */
> +	return file_remove_privs(file);
> +}
> +EXPORT_SYMBOL(generic_remap_file_range_touch);

The name seems a little out of touch with what it actually does.
Also why a bool argument instead of the more descriptive flags which
introduced a few patches ago?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep
  2018-10-13  0:06 ` [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep Darrick J. Wong
@ 2018-10-14 17:22   ` Christoph Hellwig
  2018-10-14 17:37   ` Christoph Hellwig
  1 sibling, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

On Fri, Oct 12, 2018 at 05:06:58PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Plumb the remap flags through the filesystem from the vfs function
> dispatcher all the way to the prep function to prepare for behavior
> changes in subsequent patches.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>

It seems like we should have passed this down earlier before all
the renaing and adding new helpers that could have easily started
out with the flags.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 16/25] vfs: make remapping to source file eof more explicit
  2018-10-13  0:07 ` [PATCH 16/25] vfs: make remapping to source file eof more explicit Darrick J. Wong
@ 2018-10-14 17:24   ` Christoph Hellwig
  2018-10-15 15:32     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

On Fri, Oct 12, 2018 at 05:07:37PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a RFR_TO_SRC_EOF flag to explicitly declare that the caller wants
> the remap implementation to remap to the end of the source file, once
> the files are locked.

The name looks like a cat threw up on your keyboard :)

From reading the code this seems to ask for a whole file remap, right?
Why not put that in the name to make it more descriptive?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 19/25] vfs: implement opportunistic short dedupe
  2018-10-13  0:07 ` [PATCH 19/25] vfs: implement opportunistic short dedupe Darrick J. Wong
@ 2018-10-14 17:26   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

How is RFR_SHORT_DEDUPE so different from RFR_SAME_DATA + RFR_CAN_SHORTEN
that we need another flag for it?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 24/25] xfs: support returning partial reflink results
  2018-10-13  0:08 ` [PATCH 24/25] xfs: support returning partial reflink results Darrick J. Wong
@ 2018-10-14 17:35   ` Christoph Hellwig
  2018-10-14 23:05     ` Dave Chinner
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

On Fri, Oct 12, 2018 at 05:08:32PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Back when the XFS reflink code only supported clone_file_range, we were
> only able to return zero or negative error codes to userspace.  However,
> now that copy_file_range (which returns bytes copied) can use XFS'
> clone_file_range, we have the opportunity to return partial results.
> For example, if userspace sends a 1GB clone request and we run out of
> space halfway through, we at least can tell userspace that we completed
> 512M of that request like a regular write.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_file.c    |    5 +----
>  fs/xfs/xfs_reflink.c |   20 +++++++++++++++-----
>  fs/xfs/xfs_reflink.h |    2 +-
>  3 files changed, 17 insertions(+), 10 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bc9e94bcb7a3..b2b15b8dc4a1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -928,14 +928,11 @@ xfs_file_remap_range(
>  	loff_t		len,
>  	unsigned int	remap_flags)
>  {
> -	int		ret;
> -
>  	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
>  		return -EINVAL;
>  
> -	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> +	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
>  			len, remap_flags);

Is there any reason not to merge xfs_file_remap_range and
xfs_reflink_remap_range at this point?

>  STATIC int
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index e1592e751cc2..66a8ddb9c058 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1123,6 +1123,7 @@ xfs_reflink_remap_blocks(
>  	struct xfs_inode	*dest,
>  	xfs_fileoff_t		destoff,
>  	xfs_filblks_t		len,
> +	xfs_filblks_t		*remapped_len,
>  	xfs_off_t		new_isize)
>  {
>  	struct xfs_bmbt_irec	imap;
> @@ -1130,6 +1131,7 @@ xfs_reflink_remap_blocks(
>  	int			error = 0;
>  	xfs_filblks_t		range_len;
>  
> +	*remapped_len = 0;
>  	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
>  	while (len) {
>  		uint		lock_mode;
> @@ -1168,6 +1170,7 @@ xfs_reflink_remap_blocks(
>  		srcoff += range_len;
>  		destoff += range_len;
>  		len -= range_len;
> +		*remapped_len += range_len;
>  	}
>  
>  	return 0;
> @@ -1391,7 +1394,7 @@ xfs_reflink_remap_prep(
>  /*
>   * Link a range of blocks from one file to another.
>   */
> -int
> +loff_t
>  xfs_reflink_remap_range(
>  	struct file		*file_in,
>  	loff_t			pos_in,
> @@ -1406,9 +1409,10 @@ xfs_reflink_remap_range(
>  	struct xfs_inode	*dest = XFS_I(inode_out);
>  	struct xfs_mount	*mp = src->i_mount;
>  	xfs_fileoff_t		sfsbno, dfsbno;
> -	xfs_filblks_t		fsblen;
> +	xfs_filblks_t		fsblen, remappedfsb = 0;
> +	loff_t			remapped_bytes = 0;
>  	xfs_extlen_t		cowextsize;
> -	ssize_t			ret;
> +	int			ret;
>  
>  	if (!xfs_sb_version_hasreflink(&mp->m_sb))
>  		return -EOPNOTSUPP;
> @@ -1424,11 +1428,17 @@ xfs_reflink_remap_range(
>  
>  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
>  
> +	if (len == 0) {
> +		ret = 0;
> +		goto out_unlock;
> +	}

Looking at the final tree this looks like dead (and bogus) code:

	if (ret <= 0)
	        return ret;

        trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);

        if (len == 0) {
                ret = 0;
                goto out_unlock;
        }


> +
>  	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
>  	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
>  	fsblen = XFS_B_TO_FSB(mp, len);
>  	ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> +			&remappedfsb, pos_out + len);
> +	remapped_bytes = min_t(int64_t, len, XFS_FSB_TO_B(mp, remappedfsb));
>  	if (ret)
>  		goto out_unlock;

Shouldn't we just follow the calling convention of the method here:

 negative return value:	error
 positive:		number of bytes handled
 
 Something like:

	done = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno,
			fsblen, pos_out + len);
	if (done < 0) {
		xfs_reflink_remap_unlock(file_in, file_out);
		trace_xfs_reflink_remap_range_error(dest, done, _RET_IP_);
		return done;
	}

>  
> @@ -1451,7 +1461,7 @@ xfs_reflink_remap_range(
>  	xfs_reflink_remap_unlock(file_in, file_out);
>  	if (ret)
>  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);

And then we can drop this conditional here.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep
  2018-10-13  0:06 ` [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep Darrick J. Wong
  2018-10-14 17:22   ` Christoph Hellwig
@ 2018-10-14 17:37   ` Christoph Hellwig
  2018-10-15 15:42     ` Darrick J. Wong
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

> +	bool is_dedupe = (remap_flags & RFR_SAME_DATA);

Btw, I think the code would be cleaner if we dropped this variable.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 22/25] ocfs2: support partial clone range and dedupe range
  2018-10-13  0:08 ` [PATCH 22/25] ocfs2: support partial clone range and dedupe range Darrick J. Wong
@ 2018-10-14 17:41   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, linux-unionfs, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

> @@ -2531,14 +2531,11 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
>  				     struct file *file_out, loff_t pos_out,
>  				     loff_t len, unsigned int remap_flags)
>  {
> -	int ret;
> -
>  	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
>  		return -EINVAL;
>  
> -	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> -					len, remap_flags);
> -	return ret < 0 ? ret : len;
> +	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> +			len, remap_flags);

Seems like ocfs2_remap_file_range and ocfs2_reflink_remap_range should
be merged now.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 18/25] vfs: hide file range comparison function
  2018-10-13  0:07 ` [PATCH 18/25] vfs: hide file range comparison function Darrick J. Wong
@ 2018-10-14 17:43   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-14 17:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

> +static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
> +{
> +	struct address_space *mapping;
> +	struct page *page;
> +	pgoff_t n;
> +
> +	n = offset >> PAGE_SHIFT;
> +	mapping = inode->i_mapping;
> +	page = read_mapping_page(mapping, n, NULL);
> +	if (IS_ERR(page))
> +		return page;
> +	if (!PageUptodate(page)) {
> +		put_page(page);
> +		return ERR_PTR(-EIO);
> +	}
> +	lock_page(page);
> +	return page;
> +}

Might be worth to clean ths up a bit while you are at it:

+static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
{
	struct page *page;

	page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
	...

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 24/25] xfs: support returning partial reflink results
  2018-10-14 17:35   ` Christoph Hellwig
@ 2018-10-14 23:05     ` Dave Chinner
  2018-10-15 15:49       ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2018-10-14 23:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, sandeen, linux-nfs, linux-cifs, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

On Sun, Oct 14, 2018 at 10:35:46AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 12, 2018 at 05:08:32PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Back when the XFS reflink code only supported clone_file_range, we were
> > only able to return zero or negative error codes to userspace.  However,
> > now that copy_file_range (which returns bytes copied) can use XFS'
> > clone_file_range, we have the opportunity to return partial results.
> > For example, if userspace sends a 1GB clone request and we run out of
> > space halfway through, we at least can tell userspace that we completed
> > 512M of that request like a regular write.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_file.c    |    5 +----
> >  fs/xfs/xfs_reflink.c |   20 +++++++++++++++-----
> >  fs/xfs/xfs_reflink.h |    2 +-
> >  3 files changed, 17 insertions(+), 10 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index bc9e94bcb7a3..b2b15b8dc4a1 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -928,14 +928,11 @@ xfs_file_remap_range(
> >  	loff_t		len,
> >  	unsigned int	remap_flags)
> >  {
> > -	int		ret;
> > -
> >  	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
> >  		return -EINVAL;
> >  
> > -	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> > +	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> >  			len, remap_flags);
> 
> Is there any reason not to merge xfs_file_remap_range and
> xfs_reflink_remap_range at this point?

Yeah, that seems like a good idea to me - pulling all the
vfs/generic code interactions back up into xfs_file.c would match
how the rest of the file operations are layered w.r.t. external and
internal XFS code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-14 17:19   ` Christoph Hellwig
@ 2018-10-15  6:04     ` Amir Goldstein
  2018-10-15 12:47       ` Christoph Hellwig
  2018-10-15 13:18     ` Matthew Wilcox
  2018-10-15 16:42     ` Darrick J. Wong
  2 siblings, 1 reply; 53+ messages in thread
From: Amir Goldstein @ 2018-10-15  6:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Dave Chinner, Eric Sandeen,
	Linux NFS Mailing List, linux-cifs, overlayfs, linux-xfs,
	Linux MM, Linux Btrfs, linux-fsdevel, ocfs2-devel

> > +/*
> > + * These flags control the behavior of the remap_file_range function pointer.
> > + *
> > + * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
> > + */
> > +#define RFR_SAME_DATA                (1 << 0)
> > +
> > +#define RFR_VALID_FLAGS              (RFR_SAME_DATA)
>
> RFR?  Why not REMAP_FILE_*  Also why not the well understood
> REMAP_FILE_DEDUP instead of the odd SAME_DATA?
>
> > +
> > +/*
> > + * Filesystem remapping implementations should call this helper on their
> > + * remap flags to filter out flags that the implementation doesn't support.
> > + *
> > + * Returns true if the flags are ok, false otherwise.
> > + */
> > +static inline bool remap_check_flags(unsigned int remap_flags,
> > +                                  unsigned int supported_flags)
> > +{
> > +     return (remap_flags & ~(supported_flags & RFR_VALID_FLAGS)) == 0;
> > +}
>
> Any reason to even bother with a helper for this?  ->fallocate
> seems to be doing fine without the helper, and the resulting code
> seems a lot easier to understand to me.

I supposed you figured out the reason already.
It makes it appearance in patch 16/25 as RFR_VFS_FLAGS.
All those "advisory" flags, we want to pass them in to filesystem as FYI,
but we don't want to explicitly add support for e.g. RFR_CAN_SHORTEN
to every filesystem, when vfs has already taken care of the advice.

The reason a similar helper doesn't make sense for ->fallocate()
is because vfs does not take any action on behalf of filesystem
nor does vfs pass any internal flags to filesystem.

I argued that fiemap_check_flags() should similarly mask out
FIEMAP_FLAG_SYNC before checking supported fs_flags,
because ioctl_fiemap() respects this flag regardless if filesystem
(or generic helper) declares support for FIEMAP_FLAG_SYNC.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-15  6:04     ` Amir Goldstein
@ 2018-10-15 12:47       ` Christoph Hellwig
  2018-10-15 12:54         ` Amir Goldstein
  2018-10-15 17:13         ` Darrick J. Wong
  0 siblings, 2 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-15 12:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, Eric Sandeen,
	Linux NFS Mailing List, linux-cifs, overlayfs, linux-xfs,
	Linux MM, Linux Btrfs, linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 09:04:13AM +0300, Amir Goldstein wrote:
> I supposed you figured out the reason already.

No, I hadn't.

> It makes it appearance in patch 16/25 as RFR_VFS_FLAGS.
> All those "advisory" flags, we want to pass them in to filesystem as FYI,
> but we don't want to explicitly add support for e.g. RFR_CAN_SHORTEN
> to every filesystem, when vfs has already taken care of the advice.

I don't think this model makes sense.  If they really are purely
handled in the VFS we can mask them before passing them to the file
system, if not we need to check them, or the they are avisory and
we can have a simple #define instead of the helper.

RFR_TO_SRC_EOF is checked in generic_remap_file_range_prep,
so the file system should know about it  Also looking at it again now
it seems entirely superflous - we can just pass down then len == we
use in higher level code instead of having a flag and will side step
the issue here.

RFR_CAN_SHORTEN is advisory as no one has to shorten, but that can
easily be solved by including it everywhere.

RFR_SHORT_DEDUPE is as far as I can tell entirely superflous to
start with, as RFR_CAN_SHORTEN can be used instead.

So something like this in fs.h:

#define REMAP_FILE_ADVISORY_FLAGS	REMAP_FILE_CAN_SHORTEN

And then in the file system:

	if (flags & ~REMAP_FILE_ADVISORY_FLAGS)
		-EINVAL;

or

	if (flags & ~(REMAP_FILE_ADVISORY_FLAGS | REMAP_FILE_DEDUP))
		-EINVAL;

should be all that is needed.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-15 12:47       ` Christoph Hellwig
@ 2018-10-15 12:54         ` Amir Goldstein
  2018-10-15 17:13         ` Darrick J. Wong
  1 sibling, 0 replies; 53+ messages in thread
From: Amir Goldstein @ 2018-10-15 12:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Dave Chinner, Eric Sandeen,
	Linux NFS Mailing List, linux-cifs, overlayfs, linux-xfs,
	Linux MM, Linux Btrfs, linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 3:47 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Oct 15, 2018 at 09:04:13AM +0300, Amir Goldstein wrote:
> > I supposed you figured out the reason already.
>
> No, I hadn't.
>
> > It makes it appearance in patch 16/25 as RFR_VFS_FLAGS.
> > All those "advisory" flags, we want to pass them in to filesystem as FYI,
> > but we don't want to explicitly add support for e.g. RFR_CAN_SHORTEN
> > to every filesystem, when vfs has already taken care of the advice.
>
> I don't think this model makes sense.  If they really are purely
> handled in the VFS we can mask them before passing them to the file
> system, if not we need to check them, or the they are avisory and
> we can have a simple #define instead of the helper.
>
> RFR_TO_SRC_EOF is checked in generic_remap_file_range_prep,
> so the file system should know about it  Also looking at it again now
> it seems entirely superflous - we can just pass down then len == we
> use in higher level code instead of having a flag and will side step
> the issue here.
>
> RFR_CAN_SHORTEN is advisory as no one has to shorten, but that can
> easily be solved by including it everywhere.
>
> RFR_SHORT_DEDUPE is as far as I can tell entirely superflous to
> start with, as RFR_CAN_SHORTEN can be used instead.
>
> So something like this in fs.h:
>
> #define REMAP_FILE_ADVISORY_FLAGS       REMAP_FILE_CAN_SHORTEN
>
> And then in the file system:
>
>         if (flags & ~REMAP_FILE_ADVISORY_FLAGS)
>                 -EINVAL;
>
> or
>
>         if (flags & ~(REMAP_FILE_ADVISORY_FLAGS | REMAP_FILE_DEDUP))
>                 -EINVAL;
>
> should be all that is needed.

Yeh, I suppose that makes sense.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-14 17:19   ` Christoph Hellwig
  2018-10-15  6:04     ` Amir Goldstein
@ 2018-10-15 13:18     ` Matthew Wilcox
  2018-10-15 16:42     ` Darrick J. Wong
  2 siblings, 0 replies; 53+ messages in thread
From: Matthew Wilcox @ 2018-10-15 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, david, sandeen, linux-nfs, linux-cifs,
	Amir Goldstein, linux-unionfs, linux-xfs, linux-mm, linux-btrfs,
	linux-fsdevel, ocfs2-devel

On Sun, Oct 14, 2018 at 10:19:27AM -0700, Christoph Hellwig wrote:
> >  	unsigned (*mmap_capabilities)(struct file *);
> >  #endif
> >  	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
> > -	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> > -	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> > +	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> > +				struct file *file_out, loff_t pos_out,
> > +				u64 len, unsigned int remap_flags);
> 
> None of the other methods in this file name their parameters.  While
> I generally don't like people leaving them out, in the end consistency
> is even more important.

I would agree with you *except* that the parameters do not follow memcpy()
traditional order (dst, src, len).  Instead they are (src, dst, len), so we
should probably name them to advise the poor sod who has to implement this
that we've chosen an inconsistent API.

Or we could fix it.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 16/25] vfs: make remapping to source file eof more explicit
  2018-10-14 17:24   ` Christoph Hellwig
@ 2018-10-15 15:32     ` Darrick J. Wong
  2018-10-15 18:28       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-15 15:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

On Sun, Oct 14, 2018 at 10:24:33AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 12, 2018 at 05:07:37PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a RFR_TO_SRC_EOF flag to explicitly declare that the caller wants
> > the remap implementation to remap to the end of the source file, once
> > the files are locked.
> 
> The name looks like a cat threw up on your keyboard :)

Yeah... :(

> From reading the code this seems to ask for a whole file remap, right?

Nope.  In the original btrfs clonerange ioctl, length == 0 meant "to EOF".
If you made a call like this:

struct btrfs_ioctl_clone_range_args x = {
	.src_offset = 16384,
	.src_length = 0,
	.dest_offset = 0,
	.src_fd = whatever,
};
ftruncate(dest_fd, 0);
ioctl(dest_fd, BTRFS_IOC_CLONE, &x);

Then dest_fd ends up with the contents of [16k...EOF] from src_fd.
It's annoying to have the magic length number (and no flags?) but we're
stuck with this quirk of the interface.

> Why not put that in the name to make it more descriptive?

I'm all ears for better suggestions. :)

--D

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep
  2018-10-14 17:37   ` Christoph Hellwig
@ 2018-10-15 15:42     ` Darrick J. Wong
  0 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-15 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

On Sun, Oct 14, 2018 at 10:37:38AM -0700, Christoph Hellwig wrote:
> > +	bool is_dedupe = (remap_flags & RFR_SAME_DATA);
> 
> Btw, I think the code would be cleaner if we dropped this variable.

Ok to both.  I'll move up the patch to replace is_dedupe with
remap_flags to avoid churning the _touch function too, btw.

--D

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 24/25] xfs: support returning partial reflink results
  2018-10-14 23:05     ` Dave Chinner
@ 2018-10-15 15:49       ` Darrick J. Wong
  0 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-15 15:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, sandeen, linux-nfs, linux-cifs, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 10:05:36AM +1100, Dave Chinner wrote:
> On Sun, Oct 14, 2018 at 10:35:46AM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 12, 2018 at 05:08:32PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Back when the XFS reflink code only supported clone_file_range, we were
> > > only able to return zero or negative error codes to userspace.  However,
> > > now that copy_file_range (which returns bytes copied) can use XFS'
> > > clone_file_range, we have the opportunity to return partial results.
> > > For example, if userspace sends a 1GB clone request and we run out of
> > > space halfway through, we at least can tell userspace that we completed
> > > 512M of that request like a regular write.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_file.c    |    5 +----
> > >  fs/xfs/xfs_reflink.c |   20 +++++++++++++++-----
> > >  fs/xfs/xfs_reflink.h |    2 +-
> > >  3 files changed, 17 insertions(+), 10 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index bc9e94bcb7a3..b2b15b8dc4a1 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -928,14 +928,11 @@ xfs_file_remap_range(
> > >  	loff_t		len,
> > >  	unsigned int	remap_flags)
> > >  {
> > > -	int		ret;
> > > -
> > >  	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
> > >  		return -EINVAL;
> > >  
> > > -	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> > > +	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
> > >  			len, remap_flags);
> > 
> > Is there any reason not to merge xfs_file_remap_range and
> > xfs_reflink_remap_range at this point?
> 
> Yeah, that seems like a good idea to me - pulling all the
> vfs/generic code interactions back up into xfs_file.c would match
> how the rest of the file operations are layered w.r.t. external and
> internal XFS code...

Yeah, ditto ocfs2.  I'll look into throwing a few more refactors onto
the end of this series... 8-D

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata
  2018-10-14 17:21   ` Christoph Hellwig
@ 2018-10-15 16:30     ` Darrick J. Wong
  2018-10-15 18:19       ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-15 16:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

On Sun, Oct 14, 2018 at 10:21:31AM -0700, Christoph Hellwig wrote:
> > +/* Update inode timestamps and remove security privileges when remapping. */
> > +int generic_remap_file_range_touch(struct file *file, bool is_dedupe)
> > +{
> > +	int ret;
> > +
> > +	/* If can't alter the file contents, we're done. */
> > +	if (is_dedupe)
> > +		return 0;
> > +
> > +	/* Update the timestamps, since we can alter file contents. */
> > +	if (!(file->f_mode & FMODE_NOCMTIME)) {
> > +		ret = file_update_time(file);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	/*
> > +	 * Clear the security bits if the process is not being run by root.
> > +	 * This keeps people from modifying setuid and setgid binaries.
> > +	 */
> > +	return file_remove_privs(file);
> > +}
> > +EXPORT_SYMBOL(generic_remap_file_range_touch);
> 
> The name seems a little out of touch with what it actually does.

I originally thought "touch" because it updates [cm]time. :)

Though looking at the final code, I think this can just be called from
the end of generic_remap_file_range_prep, so we can skip the export and
all that other stuff.

> Also why a bool argument instead of the more descriptive flags which
> introduced a few patches ago?

Hmm, yes, the remap_flags transition can move up to before this patch.

--D

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-14 17:19   ` Christoph Hellwig
  2018-10-15  6:04     ` Amir Goldstein
  2018-10-15 13:18     ` Matthew Wilcox
@ 2018-10-15 16:42     ` Darrick J. Wong
  2 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-15 16:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, sandeen, linux-nfs, linux-cifs, Amir Goldstein,
	linux-unionfs, linux-xfs, linux-mm, linux-btrfs, linux-fsdevel,
	ocfs2-devel

On Sun, Oct 14, 2018 at 10:19:27AM -0700, Christoph Hellwig wrote:
> >  	unsigned (*mmap_capabilities)(struct file *);
> >  #endif
> >  	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
> > -	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> > -	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> > +	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> > +				struct file *file_out, loff_t pos_out,
> > +				u64 len, unsigned int remap_flags);
> 
> None of the other methods in this file name their parameters.  While
> I generally don't like people leaving them out, in the end consistency
> is even more important.
> 
> > +int btrfs_remap_file_range(struct file *src_file, loff_t off,
> > +		struct file *dst_file, loff_t destoff, u64 len,
> > +		unsigned int remap_flags)
> >  {
> > +	if (!remap_check_flags(remap_flags, RFR_SAME_DATA))
> > +		return -EINVAL;
> > +
> > +	if (remap_flags & RFR_SAME_DATA) {
> 
> So at least for btrfs there seems to be no shared code at all below
> the function calls.  This kinda speaks against the argument that
> they fundamentally are the same..

They /do/ share/ code -- eventually both btrfs_extent_same and
btrfs_clone_files call btrfs_clone.  xfs and ocfs2 call the same paths
internally too; it's only the vfs helpers that have the extra page cache
comparisons if it's a dedup operation.

> > +/*
> > + * These flags control the behavior of the remap_file_range function pointer.
> > + *
> > + * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
> > + */
> > +#define RFR_SAME_DATA		(1 << 0)
> > +
> > +#define RFR_VALID_FLAGS		(RFR_SAME_DATA)
> 
> RFR?  Why not REMAP_FILE_*  Also why not the well understood
> REMAP_FILE_DEDUP instead of the odd SAME_DATA?

Sure.  I had begin to dislike typing RFR anyway.

> > +
> > +/*
> > + * Filesystem remapping implementations should call this helper on their
> > + * remap flags to filter out flags that the implementation doesn't support.
> > + *
> > + * Returns true if the flags are ok, false otherwise.
> > + */
> > +static inline bool remap_check_flags(unsigned int remap_flags,
> > +				     unsigned int supported_flags)
> > +{
> > +	return (remap_flags & ~(supported_flags & RFR_VALID_FLAGS)) == 0;
> > +}
> 
> Any reason to even bother with a helper for this?  ->fallocate
> seems to be doing fine without the helper, and the resulting code
> seems a lot easier to understand to me.

(Will respond to these at the current end of the flags thread.)

> > @@ -1759,10 +1779,9 @@ struct file_operations {
> >  #endif
> >  	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
> >  			loff_t, size_t, unsigned int);
> > -	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
> > -			u64);
> > -	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
> > -			u64);
> > +	int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> > +				struct file *file_out, loff_t pos_out,
> > +				u64 len, unsigned int remap_flags);
> 
> Same comment here.  Didn't we have some nice doc tools to avoid this
> duplication? :)

We do, but vfs.txt hasn't been ported to any of that.

--D

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-15 12:47       ` Christoph Hellwig
  2018-10-15 12:54         ` Amir Goldstein
@ 2018-10-15 17:13         ` Darrick J. Wong
  2018-10-15 18:32           ` Christoph Hellwig
  1 sibling, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-15 17:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Amir Goldstein, Dave Chinner, Eric Sandeen,
	Linux NFS Mailing List, linux-cifs, overlayfs, linux-xfs,
	Linux MM, Linux Btrfs, linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 05:47:19AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 15, 2018 at 09:04:13AM +0300, Amir Goldstein wrote:
> > I supposed you figured out the reason already.
> 
> No, I hadn't.
> 
> > It makes it appearance in patch 16/25 as RFR_VFS_FLAGS.
> > All those "advisory" flags, we want to pass them in to filesystem as FYI,
> > but we don't want to explicitly add support for e.g. RFR_CAN_SHORTEN
> > to every filesystem, when vfs has already taken care of the advice.
> 
> I don't think this model makes sense.  If they really are purely
> handled in the VFS we can mask them before passing them to the file
> system, if not we need to check them, or the they are avisory and
> we can have a simple #define instead of the helper.
> 
> RFR_TO_SRC_EOF is checked in generic_remap_file_range_prep,
> so the file system should know about it  Also looking at it again now
> it seems entirely superflous - we can just pass down then len == we
> use in higher level code instead of having a flag and will side step
> the issue here.

I'm not a fan of hidden behaviors like that, particularly when we
already have a flags field where callers can explicitly ask for the
to-eof behavior.

> RFR_CAN_SHORTEN is advisory as no one has to shorten, but that can
> easily be solved by including it everywhere.

CAN_SHORTEN isn't included everywhere -- FICLONE{,RANGE} don't enable it
because they have no way to communicate the number of bytes cloned back
to userspace.  Either we can clone every byte the user asked for, or we
send back -EINVAL.  (Maybe I'm misinterpreting what you meant by 'solved
by including it everywhere'?)

> RFR_SHORT_DEDUPE is as far as I can tell entirely superflous to
> start with, as RFR_CAN_SHORTEN can be used instead.

For now it's superfluous.  At first I was thinking that we could return
a short bytes_deduped if, say, the first part of the range actually did
match, but it became pretty obvious via shared/010 that duperemove can't
handle that, so we really must stick to the existing btrfs behavior.

The existing btrfs behavior is that we can round the length down to
avoid deduping partial EOF blocks, but we return the original length
(i.e. lie) in bytes_deduped when we do that.

I sort of thought about introducing a new copy_file_range flag that
would just do deduplication and allow for opportunistic "dedup as much
as you can" but ... meh.  Maybe I'll just drop the patch instead; we can
revisit that when anyone wants a better dedupe interface.

> So something like this in fs.h:
> 
> #define REMAP_FILE_ADVISORY_FLAGS	REMAP_FILE_CAN_SHORTEN
> 
> And then in the file system:
> 
> 	if (flags & ~REMAP_FILE_ADVISORY_FLAGS)
> 		-EINVAL;
> 
> or
> 
> 	if (flags & ~(REMAP_FILE_ADVISORY_FLAGS | REMAP_FILE_DEDUP))
> 		-EINVAL;
> 
> should be all that is needed.

Sounds good to me.

--D

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata
  2018-10-15 16:30     ` Darrick J. Wong
@ 2018-10-15 18:19       ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-15 18:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, david, sandeen, linux-nfs, linux-cifs,
	Amir Goldstein, linux-unionfs, linux-xfs, linux-mm, linux-btrfs,
	linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 09:30:01AM -0700, Darrick J. Wong wrote:
> I originally thought "touch" because it updates [cm]time. :)
> 
> Though looking at the final code, I think this can just be called from
> the end of generic_remap_file_range_prep, so we can skip the export and
> all that other stuff.

I though about that, but the locking didn't seem to quite work out
between xfs and ocfs.

Nevermind the big elephant of actually converting btrfs to the "VFS"
helper - I think if that doesn't work out it is rather questionable
how generic they actually are.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 16/25] vfs: make remapping to source file eof more explicit
  2018-10-15 15:32     ` Darrick J. Wong
@ 2018-10-15 18:28       ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-15 18:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, david, sandeen, linux-nfs, linux-cifs,
	Amir Goldstein, linux-unionfs, linux-xfs, linux-mm, linux-btrfs,
	linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 08:32:19AM -0700, Darrick J. Wong wrote:
> > Why not put that in the name to make it more descriptive?
> 
> I'm all ears for better suggestions. :)

I think the best idea is no flag - just carry through the special
zero..

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range
  2018-10-15 17:13         ` Darrick J. Wong
@ 2018-10-15 18:32           ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2018-10-15 18:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Amir Goldstein, Dave Chinner, Eric Sandeen,
	Linux NFS Mailing List, linux-cifs, overlayfs, linux-xfs,
	Linux MM, Linux Btrfs, linux-fsdevel, ocfs2-devel

On Mon, Oct 15, 2018 at 10:13:17AM -0700, Darrick J. Wong wrote:
> > RFR_TO_SRC_EOF is checked in generic_remap_file_range_prep,
> > so the file system should know about it  Also looking at it again now
> > it seems entirely superflous - we can just pass down then len == we
> > use in higher level code instead of having a flag and will side step
> > the issue here.
> 
> I'm not a fan of hidden behaviors like that, particularly when we
> already have a flags field where callers can explicitly ask for the
> to-eof behavior.

This just means we have a flag to mean ken is 0 and needs to be filled,
rather than encoding that in the field itself.  If you fell better we
can replace 0 with 0xffffffff and still encode it in the field.

> > RFR_CAN_SHORTEN is advisory as no one has to shorten, but that can
> > easily be solved by including it everywhere.
> 
> CAN_SHORTEN isn't included everywhere

Include it everywhere as in allow it in ever ->remap_file instance.

> I sort of thought about introducing a new copy_file_range flag that
> would just do deduplication and allow for opportunistic "dedup as much
> as you can" but ... meh.  Maybe I'll just drop the patch instead; we can
> revisit that when anyone wants a better dedupe interface.

Sounds fine to me.  The btrfs ioctl is really ugly, but then again
there is no pressing need for something better.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 17/25] vfs: enable remap callers that can handle short operations
  2018-10-11  5:15   ` Amir Goldstein
@ 2018-10-11 16:04     ` Darrick J. Wong
  0 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-11 16:04 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Eric Sandeen, Linux NFS Mailing List, linux-cifs,
	overlayfs, linux-xfs, Linux MM, Linux Btrfs, linux-fsdevel,
	ocfs2-devel

On Thu, Oct 11, 2018 at 08:15:42AM +0300, Amir Goldstein wrote:
> On Thu, Oct 11, 2018 at 7:14 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Plumb in a remap flag that enables the filesystem remap handler to
> > shorten remapping requests for callers that can handle it.  Now
> > copy_file_range can report partial success (in case we run up against
> > alignment problems, resource limits, etc.).
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/read_write.c    |   15 +++++++++------
> >  include/linux/fs.h |    7 +++++--
> >  mm/filemap.c       |   16 ++++++++++++----
> >  3 files changed, 26 insertions(+), 12 deletions(-)
> >
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 6ec908f9a69b..3713893b7e38 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >
> >                 cloned = file_in->f_op->remap_file_range(file_in, pos_in,
> >                                 file_out, pos_out,
> > -                               min_t(loff_t, MAX_RW_COUNT, len), 0);
> > +                               min_t(loff_t, MAX_RW_COUNT, len),
> > +                               RFR_CAN_SHORTEN);
> >                 if (cloned > 0) {
> >                         ret = cloned;
> >                         goto done;
> > @@ -1804,16 +1805,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >                  * If the user is attempting to remap a partial EOF block and
> >                  * it's inside the destination EOF then reject it.
> >                  *
> > -                * We don't support shortening requests, so we can only reject
> > -                * them.
> > +                * If possible, shorten the request instead of rejecting it.
> >                  */
> >                 if (is_dedupe)
> >                         ret = -EBADE;
> >                 else if (pos_out + *len < i_size_read(inode_out))
> >                         ret = -EINVAL;
> >
> > -               if (ret)
> > -                       return ret;
> > +               if (ret) {
> > +                       if (!(remap_flags & RFR_CAN_SHORTEN))
> > +                               return ret;
> > +                       *len &= ~blkmask;
> > +               }
> >         }
> >
> >         return 1;
> > @@ -2112,7 +2115,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
> >
> >                 deduped = vfs_dedupe_file_range_one(file, off, dst_file,
> >                                                     info->dest_offset, len,
> > -                                                   0);
> > +                                                   RFR_CAN_SHORTEN);
> 
> You did not update WARN_ON_ONCE in vfs_dedupe_file_range_one()
> to allow this flag and did not mention dedupe in commit message.
> Was that change intentional in this patch?
> 
> After RFR_SHORT_DEDUPE patch the end result in fine.

Heh, oops, sorry about that.  I'll reissue the patch with the two
corrections.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 17/25] vfs: enable remap callers that can handle short operations
  2018-10-11  4:14 ` [PATCH 17/25] vfs: enable remap callers that can handle short operations Darrick J. Wong
@ 2018-10-11  5:15   ` Amir Goldstein
  2018-10-11 16:04     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Amir Goldstein @ 2018-10-11  5:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Eric Sandeen, Linux NFS Mailing List, linux-cifs,
	overlayfs, linux-xfs, Linux MM, Linux Btrfs, linux-fsdevel,
	ocfs2-devel

On Thu, Oct 11, 2018 at 7:14 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Plumb in a remap flag that enables the filesystem remap handler to
> shorten remapping requests for callers that can handle it.  Now
> copy_file_range can report partial success (in case we run up against
> alignment problems, resource limits, etc.).
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/read_write.c    |   15 +++++++++------
>  include/linux/fs.h |    7 +++++--
>  mm/filemap.c       |   16 ++++++++++++----
>  3 files changed, 26 insertions(+), 12 deletions(-)
>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 6ec908f9a69b..3713893b7e38 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>
>                 cloned = file_in->f_op->remap_file_range(file_in, pos_in,
>                                 file_out, pos_out,
> -                               min_t(loff_t, MAX_RW_COUNT, len), 0);
> +                               min_t(loff_t, MAX_RW_COUNT, len),
> +                               RFR_CAN_SHORTEN);
>                 if (cloned > 0) {
>                         ret = cloned;
>                         goto done;
> @@ -1804,16 +1805,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>                  * If the user is attempting to remap a partial EOF block and
>                  * it's inside the destination EOF then reject it.
>                  *
> -                * We don't support shortening requests, so we can only reject
> -                * them.
> +                * If possible, shorten the request instead of rejecting it.
>                  */
>                 if (is_dedupe)
>                         ret = -EBADE;
>                 else if (pos_out + *len < i_size_read(inode_out))
>                         ret = -EINVAL;
>
> -               if (ret)
> -                       return ret;
> +               if (ret) {
> +                       if (!(remap_flags & RFR_CAN_SHORTEN))
> +                               return ret;
> +                       *len &= ~blkmask;
> +               }
>         }
>
>         return 1;
> @@ -2112,7 +2115,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>
>                 deduped = vfs_dedupe_file_range_one(file, off, dst_file,
>                                                     info->dest_offset, len,
> -                                                   0);
> +                                                   RFR_CAN_SHORTEN);

You did not update WARN_ON_ONCE in vfs_dedupe_file_range_one()
to allow this flag and did not mention dedupe in commit message.
Was that change intentional in this patch?

After RFR_SHORT_DEDUPE patch the end result in fine.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 17/25] vfs: enable remap callers that can handle short operations
  2018-10-11  4:12 [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
@ 2018-10-11  4:14 ` Darrick J. Wong
  2018-10-11  5:15   ` Amir Goldstein
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-10-11  4:14 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, linux-nfs, linux-cifs, Amir Goldstein, linux-unionfs,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, ocfs2-devel

From: Darrick J. Wong <darrick.wong@oracle.com>

Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it.  Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/read_write.c    |   15 +++++++++------
 include/linux/fs.h |    7 +++++--
 mm/filemap.c       |   16 ++++++++++++----
 3 files changed, 26 insertions(+), 12 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 6ec908f9a69b..3713893b7e38 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 		cloned = file_in->f_op->remap_file_range(file_in, pos_in,
 				file_out, pos_out,
-				min_t(loff_t, MAX_RW_COUNT, len), 0);
+				min_t(loff_t, MAX_RW_COUNT, len),
+				RFR_CAN_SHORTEN);
 		if (cloned > 0) {
 			ret = cloned;
 			goto done;
@@ -1804,16 +1805,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 		 * If the user is attempting to remap a partial EOF block and
 		 * it's inside the destination EOF then reject it.
 		 *
-		 * We don't support shortening requests, so we can only reject
-		 * them.
+		 * If possible, shorten the request instead of rejecting it.
 		 */
 		if (is_dedupe)
 			ret = -EBADE;
 		else if (pos_out + *len < i_size_read(inode_out))
 			ret = -EINVAL;
 
-		if (ret)
-			return ret;
+		if (ret) {
+			if (!(remap_flags & RFR_CAN_SHORTEN))
+				return ret;
+			*len &= ~blkmask;
+		}
 	}
 
 	return 1;
@@ -2112,7 +2115,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 
 		deduped = vfs_dedupe_file_range_one(file, off, dst_file,
 						    info->dest_offset, len,
-						    0);
+						    RFR_CAN_SHORTEN);
 		if (deduped == -EBADE)
 			info->status = FILE_DEDUPE_RANGE_DIFFERS;
 		else if (deduped < 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b9c314f9d5a4..57cb56bbc30a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1726,14 +1726,17 @@ struct block_device_operations;
  *
  * RFR_SAME_DATA: only remap if contents identical (i.e. deduplicate)
  * RFR_TO_SRC_EOF: remap to the end of the source file
+ * RFR_CAN_SHORTEN: caller can handle a shortened request
  */
 #define RFR_SAME_DATA		(1 << 0)
 #define RFR_TO_SRC_EOF		(1 << 1)
+#define RFR_CAN_SHORTEN		(1 << 2)
 
-#define RFR_VALID_FLAGS		(RFR_SAME_DATA | RFR_TO_SRC_EOF)
+#define RFR_VALID_FLAGS		(RFR_SAME_DATA | RFR_TO_SRC_EOF | \
+				 RFR_CAN_SHORTEN)
 
 /* Implemented by the VFS, so these are advisory. */
-#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF)
+#define RFR_VFS_FLAGS		(RFR_TO_SRC_EOF | RFR_CAN_SHORTEN)
 
 /*
  * Filesystem remapping implementations should call this helper on their
diff --git a/mm/filemap.c b/mm/filemap.c
index 369cfd164e90..bccbd3621238 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3051,8 +3051,12 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	if (pos_in + count == size_in) {
 		bcount = ALIGN(size_in, bs) - pos_in;
 	} else {
-		if (!IS_ALIGNED(count, bs))
-			return -EINVAL;
+		if (!IS_ALIGNED(count, bs)) {
+			if (remap_flags & RFR_CAN_SHORTEN)
+				count = ALIGN_DOWN(count, bs);
+			else
+				return -EINVAL;
+		}
 
 		bcount = count;
 	}
@@ -3063,10 +3067,14 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	    pos_out < pos_in + bcount)
 		return -EINVAL;
 
-	/* For now we don't support changing the length. */
-	if (*req_count != count)
+	/*
+	 * We shortened the request but the caller can't deal with that, so
+	 * bounce the request back to userspace.
+	 */
+	if (*req_count != count && !(remap_flags & RFR_CAN_SHORTEN))
 		return -EINVAL;
 
+	*req_count = count;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2018-10-15 18:32 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-13  0:05 [PATCH v4 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
2018-10-13  0:05 ` [PATCH 01/25] xfs: add a per-xfs trace_printk macro Darrick J. Wong
2018-10-13  0:05 ` [PATCH 02/25] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF Darrick J. Wong
2018-10-13  0:06 ` [PATCH 03/25] vfs: check file ranges before cloning files Darrick J. Wong
2018-10-13  0:06 ` [PATCH 04/25] vfs: strengthen checking of file range inputs to generic_remap_checks Darrick J. Wong
2018-10-13  0:06 ` [PATCH 05/25] vfs: avoid problematic remapping requests into partial EOF block Darrick J. Wong
2018-10-14 17:11   ` Christoph Hellwig
2018-10-13  0:06 ` [PATCH 06/25] vfs: skip zero-length dedupe requests Darrick J. Wong
2018-10-13  0:06 ` [PATCH 07/25] vfs: combine the clone and dedupe into a single remap_file_range Darrick J. Wong
2018-10-14 17:19   ` Christoph Hellwig
2018-10-15  6:04     ` Amir Goldstein
2018-10-15 12:47       ` Christoph Hellwig
2018-10-15 12:54         ` Amir Goldstein
2018-10-15 17:13         ` Darrick J. Wong
2018-10-15 18:32           ` Christoph Hellwig
2018-10-15 13:18     ` Matthew Wilcox
2018-10-15 16:42     ` Darrick J. Wong
2018-10-13  0:06 ` [PATCH 08/25] vfs: rename vfs_clone_file_prep to be more descriptive Darrick J. Wong
2018-10-13  0:06 ` [PATCH 09/25] vfs: rename clone_verify_area to remap_verify_area Darrick J. Wong
2018-10-13  0:06 ` [PATCH 10/25] vfs: create generic_remap_file_range_touch to update inode metadata Darrick J. Wong
2018-10-14 17:21   ` Christoph Hellwig
2018-10-15 16:30     ` Darrick J. Wong
2018-10-15 18:19       ` Christoph Hellwig
2018-10-13  0:06 ` [PATCH 11/25] vfs: pass remap flags to generic_remap_file_range_prep Darrick J. Wong
2018-10-14 17:22   ` Christoph Hellwig
2018-10-14 17:37   ` Christoph Hellwig
2018-10-15 15:42     ` Darrick J. Wong
2018-10-13  0:07 ` [PATCH 12/25] vfs: pass remap flags to generic_remap_checks Darrick J. Wong
2018-10-13  0:07 ` [PATCH 13/25] vfs: make remap_file_range functions take and return bytes completed Darrick J. Wong
2018-10-13  0:07 ` [PATCH 14/25] vfs: plumb RFR_* remap flags through the vfs clone functions Darrick J. Wong
2018-10-13  0:07 ` [PATCH 15/25] vfs: plumb RFR_* remap flags through the vfs dedupe functions Darrick J. Wong
2018-10-13  0:07 ` [PATCH 16/25] vfs: make remapping to source file eof more explicit Darrick J. Wong
2018-10-14 17:24   ` Christoph Hellwig
2018-10-15 15:32     ` Darrick J. Wong
2018-10-15 18:28       ` Christoph Hellwig
2018-10-13  0:07 ` [PATCH 17/25] vfs: enable remap callers that can handle short operations Darrick J. Wong
2018-10-13  0:07 ` [PATCH 18/25] vfs: hide file range comparison function Darrick J. Wong
2018-10-14 17:43   ` Christoph Hellwig
2018-10-13  0:07 ` [PATCH 19/25] vfs: implement opportunistic short dedupe Darrick J. Wong
2018-10-14 17:26   ` Christoph Hellwig
2018-10-13  0:08 ` [PATCH 20/25] ocfs2: truncate page cache for clone destination file before remapping Darrick J. Wong
2018-10-13  0:08 ` [PATCH 21/25] ocfs2: fix pagecache truncation prior to reflink Darrick J. Wong
2018-10-13  0:08 ` [PATCH 22/25] ocfs2: support partial clone range and dedupe range Darrick J. Wong
2018-10-14 17:41   ` Christoph Hellwig
2018-10-13  0:08 ` [PATCH 23/25] xfs: fix pagecache truncation prior to reflink Darrick J. Wong
2018-10-13  0:08 ` [PATCH 24/25] xfs: support returning partial reflink results Darrick J. Wong
2018-10-14 17:35   ` Christoph Hellwig
2018-10-14 23:05     ` Dave Chinner
2018-10-15 15:49       ` Darrick J. Wong
2018-10-13  0:08 ` [PATCH 25/25] xfs: remove redundant remap partial EOF block checks Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2018-10-11  4:12 [PATCH v3 00/25] fs: fixes for serious clone/dedupe problems Darrick J. Wong
2018-10-11  4:14 ` [PATCH 17/25] vfs: enable remap callers that can handle short operations Darrick J. Wong
2018-10-11  5:15   ` Amir Goldstein
2018-10-11 16:04     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).