All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] fs: fixes for serious clone/dedupe problems
@ 2018-10-05  0:44 ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

Hi all,

Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
reflink implementation, and tracked it down to reflink forgetting to do
some of the file-extending activities that must happen for regular
writes.

We then started auditing the clone, dedupe, and copyfile code and
realized that from a file contents perspective, clonerange isn't any
different from a regular file write.  Unfortunately, we also noticed
that *unlike* a regular write, clonerange skips a ton of overflow
checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
and RLIMIT_FSIZE.  We also observed that cloning into a file did not
strip security privileges (suid, capabilities) like a regular write
would.  I also noticed that xfs and ocfs2 need to dump the page cache
before remapping blocks, not after.

In fixing the range checking problems I also realized that both dedupe
and copyfile tell userspace how much of the requested operation was
acted upon.  Since the range validation can shorten a clone request (or
we can ENOSPC midway through), we might as well plumb the short
operation reporting back through the VFS indirection code to userspace.

So, here's the whole giant pile of patches[1] that fix all the problems.
The patch "generic: test reflink side effects" recently sent to fstests
exercises the fixes in this series.  Tests are in [2].

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 00/15] fs: fixes for serious clone/dedupe problems
@ 2018-10-05  0:44 ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

Hi all,

Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
reflink implementation, and tracked it down to reflink forgetting to do
some of the file-extending activities that must happen for regular
writes.

We then started auditing the clone, dedupe, and copyfile code and
realized that from a file contents perspective, clonerange isn't any
different from a regular file write.  Unfortunately, we also noticed
that *unlike* a regular write, clonerange skips a ton of overflow
checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
and RLIMIT_FSIZE.  We also observed that cloning into a file did not
strip security privileges (suid, capabilities) like a regular write
would.  I also noticed that xfs and ocfs2 need to dump the page cache
before remapping blocks, not after.

In fixing the range checking problems I also realized that both dedupe
and copyfile tell userspace how much of the requested operation was
acted upon.  Since the range validation can shorten a clone request (or
we can ENOSPC midway through), we might as well plumb the short
operation reporting back through the VFS indirection code to userspace.

So, here's the whole giant pile of patches[1] that fix all the problems.
The patch "generic: test reflink side effects" recently sent to fstests
exercises the fixes in this series.  Tests are in [2].

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 01/15] xfs: add a per-xfs trace_printk macro
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:44   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a "xfs_tprintk" macro so that developers can use trace_printk to
print out arbitrary debugging information with the XFS device name
attached to the trace output.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_error.h |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 246d3e989c6c..c3d9546b138c 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -99,4 +99,9 @@ extern int xfs_errortag_clearall(struct xfs_mount *mp);
 #define		XFS_PTAG_SHUTDOWN_LOGERROR	0x00000040
 #define		XFS_PTAG_FSBLOCK_ZERO		0x00000080
 
+/* trace printk version of xfs_err and friends */
+#define xfs_tprintk(mp, fmt, args...) \
+	trace_printk("dev %d:%d " fmt, MAJOR((mp)->m_super->s_dev), \
+			MINOR((mp)->m_super->s_dev), ##args)
+
 #endif	/* __XFS_ERROR_H__ */


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 01/15] xfs: add a per-xfs trace_printk macro
@ 2018-10-05  0:44   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a "xfs_tprintk" macro so that developers can use trace_printk to
print out arbitrary debugging information with the XFS device name
attached to the trace output.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_error.h |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 246d3e989c6c..c3d9546b138c 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -99,4 +99,9 @@ extern int xfs_errortag_clearall(struct xfs_mount *mp);
 #define		XFS_PTAG_SHUTDOWN_LOGERROR	0x00000040
 #define		XFS_PTAG_FSBLOCK_ZERO		0x00000080
 
+/* trace printk version of xfs_err and friends */
+#define xfs_tprintk(mp, fmt, args...) \
+	trace_printk("dev %d:%d " fmt, MAJOR((mp)->m_super->s_dev), \
+			MINOR((mp)->m_super->s_dev), ##args)
+
 #endif	/* __XFS_ERROR_H__ */

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:44   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor all the reflink preparation steps into a separate helper that
we'll use to land all the upcoming fixes for insufficient input checks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   96 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 71 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 38f405415b88..80ca9b6793cd 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout(
 	return 0;
 }
 
+/* Unlock both inodes after they've been prepped for a range clone. */
+STATIC void
+xfs_reflink_remap_unlock(
+	struct file		*file_in,
+	struct file		*file_out)
+{
+	struct inode		*inode_in = file_inode(file_in);
+	struct xfs_inode	*src = XFS_I(inode_in);
+	struct inode		*inode_out = file_inode(file_out);
+	struct xfs_inode	*dest = XFS_I(inode_out);
+	bool			same_inode = (inode_in == inode_out);
+
+	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+	if (!same_inode)
+		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
+	inode_unlock(inode_out);
+	if (!same_inode)
+		inode_unlock_shared(inode_in);
+}
+
 /*
- * Link a range of blocks from one file to another.
+ * Prepare two files for range cloning.  Upon a successful return both inodes
+ * will have the iolock and mmaplock held, the page cache of the out file
+ * will be truncated, and any leases on the out file will have been broken.
  */
-int
-xfs_reflink_remap_range(
+STATIC int
+xfs_reflink_remap_prep(
 	struct file		*file_in,
 	loff_t			pos_in,
 	struct file		*file_out,
@@ -1211,19 +1233,9 @@ xfs_reflink_remap_range(
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
-	struct xfs_mount	*mp = src->i_mount;
 	bool			same_inode = (inode_in == inode_out);
-	xfs_fileoff_t		sfsbno, dfsbno;
-	xfs_filblks_t		fsblen;
-	xfs_extlen_t		cowextsize;
 	ssize_t			ret;
 
-	if (!xfs_sb_version_hasreflink(&mp->m_sb))
-		return -EOPNOTSUPP;
-
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
 	/* Lock both files against IO */
 	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
 	if (ret)
@@ -1254,8 +1266,6 @@ xfs_reflink_remap_range(
 	if (ret)
 		goto out_unlock;
 
-	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
-
 	/*
 	 * Clear out post-eof preallocations because we don't have page cache
 	 * backing the delayed allocations and they'll never get freed on
@@ -1272,6 +1282,51 @@ xfs_reflink_remap_range(
 	if (ret)
 		goto out_unlock;
 
+	/* Zap any page cache for the destination file's range. */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
+	return 0;
+out_unlock:
+	xfs_reflink_remap_unlock(file_in, file_out);
+	return ret;
+}
+
+/*
+ * Link a range of blocks from one file to another.
+ */
+int
+xfs_reflink_remap_range(
+	struct file		*file_in,
+	loff_t			pos_in,
+	struct file		*file_out,
+	loff_t			pos_out,
+	u64			len,
+	bool			is_dedupe)
+{
+	struct inode		*inode_in = file_inode(file_in);
+	struct xfs_inode	*src = XFS_I(inode_in);
+	struct inode		*inode_out = file_inode(file_out);
+	struct xfs_inode	*dest = XFS_I(inode_out);
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		sfsbno, dfsbno;
+	xfs_filblks_t		fsblen;
+	xfs_extlen_t		cowextsize;
+	ssize_t			ret;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* Prepare and then clone file data. */
+	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
+			len, is_dedupe);
+	if (ret)
+		return ret;
+
+	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
+
 	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
 	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
 	fsblen = XFS_B_TO_FSB(mp, len);
@@ -1280,10 +1335,6 @@ xfs_reflink_remap_range(
 	if (ret)
 		goto out_unlock;
 
-	/* Zap any page cache for the destination file's range. */
-	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
-
 	/*
 	 * Carry the cowextsize hint from src to dest if we're sharing the
 	 * entire source file to the entire destination file, the source file
@@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
 			is_dedupe);
 
 out_unlock:
-	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
-	if (!same_inode)
-		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
-	inode_unlock(inode_out);
-	if (!same_inode)
-		inode_unlock_shared(inode_in);
+	xfs_reflink_remap_unlock(file_in, file_out);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return ret;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05  0:44   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor all the reflink preparation steps into a separate helper that
we'll use to land all the upcoming fixes for insufficient input checks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   96 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 71 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 38f405415b88..80ca9b6793cd 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout(
 	return 0;
 }
 
+/* Unlock both inodes after they've been prepped for a range clone. */
+STATIC void
+xfs_reflink_remap_unlock(
+	struct file		*file_in,
+	struct file		*file_out)
+{
+	struct inode		*inode_in = file_inode(file_in);
+	struct xfs_inode	*src = XFS_I(inode_in);
+	struct inode		*inode_out = file_inode(file_out);
+	struct xfs_inode	*dest = XFS_I(inode_out);
+	bool			same_inode = (inode_in == inode_out);
+
+	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+	if (!same_inode)
+		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
+	inode_unlock(inode_out);
+	if (!same_inode)
+		inode_unlock_shared(inode_in);
+}
+
 /*
- * Link a range of blocks from one file to another.
+ * Prepare two files for range cloning.  Upon a successful return both inodes
+ * will have the iolock and mmaplock held, the page cache of the out file
+ * will be truncated, and any leases on the out file will have been broken.
  */
-int
-xfs_reflink_remap_range(
+STATIC int
+xfs_reflink_remap_prep(
 	struct file		*file_in,
 	loff_t			pos_in,
 	struct file		*file_out,
@@ -1211,19 +1233,9 @@ xfs_reflink_remap_range(
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
-	struct xfs_mount	*mp = src->i_mount;
 	bool			same_inode = (inode_in == inode_out);
-	xfs_fileoff_t		sfsbno, dfsbno;
-	xfs_filblks_t		fsblen;
-	xfs_extlen_t		cowextsize;
 	ssize_t			ret;
 
-	if (!xfs_sb_version_hasreflink(&mp->m_sb))
-		return -EOPNOTSUPP;
-
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
 	/* Lock both files against IO */
 	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
 	if (ret)
@@ -1254,8 +1266,6 @@ xfs_reflink_remap_range(
 	if (ret)
 		goto out_unlock;
 
-	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
-
 	/*
 	 * Clear out post-eof preallocations because we don't have page cache
 	 * backing the delayed allocations and they'll never get freed on
@@ -1272,6 +1282,51 @@ xfs_reflink_remap_range(
 	if (ret)
 		goto out_unlock;
 
+	/* Zap any page cache for the destination file's range. */
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
+	return 0;
+out_unlock:
+	xfs_reflink_remap_unlock(file_in, file_out);
+	return ret;
+}
+
+/*
+ * Link a range of blocks from one file to another.
+ */
+int
+xfs_reflink_remap_range(
+	struct file		*file_in,
+	loff_t			pos_in,
+	struct file		*file_out,
+	loff_t			pos_out,
+	u64			len,
+	bool			is_dedupe)
+{
+	struct inode		*inode_in = file_inode(file_in);
+	struct xfs_inode	*src = XFS_I(inode_in);
+	struct inode		*inode_out = file_inode(file_out);
+	struct xfs_inode	*dest = XFS_I(inode_out);
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		sfsbno, dfsbno;
+	xfs_filblks_t		fsblen;
+	xfs_extlen_t		cowextsize;
+	ssize_t			ret;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* Prepare and then clone file data. */
+	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
+			len, is_dedupe);
+	if (ret)
+		return ret;
+
+	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
+
 	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
 	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
 	fsblen = XFS_B_TO_FSB(mp, len);
@@ -1280,10 +1335,6 @@ xfs_reflink_remap_range(
 	if (ret)
 		goto out_unlock;
 
-	/* Zap any page cache for the destination file's range. */
-	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
-
 	/*
 	 * Carry the cowextsize hint from src to dest if we're sharing the
 	 * entire source file to the entire destination file, the source file
@@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
 			is_dedupe);
 
 out_unlock:
-	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
-	if (!same_inode)
-		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
-	inode_unlock(inode_out);
-	if (!same_inode)
-		inode_unlock_shared(inode_in);
+	xfs_reflink_remap_unlock(file_in, file_out);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return ret;

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:44   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, Zorro Lang, linux-xfs, ocfs2-devel, linux-fsdevel, linux-btrfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents.  The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 80ca9b6793cd..55da7e1154f4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1215,6 +1215,26 @@ xfs_reflink_remap_unlock(
 		inode_unlock_shared(inode_in);
 }
 
+/*
+ * If we're reflinking to a point past the destination file's EOF, we must
+ * zero any speculative post-EOF preallocations that sit between the old EOF
+ * and the destination file offset.
+ */
+static int
+xfs_reflink_zero_posteof(
+	struct xfs_inode	*ip,
+	loff_t			pos)
+{
+	loff_t			isize = i_size_read(VFS_I(ip));
+
+	if (pos <= isize)
+		return 0;
+
+	trace_xfs_zero_eof(ip, isize, pos - isize);
+	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+			&xfs_iomap_ops);
+}
+
 /*
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file
@@ -1267,15 +1287,12 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	/*
-	 * Clear out post-eof preallocations because we don't have page cache
-	 * backing the delayed allocations and they'll never get freed on
-	 * their own.
+	 * Zero existing post-eof speculative preallocations in the destination
+	 * file.
 	 */
-	if (xfs_can_free_eofblocks(dest, true)) {
-		ret = xfs_free_eofblocks(dest);
-		if (ret)
-			goto out_unlock;
-	}
+	ret = xfs_reflink_zero_posteof(dest, pos_out);
+	if (ret)
+		goto out_unlock;
 
 	/* Set flags and remap blocks. */
 	ret = xfs_reflink_set_inode_flag(src, dest);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
@ 2018-10-05  0:44   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:44 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: sandeen, Zorro Lang, linux-xfs, ocfs2-devel, linux-fsdevel, linux-btrfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents.  The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 80ca9b6793cd..55da7e1154f4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1215,6 +1215,26 @@ xfs_reflink_remap_unlock(
 		inode_unlock_shared(inode_in);
 }
 
+/*
+ * If we're reflinking to a point past the destination file's EOF, we must
+ * zero any speculative post-EOF preallocations that sit between the old EOF
+ * and the destination file offset.
+ */
+static int
+xfs_reflink_zero_posteof(
+	struct xfs_inode	*ip,
+	loff_t			pos)
+{
+	loff_t			isize = i_size_read(VFS_I(ip));
+
+	if (pos <= isize)
+		return 0;
+
+	trace_xfs_zero_eof(ip, isize, pos - isize);
+	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+			&xfs_iomap_ops);
+}
+
 /*
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file
@@ -1267,15 +1287,12 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	/*
-	 * Clear out post-eof preallocations because we don't have page cache
-	 * backing the delayed allocations and they'll never get freed on
-	 * their own.
+	 * Zero existing post-eof speculative preallocations in the destination
+	 * file.
 	 */
-	if (xfs_can_free_eofblocks(dest, true)) {
-		ret = xfs_free_eofblocks(dest);
-		if (ret)
-			goto out_unlock;
-	}
+	ret = xfs_reflink_zero_posteof(dest, pos_out);
+	if (ret)
+		goto out_unlock;
 
 	/* Set flags and remap blocks. */
 	ret = xfs_reflink_set_inode_flag(src, dest);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/15] xfs: update ctime and remove suid before cloning files
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 55da7e1154f4..d8f209bc8937 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1239,6 +1239,7 @@ xfs_reflink_zero_posteof(
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file
  * will be truncated, and any leases on the out file will have been broken.
+ * This function borrows heavily from xfs_file_aio_write_checks.
  */
 STATIC int
 xfs_reflink_remap_prep(
@@ -1302,6 +1303,30 @@ xfs_reflink_remap_prep(
 	/* Zap any page cache for the destination file's range. */
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
 				   PAGE_ALIGN(pos_out + len) - 1);
+
+	/* If we're altering the file contents... */
+	if (!is_dedupe) {
+		/*
+		 * ...update the timestamps (which will grab the ilock again
+		 * from xfs_fs_dirty_inode, so we have to call it before we
+		 * take the ilock).
+		 */
+		if (!(file_out->f_mode & FMODE_NOCMTIME)) {
+			ret = file_update_time(file_out);
+			if (ret)
+				goto out_unlock;
+		}
+
+		/*
+		 * ...clear the security bits if the process is not being run
+		 * by root.  This keeps people from modifying setuid and setgid
+		 * binaries.
+		 */
+		ret = file_remove_privs(file_out);
+		if (ret)
+			goto out_unlock;
+	}
+
 	return 0;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 04/15] xfs: update ctime and remove suid before cloning files
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 55da7e1154f4..d8f209bc8937 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1239,6 +1239,7 @@ xfs_reflink_zero_posteof(
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file
  * will be truncated, and any leases on the out file will have been broken.
+ * This function borrows heavily from xfs_file_aio_write_checks.
  */
 STATIC int
 xfs_reflink_remap_prep(
@@ -1302,6 +1303,30 @@ xfs_reflink_remap_prep(
 	/* Zap any page cache for the destination file's range. */
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
 				   PAGE_ALIGN(pos_out + len) - 1);
+
+	/* If we're altering the file contents... */
+	if (!is_dedupe) {
+		/*
+		 * ...update the timestamps (which will grab the ilock again
+		 * from xfs_fs_dirty_inode, so we have to call it before we
+		 * take the ilock).
+		 */
+		if (!(file_out->f_mode & FMODE_NOCMTIME)) {
+			ret = file_update_time(file_out);
+			if (ret)
+				goto out_unlock;
+		}
+
+		/*
+		 * ...clear the security bits if the process is not being run
+		 * by root.  This keeps people from modifying setuid and setgid
+		 * binaries.
+		 */
+		ret = file_remove_privs(file_out);
+		if (ret)
+			goto out_unlock;
+	}
+
 	return 0;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/15] vfs: check file ranges before cloning files
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Move the file range checks from vfs_clone_file_prep into a separate
generic_clone_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |    2 +
 fs/read_write.c         |   55 ++++++++++----------------------------
 fs/xfs/xfs_reflink.c    |    2 +
 include/linux/fs.h      |    9 ++++--
 mm/filemap.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 90 insertions(+), 46 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 7869622af22a..11e4aad7b783 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4842,7 +4842,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
 			&len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index 39b4a21dd933..973d3da78c09 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1717,12 +1717,13 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
-int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-			       struct inode *inode_out, loff_t pos_out,
-			       u64 *len, bool is_dedupe)
+int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+			struct file *file_out, loff_t pos_out,
+			u64 *len, bool is_dedupe)
 {
-	loff_t bs = inode_out->i_sb->s_blocksize;
-	loff_t blen;
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	uint64_t nlen;
 	loff_t isize;
 	bool same_inode = (inode_in == inode_out);
 	int ret;
@@ -1740,10 +1741,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
-	/* Are we going all the way to the end? */
 	isize = i_size_read(inode_in);
-	if (isize == 0)
-		return 0;
 
 	/* Zero length dedupe exits immediately; reflink goes to EOF. */
 	if (*len == 0) {
@@ -1754,37 +1752,15 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 		*len = isize - pos_in;
 	}
 
-	/* Ensure offsets don't wrap and the input is inside i_size */
-	if (pos_in + *len < pos_in || pos_out + *len < pos_out ||
-	    pos_in + *len > isize)
-		return -EINVAL;
-
-	/* Don't allow dedupe past EOF in the dest file */
-	if (is_dedupe) {
-		loff_t	disize;
-
-		disize = i_size_read(inode_out);
-		if (pos_out >= disize || pos_out + *len > disize)
-			return -EINVAL;
-	}
-
-	/* If we're linking to EOF, continue to the block boundary. */
-	if (pos_in + *len == isize)
-		blen = ALIGN(isize, bs) - pos_in;
-	else
-		blen = *len;
-
-	/* Only reflink if we're aligned to block boundaries */
-	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
-	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+	/* Check that we don't violate system file offset limits. */
+	nlen = *len;
+	ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, &nlen,
+			is_dedupe);
+	if (ret)
+		return ret;
+	if (nlen != *len)
 		return -EINVAL;
 
-	/* Don't allow overlapped reflink within the same file */
-	if (same_inode) {
-		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
-			return -EINVAL;
-	}
-
 	/* Wait for the completion of any pending IOs on both files */
 	inode_dio_wait(inode_in);
 	if (!same_inode)
@@ -1816,7 +1792,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 
 	return 1;
 }
-EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
+EXPORT_SYMBOL(vfs_clone_file_prep);
 
 int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len)
@@ -1854,9 +1830,6 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	if (pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
 	ret = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len);
 	if (!ret) {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d8f209bc8937..1955e093e9ea 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1277,7 +1277,7 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
 			&len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c0b4a1c22ff..2a4141d36ebf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1825,9 +1825,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
-extern int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-				      struct inode *inode_out, loff_t pos_out,
-				      u64 *len, bool is_dedupe);
+extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+			       struct file *file_out, loff_t pos_out,
+			       u64 *count, bool is_dedupe);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
@@ -2978,6 +2978,9 @@ extern int sb_min_blocksize(struct super_block *, int);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+extern int generic_clone_checks(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				uint64_t *count, bool is_dedupe);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..68ec91d05c7b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2974,6 +2974,74 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+/*
+ * Performs necessary checks before doing a clone.
+ *
+ * Can adjust amount of bytes to clone.
+ * Returns appropriate error code that caller should return or
+ * zero in case the clone should be allowed.
+ */
+int generic_clone_checks(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 uint64_t *req_count, bool is_dedupe)
+{
+	struct inode *inode_in = file_in->f_mapping->host;
+	struct inode *inode_out = file_out->f_mapping->host;
+	unsigned long limit = rlimit(RLIMIT_FSIZE);
+	uint64_t count = *req_count;
+	uint64_t bcount;
+	loff_t size_in, size_out;
+	loff_t bs = inode_out->i_sb->s_blocksize;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (pos_in + count < pos_in || pos_out + count < pos_out)
+		return -EINVAL;
+
+	size_in = i_size_read(inode_in);
+	size_out = i_size_read(inode_out);
+
+	/* Dedupe requires both ranges to be within EOF. */
+	if (is_dedupe &&
+	    (pos_in >= size_in || pos_in + count > size_in ||
+	     pos_out >= size_out || pos_out + count > size_out))
+		return -EINVAL;
+
+	/* Ensure the infile range is within the infile. */
+	if (pos_in >= size_in)
+		return -EINVAL;
+	count = min(count, size_in - (uint64_t)pos_in);
+
+	/*
+	 * If the user wanted us to link to the infile's EOF, round up to the
+	 * next block boundary for this check.
+	 *
+	 * Otherwise, make sure the count is also block-aligned, having
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (pos_in + count == size_in) {
+		bcount = ALIGN(size_in, bs) - pos_in;
+	} else {
+		if (!IS_ALIGNED(count, bs))
+			return -EINVAL;
+
+		bcount = count;
+	}
+
+	/* Don't allow overlapped cloning within the same file. */
+	if (inode_in == inode_out &&
+	    pos_out + bcount > pos_in &&
+	    pos_out < pos_in + bcount)
+		return -EINVAL;
+
+	*req_count = count;
+	return 0;
+}
+EXPORT_SYMBOL(generic_clone_checks);
+
 int pagecache_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 05/15] vfs: check file ranges before cloning files
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Move the file range checks from vfs_clone_file_prep into a separate
generic_clone_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |    2 +
 fs/read_write.c         |   55 ++++++++++----------------------------
 fs/xfs/xfs_reflink.c    |    2 +
 include/linux/fs.h      |    9 ++++--
 mm/filemap.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 90 insertions(+), 46 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 7869622af22a..11e4aad7b783 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4842,7 +4842,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	    (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
 			&len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index 39b4a21dd933..973d3da78c09 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1717,12 +1717,13 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
-int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-			       struct inode *inode_out, loff_t pos_out,
-			       u64 *len, bool is_dedupe)
+int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+			struct file *file_out, loff_t pos_out,
+			u64 *len, bool is_dedupe)
 {
-	loff_t bs = inode_out->i_sb->s_blocksize;
-	loff_t blen;
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	uint64_t nlen;
 	loff_t isize;
 	bool same_inode = (inode_in == inode_out);
 	int ret;
@@ -1740,10 +1741,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
-	/* Are we going all the way to the end? */
 	isize = i_size_read(inode_in);
-	if (isize == 0)
-		return 0;
 
 	/* Zero length dedupe exits immediately; reflink goes to EOF. */
 	if (*len == 0) {
@@ -1754,37 +1752,15 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 		*len = isize - pos_in;
 	}
 
-	/* Ensure offsets don't wrap and the input is inside i_size */
-	if (pos_in + *len < pos_in || pos_out + *len < pos_out ||
-	    pos_in + *len > isize)
-		return -EINVAL;
-
-	/* Don't allow dedupe past EOF in the dest file */
-	if (is_dedupe) {
-		loff_t	disize;
-
-		disize = i_size_read(inode_out);
-		if (pos_out >= disize || pos_out + *len > disize)
-			return -EINVAL;
-	}
-
-	/* If we're linking to EOF, continue to the block boundary. */
-	if (pos_in + *len == isize)
-		blen = ALIGN(isize, bs) - pos_in;
-	else
-		blen = *len;
-
-	/* Only reflink if we're aligned to block boundaries */
-	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
-	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+	/* Check that we don't violate system file offset limits. */
+	nlen = *len;
+	ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, &nlen,
+			is_dedupe);
+	if (ret)
+		return ret;
+	if (nlen != *len)
 		return -EINVAL;
 
-	/* Don't allow overlapped reflink within the same file */
-	if (same_inode) {
-		if (pos_out + blen > pos_in && pos_out < pos_in + blen)
-			return -EINVAL;
-	}
-
 	/* Wait for the completion of any pending IOs on both files */
 	inode_dio_wait(inode_in);
 	if (!same_inode)
@@ -1816,7 +1792,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 
 	return 1;
 }
-EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
+EXPORT_SYMBOL(vfs_clone_file_prep);
 
 int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len)
@@ -1854,9 +1830,6 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	if (pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
 	ret = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len);
 	if (!ret) {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d8f209bc8937..1955e093e9ea 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1277,7 +1277,7 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
 			&len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c0b4a1c22ff..2a4141d36ebf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1825,9 +1825,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
-extern int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-				      struct inode *inode_out, loff_t pos_out,
-				      u64 *len, bool is_dedupe);
+extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+			       struct file *file_out, loff_t pos_out,
+			       u64 *count, bool is_dedupe);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
@@ -2978,6 +2978,9 @@ extern int sb_min_blocksize(struct super_block *, int);
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
+extern int generic_clone_checks(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				uint64_t *count, bool is_dedupe);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..68ec91d05c7b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2974,6 +2974,74 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+/*
+ * Performs necessary checks before doing a clone.
+ *
+ * Can adjust amount of bytes to clone.
+ * Returns appropriate error code that caller should return or
+ * zero in case the clone should be allowed.
+ */
+int generic_clone_checks(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 uint64_t *req_count, bool is_dedupe)
+{
+	struct inode *inode_in = file_in->f_mapping->host;
+	struct inode *inode_out = file_out->f_mapping->host;
+	unsigned long limit = rlimit(RLIMIT_FSIZE);
+	uint64_t count = *req_count;
+	uint64_t bcount;
+	loff_t size_in, size_out;
+	loff_t bs = inode_out->i_sb->s_blocksize;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (pos_in + count < pos_in || pos_out + count < pos_out)
+		return -EINVAL;
+
+	size_in = i_size_read(inode_in);
+	size_out = i_size_read(inode_out);
+
+	/* Dedupe requires both ranges to be within EOF. */
+	if (is_dedupe &&
+	    (pos_in >= size_in || pos_in + count > size_in ||
+	     pos_out >= size_out || pos_out + count > size_out))
+		return -EINVAL;
+
+	/* Ensure the infile range is within the infile. */
+	if (pos_in >= size_in)
+		return -EINVAL;
+	count = min(count, size_in - (uint64_t)pos_in);
+
+	/*
+	 * If the user wanted us to link to the infile's EOF, round up to the
+	 * next block boundary for this check.
+	 *
+	 * Otherwise, make sure the count is also block-aligned, having
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (pos_in + count == size_in) {
+		bcount = ALIGN(size_in, bs) - pos_in;
+	} else {
+		if (!IS_ALIGNED(count, bs))
+			return -EINVAL;
+
+		bcount = count;
+	}
+
+	/* Don't allow overlapped cloning within the same file. */
+	if (inode_in == inode_out &&
+	    pos_out + bcount > pos_in &&
+	    pos_out < pos_in + bcount)
+		return -EINVAL;
+
+	*req_count = count;
+	return 0;
+}
+EXPORT_SYMBOL(generic_clone_checks);
+
 int pagecache_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Clone range is an optimization on a regular file write.  File writes
that extend the file length are subject to various constraints which are
not checked by clonerange.  This is a correctness problem, because we're
never allowed to touch ranges that the page cache can't support
(s_maxbytes); we're not supposed to deal with large offsets
(MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits
(RLIMIT_FSIZE).

Therefore, add these checks to the new generic_clone_checks function so
that we curtail unexpected behavior.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 mm/filemap.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)


diff --git a/mm/filemap.c b/mm/filemap.c
index 68ec91d05c7b..f74391721234 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3015,6 +3015,37 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 		return -EINVAL;
 	count = min(count, size_in - (uint64_t)pos_in);
 
+	/* Don't exceed RLMIT_FSIZE in the file we're writing into. */
+	if (limit != RLIM_INFINITY) {
+		if (pos_out >= limit) {
+			send_sig(SIGXFSZ, current, 0);
+			return -EFBIG;
+		}
+		count = min(count, limit - (uint64_t)pos_out);
+	}
+
+	/* Don't exceed the LFS limits. */
+	if (unlikely(pos_out + count > MAX_NON_LFS &&
+				!(file_out->f_flags & O_LARGEFILE))) {
+		if (pos_out >= MAX_NON_LFS)
+			return -EFBIG;
+		count = min(count, MAX_NON_LFS - (uint64_t)pos_out);
+	}
+	if (unlikely(pos_in + count > MAX_NON_LFS &&
+				!(file_in->f_flags & O_LARGEFILE))) {
+		if (pos_in >= MAX_NON_LFS)
+			return -EFBIG;
+		count = min(count, MAX_NON_LFS - (uint64_t)pos_in);
+	}
+
+	/* Don't operate on ranges the page cache doesn't support. */
+	if (unlikely(pos_out >= inode_out->i_sb->s_maxbytes ||
+		     pos_in >= inode_in->i_sb->s_maxbytes))
+		return -EFBIG;
+
+	count = min(count, inode_out->i_sb->s_maxbytes - (uint64_t)pos_out);
+	count = min(count, inode_in->i_sb->s_maxbytes - (uint64_t)pos_in);
+
 	/*
 	 * If the user wanted us to link to the infile's EOF, round up to the
 	 * next block boundary for this check.


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Clone range is an optimization on a regular file write.  File writes
that extend the file length are subject to various constraints which are
not checked by clonerange.  This is a correctness problem, because we're
never allowed to touch ranges that the page cache can't support
(s_maxbytes); we're not supposed to deal with large offsets
(MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits
(RLIMIT_FSIZE).

Therefore, add these checks to the new generic_clone_checks function so
that we curtail unexpected behavior.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 mm/filemap.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)


diff --git a/mm/filemap.c b/mm/filemap.c
index 68ec91d05c7b..f74391721234 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3015,6 +3015,37 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 		return -EINVAL;
 	count = min(count, size_in - (uint64_t)pos_in);
 
+	/* Don't exceed RLMIT_FSIZE in the file we're writing into. */
+	if (limit != RLIM_INFINITY) {
+		if (pos_out >= limit) {
+			send_sig(SIGXFSZ, current, 0);
+			return -EFBIG;
+		}
+		count = min(count, limit - (uint64_t)pos_out);
+	}
+
+	/* Don't exceed the LFS limits. */
+	if (unlikely(pos_out + count > MAX_NON_LFS &&
+				!(file_out->f_flags & O_LARGEFILE))) {
+		if (pos_out >= MAX_NON_LFS)
+			return -EFBIG;
+		count = min(count, MAX_NON_LFS - (uint64_t)pos_out);
+	}
+	if (unlikely(pos_in + count > MAX_NON_LFS &&
+				!(file_in->f_flags & O_LARGEFILE))) {
+		if (pos_in >= MAX_NON_LFS)
+			return -EFBIG;
+		count = min(count, MAX_NON_LFS - (uint64_t)pos_in);
+	}
+
+	/* Don't operate on ranges the page cache doesn't support. */
+	if (unlikely(pos_out >= inode_out->i_sb->s_maxbytes ||
+		     pos_in >= inode_in->i_sb->s_maxbytes))
+		return -EFBIG;
+
+	count = min(count, inode_out->i_sb->s_maxbytes - (uint64_t)pos_out);
+	count = min(count, inode_in->i_sb->s_maxbytes - (uint64_t)pos_in);
+
 	/*
 	 * If the user wanted us to link to the infile's EOF, round up to the
 	 * next block boundary for this check.

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/15] vfs: skip zero-length dedupe requests
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Don't bother calling the filesystem for a zero-length dedupe request;
we can return zero and exit.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index 973d3da78c09..99b2f809180c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1966,6 +1966,11 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 	if (!dst_file->f_op->dedupe_file_range)
 		goto out_drop_write;
 
+	if (len == 0) {
+		ret = 0;
+		goto out_drop_write;
+	}
+
 	ret = dst_file->f_op->dedupe_file_range(src_file, src_pos,
 						dst_file, dst_pos, len);
 out_drop_write:


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 07/15] vfs: skip zero-length dedupe requests
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Don't bother calling the filesystem for a zero-length dedupe request;
we can return zero and exit.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index 973d3da78c09..99b2f809180c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1966,6 +1966,11 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 	if (!dst_file->f_op->dedupe_file_range)
 		goto out_drop_write;
 
+	if (len == 0) {
+		ret = 0;
+		goto out_drop_write;
+	}
+
 	ret = dst_file->f_op->dedupe_file_range(src_file, src_pos,
 						dst_file, dst_pos, len);
 out_drop_write:

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Change the clone_file_range and dedupe_file_range functions to return
the number of bytes they operated on.  This is the precursor to allowing
fs implementations to return short clone/dedupe results to the user,
which will enable us to obey resource limits in a graceful manner.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/btrfs/ctree.h    |    4 ++--
 fs/btrfs/ioctl.c    |   13 +++++++++----
 fs/nfs/nfs4file.c   |    4 ++--
 fs/ocfs2/file.c     |   18 ++++++++++++------
 fs/overlayfs/file.c |   18 ++++++++++++------
 fs/read_write.c     |   31 ++++++++++++++++++-------------
 fs/xfs/xfs_file.c   |   18 ++++++++++++------
 include/linux/fs.h  |   12 +++++++-----
 8 files changed, 74 insertions(+), 44 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..864651257142 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3218,7 +3218,7 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 				struct btrfs_ioctl_space_info *space);
 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
 			       struct btrfs_ioctl_balance_args *bargs);
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
+s64 btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
 			    struct file *dst_file, loff_t dst_loff,
 			    u64 olen);
 
@@ -3250,7 +3250,7 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages,
 		      size_t num_pages, loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
-int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+s64 btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
 			   struct file *file_out, loff_t pos_out, u64 len);
 
 /* tree-defrag.c */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d60b6caf09e8..35ba974f1333 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3627,13 +3627,14 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 	return ret;
 }
 
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
+s64 btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
 			    struct file *dst_file, loff_t dst_loff,
 			    u64 olen)
 {
 	struct inode *src = file_inode(src_file);
 	struct inode *dst = file_inode(dst_file);
 	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	int ret;
 
 	if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
 		/*
@@ -3644,7 +3645,8 @@ int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
 		return -EINVAL;
 	}
 
-	return btrfs_extent_same(src, src_loff, olen, dst, dst_loff);
+	ret = btrfs_extent_same(src, src_loff, olen, dst, dst_loff);
+	return ret < 0 ? ret : olen;
 }
 
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
@@ -4348,10 +4350,13 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	return ret;
 }
 
-int btrfs_clone_file_range(struct file *src_file, loff_t off,
+s64 btrfs_clone_file_range(struct file *src_file, loff_t off,
 		struct file *dst_file, loff_t destoff, u64 len)
 {
-	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
+	int ret;
+
+	ret = btrfs_clone_files(dst_file, src_file, off, len, destoff);
+	return ret < 0 ? ret : len;
 }
 
 static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4288a6ecaf75..f914861f844f 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -180,7 +180,7 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
+static s64 nfs42_clone_file_range(struct file *src_file, loff_t src_off,
 		struct file *dst_file, loff_t dst_off, u64 count)
 {
 	struct inode *dst_inode = file_inode(dst_file);
@@ -240,7 +240,7 @@ static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
 		inode_unlock(src_inode);
 	}
 out:
-	return ret;
+	return ret < 0 ? ret : count;
 }
 #endif /* CONFIG_NFS_V4_2 */
 
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 9fa35cb6f6e0..c4b78ee4a593 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2527,24 +2527,30 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
-static int ocfs2_file_clone_range(struct file *file_in,
+static s64 ocfs2_file_clone_range(struct file *file_in,
 				  loff_t pos_in,
 				  struct file *file_out,
 				  loff_t pos_out,
 				  u64 len)
 {
-	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					 len, false);
+	int ret;
+
+	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					len, false);
+	return ret < 0 ? ret : len;
 }
 
-static int ocfs2_file_dedupe_range(struct file *file_in,
+static s64 ocfs2_file_dedupe_range(struct file *file_in,
 				   loff_t pos_in,
 				   struct file *file_out,
 				   loff_t pos_out,
 				   u64 len)
 {
-	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					  len, true);
+	int ret;
+
+	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					len, true);
+	return ret < 0 ? ret : len;
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index aeaefd2a551b..6d792d817538 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -487,16 +487,21 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 			    OVL_COPY);
 }
 
-static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
+static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out, u64 len)
 {
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    OVL_CLONE);
+	int ret;
+
+	ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
+			   OVL_CLONE);
+	return ret < 0 ? ret : len;
 }
 
-static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
+static s64 ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
 				 struct file *file_out, loff_t pos_out, u64 len)
 {
+	int ret;
+
 	/*
 	 * Don't copy up because of a dedupe request, this wouldn't make sense
 	 * most of the time (data would be duplicated instead of deduplicated).
@@ -505,8 +510,9 @@ static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
 	    !ovl_inode_upper(file_inode(file_out)))
 		return -EPERM;
 
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    OVL_DEDUPE);
+	ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
+			   OVL_DEDUPE);
+	return ret < 0 ? ret : len;
 }
 
 const struct file_operations ovl_file_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 99b2f809180c..f51751281454 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1589,10 +1589,12 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	 * more efficient if both clone and copy are supported (e.g. NFS).
 	 */
 	if (file_in->f_op->clone_file_range) {
-		ret = file_in->f_op->clone_file_range(file_in, pos_in,
-				file_out, pos_out, len);
-		if (ret == 0) {
-			ret = len;
+		s64 cloned;
+
+		cloned = file_in->f_op->clone_file_range(file_in, pos_in,
+				file_out, pos_out, min(MAX_RW_COUNT, len));
+		if (cloned >= 0) {
+			ret = cloned;
 			goto done;
 		}
 	}
@@ -1799,6 +1801,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
+	s64 cloned;
 	int ret;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1830,14 +1833,16 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_in->f_op->clone_file_range(file_in, pos_in,
+	cloned = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len);
-	if (!ret) {
-		fsnotify_access(file_in);
-		fsnotify_modify(file_out);
-	}
+	if (cloned < 0)
+		return cloned;
+	else if (len && cloned != len)
+		return -EINVAL;
 
-	return ret;
+	fsnotify_access(file_in);
+	fsnotify_modify(file_out);
+	return 0;
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
 
@@ -1937,7 +1942,7 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
-int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
+s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 			      struct file *dst_file, loff_t dst_pos, u64 len)
 {
 	s64 ret;
@@ -1989,7 +1994,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	int i;
 	int ret;
 	u16 count = same->dest_count;
-	int deduped;
+	s64 deduped;
 
 	if (!(file->f_mode & FMODE_READ))
 		return -EINVAL;
@@ -2046,7 +2051,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 		else if (deduped < 0)
 			info->status = deduped;
 		else
-			info->bytes_deduped = len;
+			info->bytes_deduped = deduped;
 
 next_fdput:
 		fdput(dst_fd);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 61a5ad2600e8..efa95e0d8cee 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -919,7 +919,7 @@ xfs_file_fallocate(
 	return error;
 }
 
-STATIC int
+STATIC s64
 xfs_file_clone_range(
 	struct file	*file_in,
 	loff_t		pos_in,
@@ -927,11 +927,14 @@ xfs_file_clone_range(
 	loff_t		pos_out,
 	u64		len)
 {
-	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-				     len, false);
+	int		ret;
+
+	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, false);
+	return ret < 0 ? ret : len;
 }
 
-STATIC int
+STATIC s64
 xfs_file_dedupe_range(
 	struct file	*file_in,
 	loff_t		pos_in,
@@ -939,8 +942,11 @@ xfs_file_dedupe_range(
 	loff_t		pos_out,
 	u64		len)
 {
-	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-				     len, true);
+	int		ret;
+
+	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, true);
+	return ret < 0 ? ret : len;
 }
 
 STATIC int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a4141d36ebf..e5755340e825 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1759,10 +1759,12 @@ struct file_operations {
 #endif
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
 			loff_t, size_t, unsigned int);
-	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
-			u64);
-	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
-			u64);
+	s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				u64 count);
+	s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
+				 struct file *file_out, loff_t pos_out,
+				 u64 count);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
@@ -1835,7 +1837,7 @@ extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 loff_t len, bool *is_same);
 extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
-extern int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
+extern s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 				     struct file *dst_file, loff_t dst_pos,
 				     u64 len);
 


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Change the clone_file_range and dedupe_file_range functions to return
the number of bytes they operated on.  This is the precursor to allowing
fs implementations to return short clone/dedupe results to the user,
which will enable us to obey resource limits in a graceful manner.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/btrfs/ctree.h    |    4 ++--
 fs/btrfs/ioctl.c    |   13 +++++++++----
 fs/nfs/nfs4file.c   |    4 ++--
 fs/ocfs2/file.c     |   18 ++++++++++++------
 fs/overlayfs/file.c |   18 ++++++++++++------
 fs/read_write.c     |   31 ++++++++++++++++++-------------
 fs/xfs/xfs_file.c   |   18 ++++++++++++------
 include/linux/fs.h  |   12 +++++++-----
 8 files changed, 74 insertions(+), 44 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..864651257142 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3218,7 +3218,7 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 				struct btrfs_ioctl_space_info *space);
 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
 			       struct btrfs_ioctl_balance_args *bargs);
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
+s64 btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
 			    struct file *dst_file, loff_t dst_loff,
 			    u64 olen);
 
@@ -3250,7 +3250,7 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages,
 		      size_t num_pages, loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
-int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+s64 btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
 			   struct file *file_out, loff_t pos_out, u64 len);
 
 /* tree-defrag.c */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d60b6caf09e8..35ba974f1333 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3627,13 +3627,14 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 	return ret;
 }
 
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
+s64 btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
 			    struct file *dst_file, loff_t dst_loff,
 			    u64 olen)
 {
 	struct inode *src = file_inode(src_file);
 	struct inode *dst = file_inode(dst_file);
 	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	int ret;
 
 	if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
 		/*
@@ -3644,7 +3645,8 @@ int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
 		return -EINVAL;
 	}
 
-	return btrfs_extent_same(src, src_loff, olen, dst, dst_loff);
+	ret = btrfs_extent_same(src, src_loff, olen, dst, dst_loff);
+	return ret < 0 ? ret : olen;
 }
 
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
@@ -4348,10 +4350,13 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	return ret;
 }
 
-int btrfs_clone_file_range(struct file *src_file, loff_t off,
+s64 btrfs_clone_file_range(struct file *src_file, loff_t off,
 		struct file *dst_file, loff_t destoff, u64 len)
 {
-	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
+	int ret;
+
+	ret = btrfs_clone_files(dst_file, src_file, off, len, destoff);
+	return ret < 0 ? ret : len;
 }
 
 static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4288a6ecaf75..f914861f844f 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -180,7 +180,7 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
+static s64 nfs42_clone_file_range(struct file *src_file, loff_t src_off,
 		struct file *dst_file, loff_t dst_off, u64 count)
 {
 	struct inode *dst_inode = file_inode(dst_file);
@@ -240,7 +240,7 @@ static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
 		inode_unlock(src_inode);
 	}
 out:
-	return ret;
+	return ret < 0 ? ret : count;
 }
 #endif /* CONFIG_NFS_V4_2 */
 
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 9fa35cb6f6e0..c4b78ee4a593 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2527,24 +2527,30 @@ static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
-static int ocfs2_file_clone_range(struct file *file_in,
+static s64 ocfs2_file_clone_range(struct file *file_in,
 				  loff_t pos_in,
 				  struct file *file_out,
 				  loff_t pos_out,
 				  u64 len)
 {
-	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					 len, false);
+	int ret;
+
+	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					len, false);
+	return ret < 0 ? ret : len;
 }
 
-static int ocfs2_file_dedupe_range(struct file *file_in,
+static s64 ocfs2_file_dedupe_range(struct file *file_in,
 				   loff_t pos_in,
 				   struct file *file_out,
 				   loff_t pos_out,
 				   u64 len)
 {
-	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					  len, true);
+	int ret;
+
+	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+					len, true);
+	return ret < 0 ? ret : len;
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index aeaefd2a551b..6d792d817538 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -487,16 +487,21 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 			    OVL_COPY);
 }
 
-static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
+static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out, u64 len)
 {
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    OVL_CLONE);
+	int ret;
+
+	ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
+			   OVL_CLONE);
+	return ret < 0 ? ret : len;
 }
 
-static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
+static s64 ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
 				 struct file *file_out, loff_t pos_out, u64 len)
 {
+	int ret;
+
 	/*
 	 * Don't copy up because of a dedupe request, this wouldn't make sense
 	 * most of the time (data would be duplicated instead of deduplicated).
@@ -505,8 +510,9 @@ static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
 	    !ovl_inode_upper(file_inode(file_out)))
 		return -EPERM;
 
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-			    OVL_DEDUPE);
+	ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
+			   OVL_DEDUPE);
+	return ret < 0 ? ret : len;
 }
 
 const struct file_operations ovl_file_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 99b2f809180c..f51751281454 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1589,10 +1589,12 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	 * more efficient if both clone and copy are supported (e.g. NFS).
 	 */
 	if (file_in->f_op->clone_file_range) {
-		ret = file_in->f_op->clone_file_range(file_in, pos_in,
-				file_out, pos_out, len);
-		if (ret == 0) {
-			ret = len;
+		s64 cloned;
+
+		cloned = file_in->f_op->clone_file_range(file_in, pos_in,
+				file_out, pos_out, min(MAX_RW_COUNT, len));
+		if (cloned >= 0) {
+			ret = cloned;
 			goto done;
 		}
 	}
@@ -1799,6 +1801,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
+	s64 cloned;
 	int ret;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1830,14 +1833,16 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_in->f_op->clone_file_range(file_in, pos_in,
+	cloned = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len);
-	if (!ret) {
-		fsnotify_access(file_in);
-		fsnotify_modify(file_out);
-	}
+	if (cloned < 0)
+		return cloned;
+	else if (len && cloned != len)
+		return -EINVAL;
 
-	return ret;
+	fsnotify_access(file_in);
+	fsnotify_modify(file_out);
+	return 0;
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
 
@@ -1937,7 +1942,7 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
-int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
+s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 			      struct file *dst_file, loff_t dst_pos, u64 len)
 {
 	s64 ret;
@@ -1989,7 +1994,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	int i;
 	int ret;
 	u16 count = same->dest_count;
-	int deduped;
+	s64 deduped;
 
 	if (!(file->f_mode & FMODE_READ))
 		return -EINVAL;
@@ -2046,7 +2051,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 		else if (deduped < 0)
 			info->status = deduped;
 		else
-			info->bytes_deduped = len;
+			info->bytes_deduped = deduped;
 
 next_fdput:
 		fdput(dst_fd);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 61a5ad2600e8..efa95e0d8cee 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -919,7 +919,7 @@ xfs_file_fallocate(
 	return error;
 }
 
-STATIC int
+STATIC s64
 xfs_file_clone_range(
 	struct file	*file_in,
 	loff_t		pos_in,
@@ -927,11 +927,14 @@ xfs_file_clone_range(
 	loff_t		pos_out,
 	u64		len)
 {
-	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-				     len, false);
+	int		ret;
+
+	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, false);
+	return ret < 0 ? ret : len;
 }
 
-STATIC int
+STATIC s64
 xfs_file_dedupe_range(
 	struct file	*file_in,
 	loff_t		pos_in,
@@ -939,8 +942,11 @@ xfs_file_dedupe_range(
 	loff_t		pos_out,
 	u64		len)
 {
-	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-				     len, true);
+	int		ret;
+
+	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, true);
+	return ret < 0 ? ret : len;
 }
 
 STATIC int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a4141d36ebf..e5755340e825 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1759,10 +1759,12 @@ struct file_operations {
 #endif
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
 			loff_t, size_t, unsigned int);
-	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
-			u64);
-	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
-			u64);
+	s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				u64 count);
+	s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
+				 struct file *file_out, loff_t pos_out,
+				 u64 count);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
@@ -1835,7 +1837,7 @@ extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 loff_t len, bool *is_same);
 extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
-extern int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
+extern s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 				     struct file *dst_file, loff_t dst_pos,
 				     u64 len);
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Pass operational flags to the per-filesystem clone and dedupe
implementations.  This enables the vfs to signal when it can deal with
short clone and short dedupe operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/btrfs/ctree.h        |    3 ++-
 fs/btrfs/ioctl.c        |    3 ++-
 fs/nfs/nfs4file.c       |    3 ++-
 fs/ocfs2/file.c         |    3 ++-
 fs/ocfs2/refcounttree.c |    2 +-
 fs/overlayfs/file.c     |    3 ++-
 fs/read_write.c         |    9 ++++++---
 fs/xfs/xfs_file.c       |    3 ++-
 fs/xfs/xfs_reflink.c    |    2 +-
 include/linux/fs.h      |   10 ++++++++--
 10 files changed, 28 insertions(+), 13 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 864651257142..e8c9b871709d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3251,7 +3251,8 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 s64 btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
-			   struct file *file_out, loff_t pos_out, u64 len);
+			   struct file *file_out, loff_t pos_out, u64 len,
+			   unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 35ba974f1333..b41a65622b93 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4351,7 +4351,8 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 }
 
 s64 btrfs_clone_file_range(struct file *src_file, loff_t off,
-		struct file *dst_file, loff_t destoff, u64 len)
+		struct file *dst_file, loff_t destoff, u64 len,
+		unsigned int flags)
 {
 	int ret;
 
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index f914861f844f..f8ff06fc1c73 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -181,7 +181,8 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 }
 
 static s64 nfs42_clone_file_range(struct file *src_file, loff_t src_off,
-		struct file *dst_file, loff_t dst_off, u64 count)
+		struct file *dst_file, loff_t dst_off, u64 count,
+		unsigned int flags)
 {
 	struct inode *dst_inode = file_inode(dst_file);
 	struct nfs_server *server = NFS_SERVER(dst_inode);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index c4b78ee4a593..1ee6d3ecdac6 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2531,7 +2531,8 @@ static s64 ocfs2_file_clone_range(struct file *file_in,
 				  loff_t pos_in,
 				  struct file *file_out,
 				  loff_t pos_out,
-				  u64 len)
+				  u64 len,
+				  unsigned int flags)
 {
 	int ret;
 
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 11e4aad7b783..3758954f2377 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4843,7 +4843,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe);
+			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
 	if (ret <= 0)
 		goto out_unlock;
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 6d792d817538..440cb7a82834 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -488,7 +488,8 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 }
 
 static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out, u64 len)
+				struct file *file_out, loff_t pos_out, u64 len,
+				unsigned int flags)
 {
 	int ret;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index f51751281454..7cfff497263b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1592,7 +1592,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 		s64 cloned;
 
 		cloned = file_in->f_op->clone_file_range(file_in, pos_in,
-				file_out, pos_out, min(MAX_RW_COUNT, len));
+				file_out, pos_out, min(MAX_RW_COUNT, len),
+				CLONERANGE_SHORT);
 		if (cloned >= 0) {
 			ret = cloned;
 			goto done;
@@ -1721,13 +1722,14 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  */
 int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 			struct file *file_out, loff_t pos_out,
-			u64 *len, bool is_dedupe)
+			u64 *len, unsigned int flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	uint64_t nlen;
 	loff_t isize;
 	bool same_inode = (inode_in == inode_out);
+	bool is_dedupe = (flags & CLONERANGE_DEDUPE);
 	int ret;
 
 	/* Don't touch certain kinds of inodes */
@@ -1802,6 +1804,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	s64 cloned;
+	unsigned int flags = 0;
 	int ret;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1834,7 +1837,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		return ret;
 
 	cloned = file_in->f_op->clone_file_range(file_in, pos_in,
-			file_out, pos_out, len);
+			file_out, pos_out, len, flags);
 	if (cloned < 0)
 		return cloned;
 	else if (len && cloned != len)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index efa95e0d8cee..d5d6681ca714 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -925,7 +925,8 @@ xfs_file_clone_range(
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	u64		len)
+	u64		len,
+	unsigned int	flags)
 {
 	int		ret;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 1955e093e9ea..40684dd011ee 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1278,7 +1278,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe);
+			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
 	if (ret <= 0)
 		goto out_unlock;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e5755340e825..ae5685c31270 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1761,7 +1761,7 @@ struct file_operations {
 			loff_t, size_t, unsigned int);
 	s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
-				u64 count);
+				u64 count, unsigned int flags);
 	s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
 				 struct file *file_out, loff_t pos_out,
 				 u64 count);
@@ -1827,9 +1827,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+/* Caller can handle a shortened operation. */
+#define CLONERANGE_SHORT	(1 << 0)
+/* End operation at the source file's EOF. */
+#define CLONERANGE_EOF		(1 << 1)
+/* Operation is actually dedupe, not clone. */
+#define CLONERANGE_DEDUPE	(1 << 2)
 extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 			       struct file *file_out, loff_t pos_out,
-			       u64 *count, bool is_dedupe);
+			       u64 *count, unsigned int flags);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Pass operational flags to the per-filesystem clone and dedupe
implementations.  This enables the vfs to signal when it can deal with
short clone and short dedupe operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/btrfs/ctree.h        |    3 ++-
 fs/btrfs/ioctl.c        |    3 ++-
 fs/nfs/nfs4file.c       |    3 ++-
 fs/ocfs2/file.c         |    3 ++-
 fs/ocfs2/refcounttree.c |    2 +-
 fs/overlayfs/file.c     |    3 ++-
 fs/read_write.c         |    9 ++++++---
 fs/xfs/xfs_file.c       |    3 ++-
 fs/xfs/xfs_reflink.c    |    2 +-
 include/linux/fs.h      |   10 ++++++++--
 10 files changed, 28 insertions(+), 13 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 864651257142..e8c9b871709d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3251,7 +3251,8 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 s64 btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
-			   struct file *file_out, loff_t pos_out, u64 len);
+			   struct file *file_out, loff_t pos_out, u64 len,
+			   unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 35ba974f1333..b41a65622b93 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4351,7 +4351,8 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 }
 
 s64 btrfs_clone_file_range(struct file *src_file, loff_t off,
-		struct file *dst_file, loff_t destoff, u64 len)
+		struct file *dst_file, loff_t destoff, u64 len,
+		unsigned int flags)
 {
 	int ret;
 
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index f914861f844f..f8ff06fc1c73 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -181,7 +181,8 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 }
 
 static s64 nfs42_clone_file_range(struct file *src_file, loff_t src_off,
-		struct file *dst_file, loff_t dst_off, u64 count)
+		struct file *dst_file, loff_t dst_off, u64 count,
+		unsigned int flags)
 {
 	struct inode *dst_inode = file_inode(dst_file);
 	struct nfs_server *server = NFS_SERVER(dst_inode);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index c4b78ee4a593..1ee6d3ecdac6 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2531,7 +2531,8 @@ static s64 ocfs2_file_clone_range(struct file *file_in,
 				  loff_t pos_in,
 				  struct file *file_out,
 				  loff_t pos_out,
-				  u64 len)
+				  u64 len,
+				  unsigned int flags)
 {
 	int ret;
 
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 11e4aad7b783..3758954f2377 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4843,7 +4843,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe);
+			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
 	if (ret <= 0)
 		goto out_unlock;
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 6d792d817538..440cb7a82834 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -488,7 +488,8 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 }
 
 static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
-				struct file *file_out, loff_t pos_out, u64 len)
+				struct file *file_out, loff_t pos_out, u64 len,
+				unsigned int flags)
 {
 	int ret;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index f51751281454..7cfff497263b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1592,7 +1592,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 		s64 cloned;
 
 		cloned = file_in->f_op->clone_file_range(file_in, pos_in,
-				file_out, pos_out, min(MAX_RW_COUNT, len));
+				file_out, pos_out, min(MAX_RW_COUNT, len),
+				CLONERANGE_SHORT);
 		if (cloned >= 0) {
 			ret = cloned;
 			goto done;
@@ -1721,13 +1722,14 @@ static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
  */
 int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 			struct file *file_out, loff_t pos_out,
-			u64 *len, bool is_dedupe)
+			u64 *len, unsigned int flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	uint64_t nlen;
 	loff_t isize;
 	bool same_inode = (inode_in == inode_out);
+	bool is_dedupe = (flags & CLONERANGE_DEDUPE);
 	int ret;
 
 	/* Don't touch certain kinds of inodes */
@@ -1802,6 +1804,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	s64 cloned;
+	unsigned int flags = 0;
 	int ret;
 
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
@@ -1834,7 +1837,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		return ret;
 
 	cloned = file_in->f_op->clone_file_range(file_in, pos_in,
-			file_out, pos_out, len);
+			file_out, pos_out, len, flags);
 	if (cloned < 0)
 		return cloned;
 	else if (len && cloned != len)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index efa95e0d8cee..d5d6681ca714 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -925,7 +925,8 @@ xfs_file_clone_range(
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	u64		len)
+	u64		len,
+	unsigned int	flags)
 {
 	int		ret;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 1955e093e9ea..40684dd011ee 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1278,7 +1278,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe);
+			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
 	if (ret <= 0)
 		goto out_unlock;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e5755340e825..ae5685c31270 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1761,7 +1761,7 @@ struct file_operations {
 			loff_t, size_t, unsigned int);
 	s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
-				u64 count);
+				u64 count, unsigned int flags);
 	s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
 				 struct file *file_out, loff_t pos_out,
 				 u64 count);
@@ -1827,9 +1827,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+/* Caller can handle a shortened operation. */
+#define CLONERANGE_SHORT	(1 << 0)
+/* End operation@the source file's EOF. */
+#define CLONERANGE_EOF		(1 << 1)
+/* Operation is actually dedupe, not clone. */
+#define CLONERANGE_DEDUPE	(1 << 2)
 extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 			       struct file *file_out, loff_t pos_out,
-			       u64 *count, bool is_dedupe);
+			       u64 *count, unsigned int flags);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/15] vfs: make cloning to source file eof more explicit
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the new CLONE_RANGE_EOF flag to explicitly declare that the caller
wants the clone implementation to set *len appropriately once the files
are locked.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 7cfff497263b..4eaea52f70a8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1747,11 +1747,12 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 
 	isize = i_size_read(inode_in);
 
-	/* Zero length dedupe exits immediately; reflink goes to EOF. */
-	if (*len == 0) {
-		if (is_dedupe || pos_in == isize)
-			return 0;
-		if (pos_in > isize)
+	/*
+	 * If the caller asked to go all the way to the end of the source file,
+	 * set *len now that we have the file locked.
+	 */
+	if ((flags & CLONERANGE_EOF) && *len == 0) {
+		if (pos_in >= isize)
 			return -EINVAL;
 		*len = isize - pos_in;
 	}
@@ -1836,6 +1837,9 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
+	if (len == 0)
+		flags |= CLONERANGE_EOF;
+
 	cloned = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len, flags);
 	if (cloned < 0)


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 10/15] vfs: make cloning to source file eof more explicit
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the new CLONE_RANGE_EOF flag to explicitly declare that the caller
wants the clone implementation to set *len appropriately once the files
are locked.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 7cfff497263b..4eaea52f70a8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1747,11 +1747,12 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 
 	isize = i_size_read(inode_in);
 
-	/* Zero length dedupe exits immediately; reflink goes to EOF. */
-	if (*len == 0) {
-		if (is_dedupe || pos_in == isize)
-			return 0;
-		if (pos_in > isize)
+	/*
+	 * If the caller asked to go all the way to the end of the source file,
+	 * set *len now that we have the file locked.
+	 */
+	if ((flags & CLONERANGE_EOF) && *len == 0) {
+		if (pos_in >= isize)
 			return -EINVAL;
 		*len = isize - pos_in;
 	}
@@ -1836,6 +1837,9 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
+	if (len == 0)
+		flags |= CLONERANGE_EOF;
+
 	cloned = file_in->f_op->clone_file_range(file_in, pos_in,
 			file_out, pos_out, len, flags);
 	if (cloned < 0)

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 11/15] vfs: allow short clone and dedupe operations
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:45   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Allow the clone and dedupe prep function to shorten the request if the
caller can handle it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c    |    8 ++------
 include/linux/fs.h |    2 +-
 mm/filemap.c       |   19 +++++++++++++++----
 3 files changed, 18 insertions(+), 11 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 4eaea52f70a8..292d68c2f47c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1726,7 +1726,6 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
-	uint64_t nlen;
 	loff_t isize;
 	bool same_inode = (inode_in == inode_out);
 	bool is_dedupe = (flags & CLONERANGE_DEDUPE);
@@ -1758,13 +1757,10 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 	}
 
 	/* Check that we don't violate system file offset limits. */
-	nlen = *len;
-	ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, &nlen,
-			is_dedupe);
+	ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, len,
+			flags);
 	if (ret)
 		return ret;
-	if (nlen != *len)
-		return -EINVAL;
 
 	/* Wait for the completion of any pending IOs on both files */
 	inode_dio_wait(inode_in);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae5685c31270..eb35363478e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2988,7 +2988,7 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_clone_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
-				uint64_t *count, bool is_dedupe);
+				uint64_t *count, unsigned int flags);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index f74391721234..013451b8017f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2983,7 +2983,7 @@ EXPORT_SYMBOL(generic_write_checks);
  */
 int generic_clone_checks(struct file *file_in, loff_t pos_in,
 			 struct file *file_out, loff_t pos_out,
-			 uint64_t *req_count, bool is_dedupe)
+			 uint64_t *req_count, unsigned int flags)
 {
 	struct inode *inode_in = file_in->f_mapping->host;
 	struct inode *inode_out = file_out->f_mapping->host;
@@ -3005,7 +3005,7 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 	size_out = i_size_read(inode_out);
 
 	/* Dedupe requires both ranges to be within EOF. */
-	if (is_dedupe &&
+	if ((flags & CLONERANGE_DEDUPE) &&
 	    (pos_in >= size_in || pos_in + count > size_in ||
 	     pos_out >= size_out || pos_out + count > size_out))
 		return -EINVAL;
@@ -3056,8 +3056,12 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 	if (pos_in + count == size_in) {
 		bcount = ALIGN(size_in, bs) - pos_in;
 	} else {
-		if (!IS_ALIGNED(count, bs))
-			return -EINVAL;
+		if (!IS_ALIGNED(count, bs)) {
+			if (flags & CLONERANGE_SHORT)
+				count = ALIGN_DOWN(count, bs);
+			else
+				return -EINVAL;
+		}
 
 		bcount = count;
 	}
@@ -3068,6 +3072,13 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 	    pos_out < pos_in + bcount)
 		return -EINVAL;
 
+	/*
+	 * We shortened the request but the caller can't deal with that, so
+	 * bounce the request back to userspace.
+	 */
+	if (*req_count != count && !(flags & CLONERANGE_SHORT))
+		return -EINVAL;
+
 	*req_count = count;
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 11/15] vfs: allow short clone and dedupe operations
@ 2018-10-05  0:45   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:45 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Allow the clone and dedupe prep function to shorten the request if the
caller can handle it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c    |    8 ++------
 include/linux/fs.h |    2 +-
 mm/filemap.c       |   19 +++++++++++++++----
 3 files changed, 18 insertions(+), 11 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 4eaea52f70a8..292d68c2f47c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1726,7 +1726,6 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
-	uint64_t nlen;
 	loff_t isize;
 	bool same_inode = (inode_in == inode_out);
 	bool is_dedupe = (flags & CLONERANGE_DEDUPE);
@@ -1758,13 +1757,10 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 	}
 
 	/* Check that we don't violate system file offset limits. */
-	nlen = *len;
-	ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, &nlen,
-			is_dedupe);
+	ret = generic_clone_checks(file_in, pos_in, file_out, pos_out, len,
+			flags);
 	if (ret)
 		return ret;
-	if (nlen != *len)
-		return -EINVAL;
 
 	/* Wait for the completion of any pending IOs on both files */
 	inode_dio_wait(inode_in);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae5685c31270..eb35363478e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2988,7 +2988,7 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_clone_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
-				uint64_t *count, bool is_dedupe);
+				uint64_t *count, unsigned int flags);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index f74391721234..013451b8017f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2983,7 +2983,7 @@ EXPORT_SYMBOL(generic_write_checks);
  */
 int generic_clone_checks(struct file *file_in, loff_t pos_in,
 			 struct file *file_out, loff_t pos_out,
-			 uint64_t *req_count, bool is_dedupe)
+			 uint64_t *req_count, unsigned int flags)
 {
 	struct inode *inode_in = file_in->f_mapping->host;
 	struct inode *inode_out = file_out->f_mapping->host;
@@ -3005,7 +3005,7 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 	size_out = i_size_read(inode_out);
 
 	/* Dedupe requires both ranges to be within EOF. */
-	if (is_dedupe &&
+	if ((flags & CLONERANGE_DEDUPE) &&
 	    (pos_in >= size_in || pos_in + count > size_in ||
 	     pos_out >= size_out || pos_out + count > size_out))
 		return -EINVAL;
@@ -3056,8 +3056,12 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 	if (pos_in + count == size_in) {
 		bcount = ALIGN(size_in, bs) - pos_in;
 	} else {
-		if (!IS_ALIGNED(count, bs))
-			return -EINVAL;
+		if (!IS_ALIGNED(count, bs)) {
+			if (flags & CLONERANGE_SHORT)
+				count = ALIGN_DOWN(count, bs);
+			else
+				return -EINVAL;
+		}
 
 		bcount = count;
 	}
@@ -3068,6 +3072,13 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
 	    pos_out < pos_in + bcount)
 		return -EINVAL;
 
+	/*
+	 * We shortened the request but the caller can't deal with that, so
+	 * bounce the request back to userspace.
+	 */
+	if (*req_count != count && !(flags & CLONERANGE_SHORT))
+		return -EINVAL;
+
 	*req_count = count;
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 12/15] vfs: implement opportunistic short dedupe
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:46   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

For a given dedupe request, the bytes_deduped field in the control
structure tells userspace if we managed to deduplicate some, but not all
of, the requested regions starting from the file offsets supplied.
However, due to sloppy coding, the current dedupe code returns
FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different.
Fix this so that we can actually support partial request completion.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c    |   44 +++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    2 +-
 2 files changed, 36 insertions(+), 10 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 292d68c2f47c..9be9f261edd2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1781,13 +1781,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 	 * Check that the extents are the same.
 	 */
 	if (is_dedupe) {
-		bool		is_same = false;
-
 		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+				inode_out, pos_out, len);
 		if (ret)
 			return ret;
-		if (!is_same)
+		if (*len == 0)
 			return -EBADE;
 	}
 
@@ -1872,13 +1870,30 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 	return page;
 }
 
+static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2,
+				      unsigned int cmp_len)
+{
+	const char *orig_s1 = s1;
+	const char *e1 = s1 + cmp_len;
+	const char *e2 = s2 + cmp_len;
+
+	while (s1 < e1 && s2 < e2) {
+		if (*s1 != *s2)
+			break;
+		s1++;
+		s2++;
+	}
+
+	return s1 - orig_s1;
+}
+
 /*
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
 int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 				  struct inode *dest, loff_t destoff,
-				  loff_t len, bool *is_same)
+				  loff_t *req_len)
 {
 	loff_t src_poff;
 	loff_t dest_poff;
@@ -1886,8 +1901,11 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 	void *dest_addr;
 	struct page *src_page;
 	struct page *dest_page;
-	loff_t cmp_len;
+	loff_t len = *req_len;
+	loff_t same_len = 0;
 	bool same;
+	unsigned int cmp_len;
+	unsigned int cmp_same;
 	int error;
 
 	error = -EINVAL;
@@ -1897,7 +1915,7 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		dest_poff = destoff & (PAGE_SIZE - 1);
 		cmp_len = min(PAGE_SIZE - src_poff,
 			      PAGE_SIZE - dest_poff);
-		cmp_len = min(cmp_len, len);
+		cmp_len = min_t(loff_t, cmp_len, len);
 		if (cmp_len <= 0)
 			goto out_error;
 
@@ -1919,7 +1937,10 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		flush_dcache_page(src_page);
 		flush_dcache_page(dest_page);
 
-		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+		cmp_same = vfs_dedupe_memcmp(src_addr + src_poff,
+					     dest_addr + dest_poff, cmp_len);
+		same_len += cmp_same;
+		if (cmp_same != cmp_len)
 			same = false;
 
 		kunmap_atomic(dest_addr);
@@ -1937,7 +1958,12 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		len -= cmp_len;
 	}
 
-	*is_same = same;
+	/*
+	 * If less than the whole range matched, we have to back down to the
+	 * nearest block boundary.
+	 */
+	if (*req_len != same_len)
+		*req_len = ALIGN_DOWN(same_len, dest->i_sb->s_blocksize);
 	return 0;
 
 out_error:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eb35363478e5..490128b84d10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1840,7 +1840,7 @@ extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
-					 loff_t len, bool *is_same);
+					 loff_t *len);
 extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
 extern s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 12/15] vfs: implement opportunistic short dedupe
@ 2018-10-05  0:46   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

For a given dedupe request, the bytes_deduped field in the control
structure tells userspace if we managed to deduplicate some, but not all
of, the requested regions starting from the file offsets supplied.
However, due to sloppy coding, the current dedupe code returns
FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different.
Fix this so that we can actually support partial request completion.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c    |   44 +++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    2 +-
 2 files changed, 36 insertions(+), 10 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 292d68c2f47c..9be9f261edd2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1781,13 +1781,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
 	 * Check that the extents are the same.
 	 */
 	if (is_dedupe) {
-		bool		is_same = false;
-
 		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+				inode_out, pos_out, len);
 		if (ret)
 			return ret;
-		if (!is_same)
+		if (*len == 0)
 			return -EBADE;
 	}
 
@@ -1872,13 +1870,30 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 	return page;
 }
 
+static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2,
+				      unsigned int cmp_len)
+{
+	const char *orig_s1 = s1;
+	const char *e1 = s1 + cmp_len;
+	const char *e2 = s2 + cmp_len;
+
+	while (s1 < e1 && s2 < e2) {
+		if (*s1 != *s2)
+			break;
+		s1++;
+		s2++;
+	}
+
+	return s1 - orig_s1;
+}
+
 /*
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
 int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 				  struct inode *dest, loff_t destoff,
-				  loff_t len, bool *is_same)
+				  loff_t *req_len)
 {
 	loff_t src_poff;
 	loff_t dest_poff;
@@ -1886,8 +1901,11 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 	void *dest_addr;
 	struct page *src_page;
 	struct page *dest_page;
-	loff_t cmp_len;
+	loff_t len = *req_len;
+	loff_t same_len = 0;
 	bool same;
+	unsigned int cmp_len;
+	unsigned int cmp_same;
 	int error;
 
 	error = -EINVAL;
@@ -1897,7 +1915,7 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		dest_poff = destoff & (PAGE_SIZE - 1);
 		cmp_len = min(PAGE_SIZE - src_poff,
 			      PAGE_SIZE - dest_poff);
-		cmp_len = min(cmp_len, len);
+		cmp_len = min_t(loff_t, cmp_len, len);
 		if (cmp_len <= 0)
 			goto out_error;
 
@@ -1919,7 +1937,10 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		flush_dcache_page(src_page);
 		flush_dcache_page(dest_page);
 
-		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+		cmp_same = vfs_dedupe_memcmp(src_addr + src_poff,
+					     dest_addr + dest_poff, cmp_len);
+		same_len += cmp_same;
+		if (cmp_same != cmp_len)
 			same = false;
 
 		kunmap_atomic(dest_addr);
@@ -1937,7 +1958,12 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		len -= cmp_len;
 	}
 
-	*is_same = same;
+	/*
+	 * If less than the whole range matched, we have to back down to the
+	 * nearest block boundary.
+	 */
+	if (*req_len != same_len)
+		*req_len = ALIGN_DOWN(same_len, dest->i_sb->s_blocksize);
 	return 0;
 
 out_error:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eb35363478e5..490128b84d10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1840,7 +1840,7 @@ extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
-					 loff_t len, bool *is_same);
+					 loff_t *len);
 extern int vfs_dedupe_file_range(struct file *file,
 				 struct file_dedupe_range *same);
 extern s64 vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 13/15] ocfs2: truncate page cache for clone destination file before remapping
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:46   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

When cloning blocks into another file, truncate the page cache before we
start remapping blocks so that concurrent reads wait for us to finish.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 3758954f2377..02283c38b5a4 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4853,14 +4853,12 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
 				  SINGLE_DEPTH_NESTING);
 
-	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-					 out_bh, pos_out, len);
-
 	/* Zap any page cache for the destination file's range. */
-	if (!ret)
-		truncate_inode_pages_range(&inode_out->i_data, pos_out,
-					   PAGE_ALIGN(pos_out + len) - 1);
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
 
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
 	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
 		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 13/15] ocfs2: truncate page cache for clone destination file before remapping
@ 2018-10-05  0:46   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

When cloning blocks into another file, truncate the page cache before we
start remapping blocks so that concurrent reads wait for us to finish.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/refcounttree.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 3758954f2377..02283c38b5a4 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4853,14 +4853,12 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
 				  SINGLE_DEPTH_NESTING);
 
-	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-					 out_bh, pos_out, len);
-
 	/* Zap any page cache for the destination file's range. */
-	if (!ret)
-		truncate_inode_pages_range(&inode_out->i_data, pos_out,
-					   PAGE_ALIGN(pos_out + len) - 1);
+	truncate_inode_pages_range(&inode_out->i_data, pos_out,
+				   PAGE_ALIGN(pos_out + len) - 1);
 
+	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+					 out_bh, pos_out, len);
 	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
 		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 14/15] ocfs2: support partial clone range and dedupe range
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:46   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Change the ocfs2 remap code to allow for returning partial results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/file.c         |   14 ++++----------
 fs/ocfs2/refcounttree.c |   35 ++++++++++++++++++++---------------
 fs/ocfs2/refcounttree.h |    4 ++--
 3 files changed, 26 insertions(+), 27 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 1ee6d3ecdac6..4562111259b4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2534,11 +2534,8 @@ static s64 ocfs2_file_clone_range(struct file *file_in,
 				  u64 len,
 				  unsigned int flags)
 {
-	int ret;
-
-	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					len, false);
-	return ret < 0 ? ret : len;
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, flags);
 }
 
 static s64 ocfs2_file_dedupe_range(struct file *file_in,
@@ -2547,11 +2544,8 @@ static s64 ocfs2_file_dedupe_range(struct file *file_in,
 				   loff_t pos_out,
 				   u64 len)
 {
-	int ret;
-
-	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					len, true);
-	return ret < 0 ? ret : len;
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, CLONERANGE_DEDUPE | CLONERANGE_SHORT);
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 02283c38b5a4..60be5c76db6d 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4499,7 +4499,7 @@ static int ocfs2_reflink_update_dest(struct inode *dest,
 }
 
 /* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
-static int ocfs2_reflink_remap_extent(struct inode *s_inode,
+static s64 ocfs2_reflink_remap_extent(struct inode *s_inode,
 				      struct buffer_head *s_bh,
 				      loff_t pos_in,
 				      struct inode *t_inode,
@@ -4514,6 +4514,7 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode,
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_super *osb;
+	s64 remapped = 0;
 	loff_t pstart, plen;
 	u32 p_cluster, num_clusters, slast, spos, tpos;
 	unsigned int ext_flags;
@@ -4597,18 +4598,20 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode,
 next_loop:
 		spos += num_clusters;
 		tpos += num_clusters;
+		remapped += ocfs2_clusters_to_bytes(t_inode->i_sb,
+				num_clusters);
 	}
 
-out:
-	return ret;
+	return remapped;
 out_unlock_refcount:
 	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
 	brelse(ref_root_bh);
-	return ret;
+out:
+	return remapped > 0 ? remapped : ret;
 }
 
 /* Set up refcount tree and remap s_inode to t_inode. */
-static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
+static s64 ocfs2_reflink_remap_blocks(struct inode *s_inode,
 				      struct buffer_head *s_bh,
 				      loff_t pos_in,
 				      struct inode *t_inode,
@@ -4620,7 +4623,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
 	struct ocfs2_super *osb;
 	struct ocfs2_dinode *dis;
 	struct ocfs2_dinode *dit;
-	int ret;
+	s64 ret;
 
 	osb = OCFS2_SB(s_inode->i_sb);
 	dis = (struct ocfs2_dinode *)s_bh->b_data;
@@ -4692,7 +4695,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
 	/* Actually remap extents now. */
 	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
 					 pos_out, len, &dealloc);
-	if (ret) {
+	if (ret < 0) {
 		mlog_errno(ret);
 		goto out;
 	}
@@ -4812,18 +4815,19 @@ static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
 }
 
 /* Link a range of blocks from one file to another. */
-int ocfs2_reflink_remap_range(struct file *file_in,
+s64 ocfs2_reflink_remap_range(struct file *file_in,
 			      loff_t pos_in,
 			      struct file *file_out,
 			      loff_t pos_out,
 			      u64 len,
-			      bool is_dedupe)
+			      unsigned int flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
 	struct buffer_head *in_bh = NULL, *out_bh = NULL;
 	bool same_inode = (inode_in == inode_out);
+	s64 remapped = 0;
 	ssize_t ret;
 
 	if (!ocfs2_refcount_tree(osb))
@@ -4843,7 +4847,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
+			&len, flags);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -4857,12 +4861,13 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
 				   PAGE_ALIGN(pos_out + len) - 1);
 
-	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-					 out_bh, pos_out, len);
+	remapped = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in,
+			inode_out, out_bh, pos_out, len);
 	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
 		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
-	if (ret) {
+	if (remapped < 0) {
+		ret = remapped;
 		mlog_errno(ret);
 		goto out_unlock;
 	}
@@ -4881,9 +4886,9 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	}
 
 	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
-	return 0;
+	return remapped;
 
 out_unlock:
 	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
-	return ret;
+	return remapped > 0 ? remapped : ret;
 }
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 4af55bf4b35b..410a3145a06a 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -115,11 +115,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
-int ocfs2_reflink_remap_range(struct file *file_in,
+s64 ocfs2_reflink_remap_range(struct file *file_in,
 			      loff_t pos_in,
 			      struct file *file_out,
 			      loff_t pos_out,
 			      u64 len,
-			      bool is_dedupe);
+			      unsigned int flags);
 
 #endif /* OCFS2_REFCOUNTTREE_H */


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 14/15] ocfs2: support partial clone range and dedupe range
@ 2018-10-05  0:46   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Change the ocfs2 remap code to allow for returning partial results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ocfs2/file.c         |   14 ++++----------
 fs/ocfs2/refcounttree.c |   35 ++++++++++++++++++++---------------
 fs/ocfs2/refcounttree.h |    4 ++--
 3 files changed, 26 insertions(+), 27 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 1ee6d3ecdac6..4562111259b4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2534,11 +2534,8 @@ static s64 ocfs2_file_clone_range(struct file *file_in,
 				  u64 len,
 				  unsigned int flags)
 {
-	int ret;
-
-	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					len, false);
-	return ret < 0 ? ret : len;
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, flags);
 }
 
 static s64 ocfs2_file_dedupe_range(struct file *file_in,
@@ -2547,11 +2544,8 @@ static s64 ocfs2_file_dedupe_range(struct file *file_in,
 				   loff_t pos_out,
 				   u64 len)
 {
-	int ret;
-
-	ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-					len, true);
-	return ret < 0 ? ret : len;
+	return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, CLONERANGE_DEDUPE | CLONERANGE_SHORT);
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 02283c38b5a4..60be5c76db6d 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4499,7 +4499,7 @@ static int ocfs2_reflink_update_dest(struct inode *dest,
 }
 
 /* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
-static int ocfs2_reflink_remap_extent(struct inode *s_inode,
+static s64 ocfs2_reflink_remap_extent(struct inode *s_inode,
 				      struct buffer_head *s_bh,
 				      loff_t pos_in,
 				      struct inode *t_inode,
@@ -4514,6 +4514,7 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode,
 	struct buffer_head *ref_root_bh = NULL;
 	struct ocfs2_refcount_tree *ref_tree;
 	struct ocfs2_super *osb;
+	s64 remapped = 0;
 	loff_t pstart, plen;
 	u32 p_cluster, num_clusters, slast, spos, tpos;
 	unsigned int ext_flags;
@@ -4597,18 +4598,20 @@ static int ocfs2_reflink_remap_extent(struct inode *s_inode,
 next_loop:
 		spos += num_clusters;
 		tpos += num_clusters;
+		remapped += ocfs2_clusters_to_bytes(t_inode->i_sb,
+				num_clusters);
 	}
 
-out:
-	return ret;
+	return remapped;
 out_unlock_refcount:
 	ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
 	brelse(ref_root_bh);
-	return ret;
+out:
+	return remapped > 0 ? remapped : ret;
 }
 
 /* Set up refcount tree and remap s_inode to t_inode. */
-static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
+static s64 ocfs2_reflink_remap_blocks(struct inode *s_inode,
 				      struct buffer_head *s_bh,
 				      loff_t pos_in,
 				      struct inode *t_inode,
@@ -4620,7 +4623,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
 	struct ocfs2_super *osb;
 	struct ocfs2_dinode *dis;
 	struct ocfs2_dinode *dit;
-	int ret;
+	s64 ret;
 
 	osb = OCFS2_SB(s_inode->i_sb);
 	dis = (struct ocfs2_dinode *)s_bh->b_data;
@@ -4692,7 +4695,7 @@ static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
 	/* Actually remap extents now. */
 	ret = ocfs2_reflink_remap_extent(s_inode, s_bh, pos_in, t_inode, t_bh,
 					 pos_out, len, &dealloc);
-	if (ret) {
+	if (ret < 0) {
 		mlog_errno(ret);
 		goto out;
 	}
@@ -4812,18 +4815,19 @@ static void ocfs2_reflink_inodes_unlock(struct inode *s_inode,
 }
 
 /* Link a range of blocks from one file to another. */
-int ocfs2_reflink_remap_range(struct file *file_in,
+s64 ocfs2_reflink_remap_range(struct file *file_in,
 			      loff_t pos_in,
 			      struct file *file_out,
 			      loff_t pos_out,
 			      u64 len,
-			      bool is_dedupe)
+			      unsigned int flags)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
 	struct buffer_head *in_bh = NULL, *out_bh = NULL;
 	bool same_inode = (inode_in == inode_out);
+	s64 remapped = 0;
 	ssize_t ret;
 
 	if (!ocfs2_refcount_tree(osb))
@@ -4843,7 +4847,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
+			&len, flags);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -4857,12 +4861,13 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
 				   PAGE_ALIGN(pos_out + len) - 1);
 
-	ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-					 out_bh, pos_out, len);
+	remapped = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in,
+			inode_out, out_bh, pos_out, len);
 	up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
 	if (!same_inode)
 		up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
-	if (ret) {
+	if (remapped < 0) {
+		ret = remapped;
 		mlog_errno(ret);
 		goto out_unlock;
 	}
@@ -4881,9 +4886,9 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 	}
 
 	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
-	return 0;
+	return remapped;
 
 out_unlock:
 	ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
-	return ret;
+	return remapped > 0 ? remapped : ret;
 }
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 4af55bf4b35b..410a3145a06a 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -115,11 +115,11 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 			const char __user *oldname,
 			const char __user *newname,
 			bool preserve);
-int ocfs2_reflink_remap_range(struct file *file_in,
+s64 ocfs2_reflink_remap_range(struct file *file_in,
 			      loff_t pos_in,
 			      struct file *file_out,
 			      loff_t pos_out,
 			      u64 len,
-			      bool is_dedupe);
+			      unsigned int flags);
 
 #endif /* OCFS2_REFCOUNTTREE_H */

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 15/15] xfs: support returning partial reflink results
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  0:46   ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace.  However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |   14 ++++----------
 fs/xfs/xfs_reflink.c |   46 ++++++++++++++++++++++++++++++----------------
 fs/xfs/xfs_reflink.h |    5 +++--
 3 files changed, 37 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d5d6681ca714..4f7aff6d3360 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -928,11 +928,8 @@ xfs_file_clone_range(
 	u64		len,
 	unsigned int	flags)
 {
-	int		ret;
-
-	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-			len, false);
-	return ret < 0 ? ret : len;
+	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, flags);
 }
 
 STATIC s64
@@ -943,11 +940,8 @@ xfs_file_dedupe_range(
 	loff_t		pos_out,
 	u64		len)
 {
-	int		ret;
-
-	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-			len, true);
-	return ret < 0 ? ret : len;
+	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, CLONERANGE_DEDUPE | CLONERANGE_SHORT);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 40684dd011ee..784cebbd0157 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1090,7 +1090,7 @@ xfs_reflink_remap_extent(
 /*
  * Iteratively remap one file's extents (and holes) to another's.
  */
-STATIC int
+STATIC int64_t
 xfs_reflink_remap_blocks(
 	struct xfs_inode	*src,
 	xfs_fileoff_t		srcoff,
@@ -1100,6 +1100,7 @@ xfs_reflink_remap_blocks(
 	xfs_off_t		new_isize)
 {
 	struct xfs_bmbt_irec	imap;
+	int64_t			remapped = 0;
 	int			nimaps;
 	int			error = 0;
 	xfs_filblks_t		range_len;
@@ -1142,13 +1143,14 @@ xfs_reflink_remap_blocks(
 		srcoff += range_len;
 		destoff += range_len;
 		len -= range_len;
+		remapped += range_len;
 	}
 
-	return 0;
+	return remapped;
 
 err:
 	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
-	return error;
+	return remapped > 0 ? remapped : error;
 }
 
 /*
@@ -1247,14 +1249,15 @@ xfs_reflink_remap_prep(
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
-	u64			len,
-	bool			is_dedupe)
+	u64			*len,
+	unsigned int		flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
+	bool			is_dedupe = (flags & CLONERANGE_DEDUPE);
 	ssize_t			ret;
 
 	/* Lock both files against IO */
@@ -1278,7 +1281,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
+			len, flags);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -1302,7 +1305,7 @@ xfs_reflink_remap_prep(
 
 	/* Zap any page cache for the destination file's range. */
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
+				   PAGE_ALIGN(pos_out + *len) - 1);
 
 	/* If we're altering the file contents... */
 	if (!is_dedupe) {
@@ -1336,14 +1339,14 @@ xfs_reflink_remap_prep(
 /*
  * Link a range of blocks from one file to another.
  */
-int
+s64
 xfs_reflink_remap_range(
 	struct file		*file_in,
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
 	u64			len,
-	bool			is_dedupe)
+	unsigned int		flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
 	struct xfs_inode	*src = XFS_I(inode_in);
@@ -1352,8 +1355,10 @@ xfs_reflink_remap_range(
 	struct xfs_mount	*mp = src->i_mount;
 	xfs_fileoff_t		sfsbno, dfsbno;
 	xfs_filblks_t		fsblen;
+	s64			remapped;
 	xfs_extlen_t		cowextsize;
-	ssize_t			ret;
+	int			ret;
+	bool			is_dedupe = (flags & CLONERANGE_DEDUPE);
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
 		return -EOPNOTSUPP;
@@ -1363,19 +1368,25 @@ xfs_reflink_remap_range(
 
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
-			len, is_dedupe);
+			&len, flags);
 	if (ret)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
+	if (len == 0)
+		goto done;
+
 	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
 	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
 	fsblen = XFS_B_TO_FSB(mp, len);
-	ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
+	remapped = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
 			pos_out + len);
-	if (ret)
+	if (remapped < 0) {
+		ret = remapped;
 		goto out_unlock;
+	}
+	remapped = min_t(int64_t, len, XFS_FSB_TO_B(mp, remapped));
 
 	/*
 	 * Carry the cowextsize hint from src to dest if we're sharing the
@@ -1391,11 +1402,14 @@ xfs_reflink_remap_range(
 
 	ret = xfs_reflink_update_dest(dest, pos_out + len, cowextsize,
 			is_dedupe);
-
+	if (ret)
+		goto out_unlock;
+done:
+	xfs_reflink_remap_unlock(file_in, file_out);
+	return remapped;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
-	if (ret)
-		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
+	trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c585ad9552b2..b53470904373 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -27,8 +27,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
-extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
-		struct file *file_out, loff_t pos_out, u64 len, bool is_dedupe);
+extern s64 xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len,
+		unsigned int flags);
 extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp,
 		struct xfs_inode *ip, bool *has_shared);
 extern int xfs_reflink_clear_inode_flag(struct xfs_inode *ip,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 15/15] xfs: support returning partial reflink results
@ 2018-10-05  0:46   ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  0:46 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

From: Darrick J. Wong <darrick.wong@oracle.com>

Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace.  However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |   14 ++++----------
 fs/xfs/xfs_reflink.c |   46 ++++++++++++++++++++++++++++++----------------
 fs/xfs/xfs_reflink.h |    5 +++--
 3 files changed, 37 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d5d6681ca714..4f7aff6d3360 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -928,11 +928,8 @@ xfs_file_clone_range(
 	u64		len,
 	unsigned int	flags)
 {
-	int		ret;
-
-	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-			len, false);
-	return ret < 0 ? ret : len;
+	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, flags);
 }
 
 STATIC s64
@@ -943,11 +940,8 @@ xfs_file_dedupe_range(
 	loff_t		pos_out,
 	u64		len)
 {
-	int		ret;
-
-	ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-			len, true);
-	return ret < 0 ? ret : len;
+	return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+			len, CLONERANGE_DEDUPE | CLONERANGE_SHORT);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 40684dd011ee..784cebbd0157 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1090,7 +1090,7 @@ xfs_reflink_remap_extent(
 /*
  * Iteratively remap one file's extents (and holes) to another's.
  */
-STATIC int
+STATIC int64_t
 xfs_reflink_remap_blocks(
 	struct xfs_inode	*src,
 	xfs_fileoff_t		srcoff,
@@ -1100,6 +1100,7 @@ xfs_reflink_remap_blocks(
 	xfs_off_t		new_isize)
 {
 	struct xfs_bmbt_irec	imap;
+	int64_t			remapped = 0;
 	int			nimaps;
 	int			error = 0;
 	xfs_filblks_t		range_len;
@@ -1142,13 +1143,14 @@ xfs_reflink_remap_blocks(
 		srcoff += range_len;
 		destoff += range_len;
 		len -= range_len;
+		remapped += range_len;
 	}
 
-	return 0;
+	return remapped;
 
 err:
 	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
-	return error;
+	return remapped > 0 ? remapped : error;
 }
 
 /*
@@ -1247,14 +1249,15 @@ xfs_reflink_remap_prep(
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
-	u64			len,
-	bool			is_dedupe)
+	u64			*len,
+	unsigned int		flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
+	bool			is_dedupe = (flags & CLONERANGE_DEDUPE);
 	ssize_t			ret;
 
 	/* Lock both files against IO */
@@ -1278,7 +1281,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
-			&len, is_dedupe ? CLONERANGE_DEDUPE : 0);
+			len, flags);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -1302,7 +1305,7 @@ xfs_reflink_remap_prep(
 
 	/* Zap any page cache for the destination file's range. */
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
+				   PAGE_ALIGN(pos_out + *len) - 1);
 
 	/* If we're altering the file contents... */
 	if (!is_dedupe) {
@@ -1336,14 +1339,14 @@ xfs_reflink_remap_prep(
 /*
  * Link a range of blocks from one file to another.
  */
-int
+s64
 xfs_reflink_remap_range(
 	struct file		*file_in,
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
 	u64			len,
-	bool			is_dedupe)
+	unsigned int		flags)
 {
 	struct inode		*inode_in = file_inode(file_in);
 	struct xfs_inode	*src = XFS_I(inode_in);
@@ -1352,8 +1355,10 @@ xfs_reflink_remap_range(
 	struct xfs_mount	*mp = src->i_mount;
 	xfs_fileoff_t		sfsbno, dfsbno;
 	xfs_filblks_t		fsblen;
+	s64			remapped;
 	xfs_extlen_t		cowextsize;
-	ssize_t			ret;
+	int			ret;
+	bool			is_dedupe = (flags & CLONERANGE_DEDUPE);
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
 		return -EOPNOTSUPP;
@@ -1363,19 +1368,25 @@ xfs_reflink_remap_range(
 
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
-			len, is_dedupe);
+			&len, flags);
 	if (ret)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
+	if (len == 0)
+		goto done;
+
 	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
 	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
 	fsblen = XFS_B_TO_FSB(mp, len);
-	ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
+	remapped = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
 			pos_out + len);
-	if (ret)
+	if (remapped < 0) {
+		ret = remapped;
 		goto out_unlock;
+	}
+	remapped = min_t(int64_t, len, XFS_FSB_TO_B(mp, remapped));
 
 	/*
 	 * Carry the cowextsize hint from src to dest if we're sharing the
@@ -1391,11 +1402,14 @@ xfs_reflink_remap_range(
 
 	ret = xfs_reflink_update_dest(dest, pos_out + len, cowextsize,
 			is_dedupe);
-
+	if (ret)
+		goto out_unlock;
+done:
+	xfs_reflink_remap_unlock(file_in, file_out);
+	return remapped;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
-	if (ret)
-		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
+	trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c585ad9552b2..b53470904373 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -27,8 +27,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
-extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
-		struct file *file_out, loff_t pos_out, u64 len, bool is_dedupe);
+extern s64 xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len,
+		unsigned int flags);
 extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp,
 		struct xfs_inode *ip, bool *has_shared);
 extern int xfs_reflink_clear_inode_flag(struct xfs_inode *ip,

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 00/15] fs: fixes for serious clone/dedupe problems
  2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  1:17   ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  1:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:44:34PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> reflink implementation, and tracked it down to reflink forgetting to do
> some of the file-extending activities that must happen for regular
> writes.
> 
> We then started auditing the clone, dedupe, and copyfile code and
> realized that from a file contents perspective, clonerange isn't any
> different from a regular file write.  Unfortunately, we also noticed
> that *unlike* a regular write, clonerange skips a ton of overflow
> checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> strip security privileges (suid, capabilities) like a regular write
> would.  I also noticed that xfs and ocfs2 need to dump the page cache
> before remapping blocks, not after.
> 
> In fixing the range checking problems I also realized that both dedupe
> and copyfile tell userspace how much of the requested operation was
> acted upon.  Since the range validation can shorten a clone request (or
> we can ENOSPC midway through), we might as well plumb the short
> operation reporting back through the VFS indirection code to userspace.
> 
> So, here's the whole giant pile of patches[1] that fix all the problems.
> The patch "generic: test reflink side effects" recently sent to fstests
> exercises the fixes in this series.  Tests are in [2].

Hmmm. I've got a couple of patches to fix dedupe/reflink partial EOF
block data corruptions, too. I'll have to see how they fit into this
new series - combined they add this code just after the call to
vfs_clone_file_prep_inodes():

....
+       u64                     blkmask = i_blocksize(inode_in) - 1;
....
+       /*
+        * If the dedupe data matches, chop off the partial EOF block
+        * from the source file so we don't try to dedupe the partial
+        * EOF block.
+        */
+       if (is_dedupe) {
+               len &= ~blkmask;
+       } else if (len & blkmask) {
+               /*
+                * The user is attempting to share a partial EOF block,
+                * if it's inside the destination EOF then reject it
+                */
+               if (pos_out + len < i_size_read(inode_out)) {
+                       ret = -EINVAL;
+                       goto out_unlock;
+               }
+       }

It might be better to put these in with the eof-zeroing patch then
add all the other changes on top? Let me post them separately,
as they may be candidates for 4.19-rc7 along with the eof zeroing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 00/15] fs: fixes for serious clone/dedupe problems
@ 2018-10-05  1:17   ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  1:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:44:34PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> reflink implementation, and tracked it down to reflink forgetting to do
> some of the file-extending activities that must happen for regular
> writes.
> 
> We then started auditing the clone, dedupe, and copyfile code and
> realized that from a file contents perspective, clonerange isn't any
> different from a regular file write.  Unfortunately, we also noticed
> that *unlike* a regular write, clonerange skips a ton of overflow
> checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> strip security privileges (suid, capabilities) like a regular write
> would.  I also noticed that xfs and ocfs2 need to dump the page cache
> before remapping blocks, not after.
> 
> In fixing the range checking problems I also realized that both dedupe
> and copyfile tell userspace how much of the requested operation was
> acted upon.  Since the range validation can shorten a clone request (or
> we can ENOSPC midway through), we might as well plumb the short
> operation reporting back through the VFS indirection code to userspace.
> 
> So, here's the whole giant pile of patches[1] that fix all the problems.
> The patch "generic: test reflink side effects" recently sent to fstests
> exercises the fixes in this series.  Tests are in [2].

Hmmm. I've got a couple of patches to fix dedupe/reflink partial EOF
block data corruptions, too. I'll have to see how they fit into this
new series - combined they add this code just after the call to
vfs_clone_file_prep_inodes():

....
+       u64                     blkmask = i_blocksize(inode_in) - 1;
....
+       /*
+        * If the dedupe data matches, chop off the partial EOF block
+        * from the source file so we don't try to dedupe the partial
+        * EOF block.
+        */
+       if (is_dedupe) {
+               len &= ~blkmask;
+       } else if (len & blkmask) {
+               /*
+                * The user is attempting to share a partial EOF block,
+                * if it's inside the destination EOF then reject it
+                */
+               if (pos_out + len < i_size_read(inode_out)) {
+                       ret = -EINVAL;
+                       goto out_unlock;
+               }
+       }

It might be better to put these in with the eof-zeroing patch then
add all the other changes on top? Let me post them separately,
as they may be candidates for 4.19-rc7 along with the eof zeroing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 00/15] fs: fixes for serious clone/dedupe problems
  2018-10-05  1:17   ` [Ocfs2-devel] " Dave Chinner
@ 2018-10-05  1:24     ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  1:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 11:17:18AM +1000, Dave Chinner wrote:
> On Thu, Oct 04, 2018 at 05:44:34PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> > reflink implementation, and tracked it down to reflink forgetting to do
> > some of the file-extending activities that must happen for regular
> > writes.
> > 
> > We then started auditing the clone, dedupe, and copyfile code and
> > realized that from a file contents perspective, clonerange isn't any
> > different from a regular file write.  Unfortunately, we also noticed
> > that *unlike* a regular write, clonerange skips a ton of overflow
> > checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> > and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> > strip security privileges (suid, capabilities) like a regular write
> > would.  I also noticed that xfs and ocfs2 need to dump the page cache
> > before remapping blocks, not after.
> > 
> > In fixing the range checking problems I also realized that both dedupe
> > and copyfile tell userspace how much of the requested operation was
> > acted upon.  Since the range validation can shorten a clone request (or
> > we can ENOSPC midway through), we might as well plumb the short
> > operation reporting back through the VFS indirection code to userspace.
> > 
> > So, here's the whole giant pile of patches[1] that fix all the problems.
> > The patch "generic: test reflink side effects" recently sent to fstests
> > exercises the fixes in this series.  Tests are in [2].
> 
> Hmmm. I've got a couple of patches to fix dedupe/reflink partial EOF
> block data corruptions, too. I'll have to see how they fit into this
> new series - combined they add this code just after the call to
> vfs_clone_file_prep_inodes():
> 
> ....
> +       u64                     blkmask = i_blocksize(inode_in) - 1;
> ....
> +       /*
> +        * If the dedupe data matches, chop off the partial EOF block
> +        * from the source file so we don't try to dedupe the partial
> +        * EOF block.
> +        */
> +       if (is_dedupe) {
> +               len &= ~blkmask;
> +       } else if (len & blkmask) {
> +               /*
> +                * The user is attempting to share a partial EOF block,
> +                * if it's inside the destination EOF then reject it
> +                */
> +               if (pos_out + len < i_size_read(inode_out)) {
> +                       ret = -EINVAL;
> +                       goto out_unlock;
> +               }
> +       }
> 
> It might be better to put these in with the eof-zeroing patch then
> add all the other changes on top? Let me post them separately,
> as they may be candidates for 4.19-rc7 along with the eof zeroing.

Yeah, maybe we want to push the first two for 4.19 and leave the rest
for 4.20/5.0.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 00/15] fs: fixes for serious clone/dedupe problems
@ 2018-10-05  1:24     ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05  1:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 11:17:18AM +1000, Dave Chinner wrote:
> On Thu, Oct 04, 2018 at 05:44:34PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> > reflink implementation, and tracked it down to reflink forgetting to do
> > some of the file-extending activities that must happen for regular
> > writes.
> > 
> > We then started auditing the clone, dedupe, and copyfile code and
> > realized that from a file contents perspective, clonerange isn't any
> > different from a regular file write.  Unfortunately, we also noticed
> > that *unlike* a regular write, clonerange skips a ton of overflow
> > checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> > and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> > strip security privileges (suid, capabilities) like a regular write
> > would.  I also noticed that xfs and ocfs2 need to dump the page cache
> > before remapping blocks, not after.
> > 
> > In fixing the range checking problems I also realized that both dedupe
> > and copyfile tell userspace how much of the requested operation was
> > acted upon.  Since the range validation can shorten a clone request (or
> > we can ENOSPC midway through), we might as well plumb the short
> > operation reporting back through the VFS indirection code to userspace.
> > 
> > So, here's the whole giant pile of patches[1] that fix all the problems.
> > The patch "generic: test reflink side effects" recently sent to fstests
> > exercises the fixes in this series.  Tests are in [2].
> 
> Hmmm. I've got a couple of patches to fix dedupe/reflink partial EOF
> block data corruptions, too. I'll have to see how they fit into this
> new series - combined they add this code just after the call to
> vfs_clone_file_prep_inodes():
> 
> ....
> +       u64                     blkmask = i_blocksize(inode_in) - 1;
> ....
> +       /*
> +        * If the dedupe data matches, chop off the partial EOF block
> +        * from the source file so we don't try to dedupe the partial
> +        * EOF block.
> +        */
> +       if (is_dedupe) {
> +               len &= ~blkmask;
> +       } else if (len & blkmask) {
> +               /*
> +                * The user is attempting to share a partial EOF block,
> +                * if it's inside the destination EOF then reject it
> +                */
> +               if (pos_out + len < i_size_read(inode_out)) {
> +                       ret = -EINVAL;
> +                       goto out_unlock;
> +               }
> +       }
> 
> It might be better to put these in with the eof-zeroing patch then
> add all the other changes on top? Let me post them separately,
> as they may be candidates for 4.19-rc7 along with the eof zeroing.

Yeah, maybe we want to push the first two for 4.19 and leave the rest
for 4.20/5.0.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  5:28     ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  5:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor all the reflink preparation steps into a separate helper that
> we'll use to land all the upcoming fixes for insufficient input checks.

If I've read the patch right, this also changes the location of the
page cache truncation, right?  i.e. it now happens in the
xfs_reflink_remap_prep() function instead of after the remap?  I
think the commit message needs to mention that because it's a fix to
incorrect behaviour....

I've added:

--
This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.
--

Sound OK?

Otherwise this looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-Dave.

> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   96 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 71 insertions(+), 25 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 38f405415b88..80ca9b6793cd 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout(
>  	return 0;
>  }
>  
> +/* Unlock both inodes after they've been prepped for a range clone. */
> +STATIC void
> +xfs_reflink_remap_unlock(
> +	struct file		*file_in,
> +	struct file		*file_out)
> +{
> +	struct inode		*inode_in = file_inode(file_in);
> +	struct xfs_inode	*src = XFS_I(inode_in);
> +	struct inode		*inode_out = file_inode(file_out);
> +	struct xfs_inode	*dest = XFS_I(inode_out);
> +	bool			same_inode = (inode_in == inode_out);
> +
> +	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> +	if (!same_inode)
> +		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> +	inode_unlock(inode_out);
> +	if (!same_inode)
> +		inode_unlock_shared(inode_in);
> +}
> +
>  /*
> - * Link a range of blocks from one file to another.
> + * Prepare two files for range cloning.  Upon a successful return both inodes
> + * will have the iolock and mmaplock held, the page cache of the out file
> + * will be truncated, and any leases on the out file will have been broken.
>   */
> -int
> -xfs_reflink_remap_range(
> +STATIC int
> +xfs_reflink_remap_prep(
>  	struct file		*file_in,
>  	loff_t			pos_in,
>  	struct file		*file_out,
> @@ -1211,19 +1233,9 @@ xfs_reflink_remap_range(
>  	struct xfs_inode	*src = XFS_I(inode_in);
>  	struct inode		*inode_out = file_inode(file_out);
>  	struct xfs_inode	*dest = XFS_I(inode_out);
> -	struct xfs_mount	*mp = src->i_mount;
>  	bool			same_inode = (inode_in == inode_out);
> -	xfs_fileoff_t		sfsbno, dfsbno;
> -	xfs_filblks_t		fsblen;
> -	xfs_extlen_t		cowextsize;
>  	ssize_t			ret;
>  
> -	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> -		return -EOPNOTSUPP;
> -
> -	if (XFS_FORCED_SHUTDOWN(mp))
> -		return -EIO;
> -
>  	/* Lock both files against IO */
>  	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
>  	if (ret)
> @@ -1254,8 +1266,6 @@ xfs_reflink_remap_range(
>  	if (ret)
>  		goto out_unlock;
>  
> -	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> -
>  	/*
>  	 * Clear out post-eof preallocations because we don't have page cache
>  	 * backing the delayed allocations and they'll never get freed on
> @@ -1272,6 +1282,51 @@ xfs_reflink_remap_range(
>  	if (ret)
>  		goto out_unlock;
>  
> +	/* Zap any page cache for the destination file's range. */
> +	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> +				   PAGE_ALIGN(pos_out + len) - 1);
> +	return 0;
> +out_unlock:
> +	xfs_reflink_remap_unlock(file_in, file_out);
> +	return ret;
> +}
> +
> +/*
> + * Link a range of blocks from one file to another.
> + */
> +int
> +xfs_reflink_remap_range(
> +	struct file		*file_in,
> +	loff_t			pos_in,
> +	struct file		*file_out,
> +	loff_t			pos_out,
> +	u64			len,
> +	bool			is_dedupe)
> +{
> +	struct inode		*inode_in = file_inode(file_in);
> +	struct xfs_inode	*src = XFS_I(inode_in);
> +	struct inode		*inode_out = file_inode(file_out);
> +	struct xfs_inode	*dest = XFS_I(inode_out);
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		sfsbno, dfsbno;
> +	xfs_filblks_t		fsblen;
> +	xfs_extlen_t		cowextsize;
> +	ssize_t			ret;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	/* Prepare and then clone file data. */
> +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> +			len, is_dedupe);
> +	if (ret)
> +		return ret;
> +
> +	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> +
>  	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
>  	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
>  	fsblen = XFS_B_TO_FSB(mp, len);
> @@ -1280,10 +1335,6 @@ xfs_reflink_remap_range(
>  	if (ret)
>  		goto out_unlock;
>  
> -	/* Zap any page cache for the destination file's range. */
> -	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> -				   PAGE_ALIGN(pos_out + len) - 1);
> -
>  	/*
>  	 * Carry the cowextsize hint from src to dest if we're sharing the
>  	 * entire source file to the entire destination file, the source file
> @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
>  			is_dedupe);
>  
>  out_unlock:
> -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> -	if (!same_inode)
> -		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> -	inode_unlock(inode_out);
> -	if (!same_inode)
> -		inode_unlock_shared(inode_in);
> +	xfs_reflink_remap_unlock(file_in, file_out);
>  	if (ret)
>  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
>  	return ret;
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05  5:28     ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  5:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor all the reflink preparation steps into a separate helper that
> we'll use to land all the upcoming fixes for insufficient input checks.

If I've read the patch right, this also changes the location of the
page cache truncation, right?  i.e. it now happens in the
xfs_reflink_remap_prep() function instead of after the remap?  I
think the commit message needs to mention that because it's a fix to
incorrect behaviour....

I've added:

--
This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.
--

Sound OK?

Otherwise this looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-Dave.

> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   96 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 71 insertions(+), 25 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 38f405415b88..80ca9b6793cd 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout(
>  	return 0;
>  }
>  
> +/* Unlock both inodes after they've been prepped for a range clone. */
> +STATIC void
> +xfs_reflink_remap_unlock(
> +	struct file		*file_in,
> +	struct file		*file_out)
> +{
> +	struct inode		*inode_in = file_inode(file_in);
> +	struct xfs_inode	*src = XFS_I(inode_in);
> +	struct inode		*inode_out = file_inode(file_out);
> +	struct xfs_inode	*dest = XFS_I(inode_out);
> +	bool			same_inode = (inode_in == inode_out);
> +
> +	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> +	if (!same_inode)
> +		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> +	inode_unlock(inode_out);
> +	if (!same_inode)
> +		inode_unlock_shared(inode_in);
> +}
> +
>  /*
> - * Link a range of blocks from one file to another.
> + * Prepare two files for range cloning.  Upon a successful return both inodes
> + * will have the iolock and mmaplock held, the page cache of the out file
> + * will be truncated, and any leases on the out file will have been broken.
>   */
> -int
> -xfs_reflink_remap_range(
> +STATIC int
> +xfs_reflink_remap_prep(
>  	struct file		*file_in,
>  	loff_t			pos_in,
>  	struct file		*file_out,
> @@ -1211,19 +1233,9 @@ xfs_reflink_remap_range(
>  	struct xfs_inode	*src = XFS_I(inode_in);
>  	struct inode		*inode_out = file_inode(file_out);
>  	struct xfs_inode	*dest = XFS_I(inode_out);
> -	struct xfs_mount	*mp = src->i_mount;
>  	bool			same_inode = (inode_in == inode_out);
> -	xfs_fileoff_t		sfsbno, dfsbno;
> -	xfs_filblks_t		fsblen;
> -	xfs_extlen_t		cowextsize;
>  	ssize_t			ret;
>  
> -	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> -		return -EOPNOTSUPP;
> -
> -	if (XFS_FORCED_SHUTDOWN(mp))
> -		return -EIO;
> -
>  	/* Lock both files against IO */
>  	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
>  	if (ret)
> @@ -1254,8 +1266,6 @@ xfs_reflink_remap_range(
>  	if (ret)
>  		goto out_unlock;
>  
> -	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> -
>  	/*
>  	 * Clear out post-eof preallocations because we don't have page cache
>  	 * backing the delayed allocations and they'll never get freed on
> @@ -1272,6 +1282,51 @@ xfs_reflink_remap_range(
>  	if (ret)
>  		goto out_unlock;
>  
> +	/* Zap any page cache for the destination file's range. */
> +	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> +				   PAGE_ALIGN(pos_out + len) - 1);
> +	return 0;
> +out_unlock:
> +	xfs_reflink_remap_unlock(file_in, file_out);
> +	return ret;
> +}
> +
> +/*
> + * Link a range of blocks from one file to another.
> + */
> +int
> +xfs_reflink_remap_range(
> +	struct file		*file_in,
> +	loff_t			pos_in,
> +	struct file		*file_out,
> +	loff_t			pos_out,
> +	u64			len,
> +	bool			is_dedupe)
> +{
> +	struct inode		*inode_in = file_inode(file_in);
> +	struct xfs_inode	*src = XFS_I(inode_in);
> +	struct inode		*inode_out = file_inode(file_out);
> +	struct xfs_inode	*dest = XFS_I(inode_out);
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		sfsbno, dfsbno;
> +	xfs_filblks_t		fsblen;
> +	xfs_extlen_t		cowextsize;
> +	ssize_t			ret;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	/* Prepare and then clone file data. */
> +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> +			len, is_dedupe);
> +	if (ret)
> +		return ret;
> +
> +	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> +
>  	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
>  	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
>  	fsblen = XFS_B_TO_FSB(mp, len);
> @@ -1280,10 +1335,6 @@ xfs_reflink_remap_range(
>  	if (ret)
>  		goto out_unlock;
>  
> -	/* Zap any page cache for the destination file's range. */
> -	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> -				   PAGE_ALIGN(pos_out + len) - 1);
> -
>  	/*
>  	 * Carry the cowextsize hint from src to dest if we're sharing the
>  	 * entire source file to the entire destination file, the source file
> @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
>  			is_dedupe);
>  
>  out_unlock:
> -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> -	if (!same_inode)
> -		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> -	inode_unlock(inode_out);
> -	if (!same_inode)
> -		inode_unlock_shared(inode_in);
> +	xfs_reflink_remap_unlock(file_in, file_out);
>  	if (ret)
>  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
>  	return ret;
> 
> 

-- 
Dave Chinner
david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
  2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  5:28     ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  5:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sandeen, Zorro Lang, linux-xfs, ocfs2-devel, linux-fsdevel, linux-btrfs

On Thu, Oct 04, 2018 at 05:44:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're reflinking between two files and the destination file range
> is well beyond the destination file's EOF marker, zero any posteof
> speculative preallocations in the destination file so that we don't
> expose stale disk contents.  The previous strategy of trying to clear
> the preallocations does not work if the destination file has the
> PREALLOC flag set.
> 
> Uncovered by shared/010.
> 
> Reported-by: Zorro Lang <zlang@redhat.com>
> Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   33 +++++++++++++++++++++++++--------
>  1 file changed, 25 insertions(+), 8 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
@ 2018-10-05  5:28     ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  5:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sandeen, Zorro Lang, linux-xfs, ocfs2-devel, linux-fsdevel, linux-btrfs

On Thu, Oct 04, 2018 at 05:44:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're reflinking between two files and the destination file range
> is well beyond the destination file's EOF marker, zero any posteof
> speculative preallocations in the destination file so that we don't
> expose stale disk contents.  The previous strategy of trying to clear
> the preallocations does not work if the destination file has the
> PREALLOC flag set.
> 
> Uncovered by shared/010.
> 
> Reported-by: Zorro Lang <zlang@redhat.com>
> Bugzilla-id: https://urldefense.proofpoint.com/v2/url?u=https-3A__bugzilla.kernel.org_show-5Fbug.cgi-3Fid-3D201259&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Jt3V3HpXf-NaJTVVKEWK881PgcOTureFbFgSqu7fhDQ&s=S0cjIrf21ndOPm5VlFKcHb9vya2G2W-ZcpF0qJjBHM4&e=
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   33 +++++++++++++++++++++++++--------
>  1 file changed, 25 insertions(+), 8 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/15] xfs: update ctime and remove suid before cloning files
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  5:30     ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  5:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:45:00PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Before cloning into a file, update the ctime and remove sensitive
> attributes like suid, just like we'd do for a regular file write.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Because this fixes a security related problem, I'm going to push
this with the data corruption fixes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 04/15] xfs: update ctime and remove suid before cloning files
@ 2018-10-05  5:30     ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  5:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:45:00PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Before cloning into a file, update the ctime and remove sensitive
> attributes like suid, just like we'd do for a regular file write.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Because this fixes a security related problem, I'm going to push
this with the data corruption fixes.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
  (?)
@ 2018-10-05  6:10   ` Amir Goldstein
  2018-10-05 17:36       ` [Ocfs2-devel] " Darrick J. Wong
  -1 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2018-10-05  6:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Clone range is an optimization on a regular file write.  File writes
> that extend the file length are subject to various constraints which are
> not checked by clonerange.  This is a correctness problem, because we're
> never allowed to touch ranges that the page cache can't support
> (s_maxbytes); we're not supposed to deal with large offsets
> (MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits
> (RLIMIT_FSIZE).
>
> Therefore, add these checks to the new generic_clone_checks function so
> that we curtail unexpected behavior.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  mm/filemap.c |   31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 68ec91d05c7b..f74391721234 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3015,6 +3015,37 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
>                 return -EINVAL;
>         count = min(count, size_in - (uint64_t)pos_in);
>
> +       /* Don't exceed RLMIT_FSIZE in the file we're writing into. */
> +       if (limit != RLIM_INFINITY) {
> +               if (pos_out >= limit) {
> +                       send_sig(SIGXFSZ, current, 0);
> +                       return -EFBIG;
> +               }
> +               count = min(count, limit - (uint64_t)pos_out);
> +       }
> +
> +       /* Don't exceed the LFS limits. */
> +       if (unlikely(pos_out + count > MAX_NON_LFS &&
> +                               !(file_out->f_flags & O_LARGEFILE))) {
> +               if (pos_out >= MAX_NON_LFS)
> +                       return -EFBIG;
> +               count = min(count, MAX_NON_LFS - (uint64_t)pos_out);
> +       }
> +       if (unlikely(pos_in + count > MAX_NON_LFS &&
> +                               !(file_in->f_flags & O_LARGEFILE))) {
> +               if (pos_in >= MAX_NON_LFS)
> +                       return -EFBIG;
> +               count = min(count, MAX_NON_LFS - (uint64_t)pos_in);
> +       }
> +
> +       /* Don't operate on ranges the page cache doesn't support. */
> +       if (unlikely(pos_out >= inode_out->i_sb->s_maxbytes ||
> +                    pos_in >= inode_in->i_sb->s_maxbytes))
> +               return -EFBIG;
> +

Forget my standards, this doesn't abide by your own standards ;-)
Please factor out generic_write_checks() and use it instead of
duplicating the code. The in/out variant doesn't justify not calling
the helper twice IMO.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 12/15] vfs: implement opportunistic short dedupe
  2018-10-05  0:46   ` [Ocfs2-devel] " Darrick J. Wong
  (?)
@ 2018-10-05  6:40   ` Amir Goldstein
  2018-10-05 17:42       ` [Ocfs2-devel] " Darrick J. Wong
  -1 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2018-10-05  6:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> For a given dedupe request, the bytes_deduped field in the control
> structure tells userspace if we managed to deduplicate some, but not all
> of, the requested regions starting from the file offsets supplied.
> However, due to sloppy coding, the current dedupe code returns
> FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different.
> Fix this so that we can actually support partial request completion.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/read_write.c    |   44 +++++++++++++++++++++++++++++++++++---------
>  include/linux/fs.h |    2 +-
>  2 files changed, 36 insertions(+), 10 deletions(-)
>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 292d68c2f47c..9be9f261edd2 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1781,13 +1781,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
>          * Check that the extents are the same.
>          */
>         if (is_dedupe) {
> -               bool            is_same = false;
> -
>                 ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> -                               inode_out, pos_out, *len, &is_same);
> +                               inode_out, pos_out, len);
>                 if (ret)
>                         return ret;
> -               if (!is_same)
> +               if (*len == 0)
>                         return -EBADE;
>         }
>
> @@ -1872,13 +1870,30 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
>         return page;
>  }
>
> +static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2,
> +                                     unsigned int cmp_len)
> +{
> +       const char *orig_s1 = s1;
> +       const char *e1 = s1 + cmp_len;
> +       const char *e2 = s2 + cmp_len;
> +
> +       while (s1 < e1 && s2 < e2) {
> +               if (*s1 != *s2)
> +                       break;
> +               s1++;
> +               s2++;
> +       }
> +
> +       return s1 - orig_s1;
> +}
> +

A few nits:
'len' wouldn't have been ambiguous in this context.
I find the for loop in memcmp more elegant. It is definitely shorter.
Not sure how differently the variants compile, but decrementing
count/len seems much more sane then checking 2 conditions that
always have the same result.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 10/15] vfs: make cloning to source file eof more explicit
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
  (?)
@ 2018-10-05  6:47   ` Amir Goldstein
  -1 siblings, 0 replies; 82+ messages in thread
From: Amir Goldstein @ 2018-10-05  6:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Use the new CLONE_RANGE_EOF flag to explicitly declare that the caller
> wants the clone implementation to set *len appropriately once the files
> are locked.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/read_write.c |   14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 7cfff497263b..4eaea52f70a8 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1747,11 +1747,12 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
>
>         isize = i_size_read(inode_in);
>
> -       /* Zero length dedupe exits immediately; reflink goes to EOF. */
> -       if (*len == 0) {
> -               if (is_dedupe || pos_in == isize)
> -                       return 0;
> -               if (pos_in > isize)
> +       /*
> +        * If the caller asked to go all the way to the end of the source file,
> +        * set *len now that we have the file locked.
> +        */
> +       if ((flags & CLONERANGE_EOF) && *len == 0) {
> +               if (pos_in >= isize)
>                         return -EINVAL;

Slight change of behavior with clone of 0 bytes at EOF.
Is that intentional?
Does not change behavior through the entire series?
Please drop a work about this in commit message.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05  7:02     ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor all the reflink preparation steps into a separate helper that
> we'll use to land all the upcoming fixes for insufficient input checks.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
.....

> +xfs_reflink_remap_range(
> +	struct file		*file_in,
> +	loff_t			pos_in,
> +	struct file		*file_out,
> +	loff_t			pos_out,
> +	u64			len,
> +	bool			is_dedupe)
> +{
> +	struct inode		*inode_in = file_inode(file_in);
> +	struct xfs_inode	*src = XFS_I(inode_in);
> +	struct inode		*inode_out = file_inode(file_out);
> +	struct xfs_inode	*dest = XFS_I(inode_out);
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		sfsbno, dfsbno;
> +	xfs_filblks_t		fsblen;
> +	xfs_extlen_t		cowextsize;
> +	ssize_t			ret;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	/* Prepare and then clone file data. */
> +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> +			len, is_dedupe);
> +	if (ret)
> +		return ret;

generic/013 indicates there's a double unlock bug here.	

vfs_clone_file_prep_inodes() can return zero (do nothing, but don't
fail!), and when that happens xfs_reflink_remap_prep() unlocks the
inodes and returns 0.  This new code doesn't catch it, we do the
remap on unlocked inodes, and then trip lock debugging bugs 

> @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
>  			is_dedupe);
>  
>  out_unlock:
> -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> -	if (!same_inode)
> -		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> -	inode_unlock(inode_out);
> -	if (!same_inode)
> -		inode_unlock_shared(inode_in);
> +	xfs_reflink_remap_unlock(file_in, file_out);

here:

 DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
 WARNING: CPU: 3 PID: 4766 at kernel/locking/rwsem.c:133 up_write+0x66/0x70
 CPU: 3 PID: 4766 Comm: fsstress Not tainted 4.19.0-rc6-dgc+ #671
....
 Call Trace:
  xfs_iunlock+0x152/0x220
  xfs_reflink_remap_unlock+0x22/0x70
  xfs_reflink_remap_range+0x129/0x2a0
  do_clone_file_range+0x119/0x200
  vfs_clone_file_range+0x35/0xa0
  ioctl_file_clone+0x8a/0xa0
  do_vfs_ioctl+0x2e1/0x6c0
  ksys_ioctl+0x70/0x80
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x5a/0x180
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

I'll fix it for the moment by making xfs_reflink_remap_prep() behave
like vfs_clone_file_prep_inodes() - it will return 1 on success,
0 for nothing to do and < 0 for an error and catch it in this code.

I note that later patches in the series change the
vfs_clone_file_prep_inodes() behaviour so this behaviour is probably
masked by those later changes. It's still a nasty bisect landmine,
though, so I'll fix it here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05  7:02     ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  7:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor all the reflink preparation steps into a separate helper that
> we'll use to land all the upcoming fixes for insufficient input checks.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
.....

> +xfs_reflink_remap_range(
> +	struct file		*file_in,
> +	loff_t			pos_in,
> +	struct file		*file_out,
> +	loff_t			pos_out,
> +	u64			len,
> +	bool			is_dedupe)
> +{
> +	struct inode		*inode_in = file_inode(file_in);
> +	struct xfs_inode	*src = XFS_I(inode_in);
> +	struct inode		*inode_out = file_inode(file_out);
> +	struct xfs_inode	*dest = XFS_I(inode_out);
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		sfsbno, dfsbno;
> +	xfs_filblks_t		fsblen;
> +	xfs_extlen_t		cowextsize;
> +	ssize_t			ret;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	/* Prepare and then clone file data. */
> +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> +			len, is_dedupe);
> +	if (ret)
> +		return ret;

generic/013 indicates there's a double unlock bug here.	

vfs_clone_file_prep_inodes() can return zero (do nothing, but don't
fail!), and when that happens xfs_reflink_remap_prep() unlocks the
inodes and returns 0.  This new code doesn't catch it, we do the
remap on unlocked inodes, and then trip lock debugging bugs 

> @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
>  			is_dedupe);
>  
>  out_unlock:
> -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> -	if (!same_inode)
> -		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> -	inode_unlock(inode_out);
> -	if (!same_inode)
> -		inode_unlock_shared(inode_in);
> +	xfs_reflink_remap_unlock(file_in, file_out);

here:

 DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
 WARNING: CPU: 3 PID: 4766 at kernel/locking/rwsem.c:133 up_write+0x66/0x70
 CPU: 3 PID: 4766 Comm: fsstress Not tainted 4.19.0-rc6-dgc+ #671
....
 Call Trace:
  xfs_iunlock+0x152/0x220
  xfs_reflink_remap_unlock+0x22/0x70
  xfs_reflink_remap_range+0x129/0x2a0
  do_clone_file_range+0x119/0x200
  vfs_clone_file_range+0x35/0xa0
  ioctl_file_clone+0x8a/0xa0
  do_vfs_ioctl+0x2e1/0x6c0
  ksys_ioctl+0x70/0x80
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x5a/0x180
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

I'll fix it for the moment by making xfs_reflink_remap_prep() behave
like vfs_clone_file_prep_inodes() - it will return 1 on success,
0 for nothing to do and < 0 for an error and catch it in this code.

I note that later patches in the series change the
vfs_clone_file_prep_inodes() behaviour so this behaviour is probably
masked by those later changes. It's still a nasty bisect landmine,
though, so I'll fix it here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
  (?)
@ 2018-10-05  7:07   ` Amir Goldstein
  2018-10-05 17:50       ` [Ocfs2-devel] " Darrick J. Wong
  -1 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2018-10-05  7:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen, Miklos Szeredi

On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Pass operational flags to the per-filesystem clone and dedupe
> implementations.  This enables the vfs to signal when it can deal with
> short clone and short dedupe operations.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
[...]
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1761,7 +1761,7 @@ struct file_operations {
>                         loff_t, size_t, unsigned int);
>         s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
>                                 struct file *file_out, loff_t pos_out,
> -                               u64 count);
> +                               u64 count, unsigned int flags);
>         s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
>                                  struct file *file_out, loff_t pos_out,
>                                  u64 count);
> @@ -1827,9 +1827,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
>                 unsigned long, loff_t *, rwf_t);
>  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
>                                    loff_t, size_t, unsigned int);
> +/* Caller can handle a shortened operation. */
> +#define CLONERANGE_SHORT       (1 << 0)
> +/* End operation at the source file's EOF. */
> +#define CLONERANGE_EOF         (1 << 1)
> +/* Operation is actually dedupe, not clone. */
> +#define CLONERANGE_DEDUPE      (1 << 2)

That's cool. But you know what's going to be the next step, right?
Merging the 3 file operation interfaces into a single one.
copy_file_range() already has the flags arg for future extensions
and as you wrote somewhere, clone is really an optimized copy.
ovl_copyfile() already does that internally.

So the only take away for this patch series, please use constant
names COPYRANGE_* and also explicitly define:

/* Operation is actually clone, not copy. */
#define COPYRANGE_CLONE      (1 << 2)
/* Operation is actually dedupe, not copy. */
#define COPYRANGE_DEDUPE      (1 << 3)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
  (?)
@ 2018-10-05  8:06   ` Amir Goldstein
  2018-10-05 21:47       ` [Ocfs2-devel] " Darrick J. Wong
  -1 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2018-10-05  8:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen, Matthew Wilcox, Miklos Szeredi

On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Change the clone_file_range and dedupe_file_range functions to return
> the number of bytes they operated on.  This is the precursor to allowing
> fs implementations to return short clone/dedupe results to the user,
> which will enable us to obey resource limits in a graceful manner.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

[...]

> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index aeaefd2a551b..6d792d817538 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -487,16 +487,21 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
>                             OVL_COPY);
>  }
>
> -static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
> +static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
>                                 struct file *file_out, loff_t pos_out, u64 len)
>  {
> -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> -                           OVL_CLONE);
> +       int ret;
> +
> +       ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> +                          OVL_CLONE);
> +       return ret < 0 ? ret : len;
>  }
>
> -static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> +static s64 ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
>                                  struct file *file_out, loff_t pos_out, u64 len)
>  {
> +       int ret;
> +
>         /*
>          * Don't copy up because of a dedupe request, this wouldn't make sense
>          * most of the time (data would be duplicated instead of deduplicated).
> @@ -505,8 +510,9 @@ static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
>             !ovl_inode_upper(file_inode(file_out)))
>                 return -EPERM;
>
> -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> -                           OVL_DEDUPE);
> +       ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> +                          OVL_DEDUPE);
> +       return ret < 0 ? ret : len;
>  }
>

This is not pretty at all.
You are blocking the propagation of partial dedupe/clone result
of files that are accessed via overlay over xfs.

Please extend the interface change to the vfs helpers
(i.e. vfs_clone_file_range()) and then the change above is not needed.

Of course you would need to change the 3 callers of
vfs_clone_file_range() that expect 0 is ok.

Please take a look at commit
a725356b6659 ("vfs: swap names of {do,vfs}_clone_file_range()")

That was just merged for rc7.

I do apologize for the churn, but it's a semantic mistake that
I made that needed fixing, so please rebase your work on top
of that and take care not to trip over it.

ioctl_file_clone() and ovl_copy_up_data() just need to interpret
positive return value correctly.
nfsd4_clone_file_range() should have the same return value as
vfs_clone_file_range() to be interpreted in nfsd4_clone(), following
same practice as nfsd4_copy_file_range().

[...]

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2a4141d36ebf..e5755340e825 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1759,10 +1759,12 @@ struct file_operations {
>  #endif
>         ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
>                         loff_t, size_t, unsigned int);
> -       int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
> -                       u64);
> -       int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
> -                       u64);
> +       s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
> +                               struct file *file_out, loff_t pos_out,
> +                               u64 count);
> +       s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
> +                                struct file *file_out, loff_t pos_out,
> +                                u64 count);

Matthew has objected a similar interface change when it was proposed by Miklos:
https://marc.info/?l=linux-fsdevel&m=152570317110292&w=2
https://marc.info/?l=linux-fsdevel&m=152569298704781&w=2

He claimed that the interface should look like this:
+       loff_t (*dedupe_file_range)(struct file *src, loff_t src_off,
+                       struct file *dst, loff_t dst_off, loff_t len);

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 07/15] vfs: skip zero-length dedupe requests
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
  (?)
@ 2018-10-05  8:39   ` Amir Goldstein
  -1 siblings, 0 replies; 82+ messages in thread
From: Amir Goldstein @ 2018-10-05  8:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Don't bother calling the filesystem for a zero-length dedupe request;
> we can return zero and exit.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/read_write.c |    5 +++++
>  1 file changed, 5 insertions(+)
>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 973d3da78c09..99b2f809180c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1966,6 +1966,11 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>         if (!dst_file->f_op->dedupe_file_range)
>                 goto out_drop_write;
>
> +       if (len == 0) {
> +               ret = 0;
> +               goto out_drop_write;
> +       }
> +
>         ret = dst_file->f_op->dedupe_file_range(src_file, src_pos,
>                                                 dst_file, dst_pos, len);

I suppose it is best to let user get the same errors he would get
before this change rather than skipping zero length copy at
vfs_dedupe_file_range() level, so looks fine.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  7:02     ` [Ocfs2-devel] " Dave Chinner
@ 2018-10-05  9:02       ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  9:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 05:02:28PM +1000, Dave Chinner wrote:
> On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor all the reflink preparation steps into a separate helper that
> > we'll use to land all the upcoming fixes for insufficient input checks.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> .....
> 
> > +xfs_reflink_remap_range(
> > +	struct file		*file_in,
> > +	loff_t			pos_in,
> > +	struct file		*file_out,
> > +	loff_t			pos_out,
> > +	u64			len,
> > +	bool			is_dedupe)
> > +{
> > +	struct inode		*inode_in = file_inode(file_in);
> > +	struct xfs_inode	*src = XFS_I(inode_in);
> > +	struct inode		*inode_out = file_inode(file_out);
> > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		sfsbno, dfsbno;
> > +	xfs_filblks_t		fsblen;
> > +	xfs_extlen_t		cowextsize;
> > +	ssize_t			ret;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp))
> > +		return -EIO;
> > +
> > +	/* Prepare and then clone file data. */
> > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > +			len, is_dedupe);

More than one bug. vfs_clone_file_prep_inodes() modifes the length
parameter in the case of whole file reflink by way of "len == 0"
on a non-zero length file. So I fixed this, too.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05  9:02       ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05  9:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 05:02:28PM +1000, Dave Chinner wrote:
> On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor all the reflink preparation steps into a separate helper that
> > we'll use to land all the upcoming fixes for insufficient input checks.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> .....
> 
> > +xfs_reflink_remap_range(
> > +	struct file		*file_in,
> > +	loff_t			pos_in,
> > +	struct file		*file_out,
> > +	loff_t			pos_out,
> > +	u64			len,
> > +	bool			is_dedupe)
> > +{
> > +	struct inode		*inode_in = file_inode(file_in);
> > +	struct xfs_inode	*src = XFS_I(inode_in);
> > +	struct inode		*inode_out = file_inode(file_out);
> > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		sfsbno, dfsbno;
> > +	xfs_filblks_t		fsblen;
> > +	xfs_extlen_t		cowextsize;
> > +	ssize_t			ret;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp))
> > +		return -EIO;
> > +
> > +	/* Prepare and then clone file data. */
> > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > +			len, is_dedupe);

More than one bug. vfs_clone_file_prep_inodes() modifes the length
parameter in the case of whole file reflink by way of "len == 0"
on a non-zero length file. So I fixed this, too.

-Dave.
-- 
Dave Chinner
david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  5:28     ` [Ocfs2-devel] " Dave Chinner
@ 2018-10-05 17:06       ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 03:28:09PM +1000, Dave Chinner wrote:
> On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor all the reflink preparation steps into a separate helper that
> > we'll use to land all the upcoming fixes for insufficient input checks.
> 
> If I've read the patch right, this also changes the location of the
> page cache truncation, right?  i.e. it now happens in the
> xfs_reflink_remap_prep() function instead of after the remap?  I
> think the commit message needs to mention that because it's a fix to
> incorrect behaviour....

Right.  Sorry I forgot to put that in the changelog.

> I've added:
> 
> --
> This rework also moves the invalidation of the destination range to
> the prep function so that it is done before the range is remapped.
> This ensures that nobody can access the data in range being remapped
> until the remap is complete.
> --
> 
> Sound OK?

Yep.

--D


> Otherwise this looks fine.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> -Dave.
> 
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_reflink.c |   96 +++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 71 insertions(+), 25 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 38f405415b88..80ca9b6793cd 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout(
> >  	return 0;
> >  }
> >  
> > +/* Unlock both inodes after they've been prepped for a range clone. */
> > +STATIC void
> > +xfs_reflink_remap_unlock(
> > +	struct file		*file_in,
> > +	struct file		*file_out)
> > +{
> > +	struct inode		*inode_in = file_inode(file_in);
> > +	struct xfs_inode	*src = XFS_I(inode_in);
> > +	struct inode		*inode_out = file_inode(file_out);
> > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > +	bool			same_inode = (inode_in == inode_out);
> > +
> > +	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > +	if (!same_inode)
> > +		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> > +	inode_unlock(inode_out);
> > +	if (!same_inode)
> > +		inode_unlock_shared(inode_in);
> > +}
> > +
> >  /*
> > - * Link a range of blocks from one file to another.
> > + * Prepare two files for range cloning.  Upon a successful return both inodes
> > + * will have the iolock and mmaplock held, the page cache of the out file
> > + * will be truncated, and any leases on the out file will have been broken.
> >   */
> > -int
> > -xfs_reflink_remap_range(
> > +STATIC int
> > +xfs_reflink_remap_prep(
> >  	struct file		*file_in,
> >  	loff_t			pos_in,
> >  	struct file		*file_out,
> > @@ -1211,19 +1233,9 @@ xfs_reflink_remap_range(
> >  	struct xfs_inode	*src = XFS_I(inode_in);
> >  	struct inode		*inode_out = file_inode(file_out);
> >  	struct xfs_inode	*dest = XFS_I(inode_out);
> > -	struct xfs_mount	*mp = src->i_mount;
> >  	bool			same_inode = (inode_in == inode_out);
> > -	xfs_fileoff_t		sfsbno, dfsbno;
> > -	xfs_filblks_t		fsblen;
> > -	xfs_extlen_t		cowextsize;
> >  	ssize_t			ret;
> >  
> > -	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > -		return -EOPNOTSUPP;
> > -
> > -	if (XFS_FORCED_SHUTDOWN(mp))
> > -		return -EIO;
> > -
> >  	/* Lock both files against IO */
> >  	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
> >  	if (ret)
> > @@ -1254,8 +1266,6 @@ xfs_reflink_remap_range(
> >  	if (ret)
> >  		goto out_unlock;
> >  
> > -	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> > -
> >  	/*
> >  	 * Clear out post-eof preallocations because we don't have page cache
> >  	 * backing the delayed allocations and they'll never get freed on
> > @@ -1272,6 +1282,51 @@ xfs_reflink_remap_range(
> >  	if (ret)
> >  		goto out_unlock;
> >  
> > +	/* Zap any page cache for the destination file's range. */
> > +	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> > +				   PAGE_ALIGN(pos_out + len) - 1);
> > +	return 0;
> > +out_unlock:
> > +	xfs_reflink_remap_unlock(file_in, file_out);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Link a range of blocks from one file to another.
> > + */
> > +int
> > +xfs_reflink_remap_range(
> > +	struct file		*file_in,
> > +	loff_t			pos_in,
> > +	struct file		*file_out,
> > +	loff_t			pos_out,
> > +	u64			len,
> > +	bool			is_dedupe)
> > +{
> > +	struct inode		*inode_in = file_inode(file_in);
> > +	struct xfs_inode	*src = XFS_I(inode_in);
> > +	struct inode		*inode_out = file_inode(file_out);
> > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		sfsbno, dfsbno;
> > +	xfs_filblks_t		fsblen;
> > +	xfs_extlen_t		cowextsize;
> > +	ssize_t			ret;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp))
> > +		return -EIO;
> > +
> > +	/* Prepare and then clone file data. */
> > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > +			len, is_dedupe);
> > +	if (ret)
> > +		return ret;
> > +
> > +	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> > +
> >  	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
> >  	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
> >  	fsblen = XFS_B_TO_FSB(mp, len);
> > @@ -1280,10 +1335,6 @@ xfs_reflink_remap_range(
> >  	if (ret)
> >  		goto out_unlock;
> >  
> > -	/* Zap any page cache for the destination file's range. */
> > -	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> > -				   PAGE_ALIGN(pos_out + len) - 1);
> > -
> >  	/*
> >  	 * Carry the cowextsize hint from src to dest if we're sharing the
> >  	 * entire source file to the entire destination file, the source file
> > @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
> >  			is_dedupe);
> >  
> >  out_unlock:
> > -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > -	if (!same_inode)
> > -		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> > -	inode_unlock(inode_out);
> > -	if (!same_inode)
> > -		inode_unlock_shared(inode_in);
> > +	xfs_reflink_remap_unlock(file_in, file_out);
> >  	if (ret)
> >  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
> >  	return ret;
> > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05 17:06       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 03:28:09PM +1000, Dave Chinner wrote:
> On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor all the reflink preparation steps into a separate helper that
> > we'll use to land all the upcoming fixes for insufficient input checks.
> 
> If I've read the patch right, this also changes the location of the
> page cache truncation, right?  i.e. it now happens in the
> xfs_reflink_remap_prep() function instead of after the remap?  I
> think the commit message needs to mention that because it's a fix to
> incorrect behaviour....

Right.  Sorry I forgot to put that in the changelog.

> I've added:
> 
> --
> This rework also moves the invalidation of the destination range to
> the prep function so that it is done before the range is remapped.
> This ensures that nobody can access the data in range being remapped
> until the remap is complete.
> --
> 
> Sound OK?

Yep.

--D


> Otherwise this looks fine.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> -Dave.
> 
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_reflink.c |   96 +++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 71 insertions(+), 25 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 38f405415b88..80ca9b6793cd 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1195,11 +1195,33 @@ xfs_iolock_two_inodes_and_break_layout(
> >  	return 0;
> >  }
> >  
> > +/* Unlock both inodes after they've been prepped for a range clone. */
> > +STATIC void
> > +xfs_reflink_remap_unlock(
> > +	struct file		*file_in,
> > +	struct file		*file_out)
> > +{
> > +	struct inode		*inode_in = file_inode(file_in);
> > +	struct xfs_inode	*src = XFS_I(inode_in);
> > +	struct inode		*inode_out = file_inode(file_out);
> > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > +	bool			same_inode = (inode_in == inode_out);
> > +
> > +	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > +	if (!same_inode)
> > +		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> > +	inode_unlock(inode_out);
> > +	if (!same_inode)
> > +		inode_unlock_shared(inode_in);
> > +}
> > +
> >  /*
> > - * Link a range of blocks from one file to another.
> > + * Prepare two files for range cloning.  Upon a successful return both inodes
> > + * will have the iolock and mmaplock held, the page cache of the out file
> > + * will be truncated, and any leases on the out file will have been broken.
> >   */
> > -int
> > -xfs_reflink_remap_range(
> > +STATIC int
> > +xfs_reflink_remap_prep(
> >  	struct file		*file_in,
> >  	loff_t			pos_in,
> >  	struct file		*file_out,
> > @@ -1211,19 +1233,9 @@ xfs_reflink_remap_range(
> >  	struct xfs_inode	*src = XFS_I(inode_in);
> >  	struct inode		*inode_out = file_inode(file_out);
> >  	struct xfs_inode	*dest = XFS_I(inode_out);
> > -	struct xfs_mount	*mp = src->i_mount;
> >  	bool			same_inode = (inode_in == inode_out);
> > -	xfs_fileoff_t		sfsbno, dfsbno;
> > -	xfs_filblks_t		fsblen;
> > -	xfs_extlen_t		cowextsize;
> >  	ssize_t			ret;
> >  
> > -	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > -		return -EOPNOTSUPP;
> > -
> > -	if (XFS_FORCED_SHUTDOWN(mp))
> > -		return -EIO;
> > -
> >  	/* Lock both files against IO */
> >  	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
> >  	if (ret)
> > @@ -1254,8 +1266,6 @@ xfs_reflink_remap_range(
> >  	if (ret)
> >  		goto out_unlock;
> >  
> > -	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> > -
> >  	/*
> >  	 * Clear out post-eof preallocations because we don't have page cache
> >  	 * backing the delayed allocations and they'll never get freed on
> > @@ -1272,6 +1282,51 @@ xfs_reflink_remap_range(
> >  	if (ret)
> >  		goto out_unlock;
> >  
> > +	/* Zap any page cache for the destination file's range. */
> > +	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> > +				   PAGE_ALIGN(pos_out + len) - 1);
> > +	return 0;
> > +out_unlock:
> > +	xfs_reflink_remap_unlock(file_in, file_out);
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Link a range of blocks from one file to another.
> > + */
> > +int
> > +xfs_reflink_remap_range(
> > +	struct file		*file_in,
> > +	loff_t			pos_in,
> > +	struct file		*file_out,
> > +	loff_t			pos_out,
> > +	u64			len,
> > +	bool			is_dedupe)
> > +{
> > +	struct inode		*inode_in = file_inode(file_in);
> > +	struct xfs_inode	*src = XFS_I(inode_in);
> > +	struct inode		*inode_out = file_inode(file_out);
> > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		sfsbno, dfsbno;
> > +	xfs_filblks_t		fsblen;
> > +	xfs_extlen_t		cowextsize;
> > +	ssize_t			ret;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp))
> > +		return -EIO;
> > +
> > +	/* Prepare and then clone file data. */
> > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > +			len, is_dedupe);
> > +	if (ret)
> > +		return ret;
> > +
> > +	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> > +
> >  	dfsbno = XFS_B_TO_FSBT(mp, pos_out);
> >  	sfsbno = XFS_B_TO_FSBT(mp, pos_in);
> >  	fsblen = XFS_B_TO_FSB(mp, len);
> > @@ -1280,10 +1335,6 @@ xfs_reflink_remap_range(
> >  	if (ret)
> >  		goto out_unlock;
> >  
> > -	/* Zap any page cache for the destination file's range. */
> > -	truncate_inode_pages_range(&inode_out->i_data, pos_out,
> > -				   PAGE_ALIGN(pos_out + len) - 1);
> > -
> >  	/*
> >  	 * Carry the cowextsize hint from src to dest if we're sharing the
> >  	 * entire source file to the entire destination file, the source file
> > @@ -1300,12 +1351,7 @@ xfs_reflink_remap_range(
> >  			is_dedupe);
> >  
> >  out_unlock:
> > -	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > -	if (!same_inode)
> > -		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> > -	inode_unlock(inode_out);
> > -	if (!same_inode)
> > -		inode_unlock_shared(inode_in);
> > +	xfs_reflink_remap_unlock(file_in, file_out);
> >  	if (ret)
> >  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
> >  	return ret;
> > 
> > 
> 
> -- 
> Dave Chinner
> david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  9:02       ` [Ocfs2-devel] " Dave Chinner
@ 2018-10-05 17:21         ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 07:02:42PM +1000, Dave Chinner wrote:
> On Fri, Oct 05, 2018 at 05:02:28PM +1000, Dave Chinner wrote:
> > On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Refactor all the reflink preparation steps into a separate helper that
> > > we'll use to land all the upcoming fixes for insufficient input checks.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > .....
> > 
> > > +xfs_reflink_remap_range(
> > > +	struct file		*file_in,
> > > +	loff_t			pos_in,
> > > +	struct file		*file_out,
> > > +	loff_t			pos_out,
> > > +	u64			len,
> > > +	bool			is_dedupe)
> > > +{
> > > +	struct inode		*inode_in = file_inode(file_in);
> > > +	struct xfs_inode	*src = XFS_I(inode_in);
> > > +	struct inode		*inode_out = file_inode(file_out);
> > > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > > +	struct xfs_mount	*mp = src->i_mount;
> > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > +	xfs_filblks_t		fsblen;
> > > +	xfs_extlen_t		cowextsize;
> > > +	ssize_t			ret;
> > > +
> > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > +		return -EIO;
> > > +
> > > +	/* Prepare and then clone file data. */
> > > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > > +			len, is_dedupe);
> 
> More than one bug. vfs_clone_file_prep_inodes() modifes the length
> parameter in the case of whole file reflink by way of "len == 0"
> on a non-zero length file. So I fixed this, too.

Did your patch look something like the attached?

--D

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 80ca9b6793cd..53158bdb1105 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1219,6 +1219,7 @@ xfs_reflink_remap_unlock(
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file
  * will be truncated, and any leases on the out file will have been broken.
+ * Returns negative for error, 0 for nothing to do, and 1 for success.
  */
 STATIC int
 xfs_reflink_remap_prep(
@@ -1226,7 +1227,7 @@ xfs_reflink_remap_prep(
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
-	u64			len,
+	u64			*len,
 	bool			is_dedupe)
 {
 	struct inode		*inode_in = file_inode(file_in);
@@ -1257,7 +1258,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
-			&len, is_dedupe);
+			len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -1284,8 +1285,8 @@ xfs_reflink_remap_prep(
 
 	/* Zap any page cache for the destination file's range. */
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
-	return 0;
+				   PAGE_ALIGN(pos_out + *len) - 1);
+	return 1;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
 	return ret;
@@ -1321,8 +1322,8 @@ xfs_reflink_remap_range(
 
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
-			len, is_dedupe);
-	if (ret)
+			&len, is_dedupe);
+	if (ret <= 0)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05 17:21         ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 07:02:42PM +1000, Dave Chinner wrote:
> On Fri, Oct 05, 2018 at 05:02:28PM +1000, Dave Chinner wrote:
> > On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Refactor all the reflink preparation steps into a separate helper that
> > > we'll use to land all the upcoming fixes for insufficient input checks.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > .....
> > 
> > > +xfs_reflink_remap_range(
> > > +	struct file		*file_in,
> > > +	loff_t			pos_in,
> > > +	struct file		*file_out,
> > > +	loff_t			pos_out,
> > > +	u64			len,
> > > +	bool			is_dedupe)
> > > +{
> > > +	struct inode		*inode_in = file_inode(file_in);
> > > +	struct xfs_inode	*src = XFS_I(inode_in);
> > > +	struct inode		*inode_out = file_inode(file_out);
> > > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > > +	struct xfs_mount	*mp = src->i_mount;
> > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > +	xfs_filblks_t		fsblen;
> > > +	xfs_extlen_t		cowextsize;
> > > +	ssize_t			ret;
> > > +
> > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > +		return -EIO;
> > > +
> > > +	/* Prepare and then clone file data. */
> > > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > > +			len, is_dedupe);
> 
> More than one bug. vfs_clone_file_prep_inodes() modifes the length
> parameter in the case of whole file reflink by way of "len == 0"
> on a non-zero length file. So I fixed this, too.

Did your patch look something like the attached?

--D

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 80ca9b6793cd..53158bdb1105 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1219,6 +1219,7 @@ xfs_reflink_remap_unlock(
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file
  * will be truncated, and any leases on the out file will have been broken.
+ * Returns negative for error, 0 for nothing to do, and 1 for success.
  */
 STATIC int
 xfs_reflink_remap_prep(
@@ -1226,7 +1227,7 @@ xfs_reflink_remap_prep(
 	loff_t			pos_in,
 	struct file		*file_out,
 	loff_t			pos_out,
-	u64			len,
+	u64			*len,
 	bool			is_dedupe)
 {
 	struct inode		*inode_in = file_inode(file_in);
@@ -1257,7 +1258,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
-			&len, is_dedupe);
+			len, is_dedupe);
 	if (ret <= 0)
 		goto out_unlock;
 
@@ -1284,8 +1285,8 @@ xfs_reflink_remap_prep(
 
 	/* Zap any page cache for the destination file's range. */
 	truncate_inode_pages_range(&inode_out->i_data, pos_out,
-				   PAGE_ALIGN(pos_out + len) - 1);
-	return 0;
+				   PAGE_ALIGN(pos_out + *len) - 1);
+	return 1;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
 	return ret;
@@ -1321,8 +1322,8 @@ xfs_reflink_remap_range(
 
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
-			len, is_dedupe);
-	if (ret)
+			&len, is_dedupe);
+	if (ret <= 0)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> -Dave.
> -- 
> Dave Chinner
> david at fromorbit.com

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range
  2018-10-05  6:10   ` Amir Goldstein
@ 2018-10-05 17:36       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 05, 2018 at 09:10:12AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Clone range is an optimization on a regular file write.  File writes
> > that extend the file length are subject to various constraints which are
> > not checked by clonerange.  This is a correctness problem, because we're
> > never allowed to touch ranges that the page cache can't support
> > (s_maxbytes); we're not supposed to deal with large offsets
> > (MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits
> > (RLIMIT_FSIZE).
> >
> > Therefore, add these checks to the new generic_clone_checks function so
> > that we curtail unexpected behavior.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  mm/filemap.c |   31 +++++++++++++++++++++++++++++++
> >  1 file changed, 31 insertions(+)
> >
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 68ec91d05c7b..f74391721234 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3015,6 +3015,37 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
> >                 return -EINVAL;
> >         count = min(count, size_in - (uint64_t)pos_in);
> >
> > +       /* Don't exceed RLMIT_FSIZE in the file we're writing into. */
> > +       if (limit != RLIM_INFINITY) {
> > +               if (pos_out >= limit) {
> > +                       send_sig(SIGXFSZ, current, 0);
> > +                       return -EFBIG;
> > +               }
> > +               count = min(count, limit - (uint64_t)pos_out);
> > +       }
> > +
> > +       /* Don't exceed the LFS limits. */
> > +       if (unlikely(pos_out + count > MAX_NON_LFS &&
> > +                               !(file_out->f_flags & O_LARGEFILE))) {
> > +               if (pos_out >= MAX_NON_LFS)
> > +                       return -EFBIG;
> > +               count = min(count, MAX_NON_LFS - (uint64_t)pos_out);
> > +       }
> > +       if (unlikely(pos_in + count > MAX_NON_LFS &&
> > +                               !(file_in->f_flags & O_LARGEFILE))) {
> > +               if (pos_in >= MAX_NON_LFS)
> > +                       return -EFBIG;
> > +               count = min(count, MAX_NON_LFS - (uint64_t)pos_in);
> > +       }
> > +
> > +       /* Don't operate on ranges the page cache doesn't support. */
> > +       if (unlikely(pos_out >= inode_out->i_sb->s_maxbytes ||
> > +                    pos_in >= inode_in->i_sb->s_maxbytes))
> > +               return -EFBIG;
> > +
> 
> Forget my standards, this doesn't abide by your own standards ;-)
> Please factor out generic_write_checks() and use it instead of
> duplicating the code. The in/out variant doesn't justify not calling
> the helper twice IMO.

Factor generic_write_checks and generic_clone_checks how?  They operate
on very different parameter types.

Or were you suggeseting refactoring just the "Dont' exceed LFS limits"
and "Don't operate on ranges the page cache..." sections of
generic_clone_checks to reduce copy paste?  That I'll do.

--D

> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range
@ 2018-10-05 17:36       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 05, 2018 at 09:10:12AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Clone range is an optimization on a regular file write.  File writes
> > that extend the file length are subject to various constraints which are
> > not checked by clonerange.  This is a correctness problem, because we're
> > never allowed to touch ranges that the page cache can't support
> > (s_maxbytes); we're not supposed to deal with large offsets
> > (MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits
> > (RLIMIT_FSIZE).
> >
> > Therefore, add these checks to the new generic_clone_checks function so
> > that we curtail unexpected behavior.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  mm/filemap.c |   31 +++++++++++++++++++++++++++++++
> >  1 file changed, 31 insertions(+)
> >
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 68ec91d05c7b..f74391721234 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3015,6 +3015,37 @@ int generic_clone_checks(struct file *file_in, loff_t pos_in,
> >                 return -EINVAL;
> >         count = min(count, size_in - (uint64_t)pos_in);
> >
> > +       /* Don't exceed RLMIT_FSIZE in the file we're writing into. */
> > +       if (limit != RLIM_INFINITY) {
> > +               if (pos_out >= limit) {
> > +                       send_sig(SIGXFSZ, current, 0);
> > +                       return -EFBIG;
> > +               }
> > +               count = min(count, limit - (uint64_t)pos_out);
> > +       }
> > +
> > +       /* Don't exceed the LFS limits. */
> > +       if (unlikely(pos_out + count > MAX_NON_LFS &&
> > +                               !(file_out->f_flags & O_LARGEFILE))) {
> > +               if (pos_out >= MAX_NON_LFS)
> > +                       return -EFBIG;
> > +               count = min(count, MAX_NON_LFS - (uint64_t)pos_out);
> > +       }
> > +       if (unlikely(pos_in + count > MAX_NON_LFS &&
> > +                               !(file_in->f_flags & O_LARGEFILE))) {
> > +               if (pos_in >= MAX_NON_LFS)
> > +                       return -EFBIG;
> > +               count = min(count, MAX_NON_LFS - (uint64_t)pos_in);
> > +       }
> > +
> > +       /* Don't operate on ranges the page cache doesn't support. */
> > +       if (unlikely(pos_out >= inode_out->i_sb->s_maxbytes ||
> > +                    pos_in >= inode_in->i_sb->s_maxbytes))
> > +               return -EFBIG;
> > +
> 
> Forget my standards, this doesn't abide by your own standards ;-)
> Please factor out generic_write_checks() and use it instead of
> duplicating the code. The in/out variant doesn't justify not calling
> the helper twice IMO.

Factor generic_write_checks and generic_clone_checks how?  They operate
on very different parameter types.

Or were you suggeseting refactoring just the "Dont' exceed LFS limits"
and "Don't operate on ranges the page cache..." sections of
generic_clone_checks to reduce copy paste?  That I'll do.

--D

> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 12/15] vfs: implement opportunistic short dedupe
  2018-10-05  6:40   ` Amir Goldstein
@ 2018-10-05 17:42       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 05, 2018 at 09:40:44AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > For a given dedupe request, the bytes_deduped field in the control
> > structure tells userspace if we managed to deduplicate some, but not all
> > of, the requested regions starting from the file offsets supplied.
> > However, due to sloppy coding, the current dedupe code returns
> > FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different.
> > Fix this so that we can actually support partial request completion.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/read_write.c    |   44 +++++++++++++++++++++++++++++++++++---------
> >  include/linux/fs.h |    2 +-
> >  2 files changed, 36 insertions(+), 10 deletions(-)
> >
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 292d68c2f47c..9be9f261edd2 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1781,13 +1781,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
> >          * Check that the extents are the same.
> >          */
> >         if (is_dedupe) {
> > -               bool            is_same = false;
> > -
> >                 ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> > -                               inode_out, pos_out, *len, &is_same);
> > +                               inode_out, pos_out, len);
> >                 if (ret)
> >                         return ret;
> > -               if (!is_same)
> > +               if (*len == 0)
> >                         return -EBADE;
> >         }
> >
> > @@ -1872,13 +1870,30 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
> >         return page;
> >  }
> >
> > +static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2,
> > +                                     unsigned int cmp_len)
> > +{
> > +       const char *orig_s1 = s1;
> > +       const char *e1 = s1 + cmp_len;
> > +       const char *e2 = s2 + cmp_len;
> > +
> > +       while (s1 < e1 && s2 < e2) {
> > +               if (*s1 != *s2)
> > +                       break;
> > +               s1++;
> > +               s2++;
> > +       }
> > +
> > +       return s1 - orig_s1;
> > +}
> > +
> 
> A few nits:
> 'len' wouldn't have been ambiguous in this context.
> I find the for loop in memcmp more elegant. It is definitely shorter.
> Not sure how differently the variants compile, but decrementing
> count/len seems much more sane then checking 2 conditions that
> always have the same result.

Fair enough; will fix.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 12/15] vfs: implement opportunistic short dedupe
@ 2018-10-05 17:42       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen

On Fri, Oct 05, 2018 at 09:40:44AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > For a given dedupe request, the bytes_deduped field in the control
> > structure tells userspace if we managed to deduplicate some, but not all
> > of, the requested regions starting from the file offsets supplied.
> > However, due to sloppy coding, the current dedupe code returns
> > FILE_DEDUPE_RANGE_DIFFERS if any part of the range is different.
> > Fix this so that we can actually support partial request completion.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/read_write.c    |   44 +++++++++++++++++++++++++++++++++++---------
> >  include/linux/fs.h |    2 +-
> >  2 files changed, 36 insertions(+), 10 deletions(-)
> >
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 292d68c2f47c..9be9f261edd2 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1781,13 +1781,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
> >          * Check that the extents are the same.
> >          */
> >         if (is_dedupe) {
> > -               bool            is_same = false;
> > -
> >                 ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> > -                               inode_out, pos_out, *len, &is_same);
> > +                               inode_out, pos_out, len);
> >                 if (ret)
> >                         return ret;
> > -               if (!is_same)
> > +               if (*len == 0)
> >                         return -EBADE;
> >         }
> >
> > @@ -1872,13 +1870,30 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
> >         return page;
> >  }
> >
> > +static unsigned int vfs_dedupe_memcmp(const char *s1, const char *s2,
> > +                                     unsigned int cmp_len)
> > +{
> > +       const char *orig_s1 = s1;
> > +       const char *e1 = s1 + cmp_len;
> > +       const char *e2 = s2 + cmp_len;
> > +
> > +       while (s1 < e1 && s2 < e2) {
> > +               if (*s1 != *s2)
> > +                       break;
> > +               s1++;
> > +               s2++;
> > +       }
> > +
> > +       return s1 - orig_s1;
> > +}
> > +
> 
> A few nits:
> 'len' wouldn't have been ambiguous in this context.
> I find the for loop in memcmp more elegant. It is definitely shorter.
> Not sure how differently the variants compile, but decrementing
> count/len seems much more sane then checking 2 conditions that
> always have the same result.

Fair enough; will fix.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
  2018-10-05  7:07   ` Amir Goldstein
@ 2018-10-05 17:50       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen, Miklos Szeredi

On Fri, Oct 05, 2018 at 10:07:43AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Pass operational flags to the per-filesystem clone and dedupe
> > implementations.  This enables the vfs to signal when it can deal with
> > short clone and short dedupe operations.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> [...]
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1761,7 +1761,7 @@ struct file_operations {
> >                         loff_t, size_t, unsigned int);
> >         s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
> >                                 struct file *file_out, loff_t pos_out,
> > -                               u64 count);
> > +                               u64 count, unsigned int flags);
> >         s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
> >                                  struct file *file_out, loff_t pos_out,
> >                                  u64 count);
> > @@ -1827,9 +1827,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> >                 unsigned long, loff_t *, rwf_t);
> >  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> >                                    loff_t, size_t, unsigned int);
> > +/* Caller can handle a shortened operation. */
> > +#define CLONERANGE_SHORT       (1 << 0)
> > +/* End operation at the source file's EOF. */
> > +#define CLONERANGE_EOF         (1 << 1)
> > +/* Operation is actually dedupe, not clone. */
> > +#define CLONERANGE_DEDUPE      (1 << 2)
> 
> That's cool. But you know what's going to be the next step, right?
> Merging the 3 file operation interfaces into a single one.
> copy_file_range() already has the flags arg for future extensions
> and as you wrote somewhere, clone is really an optimized copy.
> ovl_copyfile() already does that internally.
> 
> So the only take away for this patch series, please use constant
> names COPYRANGE_* and also explicitly define:
> 
> /* Operation is actually clone, not copy. */
> #define COPYRANGE_CLONE      (1 << 2)
> /* Operation is actually dedupe, not copy. */
> #define COPYRANGE_DEDUPE      (1 << 3)

Yeah, I was too tired to try to throw that one on top of the flaming
garbage pile.  But I guess since I have a bunch more work to do to the
previous patch I might as well do that...

--D

> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
@ 2018-10-05 17:50       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 17:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen, Miklos Szeredi

On Fri, Oct 05, 2018 at 10:07:43AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Pass operational flags to the per-filesystem clone and dedupe
> > implementations.  This enables the vfs to signal when it can deal with
> > short clone and short dedupe operations.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> [...]
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1761,7 +1761,7 @@ struct file_operations {
> >                         loff_t, size_t, unsigned int);
> >         s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
> >                                 struct file *file_out, loff_t pos_out,
> > -                               u64 count);
> > +                               u64 count, unsigned int flags);
> >         s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
> >                                  struct file *file_out, loff_t pos_out,
> >                                  u64 count);
> > @@ -1827,9 +1827,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> >                 unsigned long, loff_t *, rwf_t);
> >  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> >                                    loff_t, size_t, unsigned int);
> > +/* Caller can handle a shortened operation. */
> > +#define CLONERANGE_SHORT       (1 << 0)
> > +/* End operation at the source file's EOF. */
> > +#define CLONERANGE_EOF         (1 << 1)
> > +/* Operation is actually dedupe, not clone. */
> > +#define CLONERANGE_DEDUPE      (1 << 2)
> 
> That's cool. But you know what's going to be the next step, right?
> Merging the 3 file operation interfaces into a single one.
> copy_file_range() already has the flags arg for future extensions
> and as you wrote somewhere, clone is really an optimized copy.
> ovl_copyfile() already does that internally.
> 
> So the only take away for this patch series, please use constant
> names COPYRANGE_* and also explicitly define:
> 
> /* Operation is actually clone, not copy. */
> #define COPYRANGE_CLONE      (1 << 2)
> /* Operation is actually dedupe, not copy. */
> #define COPYRANGE_DEDUPE      (1 << 3)

Yeah, I was too tired to try to throw that one on top of the flaming
garbage pile.  But I guess since I have a bunch more work to do to the
previous patch I might as well do that...

--D

> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
  2018-10-05  8:06   ` Amir Goldstein
@ 2018-10-05 21:47       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 21:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen, Matthew Wilcox, Miklos Szeredi

On Fri, Oct 05, 2018 at 11:06:54AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Change the clone_file_range and dedupe_file_range functions to return
> > the number of bytes they operated on.  This is the precursor to allowing
> > fs implementations to return short clone/dedupe results to the user,
> > which will enable us to obey resource limits in a graceful manner.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> [...]
> 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index aeaefd2a551b..6d792d817538 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -487,16 +487,21 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> >                             OVL_COPY);
> >  }
> >
> > -static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
> > +static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
> >                                 struct file *file_out, loff_t pos_out, u64 len)
> >  {
> > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > -                           OVL_CLONE);
> > +       int ret;
> > +
> > +       ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > +                          OVL_CLONE);
> > +       return ret < 0 ? ret : len;
> >  }
> >
> > -static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> > +static s64 ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> >                                  struct file *file_out, loff_t pos_out, u64 len)
> >  {
> > +       int ret;
> > +
> >         /*
> >          * Don't copy up because of a dedupe request, this wouldn't make sense
> >          * most of the time (data would be duplicated instead of deduplicated).
> > @@ -505,8 +510,9 @@ static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> >             !ovl_inode_upper(file_inode(file_out)))
> >                 return -EPERM;
> >
> > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > -                           OVL_DEDUPE);
> > +       ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > +                          OVL_DEDUPE);
> > +       return ret < 0 ? ret : len;
> >  }
> >
> 
> This is not pretty at all.
> You are blocking the propagation of partial dedupe/clone result
> of files that are accessed via overlay over xfs.
> 
> Please extend the interface change to the vfs helpers
> (i.e. vfs_clone_file_range()) and then the change above is not needed.
> 
> Of course you would need to change the 3 callers of
> vfs_clone_file_range() that expect 0 is ok.

Ok, I'll plumb the bytes-finished return value all the way through the
internal APIs.

> Please take a look at commit
> a725356b6659 ("vfs: swap names of {do,vfs}_clone_file_range()")
> 
> That was just merged for rc7.
> 
> I do apologize for the churn, but it's a semantic mistake that
> I made that needed fixing, so please rebase your work on top
> of that and take care not to trip over it.

Err... ok.  That makes working on this a little messy, we'll see if I
can get this mess rebased in time for 5.0.

> ioctl_file_clone() and ovl_copy_up_data() just need to interpret
> positive return value correctly.
> nfsd4_clone_file_range() should have the same return value as
> vfs_clone_file_range() to be interpreted in nfsd4_clone(), following
> same practice as nfsd4_copy_file_range().
> 
> [...]
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 2a4141d36ebf..e5755340e825 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,10 +1759,12 @@ struct file_operations {
> >  #endif
> >         ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
> >                         loff_t, size_t, unsigned int);
> > -       int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
> > -                       u64);
> > -       int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
> > -                       u64);
> > +       s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
> > +                               struct file *file_out, loff_t pos_out,
> > +                               u64 count);
> > +       s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
> > +                                struct file *file_out, loff_t pos_out,
> > +                                u64 count);
> 
> Matthew has objected a similar interface change when it was proposed by Miklos:
> https://marc.info/?l=linux-fsdevel&m=152570317110292&w=2
> https://marc.info/?l=linux-fsdevel&m=152569298704781&w=2
> 
> He claimed that the interface should look like this:
> +       loff_t (*dedupe_file_range)(struct file *src, loff_t src_off,
> +                       struct file *dst, loff_t dst_off, loff_t len);

I don't really like loff_t (why does the typename for a size include
"offset" in the name??) but I guess that's not horrible.  I've never
liked how functions take size_t (unsigned) but return ssize_t (signed)
anyway.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
@ 2018-10-05 21:47       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-05 21:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, Linux Btrfs, ocfs2-devel,
	Eric Sandeen, Matthew Wilcox, Miklos Szeredi

On Fri, Oct 05, 2018 at 11:06:54AM +0300, Amir Goldstein wrote:
> On Fri, Oct 5, 2018 at 3:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Change the clone_file_range and dedupe_file_range functions to return
> > the number of bytes they operated on.  This is the precursor to allowing
> > fs implementations to return short clone/dedupe results to the user,
> > which will enable us to obey resource limits in a graceful manner.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> [...]
> 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index aeaefd2a551b..6d792d817538 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -487,16 +487,21 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> >                             OVL_COPY);
> >  }
> >
> > -static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
> > +static s64 ovl_clone_file_range(struct file *file_in, loff_t pos_in,
> >                                 struct file *file_out, loff_t pos_out, u64 len)
> >  {
> > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > -                           OVL_CLONE);
> > +       int ret;
> > +
> > +       ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > +                          OVL_CLONE);
> > +       return ret < 0 ? ret : len;
> >  }
> >
> > -static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> > +static s64 ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> >                                  struct file *file_out, loff_t pos_out, u64 len)
> >  {
> > +       int ret;
> > +
> >         /*
> >          * Don't copy up because of a dedupe request, this wouldn't make sense
> >          * most of the time (data would be duplicated instead of deduplicated).
> > @@ -505,8 +510,9 @@ static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in,
> >             !ovl_inode_upper(file_inode(file_out)))
> >                 return -EPERM;
> >
> > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > -                           OVL_DEDUPE);
> > +       ret = ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
> > +                          OVL_DEDUPE);
> > +       return ret < 0 ? ret : len;
> >  }
> >
> 
> This is not pretty at all.
> You are blocking the propagation of partial dedupe/clone result
> of files that are accessed via overlay over xfs.
> 
> Please extend the interface change to the vfs helpers
> (i.e. vfs_clone_file_range()) and then the change above is not needed.
> 
> Of course you would need to change the 3 callers of
> vfs_clone_file_range() that expect 0 is ok.

Ok, I'll plumb the bytes-finished return value all the way through the
internal APIs.

> Please take a look at commit
> a725356b6659 ("vfs: swap names of {do,vfs}_clone_file_range()")
> 
> That was just merged for rc7.
> 
> I do apologize for the churn, but it's a semantic mistake that
> I made that needed fixing, so please rebase your work on top
> of that and take care not to trip over it.

Err... ok.  That makes working on this a little messy, we'll see if I
can get this mess rebased in time for 5.0.

> ioctl_file_clone() and ovl_copy_up_data() just need to interpret
> positive return value correctly.
> nfsd4_clone_file_range() should have the same return value as
> vfs_clone_file_range() to be interpreted in nfsd4_clone(), following
> same practice as nfsd4_copy_file_range().
> 
> [...]
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 2a4141d36ebf..e5755340e825 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,10 +1759,12 @@ struct file_operations {
> >  #endif
> >         ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
> >                         loff_t, size_t, unsigned int);
> > -       int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
> > -                       u64);
> > -       int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
> > -                       u64);
> > +       s64 (*clone_file_range)(struct file *file_in, loff_t pos_in,
> > +                               struct file *file_out, loff_t pos_out,
> > +                               u64 count);
> > +       s64 (*dedupe_file_range)(struct file *file_in, loff_t pos_in,
> > +                                struct file *file_out, loff_t pos_out,
> > +                                u64 count);
> 
> Matthew has objected a similar interface change when it was proposed by Miklos:
> https://marc.info/?l=linux-fsdevel&m=152570317110292&w=2
> https://marc.info/?l=linux-fsdevel&m=152569298704781&w=2
> 
> He claimed that the interface should look like this:
> +       loff_t (*dedupe_file_range)(struct file *src, loff_t src_off,
> +                       struct file *dst, loff_t dst_off, loff_t len);

I don't really like loff_t (why does the typename for a size include
"offset" in the name??) but I guess that's not horrible.  I've never
liked how functions take size_t (unsigned) but return ssize_t (signed)
anyway.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05 17:21         ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-05 23:42           ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05 23:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 10:21:59AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 05, 2018 at 07:02:42PM +1000, Dave Chinner wrote:
> > On Fri, Oct 05, 2018 at 05:02:28PM +1000, Dave Chinner wrote:
> > > On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Refactor all the reflink preparation steps into a separate helper that
> > > > we'll use to land all the upcoming fixes for insufficient input checks.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > .....
> > > 
> > > > +xfs_reflink_remap_range(
> > > > +	struct file		*file_in,
> > > > +	loff_t			pos_in,
> > > > +	struct file		*file_out,
> > > > +	loff_t			pos_out,
> > > > +	u64			len,
> > > > +	bool			is_dedupe)
> > > > +{
> > > > +	struct inode		*inode_in = file_inode(file_in);
> > > > +	struct xfs_inode	*src = XFS_I(inode_in);
> > > > +	struct inode		*inode_out = file_inode(file_out);
> > > > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > > +	xfs_filblks_t		fsblen;
> > > > +	xfs_extlen_t		cowextsize;
> > > > +	ssize_t			ret;
> > > > +
> > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > > +		return -EIO;
> > > > +
> > > > +	/* Prepare and then clone file data. */
> > > > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > > > +			len, is_dedupe);
> > 
> > More than one bug. vfs_clone_file_prep_inodes() modifes the length
> > parameter in the case of whole file reflink by way of "len == 0"
> > on a non-zero length file. So I fixed this, too.
> 
> Did your patch look something like the attached?

Identical. I folded it back in once it tested ok, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-05 23:42           ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2018-10-05 23:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 10:21:59AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 05, 2018 at 07:02:42PM +1000, Dave Chinner wrote:
> > On Fri, Oct 05, 2018 at 05:02:28PM +1000, Dave Chinner wrote:
> > > On Thu, Oct 04, 2018 at 05:44:47PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Refactor all the reflink preparation steps into a separate helper that
> > > > we'll use to land all the upcoming fixes for insufficient input checks.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > .....
> > > 
> > > > +xfs_reflink_remap_range(
> > > > +	struct file		*file_in,
> > > > +	loff_t			pos_in,
> > > > +	struct file		*file_out,
> > > > +	loff_t			pos_out,
> > > > +	u64			len,
> > > > +	bool			is_dedupe)
> > > > +{
> > > > +	struct inode		*inode_in = file_inode(file_in);
> > > > +	struct xfs_inode	*src = XFS_I(inode_in);
> > > > +	struct inode		*inode_out = file_inode(file_out);
> > > > +	struct xfs_inode	*dest = XFS_I(inode_out);
> > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > > +	xfs_filblks_t		fsblen;
> > > > +	xfs_extlen_t		cowextsize;
> > > > +	ssize_t			ret;
> > > > +
> > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > > +		return -EIO;
> > > > +
> > > > +	/* Prepare and then clone file data. */
> > > > +	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
> > > > +			len, is_dedupe);
> > 
> > More than one bug. vfs_clone_file_prep_inodes() modifes the length
> > parameter in the case of whole file reflink by way of "len == 0"
> > on a non-zero length file. So I fixed this, too.
> 
> Did your patch look something like the attached?

Identical. I folded it back in once it tested ok, though.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
  2018-10-05  5:28     ` [Ocfs2-devel] " Dave Chinner
@ 2018-10-06 10:30       ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-btrfs,
	ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 03:28:09PM +1000, Dave Chinner wrote:
> I've added:
> 
> --
> This rework also moves the invalidation of the destination range to
> the prep function so that it is done before the range is remapped.
> This ensures that nobody can access the data in range being remapped
> until the remap is complete.
> --
> 
> Sound OK?
> 
> Otherwise this looks fine.

Given that the patch will need a respin anyway I'd rather split the
move of the pagecache invalidation into a separate patch.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper
@ 2018-10-06 10:30       ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-btrfs,
	ocfs2-devel, sandeen

On Fri, Oct 05, 2018 at 03:28:09PM +1000, Dave Chinner wrote:
> I've added:
> 
> --
> This rework also moves the invalidation of the destination range to
> the prep function so that it is done before the range is remapped.
> This ensures that nobody can access the data in range being remapped
> until the remap is complete.
> --
> 
> Sound OK?
> 
> Otherwise this looks fine.

Given that the patch will need a respin anyway I'd rather split the
move of the pagecache invalidation into a separate patch.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
  2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-06 10:34     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, Zorro Lang, linux-xfs, ocfs2-devel,
	linux-fsdevel, linux-btrfs

On Thu, Oct 04, 2018 at 05:44:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're reflinking between two files and the destination file range
> is well beyond the destination file's EOF marker, zero any posteof
> speculative preallocations in the destination file so that we don't
> expose stale disk contents.  The previous strategy of trying to clear
> the preallocations does not work if the destination file has the
> PREALLOC flag set.

But I think we should still drop speculative delalloc preallocations 
instead of zeroing them in addition to zeroing of any real blocks in
the data fork.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 03/15] xfs: zero posteof blocks when cloning above eof
@ 2018-10-06 10:34     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, sandeen, Zorro Lang, linux-xfs, ocfs2-devel,
	linux-fsdevel, linux-btrfs

On Thu, Oct 04, 2018 at 05:44:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we're reflinking between two files and the destination file range
> is well beyond the destination file's EOF marker, zero any posteof
> speculative preallocations in the destination file so that we don't
> expose stale disk contents.  The previous strategy of trying to clear
> the preallocations does not work if the destination file has the
> PREALLOC flag set.

But I think we should still drop speculative delalloc preallocations 
instead of zeroing them in addition to zeroing of any real blocks in
the data fork.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/15] xfs: update ctime and remove suid before cloning files
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-06 10:35     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

>  STATIC int
>  xfs_reflink_remap_prep(
> @@ -1302,6 +1303,30 @@ xfs_reflink_remap_prep(
>  	/* Zap any page cache for the destination file's range. */
>  	truncate_inode_pages_range(&inode_out->i_data, pos_out,
>  				   PAGE_ALIGN(pos_out + len) - 1);
> +
> +	/* If we're altering the file contents... */
> +	if (!is_dedupe) {

Nipick - even a clone might not alter the file content, we just have
no guarantee.  So maybe change the comment to:

	/* If we may alter the file contents.. */

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 04/15] xfs: update ctime and remove suid before cloning files
@ 2018-10-06 10:35     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

>  STATIC int
>  xfs_reflink_remap_prep(
> @@ -1302,6 +1303,30 @@ xfs_reflink_remap_prep(
>  	/* Zap any page cache for the destination file's range. */
>  	truncate_inode_pages_range(&inode_out->i_data, pos_out,
>  				   PAGE_ALIGN(pos_out + len) - 1);
> +
> +	/* If we're altering the file contents... */
> +	if (!is_dedupe) {

Nipick - even a clone might not alter the file content, we just have
no guarantee.  So maybe change the comment to:

	/* If we may alter the file contents.. */

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 05/15] vfs: check file ranges before cloning files
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-06 10:38     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

> +EXPORT_SYMBOL(generic_clone_checks);

There doesn't seem to be any need to export this function.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 05/15] vfs: check file ranges before cloning files
@ 2018-10-06 10:38     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

> +EXPORT_SYMBOL(generic_clone_checks);

There doesn't seem to be any need to export this function.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 07/15] vfs: skip zero-length dedupe requests
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-06 10:39     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:45:21PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Don't bother calling the filesystem for a zero-length dedupe request;
> we can return zero and exit.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 07/15] vfs: skip zero-length dedupe requests
@ 2018-10-06 10:39     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:45:21PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Don't bother calling the filesystem for a zero-length dedupe request;
> we can return zero and exit.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
  2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-06 10:41     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:45:35PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Change the clone_file_range and dedupe_file_range functions to return
> the number of bytes they operated on.  This is the precursor to allowing
> fs implementations to return short clone/dedupe results to the user,
> which will enable us to obey resource limits in a graceful manner.

For clone this doesn't make any sense, as we don't allow 'short clones'.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
@ 2018-10-06 10:41     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Thu, Oct 04, 2018 at 05:45:35PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Change the clone_file_range and dedupe_file_range functions to return
> the number of bytes they operated on.  This is the precursor to allowing
> fs implementations to return short clone/dedupe results to the user,
> which will enable us to obey resource limits in a graceful manner.

For clone this doesn't make any sense, as we don't allow 'short clones'.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
  2018-10-05 17:50       ` [Ocfs2-devel] " Darrick J. Wong
@ 2018-10-06 10:44         ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dave Chinner, linux-xfs, linux-fsdevel,
	Linux Btrfs, ocfs2-devel, Eric Sandeen, Miklos Szeredi

On Fri, Oct 05, 2018 at 10:50:08AM -0700, Darrick J. Wong wrote:
> > That's cool. But you know what's going to be the next step, right?
> > Merging the 3 file operation interfaces into a single one.
> > copy_file_range() already has the flags arg for future extensions
> > and as you wrote somewhere, clone is really an optimized copy.
> > ovl_copyfile() already does that internally.
> > 
> > So the only take away for this patch series, please use constant
> > names COPYRANGE_* and also explicitly define:
> > 
> > /* Operation is actually clone, not copy. */
> > #define COPYRANGE_CLONE      (1 << 2)
> > /* Operation is actually dedupe, not copy. */
> > #define COPYRANGE_DEDUPE      (1 << 3)
> 
> Yeah, I was too tired to try to throw that one on top of the flaming
> garbage pile.  But I guess since I have a bunch more work to do to the
> previous patch I might as well do that...

I'm not totally sold on just merging everything, but I very much despise
what is done in this patch, as it creates a completely confusing
interface.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations
@ 2018-10-06 10:44         ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2018-10-06 10:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, Dave Chinner, linux-xfs, linux-fsdevel,
	Linux Btrfs, ocfs2-devel, Eric Sandeen, Miklos Szeredi

On Fri, Oct 05, 2018 at 10:50:08AM -0700, Darrick J. Wong wrote:
> > That's cool. But you know what's going to be the next step, right?
> > Merging the 3 file operation interfaces into a single one.
> > copy_file_range() already has the flags arg for future extensions
> > and as you wrote somewhere, clone is really an optimized copy.
> > ovl_copyfile() already does that internally.
> > 
> > So the only take away for this patch series, please use constant
> > names COPYRANGE_* and also explicitly define:
> > 
> > /* Operation is actually clone, not copy. */
> > #define COPYRANGE_CLONE      (1 << 2)
> > /* Operation is actually dedupe, not copy. */
> > #define COPYRANGE_DEDUPE      (1 << 3)
> 
> Yeah, I was too tired to try to throw that one on top of the flaming
> garbage pile.  But I guess since I have a bunch more work to do to the
> previous patch I might as well do that...

I'm not totally sold on just merging everything, but I very much despise
what is done in this patch, as it creates a completely confusing
interface.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
  2018-10-06 10:41     ` [Ocfs2-devel] " Christoph Hellwig
@ 2018-10-08 18:59       ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-08 18:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Sat, Oct 06, 2018 at 03:41:34AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 04, 2018 at 05:45:35PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Change the clone_file_range and dedupe_file_range functions to return
> > the number of bytes they operated on.  This is the precursor to allowing
> > fs implementations to return short clone/dedupe results to the user,
> > which will enable us to obey resource limits in a graceful manner.
> 
> For clone this doesn't make any sense, as we don't allow 'short clones'.

That's correct.  The overall goal is that copy_file_range will be able
to call ->clone_file_range with a "I can handle a short clone" flag so
that the fs (and the prep functions) know that they're allowed to
shorten the request length, and we have a way to communicate the actual
results back to userspace.

Neither FS_IOC_CLONE nor FS_IOC_CLONERANGE will be able to take
advantage of this, of course.  They won't pass in the flag, and any
implementation that wants to shorten the length will pass back -EINVAL
instead.

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Ocfs2-devel] [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed
@ 2018-10-08 18:59       ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2018-10-08 18:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, linux-xfs, linux-fsdevel, linux-btrfs, ocfs2-devel, sandeen

On Sat, Oct 06, 2018 at 03:41:34AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 04, 2018 at 05:45:35PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Change the clone_file_range and dedupe_file_range functions to return
> > the number of bytes they operated on.  This is the precursor to allowing
> > fs implementations to return short clone/dedupe results to the user,
> > which will enable us to obey resource limits in a graceful manner.
> 
> For clone this doesn't make any sense, as we don't allow 'short clones'.

That's correct.  The overall goal is that copy_file_range will be able
to call ->clone_file_range with a "I can handle a short clone" flag so
that the fs (and the prep functions) know that they're allowed to
shorten the request length, and we have a way to communicate the actual
results back to userspace.

Neither FS_IOC_CLONE nor FS_IOC_CLONERANGE will be able to take
advantage of this, of course.  They won't pass in the flag, and any
implementation that wants to shorten the length will pass back -EINVAL
instead.

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2018-10-08 19:00 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-05  0:44 [PATCH 00/15] fs: fixes for serious clone/dedupe problems Darrick J. Wong
2018-10-05  0:44 ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:44 ` [PATCH 01/15] xfs: add a per-xfs trace_printk macro Darrick J. Wong
2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:44 ` [PATCH 02/15] xfs: refactor clonerange preparation into a separate helper Darrick J. Wong
2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  5:28   ` Dave Chinner
2018-10-05  5:28     ` [Ocfs2-devel] " Dave Chinner
2018-10-05 17:06     ` Darrick J. Wong
2018-10-05 17:06       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-06 10:30     ` Christoph Hellwig
2018-10-06 10:30       ` [Ocfs2-devel] " Christoph Hellwig
2018-10-05  7:02   ` Dave Chinner
2018-10-05  7:02     ` [Ocfs2-devel] " Dave Chinner
2018-10-05  9:02     ` Dave Chinner
2018-10-05  9:02       ` [Ocfs2-devel] " Dave Chinner
2018-10-05 17:21       ` Darrick J. Wong
2018-10-05 17:21         ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05 23:42         ` Dave Chinner
2018-10-05 23:42           ` [Ocfs2-devel] " Dave Chinner
2018-10-05  0:44 ` [PATCH 03/15] xfs: zero posteof blocks when cloning above eof Darrick J. Wong
2018-10-05  0:44   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  5:28   ` Dave Chinner
2018-10-05  5:28     ` [Ocfs2-devel] " Dave Chinner
2018-10-06 10:34   ` Christoph Hellwig
2018-10-06 10:34     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-05  0:45 ` [PATCH 04/15] xfs: update ctime and remove suid before cloning files Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  5:30   ` Dave Chinner
2018-10-05  5:30     ` [Ocfs2-devel] " Dave Chinner
2018-10-06 10:35   ` Christoph Hellwig
2018-10-06 10:35     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-05  0:45 ` [PATCH 05/15] vfs: check file ranges " Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-06 10:38   ` Christoph Hellwig
2018-10-06 10:38     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-05  0:45 ` [PATCH 06/15] vfs: strengthen checking of file range inputs to clone/dedupe range Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  6:10   ` Amir Goldstein
2018-10-05 17:36     ` Darrick J. Wong
2018-10-05 17:36       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:45 ` [PATCH 07/15] vfs: skip zero-length dedupe requests Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  8:39   ` Amir Goldstein
2018-10-06 10:39   ` Christoph Hellwig
2018-10-06 10:39     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-05  0:45 ` [PATCH 08/15] vfs: change clone and dedupe range function pointers to return bytes completed Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  8:06   ` Amir Goldstein
2018-10-05 21:47     ` Darrick J. Wong
2018-10-05 21:47       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-06 10:41   ` Christoph Hellwig
2018-10-06 10:41     ` [Ocfs2-devel] " Christoph Hellwig
2018-10-08 18:59     ` Darrick J. Wong
2018-10-08 18:59       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:45 ` [PATCH 09/15] vfs: pass operation flags to {clone, dedupe}_file_range implementations Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  7:07   ` Amir Goldstein
2018-10-05 17:50     ` Darrick J. Wong
2018-10-05 17:50       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-06 10:44       ` Christoph Hellwig
2018-10-06 10:44         ` [Ocfs2-devel] " Christoph Hellwig
2018-10-05  0:45 ` [PATCH 10/15] vfs: make cloning to source file eof more explicit Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  6:47   ` Amir Goldstein
2018-10-05  0:45 ` [PATCH 11/15] vfs: allow short clone and dedupe operations Darrick J. Wong
2018-10-05  0:45   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:46 ` [PATCH 12/15] vfs: implement opportunistic short dedupe Darrick J. Wong
2018-10-05  0:46   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  6:40   ` Amir Goldstein
2018-10-05 17:42     ` Darrick J. Wong
2018-10-05 17:42       ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:46 ` [PATCH 13/15] ocfs2: truncate page cache for clone destination file before remapping Darrick J. Wong
2018-10-05  0:46   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:46 ` [PATCH 14/15] ocfs2: support partial clone range and dedupe range Darrick J. Wong
2018-10-05  0:46   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  0:46 ` [PATCH 15/15] xfs: support returning partial reflink results Darrick J. Wong
2018-10-05  0:46   ` [Ocfs2-devel] " Darrick J. Wong
2018-10-05  1:17 ` [PATCH 00/15] fs: fixes for serious clone/dedupe problems Dave Chinner
2018-10-05  1:17   ` [Ocfs2-devel] " Dave Chinner
2018-10-05  1:24   ` Darrick J. Wong
2018-10-05  1:24     ` [Ocfs2-devel] " Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.