All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS
@ 2015-12-19  8:55 ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Hi all,

This patch set goes along with the fourth revision of an RFC adding to
XFS kernel support for tracking reverse-mappings of physical blocks to
file and metadata; and support for mapping multiple file logical
blocks to the same physical block, more commonly known as reflinking.

The first four patches are taken verbatim from Anna Schumaker's
patches adding a physical copy call to the kernel.  The next two
patches are from Christoph Hellwig, and hoist the clone and
clone_range ioctls into the VFS.  These patches are a bit old at
this point; they're in here solely to demonstrate how this (long)
patchset diverges from upstream.

The third patch fixes some bugs in the first four patches, and the
fourth patch hoists the extent_same ioctl into the VFS as the dedupe
ioctl.

The patch set is based on the current (4.4-rc5) upstream kernel.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
See also the xfs-docs[4] and manpage[5] updates.

This is an extraordinary way to eat your data.  Enjoy!

Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/for-dave
[2] https://github.com/djwong/xfsprogs/tree/for-dave
[3] https://github.com/djwong/xfstests/tree/for-dave
[4] https://github.com/djwong/xfs-documentation/tree/for-dave
[5] https://github.com/djwong/man-pages/commits/for-mtk

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS
@ 2015-12-19  8:55 ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Hi all,

This patch set goes along with the fourth revision of an RFC adding to
XFS kernel support for tracking reverse-mappings of physical blocks to
file and metadata; and support for mapping multiple file logical
blocks to the same physical block, more commonly known as reflinking.

The first four patches are taken verbatim from Anna Schumaker's
patches adding a physical copy call to the kernel.  The next two
patches are from Christoph Hellwig, and hoist the clone and
clone_range ioctls into the VFS.  These patches are a bit old at
this point; they're in here solely to demonstrate how this (long)
patchset diverges from upstream.

The third patch fixes some bugs in the first four patches, and the
fourth patch hoists the extent_same ioctl into the VFS as the dedupe
ioctl.

The patch set is based on the current (4.4-rc5) upstream kernel.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
See also the xfs-docs[4] and manpage[5] updates.

This is an extraordinary way to eat your data.  Enjoy!

Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/for-dave
[2] https://github.com/djwong/xfsprogs/tree/for-dave
[3] https://github.com/djwong/xfstests/tree/for-dave
[4] https://github.com/djwong/xfs-documentation/tree/for-dave
[5] https://github.com/djwong/man-pages/commits/for-mtk

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS
@ 2015-12-19  8:55 ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david-FqsqvQoI3Ljby3iVrkZq2A, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

Hi all,

This patch set goes along with the fourth revision of an RFC adding to
XFS kernel support for tracking reverse-mappings of physical blocks to
file and metadata; and support for mapping multiple file logical
blocks to the same physical block, more commonly known as reflinking.

The first four patches are taken verbatim from Anna Schumaker's
patches adding a physical copy call to the kernel.  The next two
patches are from Christoph Hellwig, and hoist the clone and
clone_range ioctls into the VFS.  These patches are a bit old at
this point; they're in here solely to demonstrate how this (long)
patchset diverges from upstream.

The third patch fixes some bugs in the first four patches, and the
fourth patch hoists the extent_same ioctl into the VFS as the dedupe
ioctl.

The patch set is based on the current (4.4-rc5) upstream kernel.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
See also the xfs-docs[4] and manpage[5] updates.

This is an extraordinary way to eat your data.  Enjoy!

Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/for-dave
[2] https://github.com/djwong/xfsprogs/tree/for-dave
[3] https://github.com/djwong/xfstests/tree/for-dave
[4] https://github.com/djwong/xfs-documentation/tree/for-dave
[5] https://github.com/djwong/man-pages/commits/for-mtk

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: Zach Brown, linux-api, xfs, linux-fsdevel, Christoph Hellwig,
	Anna Schumaker

>>From : Anna Schumaker <Anna.Schumaker@Netapp.com>
>>From : Zach Brown <zab@redhat.com>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                 Change flags parameter from int to unsigned int,
                 Add function to include/linux/syscalls.h,
                 Check copy len after file open mode,
                 Don't forbid ranges inside the same file,
                 Use rw_verify_area() to veriy ranges,
                 Use file_out rather than file_in,
                 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c                   |  120 +++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |    3 +
 include/linux/syscalls.h          |    3 +
 include/uapi/asm-generic/unistd.h |    4 +
 kernel/sys_ni.c                   |    1 
 5 files changed, 130 insertions(+), 1 deletion(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..97c15ca 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,122 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	ssize_t ret;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_out->f_op->copy_file_range)
+		return -EBADF;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
+	if (len == 0)
+		return 0;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret = -EBADF;
+
+	f_in = fdget(fd_in);
+	if (!f_in.file)
+		goto out2;
+
+	f_out = fdget(fd_out);
+	if (!f_out.file)
+		goto out1;
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+
+out:
+	fdput(f_in);
+out1:
+	fdput(f_out);
+out2:
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..e8a7362 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,6 +1629,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1680,6 +1681,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c2b66a2..185815c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,6 +886,9 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1324b02..2622b33 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -715,9 +715,11 @@ __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 __SYSCALL(__NR_membarrier, sys_membarrier)
 #define __NR_mlock2 284
 __SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_copy_file_range 285
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 285
+#define __NR_syscalls 286
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..2c5e3a8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: Zach Brown, linux-api, Christoph Hellwig, linux-fsdevel, xfs,
	Anna Schumaker

>From : Anna Schumaker <Anna.Schumaker@Netapp.com>
>From : Zach Brown <zab@redhat.com>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                 Change flags parameter from int to unsigned int,
                 Add function to include/linux/syscalls.h,
                 Check copy len after file open mode,
                 Don't forbid ranges inside the same file,
                 Use rw_verify_area() to veriy ranges,
                 Use file_out rather than file_in,
                 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c                   |  120 +++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |    3 +
 include/linux/syscalls.h          |    3 +
 include/uapi/asm-generic/unistd.h |    4 +
 kernel/sys_ni.c                   |    1 
 5 files changed, 130 insertions(+), 1 deletion(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..97c15ca 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,122 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	ssize_t ret;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_out->f_op->copy_file_range)
+		return -EBADF;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
+	if (len == 0)
+		return 0;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret = -EBADF;
+
+	f_in = fdget(fd_in);
+	if (!f_in.file)
+		goto out2;
+
+	f_out = fdget(fd_out);
+	if (!f_out.file)
+		goto out1;
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+
+out:
+	fdput(f_in);
+out1:
+	fdput(f_out);
+out2:
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..e8a7362 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,6 +1629,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1680,6 +1681,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c2b66a2..185815c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,6 +886,9 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1324b02..2622b33 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -715,9 +715,11 @@ __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 __SYSCALL(__NR_membarrier, sys_membarrier)
 #define __NR_mlock2 284
 __SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_copy_file_range 285
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 285
+#define __NR_syscalls 286
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..2c5e3a8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david-FqsqvQoI3Ljby3iVrkZq2A, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: Zach Brown, linux-api-u79uwXL29TY76Z2rM5mHXA,
	xfs-VZNHf3L845pBDgjK7y7TUQ, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, Anna Schumaker

>From : Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
>From : Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                 Change flags parameter from int to unsigned int,
                 Add function to include/linux/syscalls.h,
                 Check copy len after file open mode,
                 Don't forbid ranges inside the same file,
                 Use rw_verify_area() to veriy ranges,
                 Use file_out rather than file_in,
                 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 fs/read_write.c                   |  120 +++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |    3 +
 include/linux/syscalls.h          |    3 +
 include/uapi/asm-generic/unistd.h |    4 +
 kernel/sys_ni.c                   |    1 
 5 files changed, 130 insertions(+), 1 deletion(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..97c15ca 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,122 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	ssize_t ret;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_out->f_op->copy_file_range)
+		return -EBADF;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
+	if (len == 0)
+		return 0;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret = -EBADF;
+
+	f_in = fdget(fd_in);
+	if (!f_in.file)
+		goto out2;
+
+	f_out = fdget(fd_out);
+	if (!f_out.file)
+		goto out1;
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+
+out:
+	fdput(f_in);
+out1:
+	fdput(f_out);
+out2:
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..e8a7362 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,6 +1629,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1680,6 +1681,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c2b66a2..185815c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,6 +886,9 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1324b02..2622b33 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -715,9 +715,11 @@ __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 __SYSCALL(__NR_membarrier, sys_membarrier)
 #define __NR_mlock2 284
 __SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_copy_file_range 285
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 285
+#define __NR_syscalls 286
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..2c5e3a8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/9] x86: add sys_copy_file_range to syscall tables
  2015-12-19  8:55 ` Darrick J. Wong
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: Zach Brown, linux-api, xfs, linux-fsdevel, Christoph Hellwig,
	Anna Schumaker

Add sys_copy_file_range to the x86 syscall tables.

>>From : Anna Schumaker <Anna.Schumaker@Netapp.com>
>>From : Zach Brown <zab@redhat.com>

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 2 files changed, 2 insertions(+)


diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..cb713df 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..dc1040a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/9] x86: add sys_copy_file_range to syscall tables
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: Zach Brown, linux-api, Christoph Hellwig, linux-fsdevel, xfs,
	Anna Schumaker

Add sys_copy_file_range to the x86 syscall tables.

>From : Anna Schumaker <Anna.Schumaker@Netapp.com>
>From : Zach Brown <zab@redhat.com>

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 2 files changed, 2 insertions(+)


diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..cb713df 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..dc1040a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/9] btrfs: add .copy_file_range file operation
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: Zach Brown, Josef Bacik, linux-api, xfs, linux-fsdevel,
	David Sterba, Christoph Hellwig, Anna Schumaker

>>From : Anna Schumaker <Anna.Schumaker@Netapp.com>
>>From : Zach Brown <zab@redhat.com>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
                 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h |    3 ++
 fs/btrfs/file.c  |    1 +
 fs/btrfs/ioctl.c |   91 +++++++++++++++++++++++++++++++-----------------------
 3 files changed, 56 insertions(+), 39 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 35489e7..ede7277 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4055,6 +4055,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72e7346..e67fe6a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2924,6 +2924,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index da94138..0f92735 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3779,17 +3779,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3802,49 +3801,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3921,6 +3891,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/9] btrfs: add .copy_file_range file operation
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: Zach Brown, Josef Bacik, linux-api, xfs, David Sterba,
	Anna Schumaker, linux-fsdevel, Christoph Hellwig

>From : Anna Schumaker <Anna.Schumaker@Netapp.com>
>From : Zach Brown <zab@redhat.com>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
                 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h |    3 ++
 fs/btrfs/file.c  |    1 +
 fs/btrfs/ioctl.c |   91 +++++++++++++++++++++++++++++++-----------------------
 3 files changed, 56 insertions(+), 39 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 35489e7..ede7277 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4055,6 +4055,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72e7346..e67fe6a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2924,6 +2924,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index da94138..0f92735 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3779,17 +3779,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3802,49 +3801,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3921,6 +3891,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/9] btrfs: add .copy_file_range file operation
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david-FqsqvQoI3Ljby3iVrkZq2A, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: Zach Brown, Josef Bacik, linux-api-u79uwXL29TY76Z2rM5mHXA,
	xfs-VZNHf3L845pBDgjK7y7TUQ, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	David Sterba, Christoph Hellwig, Anna Schumaker

>From : Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
>From : Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[Anna Schumaker: Make flags an unsigned int,
                 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Josef Bacik <jbacik-b10kYP2dOMg@public.gmane.org>
Reviewed-by: David Sterba <dsterba-IBi9RG/b67k@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 fs/btrfs/ctree.h |    3 ++
 fs/btrfs/file.c  |    1 +
 fs/btrfs/ioctl.c |   91 +++++++++++++++++++++++++++++++-----------------------
 3 files changed, 56 insertions(+), 39 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 35489e7..ede7277 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4055,6 +4055,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72e7346..e67fe6a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2924,6 +2924,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index da94138..0f92735 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3779,17 +3779,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3802,49 +3801,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3921,6 +3891,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-api, Christoph Hellwig, xfs, linux-fsdevel, Padraig Brady,
	Anna Schumaker

>>From : Anna Schumaker <Anna.Schumaker@Netapp.com>

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Padraig Brady <P@draigBrady.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 97c15ca..6f73af1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1354,8 +1354,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
-	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op->copy_file_range)
+	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
 	/* this could be relaxed once a method supports cross-fs copies */
@@ -1369,8 +1368,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range)
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if (ret == -EOPNOTSUPP)
+		ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+				len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-api, Christoph Hellwig, linux-fsdevel, xfs, Padraig Brady,
	Anna Schumaker

>From : Anna Schumaker <Anna.Schumaker@Netapp.com>

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Padraig Brady <P@draigBrady.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 97c15ca..6f73af1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1354,8 +1354,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
-	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op->copy_file_range)
+	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
 	/* this could be relaxed once a method supports cross-fs copies */
@@ -1369,8 +1368,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range)
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if (ret == -EOPNOTSUPP)
+		ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+				len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-api, Christoph Hellwig, xfs, linux-fsdevel, Padraig Brady,
	Anna Schumaker

>From : Anna Schumaker <Anna.Schumaker@Netapp.com>

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Padraig Brady <P@draigBrady.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 97c15ca..6f73af1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1354,8 +1354,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
-	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op->copy_file_range)
+	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
 	/* this could be relaxed once a method supports cross-fs copies */
@@ -1369,8 +1368,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range)
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if (ret == -EOPNOTSUPP)
+		ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+				len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/9] locks: new locks_mandatory_area calling convention
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: J. Bruce Fields, linux-api, linux-fsdevel, Christoph Hellwig, xfs

>>From : Christoph Hellwig <hch@lst.de>

Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also drop the pointless inode argument and simplify
the read/write selection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
---
 fs/locks.c         |   22 +++++++++-------------
 fs/read_write.c    |    5 ++---
 include/linux/fs.h |   28 +++++++++++++---------------
 3 files changed, 24 insertions(+), 31 deletions(-)


diff --git a/fs/locks.c b/fs/locks.c
index 0d2b326..ab2ea2e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)
 
 /**
  * locks_mandatory_area - Check for a conflicting lock
- * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
- *		for shared
- * @inode:      the file to check
  * @filp:       how the file was opened (if it was)
- * @offset:     start of area to check
- * @count:      length of area to check
+ * @start:	first byte in the file to check
+ * @end:	lastbyte in the file to check
+ * @type:	%F_WRLCK for a write lock, else %F_RDLCK
  *
  * Searches the inode's list of locks to find any POSIX locks which conflict.
- * This function is called from rw_verify_area() and
- * locks_verify_truncate().
  */
-int locks_mandatory_area(int read_write, struct inode *inode,
-			 struct file *filp, loff_t offset,
-			 size_t count)
+int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
+		unsigned char type)
 {
+	struct inode *inode = file_inode(filp);
 	struct file_lock fl;
 	int error;
 	bool sleep = false;
@@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode *inode,
 	fl.fl_flags = FL_POSIX | FL_ACCESS;
 	if (filp && !(filp->f_flags & O_NONBLOCK))
 		sleep = true;
-	fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
-	fl.fl_start = offset;
-	fl.fl_end = offset + count - 1;
+	fl.fl_type = type;
+	fl.fl_start = start;
+	fl.fl_end = end;
 
 	for (;;) {
 		if (filp) {
diff --git a/fs/read_write.c b/fs/read_write.c
index 6f73af1..e4c09dc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	}
 
 	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
-		retval = locks_mandatory_area(
-			read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
-			inode, file, pos, count);
+		retval = locks_mandatory_area(file, pos, pos + count - 1,
+				read_write == READ ? F_RDLCK : F_WRLCK);
 		if (retval < 0)
 			return retval;
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e8a7362..73874cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;
 
 #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
 
-#define FLOCK_VERIFY_READ  1
-#define FLOCK_VERIFY_WRITE 2
-
 #ifdef CONFIG_FILE_LOCKING
 extern int locks_mandatory_locked(struct file *);
-extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t);
+extern int locks_mandatory_area(struct file *, loff_t, loff_t, unsigned char);
 
 /*
  * Candidates for mandatory locking have the setgid bit set
@@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode *inode,
 				    struct file *filp,
 				    loff_t size)
 {
-	if (inode->i_flctx && mandatory_lock(inode))
-		return locks_mandatory_area(
-			FLOCK_VERIFY_WRITE, inode, filp,
-			size < inode->i_size ? size : inode->i_size,
-			(size < inode->i_size ? inode->i_size - size
-			 : size - inode->i_size)
-		);
-	return 0;
+	if (!inode->i_flctx || !mandatory_lock(inode))
+		return 0;
+
+	if (size < inode->i_size) {
+		return locks_mandatory_area(filp, size, inode->i_size - 1,
+				F_WRLCK);
+	} else {
+		return locks_mandatory_area(filp, inode->i_size, size - 1,
+				F_WRLCK);
+	}
 }
 
 static inline int break_lease(struct inode *inode, unsigned int mode)
@@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file *file)
 	return 0;
 }
 
-static inline int locks_mandatory_area(int rw, struct inode *inode,
-				       struct file *filp, loff_t offset,
-				       size_t count)
+static inline int locks_mandatory_area(struct file *filp, loff_t start,
+		loff_t end, unsigned char type)
 {
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/9] locks: new locks_mandatory_area calling convention
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: J. Bruce Fields, linux-api, linux-fsdevel, Christoph Hellwig, xfs

>From : Christoph Hellwig <hch@lst.de>

Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also drop the pointless inode argument and simplify
the read/write selection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
---
 fs/locks.c         |   22 +++++++++-------------
 fs/read_write.c    |    5 ++---
 include/linux/fs.h |   28 +++++++++++++---------------
 3 files changed, 24 insertions(+), 31 deletions(-)


diff --git a/fs/locks.c b/fs/locks.c
index 0d2b326..ab2ea2e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)
 
 /**
  * locks_mandatory_area - Check for a conflicting lock
- * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
- *		for shared
- * @inode:      the file to check
  * @filp:       how the file was opened (if it was)
- * @offset:     start of area to check
- * @count:      length of area to check
+ * @start:	first byte in the file to check
+ * @end:	lastbyte in the file to check
+ * @type:	%F_WRLCK for a write lock, else %F_RDLCK
  *
  * Searches the inode's list of locks to find any POSIX locks which conflict.
- * This function is called from rw_verify_area() and
- * locks_verify_truncate().
  */
-int locks_mandatory_area(int read_write, struct inode *inode,
-			 struct file *filp, loff_t offset,
-			 size_t count)
+int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
+		unsigned char type)
 {
+	struct inode *inode = file_inode(filp);
 	struct file_lock fl;
 	int error;
 	bool sleep = false;
@@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode *inode,
 	fl.fl_flags = FL_POSIX | FL_ACCESS;
 	if (filp && !(filp->f_flags & O_NONBLOCK))
 		sleep = true;
-	fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
-	fl.fl_start = offset;
-	fl.fl_end = offset + count - 1;
+	fl.fl_type = type;
+	fl.fl_start = start;
+	fl.fl_end = end;
 
 	for (;;) {
 		if (filp) {
diff --git a/fs/read_write.c b/fs/read_write.c
index 6f73af1..e4c09dc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	}
 
 	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
-		retval = locks_mandatory_area(
-			read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
-			inode, file, pos, count);
+		retval = locks_mandatory_area(file, pos, pos + count - 1,
+				read_write == READ ? F_RDLCK : F_WRLCK);
 		if (retval < 0)
 			return retval;
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e8a7362..73874cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;
 
 #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
 
-#define FLOCK_VERIFY_READ  1
-#define FLOCK_VERIFY_WRITE 2
-
 #ifdef CONFIG_FILE_LOCKING
 extern int locks_mandatory_locked(struct file *);
-extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t);
+extern int locks_mandatory_area(struct file *, loff_t, loff_t, unsigned char);
 
 /*
  * Candidates for mandatory locking have the setgid bit set
@@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode *inode,
 				    struct file *filp,
 				    loff_t size)
 {
-	if (inode->i_flctx && mandatory_lock(inode))
-		return locks_mandatory_area(
-			FLOCK_VERIFY_WRITE, inode, filp,
-			size < inode->i_size ? size : inode->i_size,
-			(size < inode->i_size ? inode->i_size - size
-			 : size - inode->i_size)
-		);
-	return 0;
+	if (!inode->i_flctx || !mandatory_lock(inode))
+		return 0;
+
+	if (size < inode->i_size) {
+		return locks_mandatory_area(filp, size, inode->i_size - 1,
+				F_WRLCK);
+	} else {
+		return locks_mandatory_area(filp, inode->i_size, size - 1,
+				F_WRLCK);
+	}
 }
 
 static inline int break_lease(struct inode *inode, unsigned int mode)
@@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file *file)
 	return 0;
 }
 
-static inline int locks_mandatory_area(int rw, struct inode *inode,
-				       struct file *filp, loff_t offset,
-				       size_t count)
+static inline int locks_mandatory_area(struct file *filp, loff_t start,
+		loff_t end, unsigned char type)
 {
 	return 0;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/9] locks: new locks_mandatory_area calling convention
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david-FqsqvQoI3Ljby3iVrkZq2A, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: J. Bruce Fields, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	xfs-VZNHf3L845pBDgjK7y7TUQ

>From : Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also drop the pointless inode argument and simplify
the read/write selection.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Acked-by: J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
---
 fs/locks.c         |   22 +++++++++-------------
 fs/read_write.c    |    5 ++---
 include/linux/fs.h |   28 +++++++++++++---------------
 3 files changed, 24 insertions(+), 31 deletions(-)


diff --git a/fs/locks.c b/fs/locks.c
index 0d2b326..ab2ea2e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)
 
 /**
  * locks_mandatory_area - Check for a conflicting lock
- * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
- *		for shared
- * @inode:      the file to check
  * @filp:       how the file was opened (if it was)
- * @offset:     start of area to check
- * @count:      length of area to check
+ * @start:	first byte in the file to check
+ * @end:	lastbyte in the file to check
+ * @type:	%F_WRLCK for a write lock, else %F_RDLCK
  *
  * Searches the inode's list of locks to find any POSIX locks which conflict.
- * This function is called from rw_verify_area() and
- * locks_verify_truncate().
  */
-int locks_mandatory_area(int read_write, struct inode *inode,
-			 struct file *filp, loff_t offset,
-			 size_t count)
+int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
+		unsigned char type)
 {
+	struct inode *inode = file_inode(filp);
 	struct file_lock fl;
 	int error;
 	bool sleep = false;
@@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode *inode,
 	fl.fl_flags = FL_POSIX | FL_ACCESS;
 	if (filp && !(filp->f_flags & O_NONBLOCK))
 		sleep = true;
-	fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
-	fl.fl_start = offset;
-	fl.fl_end = offset + count - 1;
+	fl.fl_type = type;
+	fl.fl_start = start;
+	fl.fl_end = end;
 
 	for (;;) {
 		if (filp) {
diff --git a/fs/read_write.c b/fs/read_write.c
index 6f73af1..e4c09dc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	}
 
 	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
-		retval = locks_mandatory_area(
-			read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
-			inode, file, pos, count);
+		retval = locks_mandatory_area(file, pos, pos + count - 1,
+				read_write == READ ? F_RDLCK : F_WRLCK);
 		if (retval < 0)
 			return retval;
 	}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e8a7362..73874cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;
 
 #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
 
-#define FLOCK_VERIFY_READ  1
-#define FLOCK_VERIFY_WRITE 2
-
 #ifdef CONFIG_FILE_LOCKING
 extern int locks_mandatory_locked(struct file *);
-extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t);
+extern int locks_mandatory_area(struct file *, loff_t, loff_t, unsigned char);
 
 /*
  * Candidates for mandatory locking have the setgid bit set
@@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode *inode,
 				    struct file *filp,
 				    loff_t size)
 {
-	if (inode->i_flctx && mandatory_lock(inode))
-		return locks_mandatory_area(
-			FLOCK_VERIFY_WRITE, inode, filp,
-			size < inode->i_size ? size : inode->i_size,
-			(size < inode->i_size ? inode->i_size - size
-			 : size - inode->i_size)
-		);
-	return 0;
+	if (!inode->i_flctx || !mandatory_lock(inode))
+		return 0;
+
+	if (size < inode->i_size) {
+		return locks_mandatory_area(filp, size, inode->i_size - 1,
+				F_WRLCK);
+	} else {
+		return locks_mandatory_area(filp, inode->i_size, size - 1,
+				F_WRLCK);
+	}
 }
 
 static inline int break_lease(struct inode *inode, unsigned int mode)
@@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file *file)
 	return 0;
 }
 
-static inline int locks_mandatory_area(int rw, struct inode *inode,
-				       struct file *filp, loff_t offset,
-				       size_t count)
+static inline int locks_mandatory_area(struct file *filp, loff_t start,
+		loff_t end, unsigned char type)
 {
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/9] vfs: pull btrfs clone API to vfs layer
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, Christoph Hellwig, xfs

>>From : Christoph Hellwig <hch@lst.de>

The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development.  To avoid growth of various slightly incompatible
implementations, add one to the VFS.  Note that clones are different from
file copies in several ways:

 - they are atomic vs other writers
 - they support whole file clones
 - they support 64-bit legth clones
 - they do not allow partial success (aka short writes)
 - clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure.  The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h        |    3 +
 fs/btrfs/file.c         |    1 
 fs/btrfs/ioctl.c        |   49 +-----------------
 fs/cifs/cifsfs.c        |   63 ++++++++++++++++++++++++
 fs/cifs/cifsfs.h        |    1 
 fs/cifs/ioctl.c         |  126 ++++++++++++++++++++++-------------------------
 fs/ioctl.c              |   29 +++++++++++
 fs/nfs/nfs4file.c       |   87 ++++----------------------------
 fs/read_write.c         |   72 +++++++++++++++++++++++++++
 include/linux/fs.h      |    7 ++-
 include/uapi/linux/fs.h |    9 +++
 11 files changed, 254 insertions(+), 193 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ede7277..dd4733f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4025,7 +4025,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
 			       struct btrfs_ioctl_balance_args *bargs);
 
-
 /* file.c */
 int btrfs_auto_defrag_init(void);
 void btrfs_auto_defrag_exit(void);
@@ -4058,6 +4057,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
 			      size_t len, unsigned int flags);
+int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+			   struct file *file_out, loff_t pos_out, u64 len);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e67fe6a..232e300 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2925,6 +2925,7 @@ const struct file_operations btrfs_file_operations = {
 	.compat_ioctl	= btrfs_ioctl,
 #endif
 	.copy_file_range = btrfs_copy_file_range,
+	.clone_file_range = btrfs_clone_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0f92735..85b1cae 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+int btrfs_clone_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len)
 {
-	struct fd src_file;
-	int ret;
-
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ)) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
-
-	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
-
-out_fput:
-	fdput(src_file);
-out_drop_write:
-	mnt_drop_write_file(file);
-	return ret;
-}
-
-static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
-{
-	struct btrfs_ioctl_clone_range_args args;
-
-	if (copy_from_user(&args, argp, sizeof(args)))
-		return -EFAULT;
-	return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
-				 args.src_length, args.dest_offset);
+	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
 }
 
 /*
@@ -5498,10 +5459,6 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_dev_info(root, argp);
 	case BTRFS_IOC_BALANCE:
 		return btrfs_ioctl_balance(file, NULL);
-	case BTRFS_IOC_CLONE:
-		return btrfs_ioctl_clone(file, arg, 0, 0, 0);
-	case BTRFS_IOC_CLONE_RANGE:
-		return btrfs_ioctl_clone_range(file, argp);
 	case BTRFS_IOC_TRANS_START:
 		return btrfs_ioctl_trans_start(file);
 	case BTRFS_IOC_TRANS_END:
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index cbc0f4b..e9b978f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -914,6 +914,61 @@ const struct inode_operations cifs_symlink_inode_ops = {
 #endif
 };
 
+static int cifs_clone_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len)
+{
+	struct inode *src_inode = file_inode(src_file);
+	struct inode *target_inode = file_inode(dst_file);
+	struct cifsFileInfo *smb_file_src = src_file->private_data;
+	struct cifsFileInfo *smb_file_target = dst_file->private_data;
+	struct cifs_tcon *src_tcon = tlink_tcon(smb_file_src->tlink);
+	struct cifs_tcon *target_tcon = tlink_tcon(smb_file_target->tlink);
+	unsigned int xid;
+	int rc;
+
+	cifs_dbg(FYI, "clone range\n");
+
+	xid = get_xid();
+
+	if (!src_file->private_data || !dst_file->private_data) {
+		rc = -EBADF;
+		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
+		goto out;
+	}
+
+	/*
+	 * Note: cifs case is easier than btrfs since server responsible for
+	 * checks for proper open modes and file type and if it wants
+	 * server could even support copy of range where source = target
+	 */
+	lock_two_nondirectories(target_inode, src_inode);
+
+	if (len == 0)
+		len = src_inode->i_size - off;
+
+	cifs_dbg(FYI, "about to flush pages\n");
+	/* should we flush first and last page first */
+	truncate_inode_pages_range(&target_inode->i_data, destoff,
+				   PAGE_CACHE_ALIGN(destoff + len)-1);
+
+	if (target_tcon->ses->server->ops->duplicate_extents)
+		rc = target_tcon->ses->server->ops->duplicate_extents(xid,
+			smb_file_src, smb_file_target, off, len, destoff);
+	else
+		rc = -EOPNOTSUPP;
+
+	/* force revalidate of size and timestamps of target file now
+	   that target is updated on the server */
+	CIFS_I(target_inode)->time = 0;
+out_unlock:
+	/* although unlocking in the reverse order from locking is not
+	   strictly necessary here it is a little cleaner to be consistent */
+	unlock_two_nondirectories(src_inode, target_inode);
+out:
+	free_xid(xid);
+	return rc;
+}
+
 const struct file_operations cifs_file_ops = {
 	.read_iter = cifs_loose_read_iter,
 	.write_iter = cifs_file_write_iter,
@@ -926,6 +981,7 @@ const struct file_operations cifs_file_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -942,6 +998,8 @@ const struct file_operations cifs_file_strict_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -958,6 +1016,7 @@ const struct file_operations cifs_file_direct_ops = {
 	.mmap = cifs_file_mmap,
 	.splice_read = generic_file_splice_read,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -974,6 +1033,7 @@ const struct file_operations cifs_file_nobrl_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -989,6 +1049,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1004,6 +1065,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
 	.mmap = cifs_file_mmap,
 	.splice_read = generic_file_splice_read,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -1014,6 +1076,7 @@ const struct file_operations cifs_dir_ops = {
 	.release = cifs_closedir,
 	.read    = generic_read_dir,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = generic_file_llseek,
 };
 
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index c3cc160..c399513 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -131,7 +131,6 @@ extern int	cifs_setxattr(struct dentry *, const char *, const void *,
 extern ssize_t	cifs_getxattr(struct dentry *, const char *, void *, size_t);
 extern ssize_t	cifs_listxattr(struct dentry *, char *, size_t);
 extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
-
 #ifdef CONFIG_CIFS_NFSD_EXPORT
 extern const struct export_operations cifs_export_ops;
 #endif /* CONFIG_CIFS_NFSD_EXPORT */
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 35cf990..7a3b84e 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -34,73 +34,36 @@
 #include "cifs_ioctl.h"
 #include <linux/btrfs.h>
 
-static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
-			unsigned long srcfd, u64 off, u64 len, u64 destoff,
-			bool dup_extents)
+static int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+			  struct file *dst_file)
 {
-	int rc;
-	struct cifsFileInfo *smb_file_target = dst_file->private_data;
+	struct inode *src_inode = file_inode(src_file);
 	struct inode *target_inode = file_inode(dst_file);
-	struct cifs_tcon *target_tcon;
-	struct fd src_file;
 	struct cifsFileInfo *smb_file_src;
-	struct inode *src_inode;
+	struct cifsFileInfo *smb_file_target;
 	struct cifs_tcon *src_tcon;
+	struct cifs_tcon *target_tcon;
+	int rc;
 
 	cifs_dbg(FYI, "ioctl clone range\n");
-	/* the destination must be opened for writing */
-	if (!(dst_file->f_mode & FMODE_WRITE)) {
-		cifs_dbg(FYI, "file target not open for write\n");
-		return -EINVAL;
-	}
 
-	/* check if target volume is readonly and take reference */
-	rc = mnt_want_write_file(dst_file);
-	if (rc) {
-		cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
-		return rc;
-	}
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		rc = -EBADF;
-		goto out_drop_write;
-	}
-
-	if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
-		rc = -EBADF;
-		cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
-		goto out_fput;
-	}
-
-	if ((!src_file.file->private_data) || (!dst_file->private_data)) {
+	if (!src_file->private_data || !dst_file->private_data) {
 		rc = -EBADF;
 		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
-		goto out_fput;
+		goto out;
 	}
 
 	rc = -EXDEV;
 	smb_file_target = dst_file->private_data;
-	smb_file_src = src_file.file->private_data;
+	smb_file_src = src_file->private_data;
 	src_tcon = tlink_tcon(smb_file_src->tlink);
 	target_tcon = tlink_tcon(smb_file_target->tlink);
 
-	/* check source and target on same server (or volume if dup_extents) */
-	if (dup_extents && (src_tcon != target_tcon)) {
-		cifs_dbg(VFS, "source and target of copy not on same share\n");
-		goto out_fput;
-	}
-
-	if (!dup_extents && (src_tcon->ses != target_tcon->ses)) {
+	if (src_tcon->ses != target_tcon->ses) {
 		cifs_dbg(VFS, "source and target of copy not on same server\n");
-		goto out_fput;
+		goto out;
 	}
 
-	src_inode = file_inode(src_file.file);
-	rc = -EINVAL;
-	if (S_ISDIR(src_inode->i_mode))
-		goto out_fput;
-
 	/*
 	 * Note: cifs case is easier than btrfs since server responsible for
 	 * checks for proper open modes and file type and if it wants
@@ -108,34 +71,66 @@ static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
 	 */
 	lock_two_nondirectories(target_inode, src_inode);
 
-	/* determine range to clone */
-	rc = -EINVAL;
-	if (off + len > src_inode->i_size || off + len < off)
-		goto out_unlock;
-	if (len == 0)
-		len = src_inode->i_size - off;
-
 	cifs_dbg(FYI, "about to flush pages\n");
 	/* should we flush first and last page first */
-	truncate_inode_pages_range(&target_inode->i_data, destoff,
-				   PAGE_CACHE_ALIGN(destoff + len)-1);
+	truncate_inode_pages(&target_inode->i_data, 0);
 
-	if (dup_extents && target_tcon->ses->server->ops->duplicate_extents)
-		rc = target_tcon->ses->server->ops->duplicate_extents(xid,
-			smb_file_src, smb_file_target, off, len, destoff);
-	else if (!dup_extents && target_tcon->ses->server->ops->clone_range)
+	if (target_tcon->ses->server->ops->clone_range)
 		rc = target_tcon->ses->server->ops->clone_range(xid,
-			smb_file_src, smb_file_target, off, len, destoff);
+			smb_file_src, smb_file_target, 0, src_inode->i_size, 0);
 	else
 		rc = -EOPNOTSUPP;
 
 	/* force revalidate of size and timestamps of target file now
 	   that target is updated on the server */
 	CIFS_I(target_inode)->time = 0;
-out_unlock:
 	/* although unlocking in the reverse order from locking is not
 	   strictly necessary here it is a little cleaner to be consistent */
 	unlock_two_nondirectories(src_inode, target_inode);
+out:
+	return rc;
+}
+
+static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
+			unsigned long srcfd)
+{
+	int rc;
+	struct fd src_file;
+	struct inode *src_inode;
+
+	cifs_dbg(FYI, "ioctl clone range\n");
+	/* the destination must be opened for writing */
+	if (!(dst_file->f_mode & FMODE_WRITE)) {
+		cifs_dbg(FYI, "file target not open for write\n");
+		return -EINVAL;
+	}
+
+	/* check if target volume is readonly and take reference */
+	rc = mnt_want_write_file(dst_file);
+	if (rc) {
+		cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
+		return rc;
+	}
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		rc = -EBADF;
+		goto out_drop_write;
+	}
+
+	if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
+		rc = -EBADF;
+		cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
+		goto out_fput;
+	}
+
+	src_inode = file_inode(src_file.file);
+	rc = -EINVAL;
+	if (S_ISDIR(src_inode->i_mode))
+		goto out_fput;
+
+	rc = cifs_file_clone_range(xid, src_file.file, dst_file);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
@@ -256,10 +251,7 @@ long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg)
 			}
 			break;
 		case CIFS_IOC_COPYCHUNK_FILE:
-			rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, false);
-			break;
-		case BTRFS_IOC_CLONE:
-			rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, true);
+			rc = cifs_ioctl_clone(xid, filep, arg);
 			break;
 		case CIFS_IOC_SET_INTEGRITY:
 			if (pSMBFile == NULL)
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5d01d26..84c6e79 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -215,6 +215,29 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
 	return error;
 }
 
+static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
+			     u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file = fdget(srcfd);
+	int ret;
+
+	if (!src_file.file)
+		return -EBADF;
+	ret = vfs_clone_file_range(src_file.file, off, dst_file, destoff, olen);
+	fdput(src_file);
+	return ret;
+}
+
+static long ioctl_file_clone_range(struct file *file, void __user *argp)
+{
+	struct file_clone_range args;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	return ioctl_file_clone(file, args.src_fd, args.src_offset,
+				args.src_length, args.dest_offset);
+}
+
 #ifdef CONFIG_BLOCK
 
 static inline sector_t logical_to_blk(struct inode *inode, loff_t offset)
@@ -600,6 +623,12 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FICLONE:
+		return ioctl_file_clone(filp, arg, 0, 0, 0);
+
+	case FICLONERANGE:
+		return ioctl_file_clone_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index db9b5fe..26f9a23 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -195,65 +195,27 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static noinline long
-nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
-		  u64 src_off, u64 dst_off, u64 count)
+static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
+		struct file *dst_file, loff_t dst_off, u64 count)
 {
 	struct inode *dst_inode = file_inode(dst_file);
 	struct nfs_server *server = NFS_SERVER(dst_inode);
-	struct fd src_file;
-	struct inode *src_inode;
+	struct inode *src_inode = file_inode(src_file);
 	unsigned int bs = server->clone_blksize;
 	bool same_inode = false;
 	int ret;
 
-	/* dst file must be opened for writing */
-	if (!(dst_file->f_mode & FMODE_WRITE))
-		return -EINVAL;
-
-	ret = mnt_want_write_file(dst_file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	src_inode = file_inode(src_file.file);
-
-	if (src_inode == dst_inode)
-		same_inode = true;
-
-	/* src file must be opened for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
-
-	/* src and dst must be regular files */
-	ret = -EISDIR;
-	if (!S_ISREG(src_inode->i_mode) || !S_ISREG(dst_inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != dst_file->f_path.mnt ||
-	    src_inode->i_sb != dst_inode->i_sb)
-		goto out_fput;
-
 	/* check alignment w.r.t. clone_blksize */
 	ret = -EINVAL;
 	if (bs) {
 		if (!IS_ALIGNED(src_off, bs) || !IS_ALIGNED(dst_off, bs))
-			goto out_fput;
+			goto out;
 		if (!IS_ALIGNED(count, bs) && i_size_read(src_inode) != (src_off + count))
-			goto out_fput;
+			goto out;
 	}
 
-	/* verify if ranges are overlapped within the same file */
-	if (same_inode) {
-		if (dst_off + count > src_off && dst_off < src_off + count)
-			goto out_fput;
-	}
+	if (src_inode == dst_inode)
+		same_inode = true;
 
 	/* XXX: do we lock at all? what if server needs CB_RECALL_LAYOUT? */
 	if (same_inode) {
@@ -275,7 +237,7 @@ nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
 	if (ret)
 		goto out_unlock;
 
-	ret = nfs42_proc_clone(src_file.file, dst_file, src_off, dst_off, count);
+	ret = nfs42_proc_clone(src_file, dst_file, src_off, dst_off, count);
 
 	/* truncate inode page cache of the dst range so that future reads can fetch
 	 * new data from server */
@@ -292,37 +254,9 @@ out_unlock:
 		mutex_unlock(&dst_inode->i_mutex);
 		mutex_unlock(&src_inode->i_mutex);
 	}
-out_fput:
-	fdput(src_file);
-out_drop_write:
-	mnt_drop_write_file(dst_file);
+out:
 	return ret;
 }
-
-static long nfs42_ioctl_clone_range(struct file *dst_file, void __user *argp)
-{
-	struct btrfs_ioctl_clone_range_args args;
-
-	if (copy_from_user(&args, argp, sizeof(args)))
-		return -EFAULT;
-
-	return nfs42_ioctl_clone(dst_file, args.src_fd, args.src_offset,
-				 args.dest_offset, args.src_length);
-}
-
-long nfs4_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
-	void __user *argp = (void __user *)arg;
-
-	switch (cmd) {
-	case BTRFS_IOC_CLONE:
-		return nfs42_ioctl_clone(file, arg, 0, 0, 0);
-	case BTRFS_IOC_CLONE_RANGE:
-		return nfs42_ioctl_clone_range(file, argp);
-	}
-
-	return -ENOTTY;
-}
 #endif /* CONFIG_NFS_V4_2 */
 
 const struct file_operations nfs4_file_operations = {
@@ -342,8 +276,7 @@ const struct file_operations nfs4_file_operations = {
 #ifdef CONFIG_NFS_V4_2
 	.llseek		= nfs4_file_llseek,
 	.fallocate	= nfs42_fallocate,
-	.unlocked_ioctl = nfs4_ioctl,
-	.compat_ioctl	= nfs4_ioctl,
+	.clone_file_range = nfs42_clone_file_range,
 #else
 	.llseek		= nfs_file_llseek,
 #endif
diff --git a/fs/read_write.c b/fs/read_write.c
index e4c09dc..6395c733 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1451,3 +1451,75 @@ out1:
 out2:
 	return ret;
 }
+
+static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
+{
+	struct inode *inode = file_inode(file);
+
+	if (unlikely(pos < 0))
+		return -EINVAL;
+
+	 if (unlikely((loff_t) (pos + len) < 0))
+		return -EINVAL;
+
+	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
+		loff_t end = len ? pos + len - 1 : OFFSET_MAX;
+		int retval;
+
+		retval = locks_mandatory_area(file, pos, end,
+				write ? F_WRLCK : F_RDLCK);
+		if (retval < 0)
+			return retval;
+	}
+
+	return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
+}
+
+int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	int ret;
+
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EOPNOTSUPP;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_in->f_op->clone_file_range)
+		return -EBADF;
+
+	ret = clone_verify_area(file_in, pos_in, len, false);
+	if (ret)
+		return ret;
+
+	ret = clone_verify_area(file_out, pos_out, len, true);
+	if (ret)
+		return ret;
+
+	if (pos_in + len > i_size_read(inode_in))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_in->f_op->clone_file_range(file_in, pos_in,
+			file_out, pos_out, len);
+	if (!ret) {
+		fsnotify_access(file_in);
+		fsnotify_modify(file_out);
+	}
+
+	mnt_drop_write_file(file_out);
+	return ret;
+}
+EXPORT_SYMBOL(vfs_clone_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73874cd..26b7607 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,7 +1629,10 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
-	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+			loff_t, size_t, unsigned int);
+	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
+			u64);
 };
 
 struct inode_operations {
@@ -1683,6 +1686,8 @@ extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6decbac..ad60359 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -39,6 +39,13 @@
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
 
+struct file_clone_range {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
 struct fstrim_range {
 	__u64 start;
 	__u64 len;
@@ -168,6 +175,8 @@ struct blkzeroout2 {
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#define FICLONE		_IOW(0x94, 9, int)
+#define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/9] vfs: pull btrfs clone API to vfs layer
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, Christoph Hellwig, xfs

>From : Christoph Hellwig <hch@lst.de>

The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development.  To avoid growth of various slightly incompatible
implementations, add one to the VFS.  Note that clones are different from
file copies in several ways:

 - they are atomic vs other writers
 - they support whole file clones
 - they support 64-bit legth clones
 - they do not allow partial success (aka short writes)
 - clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure.  The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h        |    3 +
 fs/btrfs/file.c         |    1 
 fs/btrfs/ioctl.c        |   49 +-----------------
 fs/cifs/cifsfs.c        |   63 ++++++++++++++++++++++++
 fs/cifs/cifsfs.h        |    1 
 fs/cifs/ioctl.c         |  126 ++++++++++++++++++++++-------------------------
 fs/ioctl.c              |   29 +++++++++++
 fs/nfs/nfs4file.c       |   87 ++++----------------------------
 fs/read_write.c         |   72 +++++++++++++++++++++++++++
 include/linux/fs.h      |    7 ++-
 include/uapi/linux/fs.h |    9 +++
 11 files changed, 254 insertions(+), 193 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ede7277..dd4733f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4025,7 +4025,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
 			       struct btrfs_ioctl_balance_args *bargs);
 
-
 /* file.c */
 int btrfs_auto_defrag_init(void);
 void btrfs_auto_defrag_exit(void);
@@ -4058,6 +4057,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
 			      size_t len, unsigned int flags);
+int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+			   struct file *file_out, loff_t pos_out, u64 len);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e67fe6a..232e300 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2925,6 +2925,7 @@ const struct file_operations btrfs_file_operations = {
 	.compat_ioctl	= btrfs_ioctl,
 #endif
 	.copy_file_range = btrfs_copy_file_range,
+	.clone_file_range = btrfs_clone_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0f92735..85b1cae 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+int btrfs_clone_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len)
 {
-	struct fd src_file;
-	int ret;
-
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ)) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
-
-	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
-
-out_fput:
-	fdput(src_file);
-out_drop_write:
-	mnt_drop_write_file(file);
-	return ret;
-}
-
-static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
-{
-	struct btrfs_ioctl_clone_range_args args;
-
-	if (copy_from_user(&args, argp, sizeof(args)))
-		return -EFAULT;
-	return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
-				 args.src_length, args.dest_offset);
+	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
 }
 
 /*
@@ -5498,10 +5459,6 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_dev_info(root, argp);
 	case BTRFS_IOC_BALANCE:
 		return btrfs_ioctl_balance(file, NULL);
-	case BTRFS_IOC_CLONE:
-		return btrfs_ioctl_clone(file, arg, 0, 0, 0);
-	case BTRFS_IOC_CLONE_RANGE:
-		return btrfs_ioctl_clone_range(file, argp);
 	case BTRFS_IOC_TRANS_START:
 		return btrfs_ioctl_trans_start(file);
 	case BTRFS_IOC_TRANS_END:
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index cbc0f4b..e9b978f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -914,6 +914,61 @@ const struct inode_operations cifs_symlink_inode_ops = {
 #endif
 };
 
+static int cifs_clone_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len)
+{
+	struct inode *src_inode = file_inode(src_file);
+	struct inode *target_inode = file_inode(dst_file);
+	struct cifsFileInfo *smb_file_src = src_file->private_data;
+	struct cifsFileInfo *smb_file_target = dst_file->private_data;
+	struct cifs_tcon *src_tcon = tlink_tcon(smb_file_src->tlink);
+	struct cifs_tcon *target_tcon = tlink_tcon(smb_file_target->tlink);
+	unsigned int xid;
+	int rc;
+
+	cifs_dbg(FYI, "clone range\n");
+
+	xid = get_xid();
+
+	if (!src_file->private_data || !dst_file->private_data) {
+		rc = -EBADF;
+		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
+		goto out;
+	}
+
+	/*
+	 * Note: cifs case is easier than btrfs since server responsible for
+	 * checks for proper open modes and file type and if it wants
+	 * server could even support copy of range where source = target
+	 */
+	lock_two_nondirectories(target_inode, src_inode);
+
+	if (len == 0)
+		len = src_inode->i_size - off;
+
+	cifs_dbg(FYI, "about to flush pages\n");
+	/* should we flush first and last page first */
+	truncate_inode_pages_range(&target_inode->i_data, destoff,
+				   PAGE_CACHE_ALIGN(destoff + len)-1);
+
+	if (target_tcon->ses->server->ops->duplicate_extents)
+		rc = target_tcon->ses->server->ops->duplicate_extents(xid,
+			smb_file_src, smb_file_target, off, len, destoff);
+	else
+		rc = -EOPNOTSUPP;
+
+	/* force revalidate of size and timestamps of target file now
+	   that target is updated on the server */
+	CIFS_I(target_inode)->time = 0;
+out_unlock:
+	/* although unlocking in the reverse order from locking is not
+	   strictly necessary here it is a little cleaner to be consistent */
+	unlock_two_nondirectories(src_inode, target_inode);
+out:
+	free_xid(xid);
+	return rc;
+}
+
 const struct file_operations cifs_file_ops = {
 	.read_iter = cifs_loose_read_iter,
 	.write_iter = cifs_file_write_iter,
@@ -926,6 +981,7 @@ const struct file_operations cifs_file_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -942,6 +998,8 @@ const struct file_operations cifs_file_strict_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -958,6 +1016,7 @@ const struct file_operations cifs_file_direct_ops = {
 	.mmap = cifs_file_mmap,
 	.splice_read = generic_file_splice_read,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -974,6 +1033,7 @@ const struct file_operations cifs_file_nobrl_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -989,6 +1049,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1004,6 +1065,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
 	.mmap = cifs_file_mmap,
 	.splice_read = generic_file_splice_read,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -1014,6 +1076,7 @@ const struct file_operations cifs_dir_ops = {
 	.release = cifs_closedir,
 	.read    = generic_read_dir,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = generic_file_llseek,
 };
 
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index c3cc160..c399513 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -131,7 +131,6 @@ extern int	cifs_setxattr(struct dentry *, const char *, const void *,
 extern ssize_t	cifs_getxattr(struct dentry *, const char *, void *, size_t);
 extern ssize_t	cifs_listxattr(struct dentry *, char *, size_t);
 extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
-
 #ifdef CONFIG_CIFS_NFSD_EXPORT
 extern const struct export_operations cifs_export_ops;
 #endif /* CONFIG_CIFS_NFSD_EXPORT */
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 35cf990..7a3b84e 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -34,73 +34,36 @@
 #include "cifs_ioctl.h"
 #include <linux/btrfs.h>
 
-static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
-			unsigned long srcfd, u64 off, u64 len, u64 destoff,
-			bool dup_extents)
+static int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+			  struct file *dst_file)
 {
-	int rc;
-	struct cifsFileInfo *smb_file_target = dst_file->private_data;
+	struct inode *src_inode = file_inode(src_file);
 	struct inode *target_inode = file_inode(dst_file);
-	struct cifs_tcon *target_tcon;
-	struct fd src_file;
 	struct cifsFileInfo *smb_file_src;
-	struct inode *src_inode;
+	struct cifsFileInfo *smb_file_target;
 	struct cifs_tcon *src_tcon;
+	struct cifs_tcon *target_tcon;
+	int rc;
 
 	cifs_dbg(FYI, "ioctl clone range\n");
-	/* the destination must be opened for writing */
-	if (!(dst_file->f_mode & FMODE_WRITE)) {
-		cifs_dbg(FYI, "file target not open for write\n");
-		return -EINVAL;
-	}
 
-	/* check if target volume is readonly and take reference */
-	rc = mnt_want_write_file(dst_file);
-	if (rc) {
-		cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
-		return rc;
-	}
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		rc = -EBADF;
-		goto out_drop_write;
-	}
-
-	if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
-		rc = -EBADF;
-		cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
-		goto out_fput;
-	}
-
-	if ((!src_file.file->private_data) || (!dst_file->private_data)) {
+	if (!src_file->private_data || !dst_file->private_data) {
 		rc = -EBADF;
 		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
-		goto out_fput;
+		goto out;
 	}
 
 	rc = -EXDEV;
 	smb_file_target = dst_file->private_data;
-	smb_file_src = src_file.file->private_data;
+	smb_file_src = src_file->private_data;
 	src_tcon = tlink_tcon(smb_file_src->tlink);
 	target_tcon = tlink_tcon(smb_file_target->tlink);
 
-	/* check source and target on same server (or volume if dup_extents) */
-	if (dup_extents && (src_tcon != target_tcon)) {
-		cifs_dbg(VFS, "source and target of copy not on same share\n");
-		goto out_fput;
-	}
-
-	if (!dup_extents && (src_tcon->ses != target_tcon->ses)) {
+	if (src_tcon->ses != target_tcon->ses) {
 		cifs_dbg(VFS, "source and target of copy not on same server\n");
-		goto out_fput;
+		goto out;
 	}
 
-	src_inode = file_inode(src_file.file);
-	rc = -EINVAL;
-	if (S_ISDIR(src_inode->i_mode))
-		goto out_fput;
-
 	/*
 	 * Note: cifs case is easier than btrfs since server responsible for
 	 * checks for proper open modes and file type and if it wants
@@ -108,34 +71,66 @@ static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
 	 */
 	lock_two_nondirectories(target_inode, src_inode);
 
-	/* determine range to clone */
-	rc = -EINVAL;
-	if (off + len > src_inode->i_size || off + len < off)
-		goto out_unlock;
-	if (len == 0)
-		len = src_inode->i_size - off;
-
 	cifs_dbg(FYI, "about to flush pages\n");
 	/* should we flush first and last page first */
-	truncate_inode_pages_range(&target_inode->i_data, destoff,
-				   PAGE_CACHE_ALIGN(destoff + len)-1);
+	truncate_inode_pages(&target_inode->i_data, 0);
 
-	if (dup_extents && target_tcon->ses->server->ops->duplicate_extents)
-		rc = target_tcon->ses->server->ops->duplicate_extents(xid,
-			smb_file_src, smb_file_target, off, len, destoff);
-	else if (!dup_extents && target_tcon->ses->server->ops->clone_range)
+	if (target_tcon->ses->server->ops->clone_range)
 		rc = target_tcon->ses->server->ops->clone_range(xid,
-			smb_file_src, smb_file_target, off, len, destoff);
+			smb_file_src, smb_file_target, 0, src_inode->i_size, 0);
 	else
 		rc = -EOPNOTSUPP;
 
 	/* force revalidate of size and timestamps of target file now
 	   that target is updated on the server */
 	CIFS_I(target_inode)->time = 0;
-out_unlock:
 	/* although unlocking in the reverse order from locking is not
 	   strictly necessary here it is a little cleaner to be consistent */
 	unlock_two_nondirectories(src_inode, target_inode);
+out:
+	return rc;
+}
+
+static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
+			unsigned long srcfd)
+{
+	int rc;
+	struct fd src_file;
+	struct inode *src_inode;
+
+	cifs_dbg(FYI, "ioctl clone range\n");
+	/* the destination must be opened for writing */
+	if (!(dst_file->f_mode & FMODE_WRITE)) {
+		cifs_dbg(FYI, "file target not open for write\n");
+		return -EINVAL;
+	}
+
+	/* check if target volume is readonly and take reference */
+	rc = mnt_want_write_file(dst_file);
+	if (rc) {
+		cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
+		return rc;
+	}
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		rc = -EBADF;
+		goto out_drop_write;
+	}
+
+	if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
+		rc = -EBADF;
+		cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
+		goto out_fput;
+	}
+
+	src_inode = file_inode(src_file.file);
+	rc = -EINVAL;
+	if (S_ISDIR(src_inode->i_mode))
+		goto out_fput;
+
+	rc = cifs_file_clone_range(xid, src_file.file, dst_file);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
@@ -256,10 +251,7 @@ long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg)
 			}
 			break;
 		case CIFS_IOC_COPYCHUNK_FILE:
-			rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, false);
-			break;
-		case BTRFS_IOC_CLONE:
-			rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, true);
+			rc = cifs_ioctl_clone(xid, filep, arg);
 			break;
 		case CIFS_IOC_SET_INTEGRITY:
 			if (pSMBFile == NULL)
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5d01d26..84c6e79 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -215,6 +215,29 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
 	return error;
 }
 
+static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
+			     u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file = fdget(srcfd);
+	int ret;
+
+	if (!src_file.file)
+		return -EBADF;
+	ret = vfs_clone_file_range(src_file.file, off, dst_file, destoff, olen);
+	fdput(src_file);
+	return ret;
+}
+
+static long ioctl_file_clone_range(struct file *file, void __user *argp)
+{
+	struct file_clone_range args;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	return ioctl_file_clone(file, args.src_fd, args.src_offset,
+				args.src_length, args.dest_offset);
+}
+
 #ifdef CONFIG_BLOCK
 
 static inline sector_t logical_to_blk(struct inode *inode, loff_t offset)
@@ -600,6 +623,12 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FICLONE:
+		return ioctl_file_clone(filp, arg, 0, 0, 0);
+
+	case FICLONERANGE:
+		return ioctl_file_clone_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index db9b5fe..26f9a23 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -195,65 +195,27 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static noinline long
-nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
-		  u64 src_off, u64 dst_off, u64 count)
+static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
+		struct file *dst_file, loff_t dst_off, u64 count)
 {
 	struct inode *dst_inode = file_inode(dst_file);
 	struct nfs_server *server = NFS_SERVER(dst_inode);
-	struct fd src_file;
-	struct inode *src_inode;
+	struct inode *src_inode = file_inode(src_file);
 	unsigned int bs = server->clone_blksize;
 	bool same_inode = false;
 	int ret;
 
-	/* dst file must be opened for writing */
-	if (!(dst_file->f_mode & FMODE_WRITE))
-		return -EINVAL;
-
-	ret = mnt_want_write_file(dst_file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	src_inode = file_inode(src_file.file);
-
-	if (src_inode == dst_inode)
-		same_inode = true;
-
-	/* src file must be opened for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
-
-	/* src and dst must be regular files */
-	ret = -EISDIR;
-	if (!S_ISREG(src_inode->i_mode) || !S_ISREG(dst_inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != dst_file->f_path.mnt ||
-	    src_inode->i_sb != dst_inode->i_sb)
-		goto out_fput;
-
 	/* check alignment w.r.t. clone_blksize */
 	ret = -EINVAL;
 	if (bs) {
 		if (!IS_ALIGNED(src_off, bs) || !IS_ALIGNED(dst_off, bs))
-			goto out_fput;
+			goto out;
 		if (!IS_ALIGNED(count, bs) && i_size_read(src_inode) != (src_off + count))
-			goto out_fput;
+			goto out;
 	}
 
-	/* verify if ranges are overlapped within the same file */
-	if (same_inode) {
-		if (dst_off + count > src_off && dst_off < src_off + count)
-			goto out_fput;
-	}
+	if (src_inode == dst_inode)
+		same_inode = true;
 
 	/* XXX: do we lock at all? what if server needs CB_RECALL_LAYOUT? */
 	if (same_inode) {
@@ -275,7 +237,7 @@ nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
 	if (ret)
 		goto out_unlock;
 
-	ret = nfs42_proc_clone(src_file.file, dst_file, src_off, dst_off, count);
+	ret = nfs42_proc_clone(src_file, dst_file, src_off, dst_off, count);
 
 	/* truncate inode page cache of the dst range so that future reads can fetch
 	 * new data from server */
@@ -292,37 +254,9 @@ out_unlock:
 		mutex_unlock(&dst_inode->i_mutex);
 		mutex_unlock(&src_inode->i_mutex);
 	}
-out_fput:
-	fdput(src_file);
-out_drop_write:
-	mnt_drop_write_file(dst_file);
+out:
 	return ret;
 }
-
-static long nfs42_ioctl_clone_range(struct file *dst_file, void __user *argp)
-{
-	struct btrfs_ioctl_clone_range_args args;
-
-	if (copy_from_user(&args, argp, sizeof(args)))
-		return -EFAULT;
-
-	return nfs42_ioctl_clone(dst_file, args.src_fd, args.src_offset,
-				 args.dest_offset, args.src_length);
-}
-
-long nfs4_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
-	void __user *argp = (void __user *)arg;
-
-	switch (cmd) {
-	case BTRFS_IOC_CLONE:
-		return nfs42_ioctl_clone(file, arg, 0, 0, 0);
-	case BTRFS_IOC_CLONE_RANGE:
-		return nfs42_ioctl_clone_range(file, argp);
-	}
-
-	return -ENOTTY;
-}
 #endif /* CONFIG_NFS_V4_2 */
 
 const struct file_operations nfs4_file_operations = {
@@ -342,8 +276,7 @@ const struct file_operations nfs4_file_operations = {
 #ifdef CONFIG_NFS_V4_2
 	.llseek		= nfs4_file_llseek,
 	.fallocate	= nfs42_fallocate,
-	.unlocked_ioctl = nfs4_ioctl,
-	.compat_ioctl	= nfs4_ioctl,
+	.clone_file_range = nfs42_clone_file_range,
 #else
 	.llseek		= nfs_file_llseek,
 #endif
diff --git a/fs/read_write.c b/fs/read_write.c
index e4c09dc..6395c733 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1451,3 +1451,75 @@ out1:
 out2:
 	return ret;
 }
+
+static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
+{
+	struct inode *inode = file_inode(file);
+
+	if (unlikely(pos < 0))
+		return -EINVAL;
+
+	 if (unlikely((loff_t) (pos + len) < 0))
+		return -EINVAL;
+
+	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
+		loff_t end = len ? pos + len - 1 : OFFSET_MAX;
+		int retval;
+
+		retval = locks_mandatory_area(file, pos, end,
+				write ? F_WRLCK : F_RDLCK);
+		if (retval < 0)
+			return retval;
+	}
+
+	return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
+}
+
+int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	int ret;
+
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EOPNOTSUPP;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_in->f_op->clone_file_range)
+		return -EBADF;
+
+	ret = clone_verify_area(file_in, pos_in, len, false);
+	if (ret)
+		return ret;
+
+	ret = clone_verify_area(file_out, pos_out, len, true);
+	if (ret)
+		return ret;
+
+	if (pos_in + len > i_size_read(inode_in))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_in->f_op->clone_file_range(file_in, pos_in,
+			file_out, pos_out, len);
+	if (!ret) {
+		fsnotify_access(file_in);
+		fsnotify_modify(file_out);
+	}
+
+	mnt_drop_write_file(file_out);
+	return ret;
+}
+EXPORT_SYMBOL(vfs_clone_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73874cd..26b7607 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,7 +1629,10 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
-	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+			loff_t, size_t, unsigned int);
+	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
+			u64);
 };
 
 struct inode_operations {
@@ -1683,6 +1686,8 @@ extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6decbac..ad60359 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -39,6 +39,13 @@
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
 
+struct file_clone_range {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
 struct fstrim_range {
 	__u64 start;
 	__u64 len;
@@ -168,6 +175,8 @@ struct blkzeroout2 {
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#define FICLONE		_IOW(0x94, 9, int)
+#define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/9] vfs: pull btrfs clone API to vfs layer
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, Christoph Hellwig, xfs

>From : Christoph Hellwig <hch@lst.de>

The btrfs clone ioctls are now adopted by other file systems, with NFS
and CIFS already having support for them, and XFS being under active
development.  To avoid growth of various slightly incompatible
implementations, add one to the VFS.  Note that clones are different from
file copies in several ways:

 - they are atomic vs other writers
 - they support whole file clones
 - they support 64-bit legth clones
 - they do not allow partial success (aka short writes)
 - clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback them on
top of the recent clone_file_range infrastructure.  The converse isn't
true and the clone_file_range system call could try clone file range as
a first attempt to copy, something that further patches will enable.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h        |    3 +
 fs/btrfs/file.c         |    1 
 fs/btrfs/ioctl.c        |   49 +-----------------
 fs/cifs/cifsfs.c        |   63 ++++++++++++++++++++++++
 fs/cifs/cifsfs.h        |    1 
 fs/cifs/ioctl.c         |  126 ++++++++++++++++++++++-------------------------
 fs/ioctl.c              |   29 +++++++++++
 fs/nfs/nfs4file.c       |   87 ++++----------------------------
 fs/read_write.c         |   72 +++++++++++++++++++++++++++
 include/linux/fs.h      |    7 ++-
 include/uapi/linux/fs.h |    9 +++
 11 files changed, 254 insertions(+), 193 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ede7277..dd4733f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4025,7 +4025,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
 			       struct btrfs_ioctl_balance_args *bargs);
 
-
 /* file.c */
 int btrfs_auto_defrag_init(void);
 void btrfs_auto_defrag_exit(void);
@@ -4058,6 +4057,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
 			      size_t len, unsigned int flags);
+int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+			   struct file *file_out, loff_t pos_out, u64 len);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e67fe6a..232e300 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2925,6 +2925,7 @@ const struct file_operations btrfs_file_operations = {
 	.compat_ioctl	= btrfs_ioctl,
 #endif
 	.copy_file_range = btrfs_copy_file_range,
+	.clone_file_range = btrfs_clone_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0f92735..85b1cae 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+int btrfs_clone_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len)
 {
-	struct fd src_file;
-	int ret;
-
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ)) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
-
-	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
-
-out_fput:
-	fdput(src_file);
-out_drop_write:
-	mnt_drop_write_file(file);
-	return ret;
-}
-
-static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
-{
-	struct btrfs_ioctl_clone_range_args args;
-
-	if (copy_from_user(&args, argp, sizeof(args)))
-		return -EFAULT;
-	return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
-				 args.src_length, args.dest_offset);
+	return btrfs_clone_files(dst_file, src_file, off, len, destoff);
 }
 
 /*
@@ -5498,10 +5459,6 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_dev_info(root, argp);
 	case BTRFS_IOC_BALANCE:
 		return btrfs_ioctl_balance(file, NULL);
-	case BTRFS_IOC_CLONE:
-		return btrfs_ioctl_clone(file, arg, 0, 0, 0);
-	case BTRFS_IOC_CLONE_RANGE:
-		return btrfs_ioctl_clone_range(file, argp);
 	case BTRFS_IOC_TRANS_START:
 		return btrfs_ioctl_trans_start(file);
 	case BTRFS_IOC_TRANS_END:
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index cbc0f4b..e9b978f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -914,6 +914,61 @@ const struct inode_operations cifs_symlink_inode_ops = {
 #endif
 };
 
+static int cifs_clone_file_range(struct file *src_file, loff_t off,
+		struct file *dst_file, loff_t destoff, u64 len)
+{
+	struct inode *src_inode = file_inode(src_file);
+	struct inode *target_inode = file_inode(dst_file);
+	struct cifsFileInfo *smb_file_src = src_file->private_data;
+	struct cifsFileInfo *smb_file_target = dst_file->private_data;
+	struct cifs_tcon *src_tcon = tlink_tcon(smb_file_src->tlink);
+	struct cifs_tcon *target_tcon = tlink_tcon(smb_file_target->tlink);
+	unsigned int xid;
+	int rc;
+
+	cifs_dbg(FYI, "clone range\n");
+
+	xid = get_xid();
+
+	if (!src_file->private_data || !dst_file->private_data) {
+		rc = -EBADF;
+		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
+		goto out;
+	}
+
+	/*
+	 * Note: cifs case is easier than btrfs since server responsible for
+	 * checks for proper open modes and file type and if it wants
+	 * server could even support copy of range where source = target
+	 */
+	lock_two_nondirectories(target_inode, src_inode);
+
+	if (len == 0)
+		len = src_inode->i_size - off;
+
+	cifs_dbg(FYI, "about to flush pages\n");
+	/* should we flush first and last page first */
+	truncate_inode_pages_range(&target_inode->i_data, destoff,
+				   PAGE_CACHE_ALIGN(destoff + len)-1);
+
+	if (target_tcon->ses->server->ops->duplicate_extents)
+		rc = target_tcon->ses->server->ops->duplicate_extents(xid,
+			smb_file_src, smb_file_target, off, len, destoff);
+	else
+		rc = -EOPNOTSUPP;
+
+	/* force revalidate of size and timestamps of target file now
+	   that target is updated on the server */
+	CIFS_I(target_inode)->time = 0;
+out_unlock:
+	/* although unlocking in the reverse order from locking is not
+	   strictly necessary here it is a little cleaner to be consistent */
+	unlock_two_nondirectories(src_inode, target_inode);
+out:
+	free_xid(xid);
+	return rc;
+}
+
 const struct file_operations cifs_file_ops = {
 	.read_iter = cifs_loose_read_iter,
 	.write_iter = cifs_file_write_iter,
@@ -926,6 +981,7 @@ const struct file_operations cifs_file_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -942,6 +998,8 @@ const struct file_operations cifs_file_strict_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -958,6 +1016,7 @@ const struct file_operations cifs_file_direct_ops = {
 	.mmap = cifs_file_mmap,
 	.splice_read = generic_file_splice_read,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -974,6 +1033,7 @@ const struct file_operations cifs_file_nobrl_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -989,6 +1049,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
 	.splice_read = generic_file_splice_read,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
 };
@@ -1004,6 +1065,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
 	.mmap = cifs_file_mmap,
 	.splice_read = generic_file_splice_read,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = cifs_llseek,
 	.setlease = cifs_setlease,
 	.fallocate = cifs_fallocate,
@@ -1014,6 +1076,7 @@ const struct file_operations cifs_dir_ops = {
 	.release = cifs_closedir,
 	.read    = generic_read_dir,
 	.unlocked_ioctl  = cifs_ioctl,
+	.clone_file_range = cifs_clone_file_range,
 	.llseek = generic_file_llseek,
 };
 
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index c3cc160..c399513 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -131,7 +131,6 @@ extern int	cifs_setxattr(struct dentry *, const char *, const void *,
 extern ssize_t	cifs_getxattr(struct dentry *, const char *, void *, size_t);
 extern ssize_t	cifs_listxattr(struct dentry *, char *, size_t);
 extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
-
 #ifdef CONFIG_CIFS_NFSD_EXPORT
 extern const struct export_operations cifs_export_ops;
 #endif /* CONFIG_CIFS_NFSD_EXPORT */
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 35cf990..7a3b84e 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -34,73 +34,36 @@
 #include "cifs_ioctl.h"
 #include <linux/btrfs.h>
 
-static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
-			unsigned long srcfd, u64 off, u64 len, u64 destoff,
-			bool dup_extents)
+static int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+			  struct file *dst_file)
 {
-	int rc;
-	struct cifsFileInfo *smb_file_target = dst_file->private_data;
+	struct inode *src_inode = file_inode(src_file);
 	struct inode *target_inode = file_inode(dst_file);
-	struct cifs_tcon *target_tcon;
-	struct fd src_file;
 	struct cifsFileInfo *smb_file_src;
-	struct inode *src_inode;
+	struct cifsFileInfo *smb_file_target;
 	struct cifs_tcon *src_tcon;
+	struct cifs_tcon *target_tcon;
+	int rc;
 
 	cifs_dbg(FYI, "ioctl clone range\n");
-	/* the destination must be opened for writing */
-	if (!(dst_file->f_mode & FMODE_WRITE)) {
-		cifs_dbg(FYI, "file target not open for write\n");
-		return -EINVAL;
-	}
 
-	/* check if target volume is readonly and take reference */
-	rc = mnt_want_write_file(dst_file);
-	if (rc) {
-		cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
-		return rc;
-	}
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		rc = -EBADF;
-		goto out_drop_write;
-	}
-
-	if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
-		rc = -EBADF;
-		cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
-		goto out_fput;
-	}
-
-	if ((!src_file.file->private_data) || (!dst_file->private_data)) {
+	if (!src_file->private_data || !dst_file->private_data) {
 		rc = -EBADF;
 		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
-		goto out_fput;
+		goto out;
 	}
 
 	rc = -EXDEV;
 	smb_file_target = dst_file->private_data;
-	smb_file_src = src_file.file->private_data;
+	smb_file_src = src_file->private_data;
 	src_tcon = tlink_tcon(smb_file_src->tlink);
 	target_tcon = tlink_tcon(smb_file_target->tlink);
 
-	/* check source and target on same server (or volume if dup_extents) */
-	if (dup_extents && (src_tcon != target_tcon)) {
-		cifs_dbg(VFS, "source and target of copy not on same share\n");
-		goto out_fput;
-	}
-
-	if (!dup_extents && (src_tcon->ses != target_tcon->ses)) {
+	if (src_tcon->ses != target_tcon->ses) {
 		cifs_dbg(VFS, "source and target of copy not on same server\n");
-		goto out_fput;
+		goto out;
 	}
 
-	src_inode = file_inode(src_file.file);
-	rc = -EINVAL;
-	if (S_ISDIR(src_inode->i_mode))
-		goto out_fput;
-
 	/*
 	 * Note: cifs case is easier than btrfs since server responsible for
 	 * checks for proper open modes and file type and if it wants
@@ -108,34 +71,66 @@ static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
 	 */
 	lock_two_nondirectories(target_inode, src_inode);
 
-	/* determine range to clone */
-	rc = -EINVAL;
-	if (off + len > src_inode->i_size || off + len < off)
-		goto out_unlock;
-	if (len == 0)
-		len = src_inode->i_size - off;
-
 	cifs_dbg(FYI, "about to flush pages\n");
 	/* should we flush first and last page first */
-	truncate_inode_pages_range(&target_inode->i_data, destoff,
-				   PAGE_CACHE_ALIGN(destoff + len)-1);
+	truncate_inode_pages(&target_inode->i_data, 0);
 
-	if (dup_extents && target_tcon->ses->server->ops->duplicate_extents)
-		rc = target_tcon->ses->server->ops->duplicate_extents(xid,
-			smb_file_src, smb_file_target, off, len, destoff);
-	else if (!dup_extents && target_tcon->ses->server->ops->clone_range)
+	if (target_tcon->ses->server->ops->clone_range)
 		rc = target_tcon->ses->server->ops->clone_range(xid,
-			smb_file_src, smb_file_target, off, len, destoff);
+			smb_file_src, smb_file_target, 0, src_inode->i_size, 0);
 	else
 		rc = -EOPNOTSUPP;
 
 	/* force revalidate of size and timestamps of target file now
 	   that target is updated on the server */
 	CIFS_I(target_inode)->time = 0;
-out_unlock:
 	/* although unlocking in the reverse order from locking is not
 	   strictly necessary here it is a little cleaner to be consistent */
 	unlock_two_nondirectories(src_inode, target_inode);
+out:
+	return rc;
+}
+
+static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
+			unsigned long srcfd)
+{
+	int rc;
+	struct fd src_file;
+	struct inode *src_inode;
+
+	cifs_dbg(FYI, "ioctl clone range\n");
+	/* the destination must be opened for writing */
+	if (!(dst_file->f_mode & FMODE_WRITE)) {
+		cifs_dbg(FYI, "file target not open for write\n");
+		return -EINVAL;
+	}
+
+	/* check if target volume is readonly and take reference */
+	rc = mnt_want_write_file(dst_file);
+	if (rc) {
+		cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
+		return rc;
+	}
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		rc = -EBADF;
+		goto out_drop_write;
+	}
+
+	if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
+		rc = -EBADF;
+		cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
+		goto out_fput;
+	}
+
+	src_inode = file_inode(src_file.file);
+	rc = -EINVAL;
+	if (S_ISDIR(src_inode->i_mode))
+		goto out_fput;
+
+	rc = cifs_file_clone_range(xid, src_file.file, dst_file);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
@@ -256,10 +251,7 @@ long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg)
 			}
 			break;
 		case CIFS_IOC_COPYCHUNK_FILE:
-			rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, false);
-			break;
-		case BTRFS_IOC_CLONE:
-			rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, true);
+			rc = cifs_ioctl_clone(xid, filep, arg);
 			break;
 		case CIFS_IOC_SET_INTEGRITY:
 			if (pSMBFile == NULL)
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5d01d26..84c6e79 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -215,6 +215,29 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
 	return error;
 }
 
+static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
+			     u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file = fdget(srcfd);
+	int ret;
+
+	if (!src_file.file)
+		return -EBADF;
+	ret = vfs_clone_file_range(src_file.file, off, dst_file, destoff, olen);
+	fdput(src_file);
+	return ret;
+}
+
+static long ioctl_file_clone_range(struct file *file, void __user *argp)
+{
+	struct file_clone_range args;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	return ioctl_file_clone(file, args.src_fd, args.src_offset,
+				args.src_length, args.dest_offset);
+}
+
 #ifdef CONFIG_BLOCK
 
 static inline sector_t logical_to_blk(struct inode *inode, loff_t offset)
@@ -600,6 +623,12 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FICLONE:
+		return ioctl_file_clone(filp, arg, 0, 0, 0);
+
+	case FICLONERANGE:
+		return ioctl_file_clone_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index db9b5fe..26f9a23 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -195,65 +195,27 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	return nfs42_proc_allocate(filep, offset, len);
 }
 
-static noinline long
-nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
-		  u64 src_off, u64 dst_off, u64 count)
+static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
+		struct file *dst_file, loff_t dst_off, u64 count)
 {
 	struct inode *dst_inode = file_inode(dst_file);
 	struct nfs_server *server = NFS_SERVER(dst_inode);
-	struct fd src_file;
-	struct inode *src_inode;
+	struct inode *src_inode = file_inode(src_file);
 	unsigned int bs = server->clone_blksize;
 	bool same_inode = false;
 	int ret;
 
-	/* dst file must be opened for writing */
-	if (!(dst_file->f_mode & FMODE_WRITE))
-		return -EINVAL;
-
-	ret = mnt_want_write_file(dst_file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	src_inode = file_inode(src_file.file);
-
-	if (src_inode == dst_inode)
-		same_inode = true;
-
-	/* src file must be opened for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
-
-	/* src and dst must be regular files */
-	ret = -EISDIR;
-	if (!S_ISREG(src_inode->i_mode) || !S_ISREG(dst_inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != dst_file->f_path.mnt ||
-	    src_inode->i_sb != dst_inode->i_sb)
-		goto out_fput;
-
 	/* check alignment w.r.t. clone_blksize */
 	ret = -EINVAL;
 	if (bs) {
 		if (!IS_ALIGNED(src_off, bs) || !IS_ALIGNED(dst_off, bs))
-			goto out_fput;
+			goto out;
 		if (!IS_ALIGNED(count, bs) && i_size_read(src_inode) != (src_off + count))
-			goto out_fput;
+			goto out;
 	}
 
-	/* verify if ranges are overlapped within the same file */
-	if (same_inode) {
-		if (dst_off + count > src_off && dst_off < src_off + count)
-			goto out_fput;
-	}
+	if (src_inode == dst_inode)
+		same_inode = true;
 
 	/* XXX: do we lock at all? what if server needs CB_RECALL_LAYOUT? */
 	if (same_inode) {
@@ -275,7 +237,7 @@ nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
 	if (ret)
 		goto out_unlock;
 
-	ret = nfs42_proc_clone(src_file.file, dst_file, src_off, dst_off, count);
+	ret = nfs42_proc_clone(src_file, dst_file, src_off, dst_off, count);
 
 	/* truncate inode page cache of the dst range so that future reads can fetch
 	 * new data from server */
@@ -292,37 +254,9 @@ out_unlock:
 		mutex_unlock(&dst_inode->i_mutex);
 		mutex_unlock(&src_inode->i_mutex);
 	}
-out_fput:
-	fdput(src_file);
-out_drop_write:
-	mnt_drop_write_file(dst_file);
+out:
 	return ret;
 }
-
-static long nfs42_ioctl_clone_range(struct file *dst_file, void __user *argp)
-{
-	struct btrfs_ioctl_clone_range_args args;
-
-	if (copy_from_user(&args, argp, sizeof(args)))
-		return -EFAULT;
-
-	return nfs42_ioctl_clone(dst_file, args.src_fd, args.src_offset,
-				 args.dest_offset, args.src_length);
-}
-
-long nfs4_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
-	void __user *argp = (void __user *)arg;
-
-	switch (cmd) {
-	case BTRFS_IOC_CLONE:
-		return nfs42_ioctl_clone(file, arg, 0, 0, 0);
-	case BTRFS_IOC_CLONE_RANGE:
-		return nfs42_ioctl_clone_range(file, argp);
-	}
-
-	return -ENOTTY;
-}
 #endif /* CONFIG_NFS_V4_2 */
 
 const struct file_operations nfs4_file_operations = {
@@ -342,8 +276,7 @@ const struct file_operations nfs4_file_operations = {
 #ifdef CONFIG_NFS_V4_2
 	.llseek		= nfs4_file_llseek,
 	.fallocate	= nfs42_fallocate,
-	.unlocked_ioctl = nfs4_ioctl,
-	.compat_ioctl	= nfs4_ioctl,
+	.clone_file_range = nfs42_clone_file_range,
 #else
 	.llseek		= nfs_file_llseek,
 #endif
diff --git a/fs/read_write.c b/fs/read_write.c
index e4c09dc..6395c733 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1451,3 +1451,75 @@ out1:
 out2:
 	return ret;
 }
+
+static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
+{
+	struct inode *inode = file_inode(file);
+
+	if (unlikely(pos < 0))
+		return -EINVAL;
+
+	 if (unlikely((loff_t) (pos + len) < 0))
+		return -EINVAL;
+
+	if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
+		loff_t end = len ? pos + len - 1 : OFFSET_MAX;
+		int retval;
+
+		retval = locks_mandatory_area(file, pos, end,
+				write ? F_WRLCK : F_RDLCK);
+		if (retval < 0)
+			return retval;
+	}
+
+	return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
+}
+
+int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	int ret;
+
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EOPNOTSUPP;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_in->f_op->clone_file_range)
+		return -EBADF;
+
+	ret = clone_verify_area(file_in, pos_in, len, false);
+	if (ret)
+		return ret;
+
+	ret = clone_verify_area(file_out, pos_out, len, true);
+	if (ret)
+		return ret;
+
+	if (pos_in + len > i_size_read(inode_in))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_in->f_op->clone_file_range(file_in, pos_in,
+			file_out, pos_out, len);
+	if (!ret) {
+		fsnotify_access(file_in);
+		fsnotify_modify(file_out);
+	}
+
+	mnt_drop_write_file(file_out);
+	return ret;
+}
+EXPORT_SYMBOL(vfs_clone_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73874cd..26b7607 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,7 +1629,10 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
-	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+			loff_t, size_t, unsigned int);
+	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
+			u64);
 };
 
 struct inode_operations {
@@ -1683,6 +1686,8 @@ extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+		struct file *file_out, loff_t pos_out, u64 len);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6decbac..ad60359 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -39,6 +39,13 @@
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
 
+struct file_clone_range {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
 struct fstrim_range {
 	__u64 start;
 	__u64 len;
@@ -168,6 +175,8 @@ struct blkzeroout2 {
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#define FICLONE		_IOW(0x94, 9, int)
+#define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/compat_ioctl.c |    4 ++++
 fs/read_write.c   |    2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)


diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index dcf2653..70d4b10 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1580,6 +1580,10 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 		goto out_fput;
 #endif
 
+	case FICLONE:
+	case FICLONERANGE:
+		goto do_ioctl;
+
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
diff --git a/fs/read_write.c b/fs/read_write.c
index 6395c733..0713e28 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1489,7 +1489,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
-		return -EOPNOTSUPP;
+		return -EINVAL;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/compat_ioctl.c |    4 ++++
 fs/read_write.c   |    2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)


diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index dcf2653..70d4b10 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1580,6 +1580,10 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 		goto out_fput;
 #endif
 
+	case FICLONE:
+	case FICLONERANGE:
+		goto do_ioctl;
+
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
diff --git a/fs/read_write.c b/fs/read_write.c
index 6395c733..0713e28 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1489,7 +1489,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
-		return -EOPNOTSUPP;
+		return -EINVAL;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david-FqsqvQoI3Ljby3iVrkZq2A, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 fs/compat_ioctl.c |    4 ++++
 fs/read_write.c   |    2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)


diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index dcf2653..70d4b10 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1580,6 +1580,10 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 		goto out_fput;
 #endif
 
+	case FICLONE:
+	case FICLONERANGE:
+		goto do_ioctl;
+
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
diff --git a/fs/read_write.c b/fs/read_write.c
index 6395c733..0713e28 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1489,7 +1489,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
-		return -EOPNOTSUPP;
+		return -EINVAL;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-19  8:55   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/compat_ioctl.c       |    1 
 fs/ioctl.c              |   38 ++++++++++++++++++
 fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h      |    4 ++
 include/uapi/linux/fs.h |   30 ++++++++++++++
 5 files changed, 173 insertions(+)


diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 70d4b10..eab31e7 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 
 	case FICLONE:
 	case FICLONERANGE:
+	case FIDEDUPERANGE:
 		goto do_ioctl;
 
 	case FIBMAP:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 84c6e79..fcdd33b 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_super(sb);
 }
 
+static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
+{
+	struct file_dedupe_range __user *argp = arg;
+	struct file_dedupe_range *same = NULL;
+	int ret;
+	unsigned long size;
+	u16 count;
+
+	if (get_user(count, &argp->dest_count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	size = offsetof(struct file_dedupe_range __user, info[count]);
+
+	same = memdup_user(argp, size);
+	if (IS_ERR(same)) {
+		ret = PTR_ERR(same);
+		same = NULL;
+		goto out;
+	}
+
+	ret = vfs_dedupe_file_range(file, same);
+	if (ret)
+		goto out;
+
+	ret = copy_to_user(argp, same, size);
+	if (ret)
+		ret = -EFAULT;
+
+out:
+	kfree(same);
+	return ret;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -629,6 +664,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FICLONERANGE:
 		return ioctl_file_clone_range(filp, argp);
 
+	case FIDEDUPERANGE:
+		return ioctl_file_dedupe_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/read_write.c b/fs/read_write.c
index 0713e28..aaaad52 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1523,3 +1523,103 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
+
+int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
+{
+	struct file_dedupe_range_info *info;
+	struct inode *src = file_inode(file);
+	u64 off;
+	u64 len;
+	int i;
+	int ret;
+	bool is_admin = capable(CAP_SYS_ADMIN);
+	u16 count = same->dest_count;
+	struct file *dst_file;
+	loff_t dst_off;
+	ssize_t deduped;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (same->reserved1 || same->reserved2)
+		return -EINVAL;
+
+	off = same->src_offset;
+	len = same->src_length;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = -EINVAL;
+	if (!S_ISREG(src->i_mode))
+		goto out;
+
+	ret = clone_verify_area(file, off, len, false);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+	/* pre-format output fields to sane values */
+	for (i = 0; i < count; i++) {
+		same->info[i].bytes_deduped = 0ULL;
+		same->info[i].status = FILE_DEDUPE_RANGE_SAME;
+	}
+
+	for (i = 0, info = same->info; i < count; i++, info++) {
+		struct inode *dst;
+		struct fd dst_fd = fdget(info->dest_fd);
+
+		dst_file = dst_fd.file;
+		if (!dst_file) {
+			info->status = -EBADF;
+			goto next_loop;
+		}
+		dst = file_inode(dst_file);
+
+		ret = mnt_want_write_file(dst_file);
+		if (ret) {
+			info->status = ret;
+			goto next_loop;
+		}
+
+		dst_off = info->dest_offset;
+		ret = clone_verify_area(dst_file, dst_off, len, true);
+		if (ret < 0) {
+			info->status = ret;
+			goto next_file;
+		}
+		ret = 0;
+
+		if (info->reserved) {
+			info->status = -EINVAL;
+		} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+			info->status = -EINVAL;
+		} else if (file->f_path.mnt != dst_file->f_path.mnt) {
+			info->status = -EXDEV;
+		} else if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+		} else if (dst_file->f_op->dedupe_file_range == NULL) {
+			info->status = -EINVAL;
+		} else {
+			deduped = dst_file->f_op->dedupe_file_range(file, off,
+							len, dst_file,
+							info->dest_offset);
+			if (deduped == -EBADE)
+				info->status = FILE_DEDUPE_RANGE_DIFFERS;
+			else if (deduped < 0)
+				info->status = deduped;
+			else
+				info->bytes_deduped += deduped;
+		}
+
+next_file:
+		mnt_drop_write_file(dst_file);
+next_loop:
+		fdput(dst_fd);
+	}
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(vfs_dedupe_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26b7607..bcfa23d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1633,6 +1633,8 @@ struct file_operations {
 			loff_t, size_t, unsigned int);
 	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
 			u64);
+	ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
+			u64);
 };
 
 struct inode_operations {
@@ -1688,6 +1690,8 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
+extern int vfs_dedupe_file_range(struct file *file,
+				 struct file_dedupe_range *same);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ad60359..801986e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -52,6 +52,35 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[0];
+};
+
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	unsigned long nr_files;		/* read only */
@@ -177,6 +206,7 @@ struct blkzeroout2 {
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
 #define FICLONE		_IOW(0x94, 9, int)
 #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/compat_ioctl.c       |    1 
 fs/ioctl.c              |   38 ++++++++++++++++++
 fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h      |    4 ++
 include/uapi/linux/fs.h |   30 ++++++++++++++
 5 files changed, 173 insertions(+)


diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 70d4b10..eab31e7 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 
 	case FICLONE:
 	case FICLONERANGE:
+	case FIDEDUPERANGE:
 		goto do_ioctl;
 
 	case FIBMAP:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 84c6e79..fcdd33b 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_super(sb);
 }
 
+static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
+{
+	struct file_dedupe_range __user *argp = arg;
+	struct file_dedupe_range *same = NULL;
+	int ret;
+	unsigned long size;
+	u16 count;
+
+	if (get_user(count, &argp->dest_count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	size = offsetof(struct file_dedupe_range __user, info[count]);
+
+	same = memdup_user(argp, size);
+	if (IS_ERR(same)) {
+		ret = PTR_ERR(same);
+		same = NULL;
+		goto out;
+	}
+
+	ret = vfs_dedupe_file_range(file, same);
+	if (ret)
+		goto out;
+
+	ret = copy_to_user(argp, same, size);
+	if (ret)
+		ret = -EFAULT;
+
+out:
+	kfree(same);
+	return ret;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -629,6 +664,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FICLONERANGE:
 		return ioctl_file_clone_range(filp, argp);
 
+	case FIDEDUPERANGE:
+		return ioctl_file_dedupe_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/read_write.c b/fs/read_write.c
index 0713e28..aaaad52 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1523,3 +1523,103 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
+
+int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
+{
+	struct file_dedupe_range_info *info;
+	struct inode *src = file_inode(file);
+	u64 off;
+	u64 len;
+	int i;
+	int ret;
+	bool is_admin = capable(CAP_SYS_ADMIN);
+	u16 count = same->dest_count;
+	struct file *dst_file;
+	loff_t dst_off;
+	ssize_t deduped;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (same->reserved1 || same->reserved2)
+		return -EINVAL;
+
+	off = same->src_offset;
+	len = same->src_length;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = -EINVAL;
+	if (!S_ISREG(src->i_mode))
+		goto out;
+
+	ret = clone_verify_area(file, off, len, false);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+	/* pre-format output fields to sane values */
+	for (i = 0; i < count; i++) {
+		same->info[i].bytes_deduped = 0ULL;
+		same->info[i].status = FILE_DEDUPE_RANGE_SAME;
+	}
+
+	for (i = 0, info = same->info; i < count; i++, info++) {
+		struct inode *dst;
+		struct fd dst_fd = fdget(info->dest_fd);
+
+		dst_file = dst_fd.file;
+		if (!dst_file) {
+			info->status = -EBADF;
+			goto next_loop;
+		}
+		dst = file_inode(dst_file);
+
+		ret = mnt_want_write_file(dst_file);
+		if (ret) {
+			info->status = ret;
+			goto next_loop;
+		}
+
+		dst_off = info->dest_offset;
+		ret = clone_verify_area(dst_file, dst_off, len, true);
+		if (ret < 0) {
+			info->status = ret;
+			goto next_file;
+		}
+		ret = 0;
+
+		if (info->reserved) {
+			info->status = -EINVAL;
+		} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+			info->status = -EINVAL;
+		} else if (file->f_path.mnt != dst_file->f_path.mnt) {
+			info->status = -EXDEV;
+		} else if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+		} else if (dst_file->f_op->dedupe_file_range == NULL) {
+			info->status = -EINVAL;
+		} else {
+			deduped = dst_file->f_op->dedupe_file_range(file, off,
+							len, dst_file,
+							info->dest_offset);
+			if (deduped == -EBADE)
+				info->status = FILE_DEDUPE_RANGE_DIFFERS;
+			else if (deduped < 0)
+				info->status = deduped;
+			else
+				info->bytes_deduped += deduped;
+		}
+
+next_file:
+		mnt_drop_write_file(dst_file);
+next_loop:
+		fdput(dst_fd);
+	}
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(vfs_dedupe_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26b7607..bcfa23d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1633,6 +1633,8 @@ struct file_operations {
 			loff_t, size_t, unsigned int);
 	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
 			u64);
+	ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
+			u64);
 };
 
 struct inode_operations {
@@ -1688,6 +1690,8 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
+extern int vfs_dedupe_file_range(struct file *file,
+				 struct file_dedupe_range *same);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ad60359..801986e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -52,6 +52,35 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[0];
+};
+
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	unsigned long nr_files;		/* read only */
@@ -177,6 +206,7 @@ struct blkzeroout2 {
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
 #define FICLONE		_IOW(0x94, 9, int)
 #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2015-12-19  8:55   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:55 UTC (permalink / raw)
  To: david-FqsqvQoI3Ljby3iVrkZq2A, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).

Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 fs/compat_ioctl.c       |    1 
 fs/ioctl.c              |   38 ++++++++++++++++++
 fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h      |    4 ++
 include/uapi/linux/fs.h |   30 ++++++++++++++
 5 files changed, 173 insertions(+)


diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 70d4b10..eab31e7 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 
 	case FICLONE:
 	case FICLONERANGE:
+	case FIDEDUPERANGE:
 		goto do_ioctl;
 
 	case FIBMAP:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 84c6e79..fcdd33b 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_super(sb);
 }
 
+static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
+{
+	struct file_dedupe_range __user *argp = arg;
+	struct file_dedupe_range *same = NULL;
+	int ret;
+	unsigned long size;
+	u16 count;
+
+	if (get_user(count, &argp->dest_count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	size = offsetof(struct file_dedupe_range __user, info[count]);
+
+	same = memdup_user(argp, size);
+	if (IS_ERR(same)) {
+		ret = PTR_ERR(same);
+		same = NULL;
+		goto out;
+	}
+
+	ret = vfs_dedupe_file_range(file, same);
+	if (ret)
+		goto out;
+
+	ret = copy_to_user(argp, same, size);
+	if (ret)
+		ret = -EFAULT;
+
+out:
+	kfree(same);
+	return ret;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -629,6 +664,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FICLONERANGE:
 		return ioctl_file_clone_range(filp, argp);
 
+	case FIDEDUPERANGE:
+		return ioctl_file_dedupe_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/read_write.c b/fs/read_write.c
index 0713e28..aaaad52 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1523,3 +1523,103 @@ int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
+
+int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
+{
+	struct file_dedupe_range_info *info;
+	struct inode *src = file_inode(file);
+	u64 off;
+	u64 len;
+	int i;
+	int ret;
+	bool is_admin = capable(CAP_SYS_ADMIN);
+	u16 count = same->dest_count;
+	struct file *dst_file;
+	loff_t dst_off;
+	ssize_t deduped;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (same->reserved1 || same->reserved2)
+		return -EINVAL;
+
+	off = same->src_offset;
+	len = same->src_length;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = -EINVAL;
+	if (!S_ISREG(src->i_mode))
+		goto out;
+
+	ret = clone_verify_area(file, off, len, false);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+	/* pre-format output fields to sane values */
+	for (i = 0; i < count; i++) {
+		same->info[i].bytes_deduped = 0ULL;
+		same->info[i].status = FILE_DEDUPE_RANGE_SAME;
+	}
+
+	for (i = 0, info = same->info; i < count; i++, info++) {
+		struct inode *dst;
+		struct fd dst_fd = fdget(info->dest_fd);
+
+		dst_file = dst_fd.file;
+		if (!dst_file) {
+			info->status = -EBADF;
+			goto next_loop;
+		}
+		dst = file_inode(dst_file);
+
+		ret = mnt_want_write_file(dst_file);
+		if (ret) {
+			info->status = ret;
+			goto next_loop;
+		}
+
+		dst_off = info->dest_offset;
+		ret = clone_verify_area(dst_file, dst_off, len, true);
+		if (ret < 0) {
+			info->status = ret;
+			goto next_file;
+		}
+		ret = 0;
+
+		if (info->reserved) {
+			info->status = -EINVAL;
+		} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+			info->status = -EINVAL;
+		} else if (file->f_path.mnt != dst_file->f_path.mnt) {
+			info->status = -EXDEV;
+		} else if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+		} else if (dst_file->f_op->dedupe_file_range == NULL) {
+			info->status = -EINVAL;
+		} else {
+			deduped = dst_file->f_op->dedupe_file_range(file, off,
+							len, dst_file,
+							info->dest_offset);
+			if (deduped == -EBADE)
+				info->status = FILE_DEDUPE_RANGE_DIFFERS;
+			else if (deduped < 0)
+				info->status = deduped;
+			else
+				info->bytes_deduped += deduped;
+		}
+
+next_file:
+		mnt_drop_write_file(dst_file);
+next_loop:
+		fdput(dst_fd);
+	}
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(vfs_dedupe_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26b7607..bcfa23d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1633,6 +1633,8 @@ struct file_operations {
 			loff_t, size_t, unsigned int);
 	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
 			u64);
+	ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
+			u64);
 };
 
 struct inode_operations {
@@ -1688,6 +1690,8 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
+extern int vfs_dedupe_file_range(struct file *file,
+				 struct file_dedupe_range *same);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ad60359..801986e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -52,6 +52,35 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[0];
+};
+
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	unsigned long nr_files;		/* read only */
@@ -177,6 +206,7 @@ struct blkzeroout2 {
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
 #define FICLONE		_IOW(0x94, 9, int)
 #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 9/9] btrfs: use new dedupe data function pointer
  2015-12-19  8:55 ` Darrick J. Wong
@ 2015-12-19  8:56   ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:56 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/btrfs/ctree.h |    2 +
 fs/btrfs/file.c  |    1 
 fs/btrfs/ioctl.c |  110 ++++++------------------------------------------------
 3 files changed, 16 insertions(+), 97 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd4733f..b7e4e34 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4024,6 +4024,8 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 				struct btrfs_ioctl_space_info *space);
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
 			       struct btrfs_ioctl_balance_args *bargs);
+ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
+			   struct file *dst_file, u64 dst_loff);
 
 /* file.c */
 int btrfs_auto_defrag_init(void);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 232e300..d012e0a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2926,6 +2926,7 @@ const struct file_operations btrfs_file_operations = {
 #endif
 	.copy_file_range = btrfs_copy_file_range,
 	.clone_file_range = btrfs_clone_file_range,
+	.dedupe_file_range = btrfs_dedupe_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 85b1cae..e219973 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2962,7 +2962,7 @@ static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
 		flush_dcache_page(dst_page);
 
 		if (memcmp(addr, dst_addr, cmp_len))
-			ret = BTRFS_SAME_DATA_DIFFERS;
+			ret = -EBADE;
 
 		kunmap_atomic(addr);
 		kunmap_atomic(dst_addr);
@@ -3098,53 +3098,16 @@ out_unlock:
 
 #define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
 
-static long btrfs_ioctl_file_extent_same(struct file *file,
-			struct btrfs_ioctl_same_args __user *argp)
+ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
+				struct file *dst_file, u64 dst_loff)
 {
-	struct btrfs_ioctl_same_args *same = NULL;
-	struct btrfs_ioctl_same_extent_info *info;
-	struct inode *src = file_inode(file);
-	u64 off;
-	u64 len;
-	int i;
-	int ret;
-	unsigned long size;
+	struct inode *src = file_inode(src_file);
+	struct inode *dst = file_inode(dst_file);
 	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
-	bool is_admin = capable(CAP_SYS_ADMIN);
-	u16 count;
-
-	if (!(file->f_mode & FMODE_READ))
-		return -EINVAL;
+	ssize_t res;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	if (get_user(count, &argp->dest_count)) {
-		ret = -EFAULT;
-		goto out;
-	}
-
-	size = offsetof(struct btrfs_ioctl_same_args __user, info[count]);
-
-	same = memdup_user(argp, size);
-
-	if (IS_ERR(same)) {
-		ret = PTR_ERR(same);
-		same = NULL;
-		goto out;
-	}
-
-	off = same->logical_offset;
-	len = same->length;
-
-	/*
-	 * Limit the total length we will dedupe for each operation.
-	 * This is intended to bound the total time spent in this
-	 * ioctl to something sane.
-	 */
-	if (len > BTRFS_MAX_DEDUPE_LEN)
-		len = BTRFS_MAX_DEDUPE_LEN;
+	if (olen > BTRFS_MAX_DEDUPE_LEN)
+		olen = BTRFS_MAX_DEDUPE_LEN;
 
 	if (WARN_ON_ONCE(bs < PAGE_CACHE_SIZE)) {
 		/*
@@ -3152,58 +3115,13 @@ static long btrfs_ioctl_file_extent_same(struct file *file,
 		 * result, btrfs_cmp_data() won't correctly handle
 		 * this situation without an update.
 		 */
-		ret = -EINVAL;
-		goto out;
-	}
-
-	ret = -EISDIR;
-	if (S_ISDIR(src->i_mode))
-		goto out;
-
-	ret = -EACCES;
-	if (!S_ISREG(src->i_mode))
-		goto out;
-
-	/* pre-format output fields to sane values */
-	for (i = 0; i < count; i++) {
-		same->info[i].bytes_deduped = 0ULL;
-		same->info[i].status = 0;
-	}
-
-	for (i = 0, info = same->info; i < count; i++, info++) {
-		struct inode *dst;
-		struct fd dst_file = fdget(info->fd);
-		if (!dst_file.file) {
-			info->status = -EBADF;
-			continue;
-		}
-		dst = file_inode(dst_file.file);
-
-		if (!(is_admin || (dst_file.file->f_mode & FMODE_WRITE))) {
-			info->status = -EINVAL;
-		} else if (file->f_path.mnt != dst_file.file->f_path.mnt) {
-			info->status = -EXDEV;
-		} else if (S_ISDIR(dst->i_mode)) {
-			info->status = -EISDIR;
-		} else if (!S_ISREG(dst->i_mode)) {
-			info->status = -EACCES;
-		} else {
-			info->status = btrfs_extent_same(src, off, len, dst,
-							info->logical_offset);
-			if (info->status == 0)
-				info->bytes_deduped += len;
-		}
-		fdput(dst_file);
+		return -EINVAL;
 	}
 
-	ret = copy_to_user(argp, same, size);
-	if (ret)
-		ret = -EFAULT;
-
-out:
-	mnt_drop_write_file(file);
-	kfree(same);
-	return ret;
+	res = btrfs_extent_same(src, loff, olen, dst, dst_loff);
+	if (res)
+		return res;
+	return olen;
 }
 
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
@@ -5536,8 +5454,6 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
-	case BTRFS_IOC_FILE_EXTENT_SAME:
-		return btrfs_ioctl_file_extent_same(file, argp);
 	case BTRFS_IOC_GET_SUPPORTED_FEATURES:
 		return btrfs_ioctl_get_supported_features(file, argp);
 	case BTRFS_IOC_GET_FEATURES:


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 9/9] btrfs: use new dedupe data function pointer
@ 2015-12-19  8:56   ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2015-12-19  8:56 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, linux-api, xfs

Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/btrfs/ctree.h |    2 +
 fs/btrfs/file.c  |    1 
 fs/btrfs/ioctl.c |  110 ++++++------------------------------------------------
 3 files changed, 16 insertions(+), 97 deletions(-)


diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd4733f..b7e4e34 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4024,6 +4024,8 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
 				struct btrfs_ioctl_space_info *space);
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
 			       struct btrfs_ioctl_balance_args *bargs);
+ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
+			   struct file *dst_file, u64 dst_loff);
 
 /* file.c */
 int btrfs_auto_defrag_init(void);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 232e300..d012e0a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2926,6 +2926,7 @@ const struct file_operations btrfs_file_operations = {
 #endif
 	.copy_file_range = btrfs_copy_file_range,
 	.clone_file_range = btrfs_clone_file_range,
+	.dedupe_file_range = btrfs_dedupe_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 85b1cae..e219973 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2962,7 +2962,7 @@ static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
 		flush_dcache_page(dst_page);
 
 		if (memcmp(addr, dst_addr, cmp_len))
-			ret = BTRFS_SAME_DATA_DIFFERS;
+			ret = -EBADE;
 
 		kunmap_atomic(addr);
 		kunmap_atomic(dst_addr);
@@ -3098,53 +3098,16 @@ out_unlock:
 
 #define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
 
-static long btrfs_ioctl_file_extent_same(struct file *file,
-			struct btrfs_ioctl_same_args __user *argp)
+ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
+				struct file *dst_file, u64 dst_loff)
 {
-	struct btrfs_ioctl_same_args *same = NULL;
-	struct btrfs_ioctl_same_extent_info *info;
-	struct inode *src = file_inode(file);
-	u64 off;
-	u64 len;
-	int i;
-	int ret;
-	unsigned long size;
+	struct inode *src = file_inode(src_file);
+	struct inode *dst = file_inode(dst_file);
 	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
-	bool is_admin = capable(CAP_SYS_ADMIN);
-	u16 count;
-
-	if (!(file->f_mode & FMODE_READ))
-		return -EINVAL;
+	ssize_t res;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	if (get_user(count, &argp->dest_count)) {
-		ret = -EFAULT;
-		goto out;
-	}
-
-	size = offsetof(struct btrfs_ioctl_same_args __user, info[count]);
-
-	same = memdup_user(argp, size);
-
-	if (IS_ERR(same)) {
-		ret = PTR_ERR(same);
-		same = NULL;
-		goto out;
-	}
-
-	off = same->logical_offset;
-	len = same->length;
-
-	/*
-	 * Limit the total length we will dedupe for each operation.
-	 * This is intended to bound the total time spent in this
-	 * ioctl to something sane.
-	 */
-	if (len > BTRFS_MAX_DEDUPE_LEN)
-		len = BTRFS_MAX_DEDUPE_LEN;
+	if (olen > BTRFS_MAX_DEDUPE_LEN)
+		olen = BTRFS_MAX_DEDUPE_LEN;
 
 	if (WARN_ON_ONCE(bs < PAGE_CACHE_SIZE)) {
 		/*
@@ -3152,58 +3115,13 @@ static long btrfs_ioctl_file_extent_same(struct file *file,
 		 * result, btrfs_cmp_data() won't correctly handle
 		 * this situation without an update.
 		 */
-		ret = -EINVAL;
-		goto out;
-	}
-
-	ret = -EISDIR;
-	if (S_ISDIR(src->i_mode))
-		goto out;
-
-	ret = -EACCES;
-	if (!S_ISREG(src->i_mode))
-		goto out;
-
-	/* pre-format output fields to sane values */
-	for (i = 0; i < count; i++) {
-		same->info[i].bytes_deduped = 0ULL;
-		same->info[i].status = 0;
-	}
-
-	for (i = 0, info = same->info; i < count; i++, info++) {
-		struct inode *dst;
-		struct fd dst_file = fdget(info->fd);
-		if (!dst_file.file) {
-			info->status = -EBADF;
-			continue;
-		}
-		dst = file_inode(dst_file.file);
-
-		if (!(is_admin || (dst_file.file->f_mode & FMODE_WRITE))) {
-			info->status = -EINVAL;
-		} else if (file->f_path.mnt != dst_file.file->f_path.mnt) {
-			info->status = -EXDEV;
-		} else if (S_ISDIR(dst->i_mode)) {
-			info->status = -EISDIR;
-		} else if (!S_ISREG(dst->i_mode)) {
-			info->status = -EACCES;
-		} else {
-			info->status = btrfs_extent_same(src, off, len, dst,
-							info->logical_offset);
-			if (info->status == 0)
-				info->bytes_deduped += len;
-		}
-		fdput(dst_file);
+		return -EINVAL;
 	}
 
-	ret = copy_to_user(argp, same, size);
-	if (ret)
-		ret = -EFAULT;
-
-out:
-	mnt_drop_write_file(file);
-	kfree(same);
-	return ret;
+	res = btrfs_extent_same(src, loff, olen, dst, dst_loff);
+	if (res)
+		return res;
+	return olen;
 }
 
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
@@ -5536,8 +5454,6 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
-	case BTRFS_IOC_FILE_EXTENT_SAME:
-		return btrfs_ioctl_file_extent_same(file, argp);
 	case BTRFS_IOC_GET_SUPPORTED_FEATURES:
 		return btrfs_ioctl_get_supported_features(file, argp);
 	case BTRFS_IOC_GET_FEATURES:

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS
  2015-12-19  8:55 ` Darrick J. Wong
  (?)
@ 2015-12-20 15:30   ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2015-12-20 15:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, linux-api, xfs

FYI, it might make sense to base this off Al's for-next tree
which has Anna's and my patches already.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS
@ 2015-12-20 15:30   ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2015-12-20 15:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-api, xfs

FYI, it might make sense to base this off Al's for-next tree
which has Anna's and my patches already.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS
@ 2015-12-20 15:30   ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2015-12-20 15:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david-FqsqvQoI3Ljby3iVrkZq2A,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

FYI, it might make sense to base this off Al's for-next tree
which has Anna's and my patches already.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2015-12-19  8:55   ` Darrick J. Wong
@ 2016-01-12  6:07     ` Eric Biggers
  -1 siblings, 0 replies; 50+ messages in thread
From: Eric Biggers @ 2016-01-12  6:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, linux-api, xfs

Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
(note: I realize the patch is mostly just moving the code that already existed
in btrfs, but in the VFS it deserves a more thorough review):

At high level, I am confused about what is meant by the "source" and
"destination" files.  I understand that with more than two files, you
effectively have to choose one file to treat specially and dedupe with all the
other files (an NxN comparison isn't realistic).  But with just two files, a
deduplication operation should be completely symmetric, should it not?  The end
result should be that the data is deduplicated, regardless of the order in which
I gave the file descriptors.  So why is there some non-symmetric behavior?
There are several examples but one is that the VFS is checking !S_ISREG() on the
"source" file descriptor but not on the "destination" file descriptor.  Another
is that different permissions are required on the source versus on the
destination.  If there are good reasons for the nonsymmetry then this needs to
be clearly explained in the man page; otherwise it may not be clear what to use
as the "source" and what to use as the "destination".

It seems odd to be adding "copy" as a system call but then have "dedupe" and
"clone" as ioctls rather than system calls... it seems that they should all be
one or the other (at least, if we put aside the fact that the ioctls already
exist in btrfs).

The range checking in clone_verify_area() appears incomplete.  Someone could
provide len=UINT64_MAX and all the checks would still pass even though 'pos+len'
would overflow.

Should the ioctl be interruptible?  Right now it always goes through *all* the
'struct file_dedupe_range_info's you passed in --- potentially up to 65535 of
them.

Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
whole.

What permissions do you need on the destination file descriptors?  The man page
implies they must be open for writing and not appending.  The implementation
differs: it requires FMODE_WRITE only for non-admin users, and it doesn't check
for O_APPEND at all.  The man page also says you get EPERM if "dest_fd is
immutable" and ETXTBSY if "one of the files is a swap file", which I don't see
actually happening in the implementation; it seems those error codes perhaps
exist at all for this ioctl but rather be left to open(..., O_WRONLY).

If the filesystem doesn't support deduplication, or I pass in a strange file
descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The man
page isn't clear.

Under what circumstances will 'bytes_deduped' differ from the count that was
passed in?  If short counts are allowed, what will be the 'status' be in that
case: FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?  Can
data be deduped even if only a prefix of the data region matches?

The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.

The man page isn't clear about whether the ioctl stops early if an error occurs
or always processes all the 'struct file_dedupe_range_info's you pass in.  And
if it were, hypothetically, to stop early, how is the user meant to know on
which file it stopped?

The man page says "logical_offset" but in the struct it is called "dest_offset".

There are some variables named "same" which don't really make sense now that the
ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-01-12  6:07     ` Eric Biggers
  0 siblings, 0 replies; 50+ messages in thread
From: Eric Biggers @ 2016-01-12  6:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-api, xfs

Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
(note: I realize the patch is mostly just moving the code that already existed
in btrfs, but in the VFS it deserves a more thorough review):

At high level, I am confused about what is meant by the "source" and
"destination" files.  I understand that with more than two files, you
effectively have to choose one file to treat specially and dedupe with all the
other files (an NxN comparison isn't realistic).  But with just two files, a
deduplication operation should be completely symmetric, should it not?  The end
result should be that the data is deduplicated, regardless of the order in which
I gave the file descriptors.  So why is there some non-symmetric behavior?
There are several examples but one is that the VFS is checking !S_ISREG() on the
"source" file descriptor but not on the "destination" file descriptor.  Another
is that different permissions are required on the source versus on the
destination.  If there are good reasons for the nonsymmetry then this needs to
be clearly explained in the man page; otherwise it may not be clear what to use
as the "source" and what to use as the "destination".

It seems odd to be adding "copy" as a system call but then have "dedupe" and
"clone" as ioctls rather than system calls... it seems that they should all be
one or the other (at least, if we put aside the fact that the ioctls already
exist in btrfs).

The range checking in clone_verify_area() appears incomplete.  Someone could
provide len=UINT64_MAX and all the checks would still pass even though 'pos+len'
would overflow.

Should the ioctl be interruptible?  Right now it always goes through *all* the
'struct file_dedupe_range_info's you passed in --- potentially up to 65535 of
them.

Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
whole.

What permissions do you need on the destination file descriptors?  The man page
implies they must be open for writing and not appending.  The implementation
differs: it requires FMODE_WRITE only for non-admin users, and it doesn't check
for O_APPEND at all.  The man page also says you get EPERM if "dest_fd is
immutable" and ETXTBSY if "one of the files is a swap file", which I don't see
actually happening in the implementation; it seems those error codes perhaps
exist at all for this ioctl but rather be left to open(..., O_WRONLY).

If the filesystem doesn't support deduplication, or I pass in a strange file
descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The man
page isn't clear.

Under what circumstances will 'bytes_deduped' differ from the count that was
passed in?  If short counts are allowed, what will be the 'status' be in that
case: FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?  Can
data be deduped even if only a prefix of the data region matches?

The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.

The man page isn't clear about whether the ioctl stops early if an error occurs
or always processes all the 'struct file_dedupe_range_info's you pass in.  And
if it were, hypothetically, to stop early, how is the user meant to know on
which file it stopped?

The man page says "logical_offset" but in the struct it is called "dest_offset".

There are some variables named "same" which don't really make sense now that the
ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.

Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2016-01-12  6:07     ` Eric Biggers
@ 2016-01-12  9:14       ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-01-12  9:14 UTC (permalink / raw)
  To: Eric Biggers; +Cc: david, linux-fsdevel, linux-api, xfs, linux-btrfs, hch

[adding btrfs to the cc since we're talking about a whole new dedupe interface]

On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
> (note: I realize the patch is mostly just moving the code that already existed
> in btrfs, but in the VFS it deserves a more thorough review):

Wheee. :)

Yes, let's discuss the concerns about the btrfs extent same ioctl.

I believe Christoph dislikes about the odd return mechanism (i.e. status and
bytes_deduped) and doubts that the vectorization is really necessary.  There's
not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
does".  I suspect that will leave my explanations lackng, since I neither
designed the btrfs interface nor know all that much about the decisions made to
arrive at what we have now.

(I agree with both of hch's complaints.)

Really, the best argument for keeping this ioctl is to avoid breaking
duperemove.  Even then, given that current duperemove checks for btrfs before
trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
leaving the old btrfs ioctl intact, and train duperemove to try the new
ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.

Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
that resembles the clone_range interface:

int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

struct file_dedupe_range {
    __s64 src_fd;
    __u64 src_offset;
    __u64 length;
    __u64 dest_offset;
    __u64 flags;
};

"See if the byte range src_offset:length in src_fd matches all of
dest_offset:length in dest_fd; if so, share src_fd's physical storage with
dest_fd.  Both fds must be files; if they are the same file the ranges cannot
overlap; src_fd must be readable; dest_fd must be writable or append-only.
Offsets and lengths probably need to be block-aligned, but that is filesystem
dependent."

The error conditions would be superset of the ones we know about today.  I'd
return EOVERFLOW or something if length is longer than the FS wants to deal
with.

Now all the vectorization problems go away, and since it's a new VFS interface
we can define everything from the start.

Christoph, if this new interface solves your complaints I think I'd like to get
started on the code/docs soon.

> At high level, I am confused about what is meant by the "source" and
> "destination" files.  I understand that with more than two files, you
> effectively have to choose one file to treat specially and dedupe with all
> the other files (an NxN comparison isn't realistic).  But with just two
> files, a deduplication operation should be completely symmetric, should it
> not?  The end

Not sure what you mean by 'symmetric', but in any case the convention seems
to be that src_fd's storage is shared with dest_fd if there's a match.

> result should be that the data is deduplicated, regardless of the order in
> which I gave the file descriptors.  So why is there some non-symmetric
> behavior?  There are several examples but one is that the VFS is checking
> !S_ISREG() on the "source" file descriptor but not on the "destination" file
> descriptor.

The dedupe_range function pointer should only be supplied for regular files.

> Another is that different permissions are required on the source versus on
> the destination.  If there are good reasons for the nonsymmetry then this
> needs to be clearly explained in the man page; otherwise it may not be clear
> what to use as the "source" and what to use as the "destination".
> 
> It seems odd to be adding "copy" as a system call but then have "dedupe" and
> "clone" as ioctls rather than system calls... it seems that they should all
> be one or the other (at least, if we put aside the fact that the ioctls
> already exist in btrfs).

We can't put the clone ioctl aside; coreutils has already started using it.

I'm not sure if clone_range or extent_same are all that popular, though.
AFAIK duperemove is the only program using extent_same, and I don't know
of anything using clone_range.

(Well, xfs_io does...)

> The range checking in clone_verify_area() appears incomplete.  Someone could
> provide len=UINT64_MAX and all the checks would still pass even though
> 'pos+len' would overflow.

Yeah...

> Should the ioctl be interruptible?  Right now it always goes through *all*
> the 'struct file_dedupe_range_info's you passed in --- potentially up to
> 65535 of them.

There probably ought to be explicit signal checks, or we could just get rid
of the vectorization entirely. :)

> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
> whole.

Right, because bytes_deduped is a part of file_dedup_range_info, not
file_dedupe_range.

(Note the bytes_deduped = 0 earlier in the function.)

> What permissions do you need on the destination file descriptors?  The man
> page implies they must be open for writing and not appending.  The
> implementation differs: it requires FMODE_WRITE only for non-admin users, and
> it doesn't check for O_APPEND at all.

I think the result of an earlier discussion was that src_fd must be readable,
and dest_fd must be writable or appendable.

> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> "one of the files is a swap file", which I don't see actually happening in
> the implementation; it seems those error codes perhaps exist at all for this
> ioctl but rather be left to open(..., O_WRONLY).

Those could be hoisted to the VFS (from the XFS implementation), I think.

> If the filesystem doesn't support deduplication, or I pass in a strange file
> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
> man page isn't clear.

Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
directories; and EINVAL for any other kind of non-file fd.  I suspect the
clone* manpages don't make this too clear either.

> Under what circumstances will 'bytes_deduped' differ from the count that was
> passed in?

btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(

> If short counts are allowed, what will be the 'status' be in that case:
> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?

One of those two.

> Can data be deduped even if only a prefix of the data region matches?

No.

> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.

Oops, good catch. :(

> The man page isn't clear about whether the ioctl stops early if an error
> occurs or always processes all the 'struct file_dedupe_range_info's you pass
> in.  And if it were, hypothetically, to stop early, how is the user meant to
> know on which file it stopped?

I don't know if this should be the official behavior, but it stopped at
whichever file_dedupe_range_info has both status and bytes_deduped set to zero.

> The man page says "logical_offset" but in the struct it is called
> "dest_offset".

Oops.

> There are some variables named "same" which don't really make sense now that
> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.

Perhaps not.

I'll later take a look at how many of these issues apply to clone/clone_range.

--D

> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-01-12  9:14       ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-01-12  9:14 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-api, xfs, hch, linux-fsdevel, linux-btrfs

[adding btrfs to the cc since we're talking about a whole new dedupe interface]

On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
> (note: I realize the patch is mostly just moving the code that already existed
> in btrfs, but in the VFS it deserves a more thorough review):

Wheee. :)

Yes, let's discuss the concerns about the btrfs extent same ioctl.

I believe Christoph dislikes about the odd return mechanism (i.e. status and
bytes_deduped) and doubts that the vectorization is really necessary.  There's
not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
does".  I suspect that will leave my explanations lackng, since I neither
designed the btrfs interface nor know all that much about the decisions made to
arrive at what we have now.

(I agree with both of hch's complaints.)

Really, the best argument for keeping this ioctl is to avoid breaking
duperemove.  Even then, given that current duperemove checks for btrfs before
trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
leaving the old btrfs ioctl intact, and train duperemove to try the new
ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.

Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
that resembles the clone_range interface:

int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

struct file_dedupe_range {
    __s64 src_fd;
    __u64 src_offset;
    __u64 length;
    __u64 dest_offset;
    __u64 flags;
};

"See if the byte range src_offset:length in src_fd matches all of
dest_offset:length in dest_fd; if so, share src_fd's physical storage with
dest_fd.  Both fds must be files; if they are the same file the ranges cannot
overlap; src_fd must be readable; dest_fd must be writable or append-only.
Offsets and lengths probably need to be block-aligned, but that is filesystem
dependent."

The error conditions would be superset of the ones we know about today.  I'd
return EOVERFLOW or something if length is longer than the FS wants to deal
with.

Now all the vectorization problems go away, and since it's a new VFS interface
we can define everything from the start.

Christoph, if this new interface solves your complaints I think I'd like to get
started on the code/docs soon.

> At high level, I am confused about what is meant by the "source" and
> "destination" files.  I understand that with more than two files, you
> effectively have to choose one file to treat specially and dedupe with all
> the other files (an NxN comparison isn't realistic).  But with just two
> files, a deduplication operation should be completely symmetric, should it
> not?  The end

Not sure what you mean by 'symmetric', but in any case the convention seems
to be that src_fd's storage is shared with dest_fd if there's a match.

> result should be that the data is deduplicated, regardless of the order in
> which I gave the file descriptors.  So why is there some non-symmetric
> behavior?  There are several examples but one is that the VFS is checking
> !S_ISREG() on the "source" file descriptor but not on the "destination" file
> descriptor.

The dedupe_range function pointer should only be supplied for regular files.

> Another is that different permissions are required on the source versus on
> the destination.  If there are good reasons for the nonsymmetry then this
> needs to be clearly explained in the man page; otherwise it may not be clear
> what to use as the "source" and what to use as the "destination".
> 
> It seems odd to be adding "copy" as a system call but then have "dedupe" and
> "clone" as ioctls rather than system calls... it seems that they should all
> be one or the other (at least, if we put aside the fact that the ioctls
> already exist in btrfs).

We can't put the clone ioctl aside; coreutils has already started using it.

I'm not sure if clone_range or extent_same are all that popular, though.
AFAIK duperemove is the only program using extent_same, and I don't know
of anything using clone_range.

(Well, xfs_io does...)

> The range checking in clone_verify_area() appears incomplete.  Someone could
> provide len=UINT64_MAX and all the checks would still pass even though
> 'pos+len' would overflow.

Yeah...

> Should the ioctl be interruptible?  Right now it always goes through *all*
> the 'struct file_dedupe_range_info's you passed in --- potentially up to
> 65535 of them.

There probably ought to be explicit signal checks, or we could just get rid
of the vectorization entirely. :)

> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
> whole.

Right, because bytes_deduped is a part of file_dedup_range_info, not
file_dedupe_range.

(Note the bytes_deduped = 0 earlier in the function.)

> What permissions do you need on the destination file descriptors?  The man
> page implies they must be open for writing and not appending.  The
> implementation differs: it requires FMODE_WRITE only for non-admin users, and
> it doesn't check for O_APPEND at all.

I think the result of an earlier discussion was that src_fd must be readable,
and dest_fd must be writable or appendable.

> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> "one of the files is a swap file", which I don't see actually happening in
> the implementation; it seems those error codes perhaps exist at all for this
> ioctl but rather be left to open(..., O_WRONLY).

Those could be hoisted to the VFS (from the XFS implementation), I think.

> If the filesystem doesn't support deduplication, or I pass in a strange file
> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
> man page isn't clear.

Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
directories; and EINVAL for any other kind of non-file fd.  I suspect the
clone* manpages don't make this too clear either.

> Under what circumstances will 'bytes_deduped' differ from the count that was
> passed in?

btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(

> If short counts are allowed, what will be the 'status' be in that case:
> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?

One of those two.

> Can data be deduped even if only a prefix of the data region matches?

No.

> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.

Oops, good catch. :(

> The man page isn't clear about whether the ioctl stops early if an error
> occurs or always processes all the 'struct file_dedupe_range_info's you pass
> in.  And if it were, hypothetically, to stop early, how is the user meant to
> know on which file it stopped?

I don't know if this should be the official behavior, but it stopped at
whichever file_dedupe_range_info has both status and bytes_deduped set to zero.

> The man page says "logical_offset" but in the struct it is called
> "dest_offset".

Oops.

> There are some variables named "same" which don't really make sense now that
> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.

Perhaps not.

I'll later take a look at how many of these issues apply to clone/clone_range.

--D

> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2016-01-12  9:14       ` Darrick J. Wong
  (?)
@ 2016-01-13  2:36         ` Eric Biggers
  -1 siblings, 0 replies; 50+ messages in thread
From: Eric Biggers @ 2016-01-13  2:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, linux-api, xfs, linux-btrfs, hch

On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> 
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
> 
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

Getting rid of the vectorization certainly avoids a lot of API complexity.  I
don't know precisely why it was made vectorized in the first place, but to me it
seems it shouldn't be vectorized unless there's a strong reason for it to be.

> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
> ...
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.

Deduplication is a strange operation since you're not "reading" or "writing" in
the traditional sense.  All file data stays exactly the same from a user
perspective.

So rather than arbitrarily say that you're "reading" from one file and "writing"
to the other, one possibility is to have an API where you submit two files and
the implementation decides how to best deduplicate them.  That could involve
sharing existing extents in file 1 with file 2; or sharing existing extents in
file 2 with file 1; or creating a new extent and referencing it from each file.
It would be up to the implementation to choose the most efficient approach.

But I take it that the existing btrfs ioctl doesn't work that way, and it always
will share the existing extents in file 1 with file 2.

That's probably a perfectly fine way to do it, but I was wondering whether it
really makes sense to hard-code that detail in the API, since an API could be
defined more generally.

And from a practical perpective, people using the ioctl to deduplicate two files
need to know which one to submit to the API as the "source" and which as the
"destination", if it matters.

Related to this are the permissions you need on each file descriptor.  Since
deduplication isn't "reading" or "writing" in the traditional sense it could,
theoretically, be argued that no permissions at all should be required: neither
FMODE_READ nor FMODE_WRITE.

Most likely are concerns that would make that a bad idea, though.  Perhaps
someone could create mischief by causing fragmentation in system files.

If this was previously discussed, can you point me to that discussion?

The proposed man page is also very brief on permissions, only mentioning them
indirectly in the error codes section.

> > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > "one of the files is a swap file", which I don't see actually happening in
> > the implementation; it seems those error codes perhaps exist at all for this
> > ioctl but rather be left to open(..., O_WRONLY).
> 
> Those could be hoisted to the VFS (from the XFS implementation), I think.

If the destination file has to already be open for writing then it can't be
immutable.  So I don't think that error code is needed.  Checking for an active
swapfile, on the other hand, may be valid, although I don't see such a check
anywhere in the version of the code I'm looking at.

> I'll later take a look at how many of these issues apply to clone/clone_range.

clone and clone_range seem to avoid the two biggest issues I saw: they aren't
vectorized, and there's a natural distinction between the "source" and the
"destination" of the operation.  They definitely need careful consideration as
well, though.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-01-13  2:36         ` Eric Biggers
  0 siblings, 0 replies; 50+ messages in thread
From: Eric Biggers @ 2016-01-13  2:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-api, xfs, hch, linux-fsdevel, linux-btrfs

On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> 
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
> 
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

Getting rid of the vectorization certainly avoids a lot of API complexity.  I
don't know precisely why it was made vectorized in the first place, but to me it
seems it shouldn't be vectorized unless there's a strong reason for it to be.

> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
> ...
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.

Deduplication is a strange operation since you're not "reading" or "writing" in
the traditional sense.  All file data stays exactly the same from a user
perspective.

So rather than arbitrarily say that you're "reading" from one file and "writing"
to the other, one possibility is to have an API where you submit two files and
the implementation decides how to best deduplicate them.  That could involve
sharing existing extents in file 1 with file 2; or sharing existing extents in
file 2 with file 1; or creating a new extent and referencing it from each file.
It would be up to the implementation to choose the most efficient approach.

But I take it that the existing btrfs ioctl doesn't work that way, and it always
will share the existing extents in file 1 with file 2.

That's probably a perfectly fine way to do it, but I was wondering whether it
really makes sense to hard-code that detail in the API, since an API could be
defined more generally.

And from a practical perpective, people using the ioctl to deduplicate two files
need to know which one to submit to the API as the "source" and which as the
"destination", if it matters.

Related to this are the permissions you need on each file descriptor.  Since
deduplication isn't "reading" or "writing" in the traditional sense it could,
theoretically, be argued that no permissions at all should be required: neither
FMODE_READ nor FMODE_WRITE.

Most likely are concerns that would make that a bad idea, though.  Perhaps
someone could create mischief by causing fragmentation in system files.

If this was previously discussed, can you point me to that discussion?

The proposed man page is also very brief on permissions, only mentioning them
indirectly in the error codes section.

> > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > "one of the files is a swap file", which I don't see actually happening in
> > the implementation; it seems those error codes perhaps exist at all for this
> > ioctl but rather be left to open(..., O_WRONLY).
> 
> Those could be hoisted to the VFS (from the XFS implementation), I think.

If the destination file has to already be open for writing then it can't be
immutable.  So I don't think that error code is needed.  Checking for an active
swapfile, on the other hand, may be valid, although I don't see such a check
anywhere in the version of the code I'm looking at.

> I'll later take a look at how many of these issues apply to clone/clone_range.

clone and clone_range seem to avoid the two biggest issues I saw: they aren't
vectorized, and there's a natural distinction between the "source" and the
"destination" of the operation.  They definitely need careful consideration as
well, though.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-01-13  2:36         ` Eric Biggers
  0 siblings, 0 replies; 50+ messages in thread
From: Eric Biggers @ 2016-01-13  2:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david-FqsqvQoI3Ljby3iVrkZq2A,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> 
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
> 
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

Getting rid of the vectorization certainly avoids a lot of API complexity.  I
don't know precisely why it was made vectorized in the first place, but to me it
seems it shouldn't be vectorized unless there's a strong reason for it to be.

> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
> ...
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.

Deduplication is a strange operation since you're not "reading" or "writing" in
the traditional sense.  All file data stays exactly the same from a user
perspective.

So rather than arbitrarily say that you're "reading" from one file and "writing"
to the other, one possibility is to have an API where you submit two files and
the implementation decides how to best deduplicate them.  That could involve
sharing existing extents in file 1 with file 2; or sharing existing extents in
file 2 with file 1; or creating a new extent and referencing it from each file.
It would be up to the implementation to choose the most efficient approach.

But I take it that the existing btrfs ioctl doesn't work that way, and it always
will share the existing extents in file 1 with file 2.

That's probably a perfectly fine way to do it, but I was wondering whether it
really makes sense to hard-code that detail in the API, since an API could be
defined more generally.

And from a practical perpective, people using the ioctl to deduplicate two files
need to know which one to submit to the API as the "source" and which as the
"destination", if it matters.

Related to this are the permissions you need on each file descriptor.  Since
deduplication isn't "reading" or "writing" in the traditional sense it could,
theoretically, be argued that no permissions at all should be required: neither
FMODE_READ nor FMODE_WRITE.

Most likely are concerns that would make that a bad idea, though.  Perhaps
someone could create mischief by causing fragmentation in system files.

If this was previously discussed, can you point me to that discussion?

The proposed man page is also very brief on permissions, only mentioning them
indirectly in the error codes section.

> > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > "one of the files is a swap file", which I don't see actually happening in
> > the implementation; it seems those error codes perhaps exist at all for this
> > ioctl but rather be left to open(..., O_WRONLY).
> 
> Those could be hoisted to the VFS (from the XFS implementation), I think.

If the destination file has to already be open for writing then it can't be
immutable.  So I don't think that error code is needed.  Checking for an active
swapfile, on the other hand, may be valid, although I don't see such a check
anywhere in the version of the code I'm looking at.

> I'll later take a look at how many of these issues apply to clone/clone_range.

clone and clone_range seem to avoid the two biggest issues I saw: they aren't
vectorized, and there's a natural distinction between the "source" and the
"destination" of the operation.  They definitely need careful consideration as
well, though.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2016-01-13  2:36         ` Eric Biggers
  (?)
@ 2016-01-23  0:54           ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-01-23  0:54 UTC (permalink / raw)
  To: Eric Biggers; +Cc: david, linux-fsdevel, linux-api, xfs, linux-btrfs, hch

On Tue, Jan 12, 2016 at 08:36:58PM -0600, Eric Biggers wrote:
> On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> > 
> > Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> > that resembles the clone_range interface:
> > 
> > int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
> 
> Getting rid of the vectorization certainly avoids a lot of API complexity.  I
> don't know precisely why it was made vectorized in the first place, but to me it
> seems it shouldn't be vectorized unless there's a strong reason for it to be.

I suppose the idea was that you could ask the kernel to dedupe a bunch of files
"atomically" and that locks would be held (at least on the source file) during
the entire operation...?

Ick.  Given that you can have 65536 vectors of 16MB apiece, you could be
reading up to 1TB of data into the pagecache.  That's a lot for a single
ioctl, though I guess we unlock all the inodes between each vector.

(Welll... we don't check for pending signals.  Urk.)

> > Not sure what you mean by 'symmetric', but in any case the convention seems
> > to be that src_fd's storage is shared with dest_fd if there's a match.
> > ...
> > I think the result of an earlier discussion was that src_fd must be readable,
> > and dest_fd must be writable or appendable.
> 
> Deduplication is a strange operation since you're not "reading" or "writing" in
> the traditional sense.  All file data stays exactly the same from a user
> perspective.
> 
> So rather than arbitrarily say that you're "reading" from one file and "writing"
> to the other, one possibility is to have an API where you submit two files and
> the implementation decides how to best deduplicate them.  That could involve
> sharing existing extents in file 1 with file 2; or sharing existing extents in
> file 2 with file 1; or creating a new extent and referencing it from each file.
> It would be up to the implementation to choose the most efficient approach.

That's a very good point!  We needn't limit the API to whatever the first
implementer does; all of the above are valid interpretations of what dedupe
could involve.  I might even argue that this is the use-case for the vectorized
dedupe call -- analyze all these proposed deduplications and figure out the
best plan to make it happen.

(Ofc the current implementation doesn't allow this, but we can change the
code when the singularity happens, etc.)

> But I take it that the existing btrfs ioctl doesn't work that way, and it always
> will share the existing extents in file 1 with file 2.
> 
> That's probably a perfectly fine way to do it, but I was wondering whether it
> really makes sense to hard-code that detail in the API, since an API could be
> defined more generally.
> 
> And from a practical perpective, people using the ioctl to deduplicate two files
> need to know which one to submit to the API as the "source" and which as the
> "destination", if it matters.
> 
> Related to this are the permissions you need on each file descriptor.  Since
> deduplication isn't "reading" or "writing" in the traditional sense it could,
> theoretically, be argued that no permissions at all should be required: neither
> FMODE_READ nor FMODE_WRITE.

I don't like the idea of being able to make major changes to a file that
I've opened with O_RDONLY. :)

I suppose if you're going to allow the FS to figure out its own strategy
and all that, then by the above reasoning one ought to have all the files
opened as writable.

> Most likely are concerns that would make that a bad idea, though.  Perhaps
> someone could create mischief by causing fragmentation in system files.
> 
> If this was previously discussed, can you point me to that discussion?

I don't know if there was a previous discussion, alas.

> The proposed man page is also very brief on permissions, only mentioning them
> indirectly in the error codes section.

There's a lot of unfortunate hand-waving in that manpage -- I wasn't involved
in designing the initial ioctl, so for the most part I'm simply sniffing my
way through based on observed behaviors (of duperemove) and guessing based on
the source code as I write the XFS version. :)

> > > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > > "one of the files is a swap file", which I don't see actually happening in
> > > the implementation; it seems those error codes perhaps exist at all for this
> > > ioctl but rather be left to open(..., O_WRONLY).
> > 
> > Those could be hoisted to the VFS (from the XFS implementation), I think.
> 
> If the destination file has to already be open for writing then it can't be
> immutable.  So I don't think that error code is needed.  Checking for an active
> swapfile, on the other hand, may be valid, although I don't see such a check
> anywhere in the version of the code I'm looking at.

It's at the top of xfs_file_share_range() in the more recent RFCs of XFS reflink.

> > I'll later take a look at how many of these issues apply to clone/clone_range.
> 
> clone and clone_range seem to avoid the two biggest issues I saw: they aren't
> vectorized, and there's a natural distinction between the "source" and the
> "destination" of the operation.  They definitely need careful consideration as
> well, though.

Indeed.  I appreciate your review of the manpages/patches!

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-01-23  0:54           ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-01-23  0:54 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-api, xfs, hch, linux-fsdevel, linux-btrfs

On Tue, Jan 12, 2016 at 08:36:58PM -0600, Eric Biggers wrote:
> On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> > 
> > Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> > that resembles the clone_range interface:
> > 
> > int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
> 
> Getting rid of the vectorization certainly avoids a lot of API complexity.  I
> don't know precisely why it was made vectorized in the first place, but to me it
> seems it shouldn't be vectorized unless there's a strong reason for it to be.

I suppose the idea was that you could ask the kernel to dedupe a bunch of files
"atomically" and that locks would be held (at least on the source file) during
the entire operation...?

Ick.  Given that you can have 65536 vectors of 16MB apiece, you could be
reading up to 1TB of data into the pagecache.  That's a lot for a single
ioctl, though I guess we unlock all the inodes between each vector.

(Welll... we don't check for pending signals.  Urk.)

> > Not sure what you mean by 'symmetric', but in any case the convention seems
> > to be that src_fd's storage is shared with dest_fd if there's a match.
> > ...
> > I think the result of an earlier discussion was that src_fd must be readable,
> > and dest_fd must be writable or appendable.
> 
> Deduplication is a strange operation since you're not "reading" or "writing" in
> the traditional sense.  All file data stays exactly the same from a user
> perspective.
> 
> So rather than arbitrarily say that you're "reading" from one file and "writing"
> to the other, one possibility is to have an API where you submit two files and
> the implementation decides how to best deduplicate them.  That could involve
> sharing existing extents in file 1 with file 2; or sharing existing extents in
> file 2 with file 1; or creating a new extent and referencing it from each file.
> It would be up to the implementation to choose the most efficient approach.

That's a very good point!  We needn't limit the API to whatever the first
implementer does; all of the above are valid interpretations of what dedupe
could involve.  I might even argue that this is the use-case for the vectorized
dedupe call -- analyze all these proposed deduplications and figure out the
best plan to make it happen.

(Ofc the current implementation doesn't allow this, but we can change the
code when the singularity happens, etc.)

> But I take it that the existing btrfs ioctl doesn't work that way, and it always
> will share the existing extents in file 1 with file 2.
> 
> That's probably a perfectly fine way to do it, but I was wondering whether it
> really makes sense to hard-code that detail in the API, since an API could be
> defined more generally.
> 
> And from a practical perpective, people using the ioctl to deduplicate two files
> need to know which one to submit to the API as the "source" and which as the
> "destination", if it matters.
> 
> Related to this are the permissions you need on each file descriptor.  Since
> deduplication isn't "reading" or "writing" in the traditional sense it could,
> theoretically, be argued that no permissions at all should be required: neither
> FMODE_READ nor FMODE_WRITE.

I don't like the idea of being able to make major changes to a file that
I've opened with O_RDONLY. :)

I suppose if you're going to allow the FS to figure out its own strategy
and all that, then by the above reasoning one ought to have all the files
opened as writable.

> Most likely are concerns that would make that a bad idea, though.  Perhaps
> someone could create mischief by causing fragmentation in system files.
> 
> If this was previously discussed, can you point me to that discussion?

I don't know if there was a previous discussion, alas.

> The proposed man page is also very brief on permissions, only mentioning them
> indirectly in the error codes section.

There's a lot of unfortunate hand-waving in that manpage -- I wasn't involved
in designing the initial ioctl, so for the most part I'm simply sniffing my
way through based on observed behaviors (of duperemove) and guessing based on
the source code as I write the XFS version. :)

> > > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > > "one of the files is a swap file", which I don't see actually happening in
> > > the implementation; it seems those error codes perhaps exist at all for this
> > > ioctl but rather be left to open(..., O_WRONLY).
> > 
> > Those could be hoisted to the VFS (from the XFS implementation), I think.
> 
> If the destination file has to already be open for writing then it can't be
> immutable.  So I don't think that error code is needed.  Checking for an active
> swapfile, on the other hand, may be valid, although I don't see such a check
> anywhere in the version of the code I'm looking at.

It's at the top of xfs_file_share_range() in the more recent RFCs of XFS reflink.

> > I'll later take a look at how many of these issues apply to clone/clone_range.
> 
> clone and clone_range seem to avoid the two biggest issues I saw: they aren't
> vectorized, and there's a natural distinction between the "source" and the
> "destination" of the operation.  They definitely need careful consideration as
> well, though.

Indeed.  I appreciate your review of the manpages/patches!

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-01-23  0:54           ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-01-23  0:54 UTC (permalink / raw)
  To: Eric Biggers
  Cc: david-FqsqvQoI3Ljby3iVrkZq2A,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On Tue, Jan 12, 2016 at 08:36:58PM -0600, Eric Biggers wrote:
> On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> > 
> > Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> > that resembles the clone_range interface:
> > 
> > int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
> 
> Getting rid of the vectorization certainly avoids a lot of API complexity.  I
> don't know precisely why it was made vectorized in the first place, but to me it
> seems it shouldn't be vectorized unless there's a strong reason for it to be.

I suppose the idea was that you could ask the kernel to dedupe a bunch of files
"atomically" and that locks would be held (at least on the source file) during
the entire operation...?

Ick.  Given that you can have 65536 vectors of 16MB apiece, you could be
reading up to 1TB of data into the pagecache.  That's a lot for a single
ioctl, though I guess we unlock all the inodes between each vector.

(Welll... we don't check for pending signals.  Urk.)

> > Not sure what you mean by 'symmetric', but in any case the convention seems
> > to be that src_fd's storage is shared with dest_fd if there's a match.
> > ...
> > I think the result of an earlier discussion was that src_fd must be readable,
> > and dest_fd must be writable or appendable.
> 
> Deduplication is a strange operation since you're not "reading" or "writing" in
> the traditional sense.  All file data stays exactly the same from a user
> perspective.
> 
> So rather than arbitrarily say that you're "reading" from one file and "writing"
> to the other, one possibility is to have an API where you submit two files and
> the implementation decides how to best deduplicate them.  That could involve
> sharing existing extents in file 1 with file 2; or sharing existing extents in
> file 2 with file 1; or creating a new extent and referencing it from each file.
> It would be up to the implementation to choose the most efficient approach.

That's a very good point!  We needn't limit the API to whatever the first
implementer does; all of the above are valid interpretations of what dedupe
could involve.  I might even argue that this is the use-case for the vectorized
dedupe call -- analyze all these proposed deduplications and figure out the
best plan to make it happen.

(Ofc the current implementation doesn't allow this, but we can change the
code when the singularity happens, etc.)

> But I take it that the existing btrfs ioctl doesn't work that way, and it always
> will share the existing extents in file 1 with file 2.
> 
> That's probably a perfectly fine way to do it, but I was wondering whether it
> really makes sense to hard-code that detail in the API, since an API could be
> defined more generally.
> 
> And from a practical perpective, people using the ioctl to deduplicate two files
> need to know which one to submit to the API as the "source" and which as the
> "destination", if it matters.
> 
> Related to this are the permissions you need on each file descriptor.  Since
> deduplication isn't "reading" or "writing" in the traditional sense it could,
> theoretically, be argued that no permissions at all should be required: neither
> FMODE_READ nor FMODE_WRITE.

I don't like the idea of being able to make major changes to a file that
I've opened with O_RDONLY. :)

I suppose if you're going to allow the FS to figure out its own strategy
and all that, then by the above reasoning one ought to have all the files
opened as writable.

> Most likely are concerns that would make that a bad idea, though.  Perhaps
> someone could create mischief by causing fragmentation in system files.
> 
> If this was previously discussed, can you point me to that discussion?

I don't know if there was a previous discussion, alas.

> The proposed man page is also very brief on permissions, only mentioning them
> indirectly in the error codes section.

There's a lot of unfortunate hand-waving in that manpage -- I wasn't involved
in designing the initial ioctl, so for the most part I'm simply sniffing my
way through based on observed behaviors (of duperemove) and guessing based on
the source code as I write the XFS version. :)

> > > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > > "one of the files is a swap file", which I don't see actually happening in
> > > the implementation; it seems those error codes perhaps exist at all for this
> > > ioctl but rather be left to open(..., O_WRONLY).
> > 
> > Those could be hoisted to the VFS (from the XFS implementation), I think.
> 
> If the destination file has to already be open for writing then it can't be
> immutable.  So I don't think that error code is needed.  Checking for an active
> swapfile, on the other hand, may be valid, although I don't see such a check
> anywhere in the version of the code I'm looking at.

It's at the top of xfs_file_share_range() in the more recent RFCs of XFS reflink.

> > I'll later take a look at how many of these issues apply to clone/clone_range.
> 
> clone and clone_range seem to avoid the two biggest issues I saw: they aren't
> vectorized, and there's a natural distinction between the "source" and the
> "destination" of the operation.  They definitely need careful consideration as
> well, though.

Indeed.  I appreciate your review of the manpages/patches!

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2015-12-19  8:55   ` Darrick J. Wong
@ 2016-07-27 21:51     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 50+ messages in thread
From: Kirill A. Shutemov @ 2016-07-27 21:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, linux-api, xfs, Vlastimil Babka

On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> more systematic (FIDEDUPERANGE).
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/compat_ioctl.c       |    1 
>  fs/ioctl.c              |   38 ++++++++++++++++++
>  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h      |    4 ++
>  include/uapi/linux/fs.h |   30 ++++++++++++++
>  5 files changed, 173 insertions(+)
> 
> 
> diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> index 70d4b10..eab31e7 100644
> --- a/fs/compat_ioctl.c
> +++ b/fs/compat_ioctl.c
> @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
>  
>  	case FICLONE:
>  	case FICLONERANGE:
> +	case FIDEDUPERANGE:
>  		goto do_ioctl;
>  
>  	case FIBMAP:
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 84c6e79..fcdd33b 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
>  	return thaw_super(sb);
>  }
>  
> +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> +{
> +	struct file_dedupe_range __user *argp = arg;
> +	struct file_dedupe_range *same = NULL;
> +	int ret;
> +	unsigned long size;
> +	u16 count;
> +
> +	if (get_user(count, &argp->dest_count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	size = offsetof(struct file_dedupe_range __user, info[count]);

Vlastimil triggered this during fuzzing:

http://paste.opensuse.org/view/raw/99203426

High order allocation without __GFP_NOWARN + fallback. That's not good.

Basically, we don't have any sanity check of 'dest_count' here. This u16
comes directly from userspace. And we call memdup_user() based on it.

Here's a program which makes kernel allocate order-9 page:

https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22

Should we put some reasonable upper limit for the 'dest_count'?
What is typical 'dest_count'?

> +
> +	same = memdup_user(argp, size);
> +	if (IS_ERR(same)) {
> +		ret = PTR_ERR(same);
> +		same = NULL;
> +		goto out;
> +	}
> +
> +	ret = vfs_dedupe_file_range(file, same);
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_to_user(argp, same, size);
> +	if (ret)
> +		ret = -EFAULT;
> +
> +out:
> +	kfree(same);
> +	return ret;
> +}
> +

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-07-27 21:51     ` Kirill A. Shutemov
  0 siblings, 0 replies; 50+ messages in thread
From: Kirill A. Shutemov @ 2016-07-27 21:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-api, Vlastimil Babka, xfs

On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> more systematic (FIDEDUPERANGE).
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/compat_ioctl.c       |    1 
>  fs/ioctl.c              |   38 ++++++++++++++++++
>  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h      |    4 ++
>  include/uapi/linux/fs.h |   30 ++++++++++++++
>  5 files changed, 173 insertions(+)
> 
> 
> diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> index 70d4b10..eab31e7 100644
> --- a/fs/compat_ioctl.c
> +++ b/fs/compat_ioctl.c
> @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
>  
>  	case FICLONE:
>  	case FICLONERANGE:
> +	case FIDEDUPERANGE:
>  		goto do_ioctl;
>  
>  	case FIBMAP:
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 84c6e79..fcdd33b 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
>  	return thaw_super(sb);
>  }
>  
> +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> +{
> +	struct file_dedupe_range __user *argp = arg;
> +	struct file_dedupe_range *same = NULL;
> +	int ret;
> +	unsigned long size;
> +	u16 count;
> +
> +	if (get_user(count, &argp->dest_count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	size = offsetof(struct file_dedupe_range __user, info[count]);

Vlastimil triggered this during fuzzing:

http://paste.opensuse.org/view/raw/99203426

High order allocation without __GFP_NOWARN + fallback. That's not good.

Basically, we don't have any sanity check of 'dest_count' here. This u16
comes directly from userspace. And we call memdup_user() based on it.

Here's a program which makes kernel allocate order-9 page:

https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22

Should we put some reasonable upper limit for the 'dest_count'?
What is typical 'dest_count'?

> +
> +	same = memdup_user(argp, size);
> +	if (IS_ERR(same)) {
> +		ret = PTR_ERR(same);
> +		same = NULL;
> +		goto out;
> +	}
> +
> +	ret = vfs_dedupe_file_range(file, same);
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_to_user(argp, same, size);
> +	if (ret)
> +		ret = -EFAULT;
> +
> +out:
> +	kfree(same);
> +	return ret;
> +}
> +

-- 
 Kirill A. Shutemov

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2016-07-27 21:51     ` Kirill A. Shutemov
  (?)
@ 2016-07-28 18:07       ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-07-28 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: david, linux-fsdevel, linux-api, xfs, Vlastimil Babka

On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);

(I still hate this interface.)

> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22
> 
> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?

There are two userland programs I know of that call this ioctl.  The
first is xfs_io, which always sets dest_count = 1.

The other is duperemove, which seems capable of setting dest_count to
however many fragments it finds, up to a max of 120.  Capping size to
x86's 4k page size yields 127 entries.  On bigger machines with 64k
pages, that increases to 2047.  I think that's enough for anybody.

(Honestly, 127 dedupe candidates * max 16M extent length is already
2GB of IO for a single call.)

--D

> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-07-28 18:07       ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-07-28 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-api, Vlastimil Babka, xfs

On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);

(I still hate this interface.)

> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22
> 
> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?

There are two userland programs I know of that call this ioctl.  The
first is xfs_io, which always sets dest_count = 1.

The other is duperemove, which seems capable of setting dest_count to
however many fragments it finds, up to a max of 120.  Capping size to
x86's 4k page size yields 127 entries.  On bigger machines with 64k
pages, that increases to 2047.  I think that's enough for anybody.

(Honestly, 127 dedupe candidates * max 16M extent length is already
2GB of IO for a single call.)

--D

> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-07-28 18:07       ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-07-28 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: david-FqsqvQoI3Ljby3iVrkZq2A,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	Vlastimil Babka

On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);

(I still hate this interface.)

> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22
> 
> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?

There are two userland programs I know of that call this ioctl.  The
first is xfs_io, which always sets dest_count = 1.

The other is duperemove, which seems capable of setting dest_count to
however many fragments it finds, up to a max of 120.  Capping size to
x86's 4k page size yields 127 entries.  On bigger machines with 64k
pages, that increases to 2047.  I think that's enough for anybody.

(Honestly, 127 dedupe candidates * max 16M extent length is already
2GB of IO for a single call.)

--D

> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2016-07-27 21:51     ` Kirill A. Shutemov
@ 2016-07-28 19:25       ` Darrick J. Wong
  -1 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-07-28 19:25 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: david, linux-fsdevel, linux-api, xfs, Vlastimil Babka

On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);
> 
> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22

I forgot to say, please wrap this up as an xfstest so we can check for
future regressions.  After the patch, the ioctl should return ENOMEM
to signal that the caller asked for more dedupe_count than we want to
allocate memory for.

--D

> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?
> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-07-28 19:25       ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2016-07-28 19:25 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: linux-fsdevel, linux-api, Vlastimil Babka, xfs

On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);
> 
> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22

I forgot to say, please wrap this up as an xfstest so we can check for
future regressions.  After the patch, the ioctl should return ENOMEM
to signal that the caller asked for more dedupe_count than we want to
allocate memory for.

--D

> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?
> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
  2016-01-12  9:14       ` Darrick J. Wong
@ 2016-08-07 17:47         ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 50+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-08-07 17:47 UTC (permalink / raw)
  To: Darrick J. Wong, Eric Biggers
  Cc: mtk.manpages, david, linux-fsdevel, linux-api, xfs, linux-btrfs, hch

Hi Darrick,

On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]

In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?

Thanks,

Michael

  
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary.  There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does".  I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove.  Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
>     __s64 src_fd;
>     __u64 src_offset;
>     __u64 length;
>     __u64 dest_offset;
>     __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd.  Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today.  I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files.  I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic).  But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not?  The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors.  So why is there some non-symmetric
>> behavior?  There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination.  If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete.  Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible?  Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors?  The man
>> page implies they must be open for writing and not appending.  The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd.  I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in.  And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
@ 2016-08-07 17:47         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 50+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-08-07 17:47 UTC (permalink / raw)
  To: Darrick J. Wong, Eric Biggers
  Cc: linux-api, xfs, hch, mtk.manpages, linux-fsdevel, linux-btrfs

Hi Darrick,

On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]

In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?

Thanks,

Michael

  
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary.  There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does".  I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove.  Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
>     __s64 src_fd;
>     __u64 src_offset;
>     __u64 length;
>     __u64 dest_offset;
>     __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd.  Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today.  I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files.  I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic).  But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not?  The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors.  So why is there some non-symmetric
>> behavior?  There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination.  If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete.  Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible?  Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors?  The man
>> page implies they must be open for writing and not appending.  The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd.  I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in.  And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2016-08-07 17:47 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-19  8:55 [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Darrick J. Wong
2015-12-19  8:55 ` Darrick J. Wong
2015-12-19  8:55 ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 1/9] vfs: add copy_file_range syscall and vfs helper Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 2/9] x86: add sys_copy_file_range to syscall tables Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 3/9] btrfs: add .copy_file_range file operation Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 4/9] vfs: Add vfs_copy_file_range() support for pagecache copies Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 5/9] locks: new locks_mandatory_area calling convention Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 6/9] vfs: pull btrfs clone API to vfs layer Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 7/9] vfs: wire up compat ioctl for CLONE/CLONE_RANGE Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55 ` [PATCH 8/9] vfs: hoist the btrfs deduplication ioctl to the vfs Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2015-12-19  8:55   ` Darrick J. Wong
2016-01-12  6:07   ` Eric Biggers
2016-01-12  6:07     ` Eric Biggers
2016-01-12  9:14     ` Darrick J. Wong
2016-01-12  9:14       ` Darrick J. Wong
2016-01-13  2:36       ` Eric Biggers
2016-01-13  2:36         ` Eric Biggers
2016-01-13  2:36         ` Eric Biggers
2016-01-23  0:54         ` Darrick J. Wong
2016-01-23  0:54           ` Darrick J. Wong
2016-01-23  0:54           ` Darrick J. Wong
2016-08-07 17:47       ` Michael Kerrisk (man-pages)
2016-08-07 17:47         ` Michael Kerrisk (man-pages)
2016-07-27 21:51   ` Kirill A. Shutemov
2016-07-27 21:51     ` Kirill A. Shutemov
2016-07-28 18:07     ` Darrick J. Wong
2016-07-28 18:07       ` Darrick J. Wong
2016-07-28 18:07       ` Darrick J. Wong
2016-07-28 19:25     ` Darrick J. Wong
2016-07-28 19:25       ` Darrick J. Wong
2015-12-19  8:56 ` [PATCH 9/9] btrfs: use new dedupe data function pointer Darrick J. Wong
2015-12-19  8:56   ` Darrick J. Wong
2015-12-20 15:30 ` [RFCv4 0/9] vfs: hoist reflink/dedupe ioctls to the VFS Christoph Hellwig
2015-12-20 15:30   ` Christoph Hellwig
2015-12-20 15:30   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.