All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-10 21:53 ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

Changes in v9:
- Update syscall number for sys_mlock2()
- Fix calls to rw_verify_area()

Thanks,
Anna


Anna Schumaker (1):
  vfs: Add vfs_copy_file_range() support for pagecache copies

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h                       |   3 +
 fs/btrfs/file.c                        |   1 +
 fs/btrfs/ioctl.c                       |  91 ++++++++++++++----------
 fs/read_write.c                        | 125 +++++++++++++++++++++++++++++++++
 include/linux/fs.h                     |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/sys_ni.c                        |   1 +
 10 files changed, 193 insertions(+), 40 deletions(-)

-- 
2.6.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-10 21:53 ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

Changes in v9:
- Update syscall number for sys_mlock2()
- Fix calls to rw_verify_area()

Thanks,
Anna


Anna Schumaker (1):
  vfs: Add vfs_copy_file_range() support for pagecache copies

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h                       |   3 +
 fs/btrfs/file.c                        |   1 +
 fs/btrfs/ioctl.c                       |  91 ++++++++++++++----------
 fs/read_write.c                        | 125 +++++++++++++++++++++++++++++++++
 include/linux/fs.h                     |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/sys_ni.c                        |   1 +
 10 files changed, 193 insertions(+), 40 deletions(-)

-- 
2.6.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-10 21:53 ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

Changes in v9:
- Update syscall number for sys_mlock2()
- Fix calls to rw_verify_area()

Thanks,
Anna


Anna Schumaker (1):
  vfs: Add vfs_copy_file_range() support for pagecache copies

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h                       |   3 +
 fs/btrfs/file.c                        |   1 +
 fs/btrfs/ioctl.c                       |  91 ++++++++++++++----------
 fs/read_write.c                        | 125 +++++++++++++++++++++++++++++++++
 include/linux/fs.h                     |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/sys_ni.c                        |   1 +
 10 files changed, 193 insertions(+), 40 deletions(-)

-- 
2.6.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v9 1/4] vfs: add copy_file_range syscall and vfs helper
  2015-11-10 21:53 ` Anna Schumaker
@ 2015-11-10 21:53   ` Anna Schumaker
  -1 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                 Change flags parameter from int to unsigned int,
                 Add function to include/linux/syscalls.h,
                 Check copy len after file open mode,
                 Don't forbid ranges inside the same file,
                 Use rw_verify_area() to veriy ranges,
                 Use file_out rather than file_in,
                 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c                   | 120 ++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |   3 +
 include/linux/syscalls.h          |   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   1 +
 5 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..97c15ca 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,122 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	ssize_t ret;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_out->f_op->copy_file_range)
+		return -EBADF;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
+	if (len == 0)
+		return 0;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret = -EBADF;
+
+	f_in = fdget(fd_in);
+	if (!f_in.file)
+		goto out2;
+
+	f_out = fdget(fd_out);
+	if (!f_out.file)
+		goto out1;
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+
+out:
+	fdput(f_in);
+out1:
+	fdput(f_out);
+out2:
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9a1cb8c..117b055 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,6 +1629,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1682,6 +1683,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a156b82..3116e3c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,6 +886,9 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1324b02..2622b33 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -715,9 +715,11 @@ __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 __SYSCALL(__NR_membarrier, sys_membarrier)
 #define __NR_mlock2 284
 __SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_copy_file_range 285
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 285
+#define __NR_syscalls 286
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..2c5e3a8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 1/4] vfs: add copy_file_range syscall and vfs helper
@ 2015-11-10 21:53   ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                 Change flags parameter from int to unsigned int,
                 Add function to include/linux/syscalls.h,
                 Check copy len after file open mode,
                 Don't forbid ranges inside the same file,
                 Use rw_verify_area() to veriy ranges,
                 Use file_out rather than file_in,
                 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c                   | 120 ++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |   3 +
 include/linux/syscalls.h          |   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   1 +
 5 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..97c15ca 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,122 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	ssize_t ret;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_out->f_op->copy_file_range)
+		return -EBADF;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
+	if (len == 0)
+		return 0;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret = -EBADF;
+
+	f_in = fdget(fd_in);
+	if (!f_in.file)
+		goto out2;
+
+	f_out = fdget(fd_out);
+	if (!f_out.file)
+		goto out1;
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+
+out:
+	fdput(f_in);
+out1:
+	fdput(f_out);
+out2:
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9a1cb8c..117b055 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,6 +1629,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1682,6 +1683,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a156b82..3116e3c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,6 +886,9 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1324b02..2622b33 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -715,9 +715,11 @@ __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 __SYSCALL(__NR_membarrier, sys_membarrier)
 #define __NR_mlock2 284
 __SYSCALL(__NR_mlock2, sys_mlock2)
+#define __NR_copy_file_range 285
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 285
+#define __NR_syscalls 286
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0623787..2c5e3a8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 2/4] x86: add sys_copy_file_range to syscall tables
@ 2015-11-10 21:53   ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..cb713df 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..dc1040a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 2/4] x86: add sys_copy_file_range to syscall tables
@ 2015-11-10 21:53   ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

From: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..cb713df 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..dc1040a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 2/4] x86: add sys_copy_file_range to syscall tables
@ 2015-11-10 21:53   ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

From: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f17705e..cb713df 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -383,3 +383,4 @@
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
+377	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 314a90b..dc1040a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -332,6 +332,7 @@
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
+326	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 3/4] btrfs: add .copy_file_range file operation
  2015-11-10 21:53 ` Anna Schumaker
@ 2015-11-10 21:53   ` Anna Schumaker
  -1 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
                 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8c58191..dd7d888 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4051,6 +4051,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6bd5ce9..5e914b3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2912,6 +2912,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index da94138..0f92735 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3779,17 +3779,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3802,49 +3801,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3921,6 +3891,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 3/4] btrfs: add .copy_file_range file operation
@ 2015-11-10 21:53   ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int,
                 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8c58191..dd7d888 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4051,6 +4051,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6bd5ce9..5e914b3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2912,6 +2912,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index da94138..0f92735 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3779,17 +3779,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3802,49 +3801,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3921,6 +3891,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 4/4] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-11-10 21:53 ` Anna Schumaker
@ 2015-11-10 21:53   ` Anna Schumaker
  -1 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Padraig Brady <P@draigBrady.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v9:
- Don't remove call to rw_verify_area()
---
 fs/read_write.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 97c15ca..6f73af1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1354,8 +1354,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
-	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op->copy_file_range)
+	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
 	/* this could be relaxed once a method supports cross-fs copies */
@@ -1369,8 +1368,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range)
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if (ret == -EOPNOTSUPP)
+		ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+				len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v9 4/4] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-11-10 21:53   ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-10 21:53 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Padraig Brady <P@draigBrady.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v9:
- Don't remove call to rw_verify_area()
---
 fs/read_write.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 97c15ca..6f73af1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1354,8 +1354,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
-	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op->copy_file_range)
+	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
 	/* this could be relaxed once a method supports cross-fs copies */
@@ -1369,8 +1368,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range)
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if (ret == -EOPNOTSUPP)
+		ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+				len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
  2015-11-10 21:53 ` Anna Schumaker
                   ` (5 preceding siblings ...)
  (?)
@ 2015-11-11  3:38 ` Al Viro
  2015-11-11 14:00     ` Anna Schumaker
  -1 siblings, 1 reply; 19+ messages in thread
From: Al Viro @ 2015-11-11  3:38 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, clm,
	darrick.wong, mtk.manpages, andros, hch

On Tue, Nov 10, 2015 at 04:53:29PM -0500, Anna Schumaker wrote:
> Copy system calls came up during Plumbers a while ago, mostly because several
> filesystems (including NFS and XFS) are currently working on copy acceleration
> implementations.  We haven't heard from Zach Brown in a while, so I volunteered
> to push his patches upstream so individual filesystems don't need to keep
> writing their own ioctls.

OK, taken for the next cycle.  FWIW, I'm going to toss COMPAT_SYSCALL_DEFINE
counterpart in as well - right now it's doing only native, AFAICS.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-11 14:00     ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-11 14:00 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, clm,
	darrick.wong, mtk.manpages, andros, hch

On 11/10/2015 10:38 PM, Al Viro wrote:
> On Tue, Nov 10, 2015 at 04:53:29PM -0500, Anna Schumaker wrote:
>> Copy system calls came up during Plumbers a while ago, mostly because several
>> filesystems (including NFS and XFS) are currently working on copy acceleration
>> implementations.  We haven't heard from Zach Brown in a while, so I volunteered
>> to push his patches upstream so individual filesystems don't need to keep
>> writing their own ioctls.
> 
> OK, taken for the next cycle.  FWIW, I'm going to toss COMPAT_SYSCALL_DEFINE
> counterpart in as well - right now it's doing only native, AFAICS.
> 

Thanks, Al!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-11 14:00     ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-11 14:00 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	clm-b10kYP2dOMg, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On 11/10/2015 10:38 PM, Al Viro wrote:
> On Tue, Nov 10, 2015 at 04:53:29PM -0500, Anna Schumaker wrote:
>> Copy system calls came up during Plumbers a while ago, mostly because several
>> filesystems (including NFS and XFS) are currently working on copy acceleration
>> implementations.  We haven't heard from Zach Brown in a while, so I volunteered
>> to push his patches upstream so individual filesystems don't need to keep
>> writing their own ioctls.
> 
> OK, taken for the next cycle.  FWIW, I'm going to toss COMPAT_SYSCALL_DEFINE
> counterpart in as well - right now it's doing only native, AFAICS.
> 

Thanks, Al!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-11 14:00     ` Anna Schumaker
  0 siblings, 0 replies; 19+ messages in thread
From: Anna Schumaker @ 2015-11-11 14:00 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	clm-b10kYP2dOMg, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On 11/10/2015 10:38 PM, Al Viro wrote:
> On Tue, Nov 10, 2015 at 04:53:29PM -0500, Anna Schumaker wrote:
>> Copy system calls came up during Plumbers a while ago, mostly because several
>> filesystems (including NFS and XFS) are currently working on copy acceleration
>> implementations.  We haven't heard from Zach Brown in a while, so I volunteered
>> to push his patches upstream so individual filesystems don't need to keep
>> writing their own ioctls.
> 
> OK, taken for the next cycle.  FWIW, I'm going to toss COMPAT_SYSCALL_DEFINE
> counterpart in as well - right now it's doing only native, AFAICS.
> 

Thanks, Al!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-11 14:53   ` Eric Biggers
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2015-11-11 14:53 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

On Tue, Nov 10, 2015 at 04:53:30PM -0500, Anna Schumaker wrote:
>	out:
>		fdput(f_in);
>	out1:
>		fdput(f_out);

The fdput()s are in the wrong order.  fdget(f_in) is first at the beginning, so
fdput(f_in) needs to be last at the end.

>       /* this could be relaxed once a method supports cross-fs copies */
>       if (inode_in->i_sb != inode_out->i_sb)
>               return -EXDEV;

This allows the same superblock but different mounts --- is that intentional?
The commit message says otherwise: it says the vfs entry point requires the same
superblock and mount.


Was there a decision made on FMODE_PREAD and FMODE_PWRITE?  To me it seems
logical that the if the user explicitly specifies an offset, then the
corresponding mode should be checked.  That would check whether the file is
seekable or not, I believe.  Note that e.g. sys_sendfile() does the same thing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
@ 2015-11-11 14:53   ` Eric Biggers
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2015-11-11 14:53 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On Tue, Nov 10, 2015 at 04:53:30PM -0500, Anna Schumaker wrote:
>	out:
>		fdput(f_in);
>	out1:
>		fdput(f_out);

The fdput()s are in the wrong order.  fdget(f_in) is first at the beginning, so
fdput(f_in) needs to be last at the end.

>       /* this could be relaxed once a method supports cross-fs copies */
>       if (inode_in->i_sb != inode_out->i_sb)
>               return -EXDEV;

This allows the same superblock but different mounts --- is that intentional?
The commit message says otherwise: it says the vfs entry point requires the same
superblock and mount.


Was there a decision made on FMODE_PREAD and FMODE_PWRITE?  To me it seems
logical that the if the user explicitly specifies an offset, then the
corresponding mode should be checked.  That would check whether the file is
seekable or not, I believe.  Note that e.g. sys_sendfile() does the same thing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v9 0/4] VFS: In-kernel copy system call
  2015-11-11 14:53   ` Eric Biggers
  (?)
@ 2015-11-12 12:39   ` Austin S Hemmelgarn
  -1 siblings, 0 replies; 19+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-12 12:39 UTC (permalink / raw)
  To: Eric Biggers, Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

On 2015-11-11 09:53, Eric Biggers wrote:
> On Tue, Nov 10, 2015 at 04:53:30PM -0500, Anna Schumaker wrote:
>>        /* this could be relaxed once a method supports cross-fs copies */
>>        if (inode_in->i_sb != inode_out->i_sb)
>>                return -EXDEV;
>
> This allows the same superblock but different mounts --- is that intentional?
> The commit message says otherwise: it says the vfs entry point requires the same
> superblock and mount.
This could be important for BTRFS (because of subvolumes, you can have 
multiple mounts that have the same SB, and as of right now, it can be 
hard to tell the difference between a nested subvolume and a separate 
mount point).  We do support cross-subvolume reflinks, but I'm not sure 
if that works between subvolumes mounted at different places (if it 
doesn't, I may have to look into getting that changed).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-11-12 12:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10 21:53 [PATCH v9 0/4] VFS: In-kernel copy system call Anna Schumaker
2015-11-10 21:53 ` Anna Schumaker
2015-11-10 21:53 ` Anna Schumaker
2015-11-10 21:53 ` [PATCH v9 1/4] vfs: add copy_file_range syscall and vfs helper Anna Schumaker
2015-11-10 21:53   ` Anna Schumaker
2015-11-10 21:53 ` [PATCH v9 2/4] x86: add sys_copy_file_range to syscall tables Anna Schumaker
2015-11-10 21:53   ` Anna Schumaker
2015-11-10 21:53   ` Anna Schumaker
2015-11-10 21:53 ` [PATCH v9 3/4] btrfs: add .copy_file_range file operation Anna Schumaker
2015-11-10 21:53   ` Anna Schumaker
2015-11-10 21:53 ` [PATCH v9 4/4] vfs: Add vfs_copy_file_range() support for pagecache copies Anna Schumaker
2015-11-10 21:53   ` Anna Schumaker
2015-11-11  3:38 ` [PATCH v9 0/4] VFS: In-kernel copy system call Al Viro
2015-11-11 14:00   ` Anna Schumaker
2015-11-11 14:00     ` Anna Schumaker
2015-11-11 14:00     ` Anna Schumaker
2015-11-11 14:53 ` Eric Biggers
2015-11-11 14:53   ` Eric Biggers
2015-11-12 12:39   ` Austin S Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.