All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/9] VFS: In-kernel copy system call
@ 2015-09-30 17:26 ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

This posting fixes a few issues that popped up after I submitted v4 yesterday.

Changes in v5:
- Bump syscall number (again)
- Add sys_copy_file_range() to include/linux/syscalls.h
- Change flags parameter on btrfs to an unsigned int


Anna Schumaker (6):
  vfs: Copy should check len after file open mode
  vfs: Copy shouldn't forbid ranges inside the same file
  vfs: Copy should use file_out rather than file_in
  vfs: Remove copy_file_range mountpoint checks
  vfs: Add vfs_copy_file_range() support for pagecache copies
  btrfs: btrfs_copy_file_range() only supports reflinks

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h                       |   3 +
 fs/btrfs/file.c                        |   1 +
 fs/btrfs/ioctl.c                       |  95 +++++++++++++---------
 fs/read_write.c                        | 141 +++++++++++++++++++++++++++++++++
 include/linux/copy.h                   |   6 ++
 include/linux/fs.h                     |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/Kbuild              |   1 +
 include/uapi/linux/copy.h              |   8 ++
 kernel/sys_ni.c                        |   1 +
 13 files changed, 228 insertions(+), 40 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

-- 
2.6.0


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v5 0/9] VFS: In-kernel copy system call
@ 2015-09-30 17:26 ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

This posting fixes a few issues that popped up after I submitted v4 yesterday.

Changes in v5:
- Bump syscall number (again)
- Add sys_copy_file_range() to include/linux/syscalls.h
- Change flags parameter on btrfs to an unsigned int


Anna Schumaker (6):
  vfs: Copy should check len after file open mode
  vfs: Copy shouldn't forbid ranges inside the same file
  vfs: Copy should use file_out rather than file_in
  vfs: Remove copy_file_range mountpoint checks
  vfs: Add vfs_copy_file_range() support for pagecache copies
  btrfs: btrfs_copy_file_range() only supports reflinks

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h                       |   3 +
 fs/btrfs/file.c                        |   1 +
 fs/btrfs/ioctl.c                       |  95 +++++++++++++---------
 fs/read_write.c                        | 141 +++++++++++++++++++++++++++++++++
 include/linux/copy.h                   |   6 ++
 include/linux/fs.h                     |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/Kbuild              |   1 +
 include/uapi/linux/copy.h              |   8 ++
 kernel/sys_ni.c                        |   1 +
 13 files changed, 228 insertions(+), 40 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v5 0/9] VFS: In-kernel copy system call
@ 2015-09-30 17:26 ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

This posting fixes a few issues that popped up after I submitted v4 yesterday.

Changes in v5:
- Bump syscall number (again)
- Add sys_copy_file_range() to include/linux/syscalls.h
- Change flags parameter on btrfs to an unsigned int


Anna Schumaker (6):
  vfs: Copy should check len after file open mode
  vfs: Copy shouldn't forbid ranges inside the same file
  vfs: Copy should use file_out rather than file_in
  vfs: Remove copy_file_range mountpoint checks
  vfs: Add vfs_copy_file_range() support for pagecache copies
  btrfs: btrfs_copy_file_range() only supports reflinks

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h                       |   3 +
 fs/btrfs/file.c                        |   1 +
 fs/btrfs/ioctl.c                       |  95 +++++++++++++---------
 fs/read_write.c                        | 141 +++++++++++++++++++++++++++++++++
 include/linux/copy.h                   |   6 ++
 include/linux/fs.h                     |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/Kbuild              |   1 +
 include/uapi/linux/copy.h              |   8 ++
 kernel/sys_ni.c                        |   1 +
 13 files changed, 228 insertions(+), 40 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v5 1/9] vfs: add copy_file_range syscall and vfs helper
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification]
[Anna Schumaker: Change flags parameter from int to unsigned int]
[Anna Schumaker: Add function to include/linux/syscalls.h]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
v5:
- Bump syscall number again
- Add to include/linux/syscalls.h
---
 fs/read_write.c                   | 129 ++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |   3 +
 include/linux/syscalls.h          |   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   1 +
 5 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..dd10750 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in;
+	struct inode *inode_out;
+	ssize_t ret;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len == 0)
+		return 0;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_in->f_op || !file_in->f_op->copy_file_range)
+		return -EBADF;
+
+	inode_in = file_inode(file_in);
+	inode_out = file_inode(file_out);
+
+	/* make sure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > i_size_read(inode_in))
+		return -EINVAL;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+
+	/* forbid ranges in the same file */
+	if (inode_in == inode_out)
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					     len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret;
+
+	f_in = fdget(fd_in);
+	f_out = fdget(fd_out);
+	if (!f_in.file || !f_out.file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+out:
+	fdput(f_in);
+	fdput(f_out);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..6220307 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,6 +1642,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1695,6 +1696,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a460e2e..290205f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,5 +886,8 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ee12400..2d79155 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -713,9 +713,11 @@ __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 #define __NR_membarrier 283
 __SYSCALL(__NR_membarrier, sys_membarrier)
+#define __NR_copy_file_range 284
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 284
+#define __NR_syscalls 285
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a02decf..83c5c82 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 1/9] vfs: add copy_file_range syscall and vfs helper
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification]
[Anna Schumaker: Change flags parameter from int to unsigned int]
[Anna Schumaker: Add function to include/linux/syscalls.h]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
v5:
- Bump syscall number again
- Add to include/linux/syscalls.h
---
 fs/read_write.c                   | 129 ++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |   3 +
 include/linux/syscalls.h          |   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   1 +
 5 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..dd10750 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/pagemap.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 	return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	struct inode *inode_in;
+	struct inode *inode_out;
+	ssize_t ret;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len == 0)
+		return 0;
+
+	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_in->f_op || !file_in->f_op->copy_file_range)
+		return -EBADF;
+
+	inode_in = file_inode(file_in);
+	inode_out = file_inode(file_out);
+
+	/* make sure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > i_size_read(inode_in))
+		return -EINVAL;
+
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+
+	/* forbid ranges in the same file */
+	if (inode_in == inode_out)
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					     len, flags);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+		int, fd_out, loff_t __user *, off_out,
+		size_t, len, unsigned int, flags)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret;
+
+	f_in = fdget(fd_in);
+	f_out = fdget(fd_out);
+	if (!f_in.file || !f_out.file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	ret = -EFAULT;
+	if (off_in) {
+		if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_in = f_in.file->f_pos;
+	}
+
+	if (off_out) {
+		if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+			goto out;
+	} else {
+		pos_out = f_out.file->f_pos;
+	}
+
+	ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+				  flags);
+	if (ret > 0) {
+		pos_in += ret;
+		pos_out += ret;
+
+		if (off_in) {
+			if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_in.file->f_pos = pos_in;
+		}
+
+		if (off_out) {
+			if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+				ret = -EFAULT;
+		} else {
+			f_out.file->f_pos = pos_out;
+		}
+	}
+out:
+	fdput(f_in);
+	fdput(f_out);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..6220307 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,6 +1642,7 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 };
 
 struct inode_operations {
@@ -1695,6 +1696,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+				   loff_t, size_t, unsigned int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a460e2e..290205f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,5 +886,8 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *envp, int flags);
 
 asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+				    int fd_out, loff_t __user *off_out,
+				    size_t len, unsigned int flags);
 
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ee12400..2d79155 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -713,9 +713,11 @@ __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 __SYSCALL(__NR_userfaultfd, sys_userfaultfd)
 #define __NR_membarrier 283
 __SYSCALL(__NR_membarrier, sys_membarrier)
+#define __NR_copy_file_range 284
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 284
+#define __NR_syscalls 285
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a02decf..83c5c82 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
 cond_syscall(sys_setfsgid);
 cond_syscall(sys_capget);
 cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 2/9] x86: add sys_copy_file_range to syscall tables
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7663c45..0531270 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+376	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..03a9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+325	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 2/9] x86: add sys_copy_file_range to syscall tables
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7663c45..0531270 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+376	i386	copy_file_range		sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..03a9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+325	common	copy_file_range		sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 3/9] btrfs: add .copy_file_range file operation
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
v5:
- Make flags variable an unsigned int
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..0046567 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf542..d3697e8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 3/9] btrfs: add .copy_file_range file operation
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

From: Zach Brown <zab@redhat.com>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Make flags an unsigned int]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
v5:
- Make flags variable an unsigned int
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..0046567 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf542..d3697e8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+					u64 off, u64 olen, u64 destoff)
 {
 	struct inode *inode = file_inode(file);
+	struct inode *src = file_inode(file_src);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	int ret;
 	u64 len = olen;
 	u64 bs = root->fs_info->sb->s_blocksize;
-	int same_inode = 0;
+	int same_inode = src == inode;
 
 	/*
 	 * TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   be either compressed or non-compressed.
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		same_inode = 1;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
+	if (file_src->f_path.mnt != file->f_path.mnt ||
+	    src->i_sb != inode->i_sb)
+		return -EXDEV;
 
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -EISDIR;
 	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EISDIR;
 
 	if (!same_inode) {
 		btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
 		btrfs_double_inode_unlock(src, inode);
 	else
 		mutex_unlock(&src->i_mutex);
+	return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+	if (ret == 0)
+		ret = len;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct fd src_file;
+	int ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 4/9] vfs: Copy should check len after file open mode
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

I don't think it makes sense to report that a copy succeeded if the
files aren't open properly.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index dd10750..f3d6c48 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1345,9 +1345,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (flags)
 		return -EINVAL;
 
-	if (len == 0)
-		return 0;
-
 	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
 	ret = rw_verify_area(READ, file_in, &pos_in, len);
 	if (ret >= 0)
@@ -1378,6 +1375,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (inode_in == inode_out)
 		return -EINVAL;
 
+	if (len == 0)
+		return 0;
+
 	ret = mnt_want_write_file(file_out);
 	if (ret)
 		return ret;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 4/9] vfs: Copy should check len after file open mode
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

I don't think it makes sense to report that a copy succeeded if the
files aren't open properly.

Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: David Sterba <dsterba-IBi9RG/b67k@public.gmane.org>
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index dd10750..f3d6c48 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1345,9 +1345,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (flags)
 		return -EINVAL;
 
-	if (len == 0)
-		return 0;
-
 	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
 	ret = rw_verify_area(READ, file_in, &pos_in, len);
 	if (ret >= 0)
@@ -1378,6 +1375,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (inode_in == inode_out)
 		return -EINVAL;
 
+	if (len == 0)
+		return 0;
+
 	ret = mnt_want_write_file(file_out);
 	if (ret)
 		return ret;
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 4/9] vfs: Copy should check len after file open mode
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

I don't think it makes sense to report that a copy succeeded if the
files aren't open properly.

Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: David Sterba <dsterba-IBi9RG/b67k@public.gmane.org>
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index dd10750..f3d6c48 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1345,9 +1345,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (flags)
 		return -EINVAL;
 
-	if (len == 0)
-		return 0;
-
 	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
 	ret = rw_verify_area(READ, file_in, &pos_in, len);
 	if (ret >= 0)
@@ -1378,6 +1375,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (inode_in == inode_out)
 		return -EINVAL;
 
+	if (len == 0)
+		return 0;
+
 	ret = mnt_want_write_file(file_out);
 	if (ret)
 		return ret;
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

This is perfectly valid for BTRFS and XFS, so let's leave this up to
filesystems to check.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index f3d6c48..8e7cb33 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1371,10 +1371,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	    file_in->f_path.mnt != file_out->f_path.mnt)
 		return -EXDEV;
 
-	/* forbid ranges in the same file */
-	if (inode_in == inode_out)
-		return -EINVAL;
-
 	if (len == 0)
 		return 0;
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

This is perfectly valid for BTRFS and XFS, so let's leave this up to
filesystems to check.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/read_write.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index f3d6c48..8e7cb33 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1371,10 +1371,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	    file_in->f_path.mnt != file_out->f_path.mnt)
 		return -EXDEV;
 
-	/* forbid ranges in the same file */
-	if (inode_in == inode_out)
-		return -EINVAL;
-
 	if (len == 0)
 		return 0;
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 6/9] vfs: Copy should use file_out rather than file_in
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

The way to think about this is that the destination filesystem reads the
data from the source file and processes it accordingly.  This is
especially important to avoid an infinate loop when doing a "server to
server" copy on NFS.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 8e7cb33..6f74f1f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1355,7 +1355,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND) ||
-	    !file_in->f_op || !file_in->f_op->copy_file_range)
+	    !file_out->f_op || !file_out->f_op->copy_file_range)
 		return -EBADF;
 
 	inode_in = file_inode(file_in);
@@ -1378,8 +1378,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					     len, flags);
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 6/9] vfs: Copy should use file_out rather than file_in
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

The way to think about this is that the destination filesystem reads the
data from the source file and processes it accordingly.  This is
especially important to avoid an infinate loop when doing a "server to
server" copy on NFS.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
---
 fs/read_write.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 8e7cb33..6f74f1f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1355,7 +1355,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND) ||
-	    !file_in->f_op || !file_in->f_op->copy_file_range)
+	    !file_out->f_op || !file_out->f_op->copy_file_range)
 		return -EBADF;
 
 	inode_in = file_inode(file_in);
@@ -1378,8 +1378,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					     len, flags);
+	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+					      len, flags);
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

I still want to do an in-kernel copy even if the files are on different
mountpoints, and NFS has a "server to server" copy that expects two
files on different mountpoints.  Let's have individual filesystems
implement this check instead.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/read_write.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6f74f1f..ee9fa37 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1366,11 +1366,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	    pos_in + len > i_size_read(inode_in))
 		return -EINVAL;
 
-	/* this could be relaxed once a method supports cross-fs copies */
-	if (inode_in->i_sb != inode_out->i_sb ||
-	    file_in->f_path.mnt != file_out->f_path.mnt)
-		return -EXDEV;
-
 	if (len == 0)
 		return 0;
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

I still want to do an in-kernel copy even if the files are on different
mountpoints, and NFS has a "server to server" copy that expects two
files on different mountpoints.  Let's have individual filesystems
implement this check instead.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/read_write.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6f74f1f..ee9fa37 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1366,11 +1366,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	    pos_in + len > i_size_read(inode_in))
 		return -EINVAL;
 
-	/* this could be relaxed once a method supports cross-fs copies */
-	if (inode_in->i_sb != inode_out->i_sb ||
-	    file_in->f_path.mnt != file_out->f_path.mnt)
-		return -EXDEV;
-
 	if (len == 0)
 		return 0;
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

I make pagecache copies configurable by adding three new (exclusive)
flags:
- COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
- COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
- COPY_FR_DEDUP creates a reflink, but only if the contents of both
  ranges are identical.

The default (flags=0) means to first attempt a reflink, but use the pagecache
if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Padraig Brady <P@draigBrady.com>
---
 fs/read_write.c           | 61 +++++++++++++++++++++++++++++++----------------
 include/linux/copy.h      |  6 +++++
 include/uapi/linux/Kbuild |  1 +
 include/uapi/linux/copy.h |  8 +++++++
 4 files changed, 56 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index ee9fa37..4fb9b8e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h> 
 #include <linux/stat.h>
 #include <linux/fcntl.h>
+#include <linux/copy.h>
 #include <linux/file.h>
 #include <linux/uio.h>
 #include <linux/fsnotify.h>
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 }
 #endif
 
+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+				       struct file *file_out, loff_t pos_out,
+				       size_t len)
+{
+	ssize_t ret;
+
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0) {
+		len = ret;
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+		if (ret >= 0)
+			len = ret;
+	}
+	if (ret < 0)
+		return ret;
+
+	file_start_write(file_out);
+	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+	file_end_write(file_out);
+
+	return ret;
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
-	struct inode *inode_in;
-	struct inode *inode_out;
 	ssize_t ret;
 
-	if (flags)
+	/* Flags should only be used exclusively. */
+	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
+		return -EINVAL;
+	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
+		return -EINVAL;
+	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
 		return -EINVAL;
 
-	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-	ret = rw_verify_area(READ, file_in, &pos_in, len);
-	if (ret >= 0)
-		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
-	if (ret < 0)
-		return ret;
+	/* Default behavior is to try both. */
+	if (flags == 0)
+		flags = COPY_FR_COPY | COPY_FR_REFLINK;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op || !file_out->f_op->copy_file_range)
+	    !file_out->f_op)
 		return -EBADF;
 
-	inode_in = file_inode(file_in);
-	inode_out = file_inode(file_out);
-
-	/* make sure offsets don't wrap and the input is inside i_size */
-	if (pos_in + len < pos_in || pos_out + len < pos_out ||
-	    pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
 	if (len == 0)
 		return 0;
 
@@ -1373,8 +1389,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range && (file_in->f_op == file_out->f_op))
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if ((ret < 0) && (flags & COPY_FR_COPY))
+		ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+					      pos_out, len);
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 0000000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_COPY_H
+#define _LINUX_COPY_H
+
+#include <uapi/linux/copy.h>
+
+#endif /* _LINUX_COPY_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index f7b2db4..faafd67 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -90,6 +90,7 @@ header-y += coda_psdev.h
 header-y += coff.h
 header-y += connector.h
 header-y += const.h
+header-y += copy.h
 header-y += cramfs_fs.h
 header-y += cuda.h
 header-y += cyclades.h
diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
new file mode 100644
index 0000000..b807dcd
--- /dev/null
+++ b/include/uapi/linux/copy.h
@@ -0,0 +1,8 @@
+#ifndef _UAPI_LINUX_COPY_H
+#define _UAPI_LINUX_COPY_H
+
+#define COPY_FR_COPY		(1 << 0)  /* Only do a pagecache copy.  */
+#define COPY_FR_REFLINK		(1 << 1)  /* Only make a reflink.       */
+#define COPY_FR_DEDUP		(1 << 2)  /* Deduplicate file data.     */
+
+#endif /* _UAPI_LINUX_COPY_H */
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

I make pagecache copies configurable by adding three new (exclusive)
flags:
- COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
- COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
- COPY_FR_DEDUP creates a reflink, but only if the contents of both
  ranges are identical.

The default (flags=0) means to first attempt a reflink, but use the pagecache
if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Padraig Brady <P@draigBrady.com>
---
 fs/read_write.c           | 61 +++++++++++++++++++++++++++++++----------------
 include/linux/copy.h      |  6 +++++
 include/uapi/linux/Kbuild |  1 +
 include/uapi/linux/copy.h |  8 +++++++
 4 files changed, 56 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index ee9fa37..4fb9b8e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h> 
 #include <linux/stat.h>
 #include <linux/fcntl.h>
+#include <linux/copy.h>
 #include <linux/file.h>
 #include <linux/uio.h>
 #include <linux/fsnotify.h>
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 }
 #endif
 
+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+				       struct file *file_out, loff_t pos_out,
+				       size_t len)
+{
+	ssize_t ret;
+
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0) {
+		len = ret;
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+		if (ret >= 0)
+			len = ret;
+	}
+	if (ret < 0)
+		return ret;
+
+	file_start_write(file_out);
+	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+	file_end_write(file_out);
+
+	return ret;
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
-	struct inode *inode_in;
-	struct inode *inode_out;
 	ssize_t ret;
 
-	if (flags)
+	/* Flags should only be used exclusively. */
+	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
+		return -EINVAL;
+	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
+		return -EINVAL;
+	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
 		return -EINVAL;
 
-	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-	ret = rw_verify_area(READ, file_in, &pos_in, len);
-	if (ret >= 0)
-		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
-	if (ret < 0)
-		return ret;
+	/* Default behavior is to try both. */
+	if (flags == 0)
+		flags = COPY_FR_COPY | COPY_FR_REFLINK;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op || !file_out->f_op->copy_file_range)
+	    !file_out->f_op)
 		return -EBADF;
 
-	inode_in = file_inode(file_in);
-	inode_out = file_inode(file_out);
-
-	/* make sure offsets don't wrap and the input is inside i_size */
-	if (pos_in + len < pos_in || pos_out + len < pos_out ||
-	    pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
 	if (len == 0)
 		return 0;
 
@@ -1373,8 +1389,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range && (file_in->f_op == file_out->f_op))
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if ((ret < 0) && (flags & COPY_FR_COPY))
+		ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+					      pos_out, len);
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 0000000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_COPY_H
+#define _LINUX_COPY_H
+
+#include <uapi/linux/copy.h>
+
+#endif /* _LINUX_COPY_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index f7b2db4..faafd67 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -90,6 +90,7 @@ header-y += coda_psdev.h
 header-y += coff.h
 header-y += connector.h
 header-y += const.h
+header-y += copy.h
 header-y += cramfs_fs.h
 header-y += cuda.h
 header-y += cyclades.h
diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
new file mode 100644
index 0000000..b807dcd
--- /dev/null
+++ b/include/uapi/linux/copy.h
@@ -0,0 +1,8 @@
+#ifndef _UAPI_LINUX_COPY_H
+#define _UAPI_LINUX_COPY_H
+
+#define COPY_FR_COPY		(1 << 0)  /* Only do a pagecache copy.  */
+#define COPY_FR_REFLINK		(1 << 1)  /* Only make a reflink.       */
+#define COPY_FR_DEDUP		(1 << 2)  /* Deduplicate file data.     */
+
+#endif /* _UAPI_LINUX_COPY_H */
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

I make pagecache copies configurable by adding three new (exclusive)
flags:
- COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
- COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
- COPY_FR_DEDUP creates a reflink, but only if the contents of both
  ranges are identical.

The default (flags=0) means to first attempt a reflink, but use the pagecache
if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Padraig Brady <P@draigBrady.com>
---
 fs/read_write.c           | 61 +++++++++++++++++++++++++++++++----------------
 include/linux/copy.h      |  6 +++++
 include/uapi/linux/Kbuild |  1 +
 include/uapi/linux/copy.h |  8 +++++++
 4 files changed, 56 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/copy.h
 create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index ee9fa37..4fb9b8e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h> 
 #include <linux/stat.h>
 #include <linux/fcntl.h>
+#include <linux/copy.h>
 #include <linux/file.h>
 #include <linux/uio.h>
 #include <linux/fsnotify.h>
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 }
 #endif
 
+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+				       struct file *file_out, loff_t pos_out,
+				       size_t len)
+{
+	ssize_t ret;
+
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0) {
+		len = ret;
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+		if (ret >= 0)
+			len = ret;
+	}
+	if (ret < 0)
+		return ret;
+
+	file_start_write(file_out);
+	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+	file_end_write(file_out);
+
+	return ret;
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
-	struct inode *inode_in;
-	struct inode *inode_out;
 	ssize_t ret;
 
-	if (flags)
+	/* Flags should only be used exclusively. */
+	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
+		return -EINVAL;
+	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
+		return -EINVAL;
+	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
 		return -EINVAL;
 
-	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-	ret = rw_verify_area(READ, file_in, &pos_in, len);
-	if (ret >= 0)
-		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
-	if (ret < 0)
-		return ret;
+	/* Default behavior is to try both. */
+	if (flags == 0)
+		flags = COPY_FR_COPY | COPY_FR_REFLINK;
 
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op || !file_out->f_op->copy_file_range)
+	    !file_out->f_op)
 		return -EBADF;
 
-	inode_in = file_inode(file_in);
-	inode_out = file_inode(file_out);
-
-	/* make sure offsets don't wrap and the input is inside i_size */
-	if (pos_in + len < pos_in || pos_out + len < pos_out ||
-	    pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
 	if (len == 0)
 		return 0;
 
@@ -1373,8 +1389,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (ret)
 		return ret;
 
-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range && (file_in->f_op == file_out->f_op))
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if ((ret < 0) && (flags & COPY_FR_COPY))
+		ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+					      pos_out, len);
 	if (ret > 0) {
 		fsnotify_access(file_in);
 		add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 0000000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_COPY_H
+#define _LINUX_COPY_H
+
+#include <uapi/linux/copy.h>
+
+#endif /* _LINUX_COPY_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index f7b2db4..faafd67 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -90,6 +90,7 @@ header-y += coda_psdev.h
 header-y += coff.h
 header-y += connector.h
 header-y += const.h
+header-y += copy.h
 header-y += cramfs_fs.h
 header-y += cuda.h
 header-y += cyclades.h
diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
new file mode 100644
index 0000000..b807dcd
--- /dev/null
+++ b/include/uapi/linux/copy.h
@@ -0,0 +1,8 @@
+#ifndef _UAPI_LINUX_COPY_H
+#define _UAPI_LINUX_COPY_H
+
+#define COPY_FR_COPY		(1 << 0)  /* Only do a pagecache copy.  */
+#define COPY_FR_REFLINK		(1 << 1)  /* Only make a reflink.       */
+#define COPY_FR_DEDUP		(1 << 2)  /* Deduplicate file data.     */
+
+#endif /* _UAPI_LINUX_COPY_H */
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
  2015-09-30 17:26 ` Anna Schumaker
@ 2015-09-30 17:26   ` Anna Schumaker
  -1 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

Reject copies that don't have the COPY_FR_REFLINK flag set.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ioctl.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d3697e8..c1f115d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -44,6 +44,7 @@
 #include <linux/uuid.h>
 #include <linux/btrfs.h>
 #include <linux/uaccess.h>
+#include <linux/copy.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 {
 	ssize_t ret;
 
+	if (!(flags & COPY_FR_REFLINK))
+		return -EOPNOTSUPP;
+
 	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
 	if (ret == 0)
 		ret = len;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

Reject copies that don't have the COPY_FR_REFLINK flag set.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ioctl.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d3697e8..c1f115d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -44,6 +44,7 @@
 #include <linux/uuid.h>
 #include <linux/btrfs.h>
 #include <linux/uaccess.h>
+#include <linux/copy.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 {
 	ssize_t ret;
 
+	if (!(flags & COPY_FR_REFLINK))
+		return -EOPNOTSUPP;
+
 	ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
 	if (ret == 0)
 		ret = len;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 10/9] copy_file_range.2: New page documenting copy_file_range()
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 man2/copy_file_range.2 | 224 +++++++++++++++++++++++++++++++++++++++++++++++++
 man2/splice.2          |   1 +
 2 files changed, 225 insertions(+)
 create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..23e3875
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,224 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@Netapp.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume.
+.\" no responsibility for errors or omissions, or for damages resulting.
+.\" from the use of the information contained herein.  The author(s) may.
+.\" not have taken the same level of care in the production of this.
+.\" manual, which is licensed free of charge, as they might when working.
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-09-29 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <linux/copy.h>
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI "                        loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument can have one of the following flags set:
+.TP 1.9i
+.B COPY_FR_COPY
+Copy all the file data in the requested range.
+Some filesystems might be able to accelerate this copy
+to avoid unnecessary data transfers.
+.TP
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.TP
+.B COPY_FR_DEDUP
+Create a reflink, but only if the contents of
+both files' byte ranges are identical.
+If ranges do not match,
+.B EILSEQ
+will be returned.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to try creating a reflink,
+and if reflinking fails
+.BR copy_file_range ()
+will fall back to performing a full data copy.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EILSEQ
+The contents of both files' byte ranges did not match.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+or
+.B COPY_DEDUP
+was specified in
+.IR flags ,
+but the target filesystem does not support the given operation.
+.TP
+.B EXDEV
+Target filesystem doesn't support cross-filesystem copies.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <linux/copy.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+                       loff_t *off_out, size_t len, unsigned int flags)
+{
+    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+                   off_out, len, flags);
+}
+
+int main(int argc, char **argv)
+{
+    int fd_in, fd_out;
+    struct stat stat;
+    loff_t len, ret;
+    char buf[2];
+
+    if (argc != 3) {
+        fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+        exit(EXIT_FAILURE);
+    }
+
+    fd_in = open(argv[1], O_RDONLY);
+    if (fd_in == \-1) {
+        perror("open (argv[1])");
+        exit(EXIT_FAILURE);
+    }
+
+    if (fstat(fd_in, &stat) == \-1) {
+        perror("fstat");
+        exit(EXIT_FAILURE);
+    }
+    len = stat.st_size;
+
+    fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
+    if (fd_out == \-1) {
+        perror("open (argv[2])");
+        exit(EXIT_FAILURE);
+    }
+
+    do {
+        ret = copy_file_range(fd_in, NULL, fd_out, NULL, 
+                              len, COPY_FR_COPY);
+        if (ret == \-1) {
+            perror("copy_file_range");
+            exit(EXIT_FAILURE);
+        }
+
+        len \-= ret;
+    } while (len > 0);
+
+    close(fd_in);
+    close(fd_out);
+    exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
diff --git a/man2/splice.2 b/man2/splice.2
index b9b4f42..5c162e0 100644
--- a/man2/splice.2
+++ b/man2/splice.2
@@ -238,6 +238,7 @@ only pointers are copied, not the pages of the buffer.
 See
 .BR tee (2).
 .SH SEE ALSO
+.BR copy_file_range (2),
 .BR sendfile (2),
 .BR tee (2),
 .BR vmsplice (2)
-- 
2.5.3


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 10/9] copy_file_range.2: New page documenting copy_file_range()
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 man2/copy_file_range.2 | 224 +++++++++++++++++++++++++++++++++++++++++++++++++
 man2/splice.2          |   1 +
 2 files changed, 225 insertions(+)
 create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..23e3875
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,224 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume.
+.\" no responsibility for errors or omissions, or for damages resulting.
+.\" from the use of the information contained herein.  The author(s) may.
+.\" not have taken the same level of care in the production of this.
+.\" manual, which is licensed free of charge, as they might when working.
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-09-29 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <linux/copy.h>
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI "                        loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument can have one of the following flags set:
+.TP 1.9i
+.B COPY_FR_COPY
+Copy all the file data in the requested range.
+Some filesystems might be able to accelerate this copy
+to avoid unnecessary data transfers.
+.TP
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.TP
+.B COPY_FR_DEDUP
+Create a reflink, but only if the contents of
+both files' byte ranges are identical.
+If ranges do not match,
+.B EILSEQ
+will be returned.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to try creating a reflink,
+and if reflinking fails
+.BR copy_file_range ()
+will fall back to performing a full data copy.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EILSEQ
+The contents of both files' byte ranges did not match.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+or
+.B COPY_DEDUP
+was specified in
+.IR flags ,
+but the target filesystem does not support the given operation.
+.TP
+.B EXDEV
+Target filesystem doesn't support cross-filesystem copies.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <linux/copy.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+                       loff_t *off_out, size_t len, unsigned int flags)
+{
+    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+                   off_out, len, flags);
+}
+
+int main(int argc, char **argv)
+{
+    int fd_in, fd_out;
+    struct stat stat;
+    loff_t len, ret;
+    char buf[2];
+
+    if (argc != 3) {
+        fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+        exit(EXIT_FAILURE);
+    }
+
+    fd_in = open(argv[1], O_RDONLY);
+    if (fd_in == \-1) {
+        perror("open (argv[1])");
+        exit(EXIT_FAILURE);
+    }
+
+    if (fstat(fd_in, &stat) == \-1) {
+        perror("fstat");
+        exit(EXIT_FAILURE);
+    }
+    len = stat.st_size;
+
+    fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
+    if (fd_out == \-1) {
+        perror("open (argv[2])");
+        exit(EXIT_FAILURE);
+    }
+
+    do {
+        ret = copy_file_range(fd_in, NULL, fd_out, NULL, 
+                              len, COPY_FR_COPY);
+        if (ret == \-1) {
+            perror("copy_file_range");
+            exit(EXIT_FAILURE);
+        }
+
+        len \-= ret;
+    } while (len > 0);
+
+    close(fd_in);
+    close(fd_out);
+    exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
diff --git a/man2/splice.2 b/man2/splice.2
index b9b4f42..5c162e0 100644
--- a/man2/splice.2
+++ b/man2/splice.2
@@ -238,6 +238,7 @@ only pointers are copied, not the pages of the buffer.
 See
 .BR tee (2).
 .SH SEE ALSO
+.BR copy_file_range (2),
 .BR sendfile (2),
 .BR tee (2),
 .BR vmsplice (2)
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v5 10/9] copy_file_range.2: New page documenting copy_file_range()
@ 2015-09-30 17:26   ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-09-30 17:26 UTC (permalink / raw)
  To: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 man2/copy_file_range.2 | 224 +++++++++++++++++++++++++++++++++++++++++++++++++
 man2/splice.2          |   1 +
 2 files changed, 225 insertions(+)
 create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..23e3875
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,224 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume.
+.\" no responsibility for errors or omissions, or for damages resulting.
+.\" from the use of the information contained herein.  The author(s) may.
+.\" not have taken the same level of care in the production of this.
+.\" manual, which is licensed free of charge, as they might when working.
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-09-29 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <linux/copy.h>
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI "                        loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument can have one of the following flags set:
+.TP 1.9i
+.B COPY_FR_COPY
+Copy all the file data in the requested range.
+Some filesystems might be able to accelerate this copy
+to avoid unnecessary data transfers.
+.TP
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.TP
+.B COPY_FR_DEDUP
+Create a reflink, but only if the contents of
+both files' byte ranges are identical.
+If ranges do not match,
+.B EILSEQ
+will be returned.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to try creating a reflink,
+and if reflinking fails
+.BR copy_file_range ()
+will fall back to performing a full data copy.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EILSEQ
+The contents of both files' byte ranges did not match.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+or
+.B COPY_DEDUP
+was specified in
+.IR flags ,
+but the target filesystem does not support the given operation.
+.TP
+.B EXDEV
+Target filesystem doesn't support cross-filesystem copies.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <linux/copy.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+                       loff_t *off_out, size_t len, unsigned int flags)
+{
+    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+                   off_out, len, flags);
+}
+
+int main(int argc, char **argv)
+{
+    int fd_in, fd_out;
+    struct stat stat;
+    loff_t len, ret;
+    char buf[2];
+
+    if (argc != 3) {
+        fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+        exit(EXIT_FAILURE);
+    }
+
+    fd_in = open(argv[1], O_RDONLY);
+    if (fd_in == \-1) {
+        perror("open (argv[1])");
+        exit(EXIT_FAILURE);
+    }
+
+    if (fstat(fd_in, &stat) == \-1) {
+        perror("fstat");
+        exit(EXIT_FAILURE);
+    }
+    len = stat.st_size;
+
+    fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
+    if (fd_out == \-1) {
+        perror("open (argv[2])");
+        exit(EXIT_FAILURE);
+    }
+
+    do {
+        ret = copy_file_range(fd_in, NULL, fd_out, NULL, 
+                              len, COPY_FR_COPY);
+        if (ret == \-1) {
+            perror("copy_file_range");
+            exit(EXIT_FAILURE);
+        }
+
+        len \-= ret;
+    } while (len > 0);
+
+    close(fd_in);
+    close(fd_out);
+    exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
diff --git a/man2/splice.2 b/man2/splice.2
index b9b4f42..5c162e0 100644
--- a/man2/splice.2
+++ b/man2/splice.2
@@ -238,6 +238,7 @@ only pointers are copied, not the pages of the buffer.
 See
 .BR tee (2).
 .SH SEE ALSO
+.BR copy_file_range (2),
 .BR sendfile (2),
 .BR tee (2),
 .BR vmsplice (2)
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-09-30 17:26   ` Anna Schumaker
  (?)
  (?)
@ 2015-10-08  1:40   ` Neil Brown
  2015-10-09 11:15       ` Pádraig Brady
  2015-10-13 19:45       ` Anna Schumaker
  -1 siblings, 2 replies; 129+ messages in thread
From: Neil Brown @ 2015-10-08  1:40 UTC (permalink / raw)
  To: Anna Schumaker, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, clm, darrick.wong, mtk.manpages, andros, hch

[-- Attachment #1: Type: text/plain, Size: 848 bytes --]

Anna Schumaker <Anna.Schumaker@netapp.com> writes:

> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  			    struct file *file_out, loff_t pos_out,
>  			    size_t len, unsigned int flags)
>  {
> -	struct inode *inode_in;
> -	struct inode *inode_out;
>  	ssize_t ret;
>  
> -	if (flags)
> +	/* Flags should only be used exclusively. */
> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
> +		return -EINVAL;
> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
> +		return -EINVAL;
> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>  		return -EINVAL;
>  

Do you also need:

   if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
   	return -EINVAL;

so that future user-space can test if the kernel supports new flags?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-08  1:40   ` Neil Brown
@ 2015-10-09 11:15       ` Pádraig Brady
  2015-10-13 19:45       ` Anna Schumaker
  1 sibling, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-09 11:15 UTC (permalink / raw)
  To: Neil Brown, Anna Schumaker, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, darrick.wong,
	mtk.manpages, andros, hch

On 08/10/15 02:40, Neil Brown wrote:
> Anna Schumaker <Anna.Schumaker@netapp.com> writes:
> 
>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>  			    struct file *file_out, loff_t pos_out,
>>  			    size_t len, unsigned int flags)
>>  {
>> -	struct inode *inode_in;
>> -	struct inode *inode_out;
>>  	ssize_t ret;
>>  
>> -	if (flags)
>> +	/* Flags should only be used exclusively. */
>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>  		return -EINVAL;
>>  
> 
> Do you also need:
> 
>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>    	return -EINVAL;
> 
> so that future user-space can test if the kernel supports new flags?

Seems like a good idea, yes.

Also that got me thinking about COPY_FR_SPARSE.
What's the current behavior when copying a sparse range?
Is the hole propagated by default (good), or is it expanded?

Note cp(1) has --sparse={never,auto,always}. Auto is the default,
so it would be good I think if that was the default mode for copy_file_range().
With other sparse modes, we'd have to avoid copy_file_range() unless
there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}.
Note currently cp --sparse=always will detect runs of zeros and also
avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, ...)

thanks,
Pádraig.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-09 11:15       ` Pádraig Brady
  0 siblings, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-09 11:15 UTC (permalink / raw)
  To: Neil Brown, Anna Schumaker, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, darrick.wong,
	mtk.manpages, andros, hch

On 08/10/15 02:40, Neil Brown wrote:
> Anna Schumaker <Anna.Schumaker@netapp.com> writes:
> 
>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>  			    struct file *file_out, loff_t pos_out,
>>  			    size_t len, unsigned int flags)
>>  {
>> -	struct inode *inode_in;
>> -	struct inode *inode_out;
>>  	ssize_t ret;
>>  
>> -	if (flags)
>> +	/* Flags should only be used exclusively. */
>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>  		return -EINVAL;
>>  
> 
> Do you also need:
> 
>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>    	return -EINVAL;
> 
> so that future user-space can test if the kernel supports new flags?

Seems like a good idea, yes.

Also that got me thinking about COPY_FR_SPARSE.
What's the current behavior when copying a sparse range?
Is the hole propagated by default (good), or is it expanded?

Note cp(1) has --sparse={never,auto,always}. Auto is the default,
so it would be good I think if that was the default mode for copy_file_range().
With other sparse modes, we'd have to avoid copy_file_range() unless
there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}.
Note currently cp --sparse=always will detect runs of zeros and also
avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, ...)

thanks,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-11 14:22     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:22 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> This allows us to have an in-kernel copy mechanism that avoids frequent
> switches between kernel and user space.  This is especially useful so
> NFSD can support server-side copies.
> 
> I make pagecache copies configurable by adding three new (exclusive)
> flags:
> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>   ranges are identical.

All but FR_COPY really should be a separate system call.  Clones (an
dedup as a special case of clones) are really a separate beast from file
copies.

If I want to clone a file I either want it clone fully or fail, not copy
a certain amount.  That means that a) we need to return an error not
short "write", and b) locking impementations are important - we need to
prevent other applications from racing with our clone even if it is
large, while to get these semantics for the possible short returning
file copy will require a proper userland locking protocol. Last but not
least file copies need to be interruptible while clones should be not.
All this is already important for local file systems and even more
important for NFS exporting.

So I'd suggest to drop this patch and just let your syscall handle
actualy copies with all their horrors.  We can go with Peng's patches
to generalize the btrfs ioctls for clones for now which is what everyone
already uses anyway, and then add a separate sys_file_clone later.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-11 14:22     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:22 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> This allows us to have an in-kernel copy mechanism that avoids frequent
> switches between kernel and user space.  This is especially useful so
> NFSD can support server-side copies.
> 
> I make pagecache copies configurable by adding three new (exclusive)
> flags:
> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>   ranges are identical.

All but FR_COPY really should be a separate system call.  Clones (an
dedup as a special case of clones) are really a separate beast from file
copies.

If I want to clone a file I either want it clone fully or fail, not copy
a certain amount.  That means that a) we need to return an error not
short "write", and b) locking impementations are important - we need to
prevent other applications from racing with our clone even if it is
large, while to get these semantics for the possible short returning
file copy will require a proper userland locking protocol. Last but not
least file copies need to be interruptible while clones should be not.
All this is already important for local file systems and even more
important for NFS exporting.

So I'd suggest to drop this patch and just let your syscall handle
actualy copies with all their horrors.  We can go with Peng's patches
to generalize the btrfs ioctls for clones for now which is what everyone
already uses anyway, and then add a separate sys_file_clone later.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 4/9] vfs: Copy should check len after file open mode
@ 2015-10-11 14:22     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:22 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

Should be folded into patch 1.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 4/9] vfs: Copy should check len after file open mode
@ 2015-10-11 14:22     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:22 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

Should be folded into patch 1.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
  2015-09-30 17:26   ` Anna Schumaker
  (?)
@ 2015-10-11 14:22   ` Christoph Hellwig
  2015-10-14 17:37       ` Anna Schumaker
  -1 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:22 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

Needs to be folded.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
  2015-09-30 17:26   ` Anna Schumaker
  (?)
@ 2015-10-11 14:23   ` Christoph Hellwig
  2015-10-14 17:41       ` Anna Schumaker
  -1 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:23 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

On Wed, Sep 30, 2015 at 01:26:51PM -0400, Anna Schumaker wrote:
> I still want to do an in-kernel copy even if the files are on different
> mountpoints, and NFS has a "server to server" copy that expects two
> files on different mountpoints.  Let's have individual filesystems
> implement this check instead.

NAK.  I thing this is a bad idea in general and will only be convinced
by a properly audited actual implementation.  And even then with a flag
where the file system specificly needs to opt into this behavior instead
of getting it by default.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 6/9] vfs: Copy should use file_out rather than file_in
  2015-09-30 17:26   ` Anna Schumaker
  (?)
@ 2015-10-11 14:24   ` Christoph Hellwig
  -1 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:24 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros, hch

On Wed, Sep 30, 2015 at 01:26:50PM -0400, Anna Schumaker wrote:
> The way to think about this is that the destination filesystem reads the
> data from the source file and processes it accordingly.  This is
> especially important to avoid an infinate loop when doing a "server to
> server" copy on NFS.

And doesn't really matter without those.  Either way this looks good
enough and should be folded.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-11 14:29     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:29 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros

On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
> Reject copies that don't have the COPY_FR_REFLINK flag set.

I think a reflink actually is a perfectly valid copy, and I don't buy
the duplicate arguments in earlier threads.  We really need to think
more in terms of how this impacts a user and now how it's implemented
internally.  How does a user notice it's a reflink?  They don't as
implemented in btrfs and co.  Now on filesystem that don't always do
copy on write but might support reflinks (ocfs2, XFS in the future)
this becomes a bit more interesting - the difference he is that we
get an implicit fallocate when doing a real copy.  But if that's
something we have actual requests for that's how we should specify
it rather than in terms of arcane implementation details.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-11 14:29     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-11 14:29 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
> Reject copies that don't have the COPY_FR_REFLINK flag set.

I think a reflink actually is a perfectly valid copy, and I don't buy
the duplicate arguments in earlier threads.  We really need to think
more in terms of how this impacts a user and now how it's implemented
internally.  How does a user notice it's a reflink?  They don't as
implemented in btrfs and co.  Now on filesystem that don't always do
copy on write but might support reflinks (ocfs2, XFS in the future)
this becomes a bit more interesting - the difference he is that we
get an implicit fallocate when doing a real copy.  But if that's
something we have actual requests for that's how we should specify
it rather than in terms of arcane implementation details.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-12 10:23       ` Pádraig Brady
  0 siblings, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-12 10:23 UTC (permalink / raw)
  To: Christoph Hellwig, Anna Schumaker
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros

On 11/10/15 15:29, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
>> Reject copies that don't have the COPY_FR_REFLINK flag set.
> 
> I think a reflink actually is a perfectly valid copy, and I don't buy
> the duplicate arguments in earlier threads.  We really need to think
> more in terms of how this impacts a user and now how it's implemented
> internally.  How does a user notice it's a reflink?  They don't as
> implemented in btrfs and co. 

You're right that if the user doesn't notice, then there is no
point exposing this. However I think the user does notice as
there is a difference in the end state of the copy.  I.E. generally
if there is a different end state it would require an option,
while if only a different copying mechanism it would not.
I think the different end state of a reflink warrants an option for 3 reasons:

 - The user might want separate bits for resiliency. Now this is
   a weak argument due to possible deduplication in lower layers,
   but still valid is some setups.

 - The user might want to avoid CoW at a later time critical stage.

 - The user might want to avoid ENOSPC at a later critical stage.

> Now on filesystem that don't always do
> copy on write but might support reflinks (ocfs2, XFS in the future)
> this becomes a bit more interesting - the difference he is that we
> get an implicit fallocate when doing a real copy.  But if that's
> something we have actual requests for that's how we should specify
> it rather than in terms of arcane implementation details.

thanks,
Pádraig.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-12 10:23       ` Pádraig Brady
  0 siblings, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-12 10:23 UTC (permalink / raw)
  To: Christoph Hellwig, Anna Schumaker
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On 11/10/15 15:29, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
>> Reject copies that don't have the COPY_FR_REFLINK flag set.
> 
> I think a reflink actually is a perfectly valid copy, and I don't buy
> the duplicate arguments in earlier threads.  We really need to think
> more in terms of how this impacts a user and now how it's implemented
> internally.  How does a user notice it's a reflink?  They don't as
> implemented in btrfs and co. 

You're right that if the user doesn't notice, then there is no
point exposing this. However I think the user does notice as
there is a difference in the end state of the copy.  I.E. generally
if there is a different end state it would require an option,
while if only a different copying mechanism it would not.
I think the different end state of a reflink warrants an option for 3 reasons:

 - The user might want separate bits for resiliency. Now this is
   a weak argument due to possible deduplication in lower layers,
   but still valid is some setups.

 - The user might want to avoid CoW at a later time critical stage.

 - The user might want to avoid ENOSPC at a later critical stage.

> Now on filesystem that don't always do
> copy on write but might support reflinks (ocfs2, XFS in the future)
> this becomes a bit more interesting - the difference he is that we
> get an implicit fallocate when doing a real copy.  But if that's
> something we have actual requests for that's how we should specify
> it rather than in terms of arcane implementation details.

thanks,
Pádraig.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
  2015-10-12 10:23       ` Pádraig Brady
  (?)
@ 2015-10-12 14:34       ` Christoph Hellwig
  2015-10-12 23:41           ` Darrick J. Wong
  -1 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-12 14:34 UTC (permalink / raw)
  To: P??draig Brady
  Cc: Christoph Hellwig, Anna Schumaker, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, darrick.wong,
	mtk.manpages, andros

On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> You're right that if the user doesn't notice, then there is no
> point exposing this. However I think the user does notice as
> there is a difference in the end state of the copy.  I.E. generally
> if there is a different end state it would require an option,
> while if only a different copying mechanism it would not.
> I think the different end state of a reflink warrants an option for 3 reasons:
> 
>  - The user might want separate bits for resiliency. Now this is
>    a weak argument due to possible deduplication in lower layers,
>    but still valid is some setups.

This one is completely bogus.  For one because literally every lower
layer can and increasinly will dedup or share in some form.  If we
prentend we could do this we actively mislead the user.

>  - The user might want to avoid CoW at a later time critical stage.
> 
>  - The user might want to avoid ENOSPC at a later critical stage.

These two are the same and would be the argument for the "falloc" flag
I mention before.  But we'd need to sit down and specify the exact
semantics for it.  For example one important question that comes to mind
is if it also applies for extents that are holes in the source range.

I'd much rather get the basic system call in ASAP and then let people
explain their use cases for this and only add it once we've made sure
we have consistent semantics that actually fit the users needs.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-12 23:17       ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-12 23:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anna Schumaker, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, clm, mtk.manpages, andros

On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> > This allows us to have an in-kernel copy mechanism that avoids frequent
> > switches between kernel and user space.  This is especially useful so
> > NFSD can support server-side copies.
> > 
> > I make pagecache copies configurable by adding three new (exclusive)
> > flags:
> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >   ranges are identical.
> 
> All but FR_COPY really should be a separate system call.  Clones (an
> dedup as a special case of clones) are really a separate beast from file
> copies.
> 
> If I want to clone a file I either want it clone fully or fail, not copy
> a certain amount.  That means that a) we need to return an error not
> short "write", and b) locking impementations are important - we need to
> prevent other applications from racing with our clone even if it is
> large, while to get these semantics for the possible short returning
> file copy will require a proper userland locking protocol. Last but not
> least file copies need to be interruptible while clones should be not.
> All this is already important for local file systems and even more
> important for NFS exporting.
> 
> So I'd suggest to drop this patch and just let your syscall handle
> actualy copies with all their horrors.  We can go with Peng's patches
> to generalize the btrfs ioctls for clones for now which is what everyone
> already uses anyway, and then add a separate sys_file_clone later.

Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.

What does everyone think about generalizing EXTENT_SAME?  The interface enables
one to ask the kernel to dedupe multiple file ranges in a single call.  That's
more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
that the extra complexity buys us the ability to ... multi-dedupe at the same
time, with locks held on the source file?

I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
hate the interface.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-12 23:17       ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-12 23:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anna Schumaker, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> > This allows us to have an in-kernel copy mechanism that avoids frequent
> > switches between kernel and user space.  This is especially useful so
> > NFSD can support server-side copies.
> > 
> > I make pagecache copies configurable by adding three new (exclusive)
> > flags:
> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >   ranges are identical.
> 
> All but FR_COPY really should be a separate system call.  Clones (an
> dedup as a special case of clones) are really a separate beast from file
> copies.
> 
> If I want to clone a file I either want it clone fully or fail, not copy
> a certain amount.  That means that a) we need to return an error not
> short "write", and b) locking impementations are important - we need to
> prevent other applications from racing with our clone even if it is
> large, while to get these semantics for the possible short returning
> file copy will require a proper userland locking protocol. Last but not
> least file copies need to be interruptible while clones should be not.
> All this is already important for local file systems and even more
> important for NFS exporting.
> 
> So I'd suggest to drop this patch and just let your syscall handle
> actualy copies with all their horrors.  We can go with Peng's patches
> to generalize the btrfs ioctls for clones for now which is what everyone
> already uses anyway, and then add a separate sys_file_clone later.

Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.

What does everyone think about generalizing EXTENT_SAME?  The interface enables
one to ask the kernel to dedupe multiple file ranges in a single call.  That's
more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
that the extra complexity buys us the ability to ... multi-dedupe at the same
time, with locks held on the source file?

I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
hate the interface.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-12 23:41           ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-12 23:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: P??draig Brady, Anna Schumaker, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, mtk.manpages, andros

On Mon, Oct 12, 2015 at 07:34:44AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> > You're right that if the user doesn't notice, then there is no
> > point exposing this. However I think the user does notice as
> > there is a difference in the end state of the copy.  I.E. generally
> > if there is a different end state it would require an option,
> > while if only a different copying mechanism it would not.
> > I think the different end state of a reflink warrants an option for 3 reasons:
> > 
> >  - The user might want separate bits for resiliency. Now this is
> >    a weak argument due to possible deduplication in lower layers,
> >    but still valid is some setups.
> 
> This one is completely bogus.  For one because literally every lower
> layer can and increasinly will dedup or share in some form.  If we
> prentend we could do this we actively mislead the user.
> 
> >  - The user might want to avoid CoW at a later time critical stage.
> > 
> >  - The user might want to avoid ENOSPC at a later critical stage.
> 
> These two are the same and would be the argument for the "falloc" flag
> I mention before.  But we'd need to sit down and specify the exact
> semantics for it.  For example one important question that comes to mind
> is if it also applies for extents that are holes in the source range.

One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
leaves holes alone.

Obviously we haven't yet figured out what are peoples' preferences in terms of
"fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
that fills the holes while unsharing is going on.

Personally I suspect that the most interest is in filling holes and unsharing,
because they don't want to pay for allocation at a critical stage for anywhere
in the file.  But I could be wrong, so allowing both goals to be expressed via
mode allows flexibility.

--D

> 
> I'd much rather get the basic system call in ASAP and then let people
> explain their use cases for this and only add it once we've made sure
> we have consistent semantics that actually fit the users needs.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-12 23:41           ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-12 23:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: P??draig Brady, Anna Schumaker, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Mon, Oct 12, 2015 at 07:34:44AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> > You're right that if the user doesn't notice, then there is no
> > point exposing this. However I think the user does notice as
> > there is a difference in the end state of the copy.  I.E. generally
> > if there is a different end state it would require an option,
> > while if only a different copying mechanism it would not.
> > I think the different end state of a reflink warrants an option for 3 reasons:
> > 
> >  - The user might want separate bits for resiliency. Now this is
> >    a weak argument due to possible deduplication in lower layers,
> >    but still valid is some setups.
> 
> This one is completely bogus.  For one because literally every lower
> layer can and increasinly will dedup or share in some form.  If we
> prentend we could do this we actively mislead the user.
> 
> >  - The user might want to avoid CoW at a later time critical stage.
> > 
> >  - The user might want to avoid ENOSPC at a later critical stage.
> 
> These two are the same and would be the argument for the "falloc" flag
> I mention before.  But we'd need to sit down and specify the exact
> semantics for it.  For example one important question that comes to mind
> is if it also applies for extents that are holes in the source range.

One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
leaves holes alone.

Obviously we haven't yet figured out what are peoples' preferences in terms of
"fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
that fills the holes while unsharing is going on.

Personally I suspect that the most interest is in filling holes and unsharing,
because they don't want to pay for allocation at a critical stage for anywhere
in the file.  But I could be wrong, so allowing both goals to be expressed via
mode allows flexibility.

--D

> 
> I'd much rather get the basic system call in ASAP and then let people
> explain their use cases for this and only add it once we've made sure
> we have consistent semantics that actually fit the users needs.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-12 23:17       ` Darrick J. Wong
  (?)
@ 2015-10-13  3:36       ` Trond Myklebust
  2015-10-13  7:19           ` Darrick J. Wong
  2015-10-13  7:30           ` Christoph Hellwig
  -1 siblings, 2 replies; 129+ messages in thread
From: Trond Myklebust @ 2015-10-13  3:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Anna Schumaker, Linux NFS Mailing List,
	Linux btrfs Developers List, Linux FS-devel Mailing List,
	Linux API Mailing List, Zach Brown, Alexander Viro, Chris Mason,
	Michael Kerrisk-manpages, William Andros Adamson

On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>> > This allows us to have an in-kernel copy mechanism that avoids frequent
>> > switches between kernel and user space.  This is especially useful so
>> > NFSD can support server-side copies.
>> >
>> > I make pagecache copies configurable by adding three new (exclusive)
>> > flags:
>> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>> >   ranges are identical.
>>
>> All but FR_COPY really should be a separate system call.  Clones (an
>> dedup as a special case of clones) are really a separate beast from file
>> copies.
>>
>> If I want to clone a file I either want it clone fully or fail, not copy
>> a certain amount.  That means that a) we need to return an error not
>> short "write", and b) locking impementations are important - we need to
>> prevent other applications from racing with our clone even if it is
>> large, while to get these semantics for the possible short returning
>> file copy will require a proper userland locking protocol. Last but not
>> least file copies need to be interruptible while clones should be not.
>> All this is already important for local file systems and even more
>> important for NFS exporting.
>>
>> So I'd suggest to drop this patch and just let your syscall handle
>> actualy copies with all their horrors.  We can go with Peng's patches
>> to generalize the btrfs ioctls for clones for now which is what everyone
>> already uses anyway, and then add a separate sys_file_clone later.
>
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
>
> What does everyone think about generalizing EXTENT_SAME?  The interface enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?

How is this supposed to be implemented on something like NFS without
protocol changes?

Trond

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13  7:19           ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-13  7:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Anna Schumaker, Linux NFS Mailing List,
	Linux btrfs Developers List, Linux FS-devel Mailing List,
	Linux API Mailing List, Zach Brown, Alexander Viro, Chris Mason,
	Michael Kerrisk-manpages, William Andros Adamson

On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote:
> On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> >> > This allows us to have an in-kernel copy mechanism that avoids frequent
> >> > switches between kernel and user space.  This is especially useful so
> >> > NFSD can support server-side copies.
> >> >
> >> > I make pagecache copies configurable by adding three new (exclusive)
> >> > flags:
> >> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> >> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> >> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >> >   ranges are identical.
> >>
> >> All but FR_COPY really should be a separate system call.  Clones (an
> >> dedup as a special case of clones) are really a separate beast from file
> >> copies.
> >>
> >> If I want to clone a file I either want it clone fully or fail, not copy
> >> a certain amount.  That means that a) we need to return an error not
> >> short "write", and b) locking impementations are important - we need to
> >> prevent other applications from racing with our clone even if it is
> >> large, while to get these semantics for the possible short returning
> >> file copy will require a proper userland locking protocol. Last but not
> >> least file copies need to be interruptible while clones should be not.
> >> All this is already important for local file systems and even more
> >> important for NFS exporting.
> >>
> >> So I'd suggest to drop this patch and just let your syscall handle
> >> actualy copies with all their horrors.  We can go with Peng's patches
> >> to generalize the btrfs ioctls for clones for now which is what everyone
> >> already uses anyway, and then add a separate sys_file_clone later.
> >
> > Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> >
> > What does everyone think about generalizing EXTENT_SAME?  The interface enables
> > one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> > that the extra complexity buys us the ability to ... multi-dedupe at the same
> > time, with locks held on the source file?
> 
> How is this supposed to be implemented on something like NFS without
> protocol changes?

Quite frankly, I'm not sure.  Assuming NFS doesn't already have some sort of
deduplication primitive (I could be totally wrong about that) I'd probably just
leave the appropriate ops function pointer set to NULL and return -EOPNOTSUPP
to userspace.  Trying to fake it by comparing contents on the client and
issuing a reflink might be doable with hard locks but if I had to guess I'd say
that's even less palatable than simply bailing out. :)

IOW: I was only considering the filesystems that already support dedupe, which
is basically btrfs and future-XFS.

--D

> 
> Trond
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13  7:19           ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-13  7:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Anna Schumaker, Linux NFS Mailing List,
	Linux btrfs Developers List, Linux FS-devel Mailing List,
	Linux API Mailing List, Zach Brown, Alexander Viro, Chris Mason,
	Michael Kerrisk-manpages, William Andros Adamson

On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote:
> On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong
> <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> >> > This allows us to have an in-kernel copy mechanism that avoids frequent
> >> > switches between kernel and user space.  This is especially useful so
> >> > NFSD can support server-side copies.
> >> >
> >> > I make pagecache copies configurable by adding three new (exclusive)
> >> > flags:
> >> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> >> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> >> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >> >   ranges are identical.
> >>
> >> All but FR_COPY really should be a separate system call.  Clones (an
> >> dedup as a special case of clones) are really a separate beast from file
> >> copies.
> >>
> >> If I want to clone a file I either want it clone fully or fail, not copy
> >> a certain amount.  That means that a) we need to return an error not
> >> short "write", and b) locking impementations are important - we need to
> >> prevent other applications from racing with our clone even if it is
> >> large, while to get these semantics for the possible short returning
> >> file copy will require a proper userland locking protocol. Last but not
> >> least file copies need to be interruptible while clones should be not.
> >> All this is already important for local file systems and even more
> >> important for NFS exporting.
> >>
> >> So I'd suggest to drop this patch and just let your syscall handle
> >> actualy copies with all their horrors.  We can go with Peng's patches
> >> to generalize the btrfs ioctls for clones for now which is what everyone
> >> already uses anyway, and then add a separate sys_file_clone later.
> >
> > Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> >
> > What does everyone think about generalizing EXTENT_SAME?  The interface enables
> > one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> > that the extra complexity buys us the ability to ... multi-dedupe at the same
> > time, with locks held on the source file?
> 
> How is this supposed to be implemented on something like NFS without
> protocol changes?

Quite frankly, I'm not sure.  Assuming NFS doesn't already have some sort of
deduplication primitive (I could be totally wrong about that) I'd probably just
leave the appropriate ops function pointer set to NULL and return -EOPNOTSUPP
to userspace.  Trying to fake it by comparing contents on the client and
issuing a reflink might be doable with hard locks but if I had to guess I'd say
that's even less palatable than simply bailing out. :)

IOW: I was only considering the filesystems that already support dedupe, which
is basically btrfs and future-XFS.

--D

> 
> Trond
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13  7:27         ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-13  7:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Anna Schumaker, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, mtk.manpages, andros

On Mon, Oct 12, 2015 at 04:17:49PM -0700, Darrick J. Wong wrote:
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> 
> What does everyone think about generalizing EXTENT_SAME?  The interface enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?
> 
> I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> hate the interface.

It's not pretty, but if the btrfs folks have a good reason for it I
don't see a reason to diverge.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13  7:27         ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-13  7:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Mon, Oct 12, 2015 at 04:17:49PM -0700, Darrick J. Wong wrote:
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> 
> What does everyone think about generalizing EXTENT_SAME?  The interface enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?
> 
> I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> hate the interface.

It's not pretty, but if the btrfs folks have a good reason for it I
don't see a reason to diverge.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-13  7:29             ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-13  7:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, P??draig Brady, Anna Schumaker, linux-nfs,
	linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	mtk.manpages, andros

On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> leaves holes alone.

Yes, I've seen the implementation. 

> Obviously we haven't yet figured out what are peoples' preferences in terms of
> "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> that fills the holes while unsharing is going on.
> 
> Personally I suspect that the most interest is in filling holes and unsharing,
> because they don't want to pay for allocation at a critical stage for anywhere
> in the file.  But I could be wrong, so allowing both goals to be expressed via
> mode allows flexibility.

Exactly.  And a normal falloc should do just that - fill holes and
ensure that we don't need to COW already allocated locks.  So I don't
think we need a new fallocate interface for that.  The question is if we
want a copy interface that gives you the same semantics as if you also
called an fallocate on the destination range.  For that case we'd
usually want to avoid doing the clone and instead do a in-kernel or
hardware assisted copy and then fill the holes with unwritten extents.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-13  7:29             ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-13  7:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, P??draig Brady, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> leaves holes alone.

Yes, I've seen the implementation. 

> Obviously we haven't yet figured out what are peoples' preferences in terms of
> "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> that fills the holes while unsharing is going on.
> 
> Personally I suspect that the most interest is in filling holes and unsharing,
> because they don't want to pay for allocation at a critical stage for anywhere
> in the file.  But I could be wrong, so allowing both goals to be expressed via
> mode allows flexibility.

Exactly.  And a normal falloc should do just that - fill holes and
ensure that we don't need to COW already allocated locks.  So I don't
think we need a new fallocate interface for that.  The question is if we
want a copy interface that gives you the same semantics as if you also
called an fallocate on the destination range.  For that case we'd
usually want to avoid doing the clone and instead do a in-kernel or
hardware assisted copy and then fill the holes with unwritten extents.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13  7:30           ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-13  7:30 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Darrick J. Wong, Christoph Hellwig, Anna Schumaker,
	Linux NFS Mailing List, Linux btrfs Developers List,
	Linux FS-devel Mailing List, Linux API Mailing List, Zach Brown,
	Alexander Viro, Chris Mason, Michael Kerrisk-manpages,
	William Andros Adamson

On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote:
> How is this supposed to be implemented on something like NFS without
> protocol changes?

Explicit dedup has no chance of working over NFS or other network
protocols without protocol changes.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13  7:30           ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-13  7:30 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Darrick J. Wong, Christoph Hellwig, Anna Schumaker,
	Linux NFS Mailing List, Linux btrfs Developers List,
	Linux FS-devel Mailing List, Linux API Mailing List, Zach Brown,
	Alexander Viro, Chris Mason, Michael Kerrisk-manpages,
	William Andros Adamson

On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote:
> How is this supposed to be implemented on something like NFS without
> protocol changes?

Explicit dedup has no chance of working over NFS or other network
protocols without protocol changes.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13 19:45       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-13 19:45 UTC (permalink / raw)
  To: Neil Brown, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, clm, darrick.wong, mtk.manpages, andros, hch

On 10/07/2015 09:40 PM, Neil Brown wrote:
> Anna Schumaker <Anna.Schumaker@netapp.com> writes:
> 
>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>  			    struct file *file_out, loff_t pos_out,
>>  			    size_t len, unsigned int flags)
>>  {
>> -	struct inode *inode_in;
>> -	struct inode *inode_out;
>>  	ssize_t ret;
>>  
>> -	if (flags)
>> +	/* Flags should only be used exclusively. */
>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>  		return -EINVAL;
>>  
> 
> Do you also need:
> 
>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>    	return -EINVAL;
> 
> so that future user-space can test if the kernel supports new flags?

Probably.  I'll add that in!

Thanks,
Anna

> 
> NeilBrown
> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13 19:45       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-13 19:45 UTC (permalink / raw)
  To: Neil Brown, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On 10/07/2015 09:40 PM, Neil Brown wrote:
> Anna Schumaker <Anna.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> writes:
> 
>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>  			    struct file *file_out, loff_t pos_out,
>>  			    size_t len, unsigned int flags)
>>  {
>> -	struct inode *inode_in;
>> -	struct inode *inode_out;
>>  	ssize_t ret;
>>  
>> -	if (flags)
>> +	/* Flags should only be used exclusively. */
>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>  		return -EINVAL;
>>  
> 
> Do you also need:
> 
>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>    	return -EINVAL;
> 
> so that future user-space can test if the kernel supports new flags?

Probably.  I'll add that in!

Thanks,
Anna

> 
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13 19:45       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-13 19:45 UTC (permalink / raw)
  To: Neil Brown, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On 10/07/2015 09:40 PM, Neil Brown wrote:
> Anna Schumaker <Anna.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> writes:
> 
>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>  			    struct file *file_out, loff_t pos_out,
>>  			    size_t len, unsigned int flags)
>>  {
>> -	struct inode *inode_in;
>> -	struct inode *inode_out;
>>  	ssize_t ret;
>>  
>> -	if (flags)
>> +	/* Flags should only be used exclusively. */
>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>> +		return -EINVAL;
>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>  		return -EINVAL;
>>  
> 
> Do you also need:
> 
>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>    	return -EINVAL;
> 
> so that future user-space can test if the kernel supports new flags?

Probably.  I'll add that in!

Thanks,
Anna

> 
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13 20:25         ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-13 20:25 UTC (permalink / raw)
  To: Pádraig Brady, Neil Brown, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, darrick.wong,
	mtk.manpages, andros, hch

On 10/09/2015 07:15 AM, Pádraig Brady wrote:
> On 08/10/15 02:40, Neil Brown wrote:
>> Anna Schumaker <Anna.Schumaker@netapp.com> writes:
>>
>>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>>  			    struct file *file_out, loff_t pos_out,
>>>  			    size_t len, unsigned int flags)
>>>  {
>>> -	struct inode *inode_in;
>>> -	struct inode *inode_out;
>>>  	ssize_t ret;
>>>  
>>> -	if (flags)
>>> +	/* Flags should only be used exclusively. */
>>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>>> +		return -EINVAL;
>>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>>> +		return -EINVAL;
>>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>>  		return -EINVAL;
>>>  
>>
>> Do you also need:
>>
>>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>>    	return -EINVAL;
>>
>> so that future user-space can test if the kernel supports new flags?
> 
> Seems like a good idea, yes.
> 
> Also that got me thinking about COPY_FR_SPARSE.
> What's the current behavior when copying a sparse range?
> Is the hole propagated by default (good), or is it expanded?

I haven't tried it, but I think the hole would be expanded :(.  I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files.  I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces.

I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto".  I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range.

Anna

> 
> Note cp(1) has --sparse={never,auto,always}. Auto is the default,
> so it would be good I think if that was the default mode for copy_file_range().
> With other sparse modes, we'd have to avoid copy_file_range() unless
> there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}.
> Note currently cp --sparse=always will detect runs of zeros and also
> avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, ...)
> 
> thanks,
> Pádraig.
> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-13 20:25         ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-13 20:25 UTC (permalink / raw)
  To: Pádraig Brady, Neil Brown, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On 10/09/2015 07:15 AM, Pádraig Brady wrote:
> On 08/10/15 02:40, Neil Brown wrote:
>> Anna Schumaker <Anna.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>>  			    struct file *file_out, loff_t pos_out,
>>>  			    size_t len, unsigned int flags)
>>>  {
>>> -	struct inode *inode_in;
>>> -	struct inode *inode_out;
>>>  	ssize_t ret;
>>>  
>>> -	if (flags)
>>> +	/* Flags should only be used exclusively. */
>>> +	if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY))
>>> +		return -EINVAL;
>>> +	if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK))
>>> +		return -EINVAL;
>>> +	if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP))
>>>  		return -EINVAL;
>>>  
>>
>> Do you also need:
>>
>>    if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP))
>>    	return -EINVAL;
>>
>> so that future user-space can test if the kernel supports new flags?
> 
> Seems like a good idea, yes.
> 
> Also that got me thinking about COPY_FR_SPARSE.
> What's the current behavior when copying a sparse range?
> Is the hole propagated by default (good), or is it expanded?

I haven't tried it, but I think the hole would be expanded :(.  I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files.  I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces.

I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto".  I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range.

Anna

> 
> Note cp(1) has --sparse={never,auto,always}. Auto is the default,
> so it would be good I think if that was the default mode for copy_file_range().
> With other sparse modes, we'd have to avoid copy_file_range() unless
> there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}.
> Note currently cp --sparse=always will detect runs of zeros and also
> avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, ...)
> 
> thanks,
> Pádraig.
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14  7:41           ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14  7:41 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: P??draig Brady, Neil Brown, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, darrick.wong,
	mtk.manpages, andros, hch

On Tue, Oct 13, 2015 at 04:25:29PM -0400, Anna Schumaker wrote:
> I haven't tried it, but I think the hole would be expanded :(.  I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files.  I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces.
> 
> I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto".  I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range.

Teh right think is to keep sparse ranges spare as much as possible.
This would require the same sort of support as NFS READ_PLUS so I think
it's worthwhile to try it.  If the file system can't support it it won't
be sparse, so we'll get a worse quality of implementation.

But please don't add even more weird flags that just confuse users.

So far I think the only useful flag for copy_file_range is a PREALLOC
or similar flag that says the destination range should have an implicit
poix_fallocate performed on it.  And due to the complexity of
implementation I'm not even sure we need that in the first version.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14  7:41           ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14  7:41 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: P??draig Brady, Neil Brown, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA, hch-wEGCiKHe2LqWVfeAwA7xHQ

On Tue, Oct 13, 2015 at 04:25:29PM -0400, Anna Schumaker wrote:
> I haven't tried it, but I think the hole would be expanded :(.  I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files.  I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces.
> 
> I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto".  I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range.

Teh right think is to keep sparse ranges spare as much as possible.
This would require the same sort of support as NFS READ_PLUS so I think
it's worthwhile to try it.  If the file system can't support it it won't
be sparse, so we'll get a worse quality of implementation.

But please don't add even more weird flags that just confuse users.

So far I think the only useful flag for copy_file_range is a PREALLOC
or similar flag that says the destination range should have an implicit
poix_fallocate performed on it.  And due to the complexity of
implementation I'm not even sure we need that in the first version.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-10-14 17:37       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros

I would have folded this and patch 4 earlier if I had written patch 1, but I didn't feel comfortable modifying Zach's work too much.  I can make that change if it's not really a problem.

Anna

On 10/11/2015 10:22 AM, Christoph Hellwig wrote:
> Needs to be folded.
> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-10-14 17:37       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

I would have folded this and patch 4 earlier if I had written patch 1, but I didn't feel comfortable modifying Zach's work too much.  I can make that change if it's not really a problem.

Anna

On 10/11/2015 10:22 AM, Christoph Hellwig wrote:
> Needs to be folded.
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-10-14 17:37       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

I would have folded this and patch 4 earlier if I had written patch 1, but I didn't feel comfortable modifying Zach's work too much.  I can make that change if it's not really a problem.

Anna

On 10/11/2015 10:22 AM, Christoph Hellwig wrote:
> Needs to be folded.
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
  2015-10-11 14:23   ` Christoph Hellwig
@ 2015-10-14 17:41       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros

On 10/11/2015 10:23 AM, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:51PM -0400, Anna Schumaker wrote:
>> I still want to do an in-kernel copy even if the files are on different
>> mountpoints, and NFS has a "server to server" copy that expects two
>> files on different mountpoints.  Let's have individual filesystems
>> implement this check instead.
> 
> NAK.  I thing this is a bad idea in general and will only be convinced
> by a properly audited actual implementation.  And even then with a flag
> where the file system specificly needs to opt into this behavior instead
> of getting it by default.
> 

So I should drop this patch even with the pagecache copy?  Andy Adamson will have to add it in later as part of his server-to-server patches.

Anna

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
@ 2015-10-14 17:41       ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros

On 10/11/2015 10:23 AM, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:51PM -0400, Anna Schumaker wrote:
>> I still want to do an in-kernel copy even if the files are on different
>> mountpoints, and NFS has a "server to server" copy that expects two
>> files on different mountpoints.  Let's have individual filesystems
>> implement this check instead.
> 
> NAK.  I thing this is a bad idea in general and will only be convinced
> by a properly audited actual implementation.  And even then with a flag
> where the file system specificly needs to opt into this behavior instead
> of getting it by default.
> 

So I should drop this patch even with the pagecache copy?  Andy Adamson will have to add it in later as part of his server-to-server patches.

Anna

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 17:59         ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-nfs, linux-btrfs, linux-fsdevel,
	linux-api, zab, viro, clm, mtk.manpages, andros

On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>>> This allows us to have an in-kernel copy mechanism that avoids frequent
>>> switches between kernel and user space.  This is especially useful so
>>> NFSD can support server-side copies.
>>>
>>> I make pagecache copies configurable by adding three new (exclusive)
>>> flags:
>>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>>>   ranges are identical.
>>
>> All but FR_COPY really should be a separate system call.  Clones (an
>> dedup as a special case of clones) are really a separate beast from file
>> copies.
>>
>> If I want to clone a file I either want it clone fully or fail, not copy
>> a certain amount.  That means that a) we need to return an error not
>> short "write", and b) locking impementations are important - we need to
>> prevent other applications from racing with our clone even if it is
>> large, while to get these semantics for the possible short returning
>> file copy will require a proper userland locking protocol. Last but not
>> least file copies need to be interruptible while clones should be not.
>> All this is already important for local file systems and even more
>> important for NFS exporting.
>>
>> So I'd suggest to drop this patch and just let your syscall handle
>> actualy copies with all their horrors.  We can go with Peng's patches
>> to generalize the btrfs ioctls for clones for now which is what everyone
>> already uses anyway, and then add a separate sys_file_clone later.

So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.

The only reason I haven't done anything to make this system call interruptible is because I haven't been able to find any documentation or examples for making system calls interruptible.  How do I do this?

Anna

> 
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> 
> What does everyone think about generalizing EXTENT_SAME?  The interface enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?
> 
> I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> hate the interface.
> 
> --D
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 17:59         ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>>> This allows us to have an in-kernel copy mechanism that avoids frequent
>>> switches between kernel and user space.  This is especially useful so
>>> NFSD can support server-side copies.
>>>
>>> I make pagecache copies configurable by adding three new (exclusive)
>>> flags:
>>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>>>   ranges are identical.
>>
>> All but FR_COPY really should be a separate system call.  Clones (an
>> dedup as a special case of clones) are really a separate beast from file
>> copies.
>>
>> If I want to clone a file I either want it clone fully or fail, not copy
>> a certain amount.  That means that a) we need to return an error not
>> short "write", and b) locking impementations are important - we need to
>> prevent other applications from racing with our clone even if it is
>> large, while to get these semantics for the possible short returning
>> file copy will require a proper userland locking protocol. Last but not
>> least file copies need to be interruptible while clones should be not.
>> All this is already important for local file systems and even more
>> important for NFS exporting.
>>
>> So I'd suggest to drop this patch and just let your syscall handle
>> actualy copies with all their horrors.  We can go with Peng's patches
>> to generalize the btrfs ioctls for clones for now which is what everyone
>> already uses anyway, and then add a separate sys_file_clone later.

So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.

The only reason I haven't done anything to make this system call interruptible is because I haven't been able to find any documentation or examples for making system calls interruptible.  How do I do this?

Anna

> 
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> 
> What does everyone think about generalizing EXTENT_SAME?  The interface enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?
> 
> I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> hate the interface.
> 
> --D
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 17:59         ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 17:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>>> This allows us to have an in-kernel copy mechanism that avoids frequent
>>> switches between kernel and user space.  This is especially useful so
>>> NFSD can support server-side copies.
>>>
>>> I make pagecache copies configurable by adding three new (exclusive)
>>> flags:
>>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>>>   ranges are identical.
>>
>> All but FR_COPY really should be a separate system call.  Clones (an
>> dedup as a special case of clones) are really a separate beast from file
>> copies.
>>
>> If I want to clone a file I either want it clone fully or fail, not copy
>> a certain amount.  That means that a) we need to return an error not
>> short "write", and b) locking impementations are important - we need to
>> prevent other applications from racing with our clone even if it is
>> large, while to get these semantics for the possible short returning
>> file copy will require a proper userland locking protocol. Last but not
>> least file copies need to be interruptible while clones should be not.
>> All this is already important for local file systems and even more
>> important for NFS exporting.
>>
>> So I'd suggest to drop this patch and just let your syscall handle
>> actualy copies with all their horrors.  We can go with Peng's patches
>> to generalize the btrfs ioctls for clones for now which is what everyone
>> already uses anyway, and then add a separate sys_file_clone later.

So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.

The only reason I haven't done anything to make this system call interruptible is because I haven't been able to find any documentation or examples for making system calls interruptible.  How do I do this?

Anna

> 
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> 
> What does everyone think about generalizing EXTENT_SAME?  The interface enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?
> 
> I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> hate the interface.
> 
> --D
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-14 17:59         ` Anna Schumaker
  (?)
  (?)
@ 2015-10-14 18:08         ` Andy Lutomirski
  2015-10-14 18:27           ` Christoph Hellwig
  -1 siblings, 1 reply; 129+ messages in thread
From: Andy Lutomirski @ 2015-10-14 18:08 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Christoph Hellwig, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 10:59 AM, Anna Schumaker
<Anna.Schumaker@netapp.com> wrote:
> On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
>> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>>>> This allows us to have an in-kernel copy mechanism that avoids frequent
>>>> switches between kernel and user space.  This is especially useful so
>>>> NFSD can support server-side copies.
>>>>
>>>> I make pagecache copies configurable by adding three new (exclusive)
>>>> flags:
>>>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>>>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>>>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>>>>   ranges are identical.
>>>
>>> All but FR_COPY really should be a separate system call.  Clones (an
>>> dedup as a special case of clones) are really a separate beast from file
>>> copies.
>>>
>>> If I want to clone a file I either want it clone fully or fail, not copy
>>> a certain amount.  That means that a) we need to return an error not
>>> short "write", and b) locking impementations are important - we need to
>>> prevent other applications from racing with our clone even if it is
>>> large, while to get these semantics for the possible short returning
>>> file copy will require a proper userland locking protocol. Last but not
>>> least file copies need to be interruptible while clones should be not.
>>> All this is already important for local file systems and even more
>>> important for NFS exporting.
>>>
>>> So I'd suggest to drop this patch and just let your syscall handle
>>> actualy copies with all their horrors.  We can go with Peng's patches
>>> to generalize the btrfs ioctls for clones for now which is what everyone
>>> already uses anyway, and then add a separate sys_file_clone later.
>
> So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.

I personally rather like the reflink option.  That thing is quite useful.

>
> The only reason I haven't done anything to make this system call interruptible is because I haven't been able to find any documentation or examples for making system calls interruptible.  How do I do this?
>

For just interruptability, avoid waiting in non-interruptable ways and
return -EINTR if one of your wait calls returns -EINTR.

For restartability, it's more complicated.  There are special values
you can return that give the signal code hints as to what to do.

--Andy

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 18:11           ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-14 18:11 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Christoph Hellwig, linux-nfs, linux-btrfs, linux-fsdevel,
	linux-api, zab, viro, clm, mtk.manpages, andros

On Wed, Oct 14, 2015 at 01:59:40PM -0400, Anna Schumaker wrote:
> On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
> > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> >>> This allows us to have an in-kernel copy mechanism that avoids frequent
> >>> switches between kernel and user space.  This is especially useful so
> >>> NFSD can support server-side copies.
> >>>
> >>> I make pagecache copies configurable by adding three new (exclusive)
> >>> flags:
> >>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> >>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> >>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >>>   ranges are identical.
> >>
> >> All but FR_COPY really should be a separate system call.  Clones (an
> >> dedup as a special case of clones) are really a separate beast from file
> >> copies.
> >>
> >> If I want to clone a file I either want it clone fully or fail, not copy
> >> a certain amount.  That means that a) we need to return an error not
> >> short "write", and b) locking impementations are important - we need to
> >> prevent other applications from racing with our clone even if it is
> >> large, while to get these semantics for the possible short returning
> >> file copy will require a proper userland locking protocol. Last but not
> >> least file copies need to be interruptible while clones should be not.
> >> All this is already important for local file systems and even more
> >> important for NFS exporting.
> >>
> >> So I'd suggest to drop this patch and just let your syscall handle
> >> actualy copies with all their horrors.  We can go with Peng's patches
> >> to generalize the btrfs ioctls for clones for now which is what everyone
> >> already uses anyway, and then add a separate sys_file_clone later.
> 
> So what I'm hearing is that I should drop the reflink and dedup flags and
> change this system call only perform a full copy (with preserving of
> sparseness), correct?  I can make those changes, but only if everybody is in
> agreement that it's the best way forward.

Sounds fine to me; I'll work on promoting EXTENT_SAME to the VFS.

> The only reason I haven't done anything to make this system call
> interruptible is because I haven't been able to find any documentation or
> examples for making system calls interruptible.  How do I do this?

I thought it was mostly a matter of sprinkling in "if (signal_pending(...))
return -ERESTARTSYS" type things whenever it's convenient to check.  The splice
code already seems to have this, though I'm no expert on what the splice code
actually does. :)

--D
> 
> Anna
> 
> > 
> > Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> > 
> > What does everyone think about generalizing EXTENT_SAME?  The interface enables
> > one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> > that the extra complexity buys us the ability to ... multi-dedupe at the same
> > time, with locks held on the source file?
> > 
> > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> > hate the interface.
> > 
> > --D
> > 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 18:11           ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-14 18:11 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Christoph Hellwig, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 01:59:40PM -0400, Anna Schumaker wrote:
> On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
> > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> >>> This allows us to have an in-kernel copy mechanism that avoids frequent
> >>> switches between kernel and user space.  This is especially useful so
> >>> NFSD can support server-side copies.
> >>>
> >>> I make pagecache copies configurable by adding three new (exclusive)
> >>> flags:
> >>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> >>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> >>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >>>   ranges are identical.
> >>
> >> All but FR_COPY really should be a separate system call.  Clones (an
> >> dedup as a special case of clones) are really a separate beast from file
> >> copies.
> >>
> >> If I want to clone a file I either want it clone fully or fail, not copy
> >> a certain amount.  That means that a) we need to return an error not
> >> short "write", and b) locking impementations are important - we need to
> >> prevent other applications from racing with our clone even if it is
> >> large, while to get these semantics for the possible short returning
> >> file copy will require a proper userland locking protocol. Last but not
> >> least file copies need to be interruptible while clones should be not.
> >> All this is already important for local file systems and even more
> >> important for NFS exporting.
> >>
> >> So I'd suggest to drop this patch and just let your syscall handle
> >> actualy copies with all their horrors.  We can go with Peng's patches
> >> to generalize the btrfs ioctls for clones for now which is what everyone
> >> already uses anyway, and then add a separate sys_file_clone later.
> 
> So what I'm hearing is that I should drop the reflink and dedup flags and
> change this system call only perform a full copy (with preserving of
> sparseness), correct?  I can make those changes, but only if everybody is in
> agreement that it's the best way forward.

Sounds fine to me; I'll work on promoting EXTENT_SAME to the VFS.

> The only reason I haven't done anything to make this system call
> interruptible is because I haven't been able to find any documentation or
> examples for making system calls interruptible.  How do I do this?

I thought it was mostly a matter of sprinkling in "if (signal_pending(...))
return -ERESTARTSYS" type things whenever it's convenient to check.  The splice
code already seems to have this, though I'm no expert on what the splice code
actually does. :)

--D
> 
> Anna
> 
> > 
> > Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> > 
> > What does everyone think about generalizing EXTENT_SAME?  The interface enables
> > one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> > that the extra complexity buys us the ability to ... multi-dedupe at the same
> > time, with locks held on the source file?
> > 
> > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> > hate the interface.
> > 
> > --D
> > 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
  2015-10-14 17:37       ` Anna Schumaker
  (?)
  (?)
@ 2015-10-14 18:25       ` Christoph Hellwig
  2015-10-14 18:27           ` Anna Schumaker
  -1 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:25 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Christoph Hellwig, linux-nfs, linux-btrfs, linux-fsdevel,
	linux-api, zab, viro, clm, darrick.wong, mtk.manpages, andros

On Wed, Oct 14, 2015 at 01:37:13PM -0400, Anna Schumaker wrote:
> I would have folded this and patch 4 earlier if I had written patch 1,
> but I didn't feel comfortable modifying Zach's work too much.  I can
> make that change if it's not really a problem.

Folding the changes is perfectly fine, just make it clear you changed
it, e.g.

Signed-off-by: Original Author <original@auhor.info>
[anna: fixed foo & bar, rewrote changelog]
Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
@ 2015-10-14 18:25         ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:25 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Christoph Hellwig, linux-nfs, linux-btrfs, linux-fsdevel,
	linux-api, zab, viro, clm, darrick.wong, mtk.manpages, andros

On Wed, Oct 14, 2015 at 01:41:23PM -0400, Anna Schumaker wrote:
> > NAK.  I thing this is a bad idea in general and will only be convinced
> > by a properly audited actual implementation.  And even then with a flag
> > where the file system specificly needs to opt into this behavior instead
> > of getting it by default.
> > 
> 
> So I should drop this patch even with the pagecache copy?  Andy
> Adamson will have to add it in later as part of his server-to-server patches.

Yes.  Let him do the proof it works alright then.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
@ 2015-10-14 18:25         ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:25 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Christoph Hellwig, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 01:41:23PM -0400, Anna Schumaker wrote:
> > NAK.  I thing this is a bad idea in general and will only be convinced
> > by a properly audited actual implementation.  And even then with a flag
> > where the file system specificly needs to opt into this behavior instead
> > of getting it by default.
> > 
> 
> So I should drop this patch even with the pagecache copy?  Andy
> Adamson will have to add it in later as part of his server-to-server patches.

Yes.  Let him do the proof it works alright then.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-14 18:11           ` Darrick J. Wong
  (?)
@ 2015-10-14 18:26           ` Andy Lutomirski
  -1 siblings, 0 replies; 129+ messages in thread
From: Andy Lutomirski @ 2015-10-14 18:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Anna Schumaker, Christoph Hellwig, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 11:11 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Wed, Oct 14, 2015 at 01:59:40PM -0400, Anna Schumaker wrote:
>> On 10/12/2015 07:17 PM, Darrick J. Wong wrote:
>> > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>> >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>> >>> This allows us to have an in-kernel copy mechanism that avoids frequent
>> >>> switches between kernel and user space.  This is especially useful so
>> >>> NFSD can support server-side copies.
>> >>>
>> >>> I make pagecache copies configurable by adding three new (exclusive)
>> >>> flags:
>> >>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>> >>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>> >>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>> >>>   ranges are identical.
>> >>
>> >> All but FR_COPY really should be a separate system call.  Clones (an
>> >> dedup as a special case of clones) are really a separate beast from file
>> >> copies.
>> >>
>> >> If I want to clone a file I either want it clone fully or fail, not copy
>> >> a certain amount.  That means that a) we need to return an error not
>> >> short "write", and b) locking impementations are important - we need to
>> >> prevent other applications from racing with our clone even if it is
>> >> large, while to get these semantics for the possible short returning
>> >> file copy will require a proper userland locking protocol. Last but not
>> >> least file copies need to be interruptible while clones should be not.
>> >> All this is already important for local file systems and even more
>> >> important for NFS exporting.
>> >>
>> >> So I'd suggest to drop this patch and just let your syscall handle
>> >> actualy copies with all their horrors.  We can go with Peng's patches
>> >> to generalize the btrfs ioctls for clones for now which is what everyone
>> >> already uses anyway, and then add a separate sys_file_clone later.
>>
>> So what I'm hearing is that I should drop the reflink and dedup flags and
>> change this system call only perform a full copy (with preserving of
>> sparseness), correct?  I can make those changes, but only if everybody is in
>> agreement that it's the best way forward.
>
> Sounds fine to me; I'll work on promoting EXTENT_SAME to the VFS.
>
>> The only reason I haven't done anything to make this system call
>> interruptible is because I haven't been able to find any documentation or
>> examples for making system calls interruptible.  How do I do this?
>
> I thought it was mostly a matter of sprinkling in "if (signal_pending(...))
> return -ERESTARTSYS" type things whenever it's convenient to check.  The splice
> code already seems to have this, though I'm no expert on what the splice code
> actually does. :)
>

Oh, right.  That's for making loops that don't otherwise block
interruptible.  If you're doing wait_xyz, then you want to use the
interruptable variable of that.

Anyway, I just checked on x86.  The relevant error codes are (I think):

-EINTR: returns -EINTR to userspace with no special handling.

-ERESTARTNOINTR: end the syscall, call a signal handler if
appropriate, then retry the syscall with the same arguments (i.e. the
syscall needs to make sure that trying again is an acceptable thing to
do by, e.g., updating offsets that are used).

-ERESTARTSYS: same as -ERESTARTNOINTR *unless* there's an unblocked
signal handler that has SA_RESTART clear, which which case the caller
gets -EINTR.

-ERESTARTNOHAND: end the syscall and retry with the same arguments if
no signal handler would be called; otherwise call the signal handler
and return -EINTR to the caller.

-ERESTART_RESTARTBLOCK: return -EINTR if a signal is delivered and
otherwise use the restart_block mechanism.  (Don't use that -- it's
evil.)

So -ERESTARTSYS is probably the most sensible thing to use under
normal circumstances.

--Andy

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-14 18:08         ` Andy Lutomirski
@ 2015-10-14 18:27           ` Christoph Hellwig
  2015-10-14 18:38               ` Andy Lutomirski
  2015-10-14 19:08               ` Austin S Hemmelgarn
  0 siblings, 2 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Anna Schumaker, Christoph Hellwig, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote:
> > So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.
> 
> I personally rather like the reflink option.  That thing is quite useful.

reflink is very useful, probably more useful than the copy actually. But it
is different from a copy.  It should be a separate interface.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-10-14 18:27           ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 18:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs, linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	darrick.wong, mtk.manpages, andros

On 10/14/2015 02:25 PM, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 01:37:13PM -0400, Anna Schumaker wrote:
>> I would have folded this and patch 4 earlier if I had written patch 1,
>> but I didn't feel comfortable modifying Zach's work too much.  I can
>> make that change if it's not really a problem.
> 
> Folding the changes is perfectly fine, just make it clear you changed
> it, e.g.
> 
> Signed-off-by: Original Author <original@auhor.info>
> [anna: fixed foo & bar, rewrote changelog]
> Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
> 

Okay, I'll do that.  Thanks!

Anna

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-10-14 18:27           ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 18:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On 10/14/2015 02:25 PM, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 01:37:13PM -0400, Anna Schumaker wrote:
>> I would have folded this and patch 4 earlier if I had written patch 1,
>> but I didn't feel comfortable modifying Zach's work too much.  I can
>> make that change if it's not really a problem.
> 
> Folding the changes is perfectly fine, just make it clear you changed
> it, e.g.
> 
> Signed-off-by: Original Author <original-/77Cqtjy9Dqh6J55Ss3d3w@public.gmane.org>
> [anna: fixed foo & bar, rewrote changelog]
> Signed-off-by: Anna Schumaker <Anna.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> 

Okay, I'll do that.  Thanks!

Anna

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
@ 2015-10-14 18:27           ` Anna Schumaker
  0 siblings, 0 replies; 129+ messages in thread
From: Anna Schumaker @ 2015-10-14 18:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On 10/14/2015 02:25 PM, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 01:37:13PM -0400, Anna Schumaker wrote:
>> I would have folded this and patch 4 earlier if I had written patch 1,
>> but I didn't feel comfortable modifying Zach's work too much.  I can
>> make that change if it's not really a problem.
> 
> Folding the changes is perfectly fine, just make it clear you changed
> it, e.g.
> 
> Signed-off-by: Original Author <original-/77Cqtjy9Dqh6J55Ss3d3w@public.gmane.org>
> [anna: fixed foo & bar, rewrote changelog]
> Signed-off-by: Anna Schumaker <Anna.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> 

Okay, I'll do that.  Thanks!

Anna

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 18:38               ` Andy Lutomirski
  0 siblings, 0 replies; 129+ messages in thread
From: Andy Lutomirski @ 2015-10-14 18:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 11:27 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote:
>> > So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.
>>
>> I personally rather like the reflink option.  That thing is quite useful.
>
> reflink is very useful, probably more useful than the copy actually. But it
> is different from a copy.  It should be a separate interface.

One might argue that reflink is like copy + immediate dedupe.  Also, I
can imagine there being network protocols over which you can't really
tell the difference between reflink and server-to-server copy.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 18:38               ` Andy Lutomirski
  0 siblings, 0 replies; 129+ messages in thread
From: Andy Lutomirski @ 2015-10-14 18:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 11:27 AM, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote:
>> > So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.
>>
>> I personally rather like the reflink option.  That thing is quite useful.
>
> reflink is very useful, probably more useful than the copy actually. But it
> is different from a copy.  It should be a separate interface.

One might argue that reflink is like copy + immediate dedupe.  Also, I
can imagine there being network protocols over which you can't really
tell the difference between reflink and server-to-server copy.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-14 18:46               ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-14 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: P??draig Brady, Anna Schumaker, linux-nfs, linux-btrfs,
	linux-fsdevel, linux-api, zab, viro, clm, mtk.manpages, andros

On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> > leaves holes alone.
> 
> Yes, I've seen the implementation. 
> 
> > Obviously we haven't yet figured out what are peoples' preferences in terms of
> > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> > fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> > that fills the holes while unsharing is going on.
> > 
> > Personally I suspect that the most interest is in filling holes and unsharing,
> > because they don't want to pay for allocation at a critical stage for anywhere
> > in the file.  But I could be wrong, so allowing both goals to be expressed via
> > mode allows flexibility.
> 
> Exactly.  And a normal falloc should do just that - fill holes and
> ensure that we don't need to COW already allocated locks.  So I don't
> think we need a new fallocate interface for that.

The documentation for fallocate ought to be updated to include that as part of
guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
shared blocks will be unshared.

Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
COW it probably doesn't make sense to unshare blocks anyway... but maybe I
also don't want to dive into btrfs f-allocation behavior at this time. :)

Ok, so I'll rework the XFS funshare code into something that hangs off the
regular fallocate call, and get rid of the explicit 'funshare' bits.

> The question is if we
> want a copy interface that gives you the same semantics as if you also
> called an fallocate on the destination range.  For that case we'd
> usually want to avoid doing the clone and instead do a in-kernel or
> hardware assisted copy and then fill the holes with unwritten extents.

Probably; I can easily imagine people wanting to fill the holes and also
not wanting them filled.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-14 18:46               ` Darrick J. Wong
  0 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-10-14 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: P??draig Brady, Anna Schumaker, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> > leaves holes alone.
> 
> Yes, I've seen the implementation. 
> 
> > Obviously we haven't yet figured out what are peoples' preferences in terms of
> > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> > fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> > that fills the holes while unsharing is going on.
> > 
> > Personally I suspect that the most interest is in filling holes and unsharing,
> > because they don't want to pay for allocation at a critical stage for anywhere
> > in the file.  But I could be wrong, so allowing both goals to be expressed via
> > mode allows flexibility.
> 
> Exactly.  And a normal falloc should do just that - fill holes and
> ensure that we don't need to COW already allocated locks.  So I don't
> think we need a new fallocate interface for that.

The documentation for fallocate ought to be updated to include that as part of
guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
shared blocks will be unshared.

Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
COW it probably doesn't make sense to unshare blocks anyway... but maybe I
also don't want to dive into btrfs f-allocation behavior at this time. :)

Ok, so I'll rework the XFS funshare code into something that hangs off the
regular fallocate call, and get rid of the explicit 'funshare' bits.

> The question is if we
> want a copy interface that gives you the same semantics as if you also
> called an fallocate on the destination range.  For that case we'd
> usually want to avoid doing the clone and instead do a in-kernel or
> hardware assisted copy and then fill the holes with unwritten extents.

Probably; I can easily imagine people wanting to fill the holes and also
not wanting them filled.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 18:49                 ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
> One might argue that reflink is like copy + immediate dedupe.

Not, it's not.  It's all that and more, because it is an operation that
is atomic vs other writes to the file and it's an operation that either
clones the whole range or nothing.  That's a very important difference.

> Also, I
> can imagine there being network protocols over which you can't really
> tell the difference between reflink and server-to-server copy.

For NFS we specificly have a CLONE and a COPY operations so that smart
servers can support the proper clone, and dumb servers still get copy
offload.  Other protocols might only be able to support COPY if they
don't have a CLONE primitive.  Note that a clone also always is a valid
copy, just with much simpler an at the same time more useful semantics.
Take a look at the NFSv4.2 sections for CLONE vs COPY if you're
interested.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 18:49                 ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-14 18:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
> One might argue that reflink is like copy + immediate dedupe.

Not, it's not.  It's all that and more, because it is an operation that
is atomic vs other writes to the file and it's an operation that either
clones the whole range or nothing.  That's a very important difference.

> Also, I
> can imagine there being network protocols over which you can't really
> tell the difference between reflink and server-to-server copy.

For NFS we specificly have a CLONE and a COPY operations so that smart
servers can support the proper clone, and dumb servers still get copy
offload.  Other protocols might only be able to support COPY if they
don't have a CLONE primitive.  Note that a clone also always is a valid
copy, just with much simpler an at the same time more useful semantics.
Take a look at the NFSv4.2 sections for CLONE vs COPY if you're
interested.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-14 18:49                 ` Christoph Hellwig
  (?)
@ 2015-10-14 18:53                 ` Andy Lutomirski
  2015-10-14 19:14                     ` Austin S Hemmelgarn
  2015-10-15  5:56                     ` Christoph Hellwig
  -1 siblings, 2 replies; 129+ messages in thread
From: Andy Lutomirski @ 2015-10-14 18:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
>> One might argue that reflink is like copy + immediate dedupe.
>
> Not, it's not.  It's all that and more, because it is an operation that
> is atomic vs other writes to the file and it's an operation that either
> clones the whole range or nothing.  That's a very important difference.

Fair enough.

Would copy_file_range without the reflink option removed still be
permitted to link blocks on supported filesystems (btrfs and maybe
XFS)?

--Andy

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 19:08               ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-14 19:08 UTC (permalink / raw)
  To: Christoph Hellwig, Andy Lutomirski
  Cc: Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 2399 bytes --]

On 2015-10-14 14:27, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote:
>>> So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.
>>
>> I personally rather like the reflink option.  That thing is quite useful.
>
> reflink is very useful, probably more useful than the copy actually. But it
> is different from a copy.  It should be a separate interface.
Whether or not reflink is different from a copy is entirely a matter of 
who is looking at it.  For someone looking directly at the block device, 
or trying to manipulate the block layout of the filesystem it is 
definitely not a copy.  For a database app that needs ACID transaction 
semantics, it is definitely not a copy (although for that usage, it's 
arguably significantly better than a copy).  From the point of view of a 
generic userspace app that didn't perform the copy operation however, 
and for anyone looking at it after the fact without paying attention to 
the block layout, a reflink _is_ for all intents and purposes 
functionally equivalent to a copy of the reflinked data (assuming of 
course that the filesystem implements it properly, and that the hardware 
behaves right).

I would not in fact be surprised if at least some SCSI devices that 
implement the XCOPY command do so internally using a reflink (I have not 
personally read the standard, but even if it 'requires' a compliant 
device to actually create a separate copy of the data, there will still 
be some vendors who ignore this), and it is well known that some SSD's 
do in-band data deduplication effectively reducing a traditional copy to 
a reflink at the firmware level.

I agree that we shouldn't try to make a reflink by default (less than 
intelligent programmers won't read the docs completely, and will make 
various stupid assumptions about how this is 'supposed' to work, making 
the defaults less ambiguous is a good thing), but it makes sense (at 
least, it does to me) to have the ability to say 'make this block of 
data appear at this location as well, I don't care how you do it as long 
as they are functionally independent for userspace applications'.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 19:08               ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-14 19:08 UTC (permalink / raw)
  To: Christoph Hellwig, Andy Lutomirski
  Cc: Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 2399 bytes --]

On 2015-10-14 14:27, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote:
>>> So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct?  I can make those changes, but only if everybody is in agreement that it's the best way forward.
>>
>> I personally rather like the reflink option.  That thing is quite useful.
>
> reflink is very useful, probably more useful than the copy actually. But it
> is different from a copy.  It should be a separate interface.
Whether or not reflink is different from a copy is entirely a matter of 
who is looking at it.  For someone looking directly at the block device, 
or trying to manipulate the block layout of the filesystem it is 
definitely not a copy.  For a database app that needs ACID transaction 
semantics, it is definitely not a copy (although for that usage, it's 
arguably significantly better than a copy).  From the point of view of a 
generic userspace app that didn't perform the copy operation however, 
and for anyone looking at it after the fact without paying attention to 
the block layout, a reflink _is_ for all intents and purposes 
functionally equivalent to a copy of the reflinked data (assuming of 
course that the filesystem implements it properly, and that the hardware 
behaves right).

I would not in fact be surprised if at least some SCSI devices that 
implement the XCOPY command do so internally using a reflink (I have not 
personally read the standard, but even if it 'requires' a compliant 
device to actually create a separate copy of the data, there will still 
be some vendors who ignore this), and it is well known that some SSD's 
do in-band data deduplication effectively reducing a traditional copy to 
a reflink at the firmware level.

I agree that we shouldn't try to make a reflink by default (less than 
intelligent programmers won't read the docs completely, and will make 
various stupid assumptions about how this is 'supposed' to work, making 
the defaults less ambiguous is a good thing), but it makes sense (at 
least, it does to me) to have the ability to say 'make this block of 
data appear at this location as well, I don't care how you do it as long 
as they are functionally independent for userspace applications'.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 19:14                     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-14 19:14 UTC (permalink / raw)
  To: Andy Lutomirski, Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

On 2015-10-14 14:53, Andy Lutomirski wrote:
> On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
>>> One might argue that reflink is like copy + immediate dedupe.
>>
>> Not, it's not.  It's all that and more, because it is an operation that
>> is atomic vs other writes to the file and it's an operation that either
>> clones the whole range or nothing.  That's a very important difference.
>
> Fair enough.
>
> Would copy_file_range without the reflink option removed still be
> permitted to link blocks on supported filesystems (btrfs and maybe
> XFS)?
I would argue that it should have such functionality, but not do so by 
default (maybe add some option to tell it to ask the FS to accelerate 
the copy operation?).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 19:14                     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-14 19:14 UTC (permalink / raw)
  To: Andy Lutomirski, Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 861 bytes --]

On 2015-10-14 14:53, Andy Lutomirski wrote:
> On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
>>> One might argue that reflink is like copy + immediate dedupe.
>>
>> Not, it's not.  It's all that and more, because it is an operation that
>> is atomic vs other writes to the file and it's an operation that either
>> clones the whole range or nothing.  That's a very important difference.
>
> Fair enough.
>
> Would copy_file_range without the reflink option removed still be
> permitted to link blocks on supported filesystems (btrfs and maybe
> XFS)?
I would argue that it should have such functionality, but not do so by 
default (maybe add some option to tell it to ask the FS to accelerate 
the copy operation?).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-14 19:14                     ` Austin S Hemmelgarn
@ 2015-10-14 19:39                       ` Pádraig Brady
  -1 siblings, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-14 19:39 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Andy Lutomirski, Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On 14/10/15 20:14, Austin S Hemmelgarn wrote:
> On 2015-10-14 14:53, Andy Lutomirski wrote:
>> On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
>>>> One might argue that reflink is like copy + immediate dedupe.
>>>
>>> Not, it's not.  It's all that and more, because it is an operation that
>>> is atomic vs other writes to the file and it's an operation that either
>>> clones the whole range or nothing.  That's a very important difference.
>>
>> Fair enough.
>>
>> Would copy_file_range without the reflink option removed still be
>> permitted to link blocks on supported filesystems (btrfs and maybe
>> XFS)?
> I would argue that it should have such functionality, but not do so by 
> default (maybe add some option to tell it to ask the FS to accelerate 
> the copy operation?).

Heh, so back to the REFLINK flag :)
TBH given the overlap between "copy" and "reflink",
I quite like the REFLINK flag as a general interface to reflink.

thanks,
Pádraig

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-14 19:39                       ` Pádraig Brady
  0 siblings, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-14 19:39 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Andy Lutomirski, Christoph Hellwig
  Cc: Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On 14/10/15 20:14, Austin S Hemmelgarn wrote:
> On 2015-10-14 14:53, Andy Lutomirski wrote:
>> On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote:
>>>> One might argue that reflink is like copy + immediate dedupe.
>>>
>>> Not, it's not.  It's all that and more, because it is an operation that
>>> is atomic vs other writes to the file and it's an operation that either
>>> clones the whole range or nothing.  That's a very important difference.
>>
>> Fair enough.
>>
>> Would copy_file_range without the reflink option removed still be
>> permitted to link blocks on supported filesystems (btrfs and maybe
>> XFS)?
> I would argue that it should have such functionality, but not do so by 
> default (maybe add some option to tell it to ask the FS to accelerate 
> the copy operation?).

Heh, so back to the REFLINK flag :)
TBH given the overlap between "copy" and "reflink",
I quite like the REFLINK flag as a general interface to reflink.

thanks,
Pádraig
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-15  5:56                     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-15  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 11:53:45AM -0700, Andy Lutomirski wrote:
> Would copy_file_range without the reflink option removed still be
> permitted to link blocks on supported filesystems (btrfs and maybe
> XFS)?

Absolutely.  Unless the COPY_FALLOCATE or whatever we call it option is
specified of course.  But I'd really love to get basic copy
infrastructure in for 4.4 and then define these options later.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-15  5:56                     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-15  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 11:53:45AM -0700, Andy Lutomirski wrote:
> Would copy_file_range without the reflink option removed still be
> permitted to link blocks on supported filesystems (btrfs and maybe
> XFS)?

Absolutely.  Unless the COPY_FALLOCATE or whatever we call it option is
specified of course.  But I'd really love to get basic copy
infrastructure in for 4.4 and then define these options later.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
  2015-10-14 18:46               ` Darrick J. Wong
  (?)
@ 2015-10-15  6:00               ` Christoph Hellwig
  2015-10-16 11:49                   ` Chris Mason
  -1 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-15  6:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, P??draig Brady, Anna Schumaker, linux-nfs,
	linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	mtk.manpages, andros

On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> The documentation for fallocate ought to be updated to include that as part of
> guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> shared blocks will be unshared.
> 
> Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> also don't want to dive into btrfs f-allocation behavior at this time. :)
> 
> Ok, so I'll rework the XFS funshare code into something that hangs off the
> regular fallocate call, and get rid of the explicit 'funshare' bits.

Yes, that would be my preference.  I'd also like to understand what
exactly btrfs does in fallocate.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-15  6:36                 ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-15  6:36 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Wed, Oct 14, 2015 at 03:08:46PM -0400, Austin S Hemmelgarn wrote:
> Whether or not reflink is different from a copy is entirely a matter of who
> is looking at it.

So what?  I've been trying to explain why clone semantics matter, and
I've not seen a counter argument for that.  I've also explained a couple
times that a valid clone always is a valid copy, and I've only heard
some slight disagreement, and so far none as long as we take the
COPY_FALLOCATE option into account.

Note that all of this also applies to storage devices - any smart array
will do a clone-like operation underneath an XCOPY, but so far SCSI
doesn't provide full clone _semantics_ even if you can emulate a lot of
it using a lot of complexity around ROD tokens.

Similar at the SCSI level you can perform a fallocate-like operation
using the anchor bit in the UNMAP or WRITE SAME commands.

> I agree that we shouldn't try to make a reflink by default (less than
> intelligent programmers won't read the docs completely, and will make
> various stupid assumptions about how this is 'supposed' to work, making the
> defaults less ambiguous is a good thing), but it makes sense (at least, it
> does to me) to have the ability to say 'make this block of data appear at
> this location as well, I don't care how you do it as long as they are
> functionally independent for userspace applications'.

Yes, we absolutely should use reflink as a default implementation for
copy where available.

But we also need a clone or reflink interface that only gives us well
specified reflink semantics, and not the much weaker copy semantics.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-15  6:36                 ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-15  6:36 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 03:08:46PM -0400, Austin S Hemmelgarn wrote:
> Whether or not reflink is different from a copy is entirely a matter of who
> is looking at it.

So what?  I've been trying to explain why clone semantics matter, and
I've not seen a counter argument for that.  I've also explained a couple
times that a valid clone always is a valid copy, and I've only heard
some slight disagreement, and so far none as long as we take the
COPY_FALLOCATE option into account.

Note that all of this also applies to storage devices - any smart array
will do a clone-like operation underneath an XCOPY, but so far SCSI
doesn't provide full clone _semantics_ even if you can emulate a lot of
it using a lot of complexity around ROD tokens.

Similar at the SCSI level you can perform a fallocate-like operation
using the anchor bit in the UNMAP or WRITE SAME commands.

> I agree that we shouldn't try to make a reflink by default (less than
> intelligent programmers won't read the docs completely, and will make
> various stupid assumptions about how this is 'supposed' to work, making the
> defaults less ambiguous is a good thing), but it makes sense (at least, it
> does to me) to have the ability to say 'make this block of data appear at
> this location as well, I don't care how you do it as long as they are
> functionally independent for userspace applications'.

Yes, we absolutely should use reflink as a default implementation for
copy where available.

But we also need a clone or reflink interface that only gives us well
specified reflink semantics, and not the much weaker copy semantics.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-15  8:35                 ` Dave Chinner
  0 siblings, 0 replies; 129+ messages in thread
From: Dave Chinner @ 2015-10-15  8:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, P??draig Brady, Anna Schumaker, linux-nfs,
	linux-btrfs, linux-fsdevel, linux-api, zab, viro, clm,
	mtk.manpages, andros

On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> > > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> > > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> > > leaves holes alone.
> > 
> > Yes, I've seen the implementation. 
> > 
> > > Obviously we haven't yet figured out what are peoples' preferences in terms of
> > > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> > > fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> > > that fills the holes while unsharing is going on.
> > > 
> > > Personally I suspect that the most interest is in filling holes and unsharing,
> > > because they don't want to pay for allocation at a critical stage for anywhere
> > > in the file.  But I could be wrong, so allowing both goals to be expressed via
> > > mode allows flexibility.
> > 
> > Exactly.  And a normal falloc should do just that - fill holes and
> > ensure that we don't need to COW already allocated locks.  So I don't
> > think we need a new fallocate interface for that.
> 
> The documentation for fallocate ought to be updated to include that as part of
> guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> shared blocks will be unshared.
> 
> Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> also don't want to dive into btrfs f-allocation behavior at this time. :)
> 
> Ok, so I'll rework the XFS funshare code into something that hangs off the
> regular fallocate call, and get rid of the explicit 'funshare' bits.

Makes sense given we have the FALLOC_FL_ZERO_RANGE operation which
returns a zero to and preallocates all the holes in the range. I
would expect this operation on shared blocks to unshare blocks,
too...

> > The question is if we
> > want a copy interface that gives you the same semantics as if you also
> > called an fallocate on the destination range.  For that case we'd
> > usually want to avoid doing the clone and instead do a in-kernel or
> > hardware assisted copy and then fill the holes with unwritten extents.

If hole filling was required, then I'd do the operation the other
way around - prealloc the entire range, then do hardware assisted
copy of each separate data range in the source file with unwritten
extent conversion on offload completion...

> Probably; I can easily imagine people wanting to fill the holes and also
> not wanting them filled.

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-15  8:35                 ` Dave Chinner
  0 siblings, 0 replies; 129+ messages in thread
From: Dave Chinner @ 2015-10-15  8:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, P??draig Brady, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, clm-b10kYP2dOMg,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 13, 2015 at 12:29:59AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 12, 2015 at 04:41:06PM -0700, Darrick J. Wong wrote:
> > > One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
> > > flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
> > > leaves holes alone.
> > 
> > Yes, I've seen the implementation. 
> > 
> > > Obviously we haven't yet figured out what are peoples' preferences in terms of
> > > "fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
> > > fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
> > > that fills the holes while unsharing is going on.
> > > 
> > > Personally I suspect that the most interest is in filling holes and unsharing,
> > > because they don't want to pay for allocation at a critical stage for anywhere
> > > in the file.  But I could be wrong, so allowing both goals to be expressed via
> > > mode allows flexibility.
> > 
> > Exactly.  And a normal falloc should do just that - fill holes and
> > ensure that we don't need to COW already allocated locks.  So I don't
> > think we need a new fallocate interface for that.
> 
> The documentation for fallocate ought to be updated to include that as part of
> guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> shared blocks will be unshared.
> 
> Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> also don't want to dive into btrfs f-allocation behavior at this time. :)
> 
> Ok, so I'll rework the XFS funshare code into something that hangs off the
> regular fallocate call, and get rid of the explicit 'funshare' bits.

Makes sense given we have the FALLOC_FL_ZERO_RANGE operation which
returns a zero to and preallocates all the holes in the range. I
would expect this operation on shared blocks to unshare blocks,
too...

> > The question is if we
> > want a copy interface that gives you the same semantics as if you also
> > called an fallocate on the destination range.  For that case we'd
> > usually want to avoid doing the clone and instead do a in-kernel or
> > hardware assisted copy and then fill the holes with unwritten extents.

If hole filling was required, then I'd do the operation the other
way around - prealloc the entire range, then do hardware assisted
copy of each separate data range in the source file with unwritten
extent conversion on offload completion...

> Probably; I can easily imagine people wanting to fill the holes and also
> not wanting them filled.

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-15 12:24                   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-15 12:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]

On 2015-10-15 02:36, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 03:08:46PM -0400, Austin S Hemmelgarn wrote:
>> Whether or not reflink is different from a copy is entirely a matter of who
>> is looking at it.
>
> So what?  I've been trying to explain why clone semantics matter, and
> I've not seen a counter argument for that.  I've also explained a couple
> times that a valid clone always is a valid copy, and I've only heard
> some slight disagreement, and so far none as long as we take the
> COPY_FALLOCATE option into account.
>
> Note that all of this also applies to storage devices - any smart array
> will do a clone-like operation underneath an XCOPY, but so far SCSI
> doesn't provide full clone _semantics_ even if you can emulate a lot of
> it using a lot of complexity around ROD tokens.
>
> Similar at the SCSI level you can perform a fallocate-like operation
> using the anchor bit in the UNMAP or WRITE SAME commands.
>
>> I agree that we shouldn't try to make a reflink by default (less than
>> intelligent programmers won't read the docs completely, and will make
>> various stupid assumptions about how this is 'supposed' to work, making the
>> defaults less ambiguous is a good thing), but it makes sense (at least, it
>> does to me) to have the ability to say 'make this block of data appear at
>> this location as well, I don't care how you do it as long as they are
>> functionally independent for userspace applications'.
>
> Yes, we absolutely should use reflink as a default implementation for
> copy where available.
>
> But we also need a clone or reflink interface that only gives us well
> specified reflink semantics, and not the much weaker copy semantics.
>
Ah, I was completely misunderstanding your meaning, sorry about any 
confusion that I may have caused as a result of this.

My only point with saying we shouldn't reflink by default is that there 
are many (unintelligent) people who will assume that since the syscall 
has copy in it's name, that's what it will do; and, while I don't think 
we should cater to such individuals, it does make sense to have a 
syscall that says in it's name that it copies data actually do so by 
default.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-15 12:24                   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-15 12:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]

On 2015-10-15 02:36, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 03:08:46PM -0400, Austin S Hemmelgarn wrote:
>> Whether or not reflink is different from a copy is entirely a matter of who
>> is looking at it.
>
> So what?  I've been trying to explain why clone semantics matter, and
> I've not seen a counter argument for that.  I've also explained a couple
> times that a valid clone always is a valid copy, and I've only heard
> some slight disagreement, and so far none as long as we take the
> COPY_FALLOCATE option into account.
>
> Note that all of this also applies to storage devices - any smart array
> will do a clone-like operation underneath an XCOPY, but so far SCSI
> doesn't provide full clone _semantics_ even if you can emulate a lot of
> it using a lot of complexity around ROD tokens.
>
> Similar at the SCSI level you can perform a fallocate-like operation
> using the anchor bit in the UNMAP or WRITE SAME commands.
>
>> I agree that we shouldn't try to make a reflink by default (less than
>> intelligent programmers won't read the docs completely, and will make
>> various stupid assumptions about how this is 'supposed' to work, making the
>> defaults less ambiguous is a good thing), but it makes sense (at least, it
>> does to me) to have the ability to say 'make this block of data appear at
>> this location as well, I don't care how you do it as long as they are
>> functionally independent for userspace applications'.
>
> Yes, we absolutely should use reflink as a default implementation for
> copy where available.
>
> But we also need a clone or reflink interface that only gives us well
> specified reflink semantics, and not the much weaker copy semantics.
>
Ah, I was completely misunderstanding your meaning, sorry about any 
confusion that I may have caused as a result of this.

My only point with saying we shouldn't reflink by default is that there 
are many (unintelligent) people who will assume that since the syscall 
has copy in it's name, that's what it will do; and, while I don't think 
we should cater to such individuals, it does make sense to have a 
syscall that says in it's name that it copies data actually do so by 
default.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16  5:38                     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16  5:38 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Christoph Hellwig, Andy Lutomirski, Anna Schumaker,
	Darrick J. Wong, linux-nfs, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros

On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote:
> My only point with saying we shouldn't reflink by default is that there are
> many (unintelligent) people who will assume that since the syscall has copy
> in it's name, that's what it will do; and, while I don't think we should
> cater to such individuals, it does make sense to have a syscall that says in
> it's name that it copies data actually do so by default.

As far as the user is concerned a reflink is a copy.  A very efficient
copy.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16  5:38                     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16  5:38 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Christoph Hellwig, Andy Lutomirski, Anna Schumaker,
	Darrick J. Wong, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote:
> My only point with saying we shouldn't reflink by default is that there are
> many (unintelligent) people who will assume that since the syscall has copy
> in it's name, that's what it will do; and, while I don't think we should
> cater to such individuals, it does make sense to have a syscall that says in
> it's name that it copies data actually do so by default.

As far as the user is concerned a reflink is a copy.  A very efficient
copy.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-16  5:38                     ` Christoph Hellwig
  (?)
@ 2015-10-16 11:46                     ` Austin S Hemmelgarn
  2015-10-16 12:02                         ` Pádraig Brady
  2015-10-16 12:21                         ` Christoph Hellwig
  -1 siblings, 2 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-16 11:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]

On 2015-10-16 01:38, Christoph Hellwig wrote:
> On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote:
>> My only point with saying we shouldn't reflink by default is that there are
>> many (unintelligent) people who will assume that since the syscall has copy
>> in it's name, that's what it will do; and, while I don't think we should
>> cater to such individuals, it does make sense to have a syscall that says in
>> it's name that it copies data actually do so by default.
>
> As far as the user is concerned a reflink is a copy.  A very efficient
> copy.
I should have been specific, what I meant was that some people will 
assume that it actually creates a physical, on-disk byte-for-byte copy 
of the data.  There are many people out there (and sadly I have to deal 
with some at work) who are absolutely terrified of the concept of data 
deduplication, and will likely refuse to use this syscall for _anything_ 
if it reflinks by default on filesystems that support it.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 11:49                   ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-16 11:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, P??draig Brady, Anna Schumaker, linux-nfs,
	linux-btrfs, linux-fsdevel, linux-api, zab, viro, mtk.manpages,
	andros

On Wed, Oct 14, 2015 at 11:00:45PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> > The documentation for fallocate ought to be updated to include that as part of
> > guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> > shared blocks will be unshared.
> > 
> > Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> > COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> > also don't want to dive into btrfs f-allocation behavior at this time. :)
> > 
> > Ok, so I'll rework the XFS funshare code into something that hangs off the
> > regular fallocate call, and get rid of the explicit 'funshare' bits.
> 
> Yes, that would be my preference.  I'd also like to understand what
> exactly btrfs does in fallocate.

For which part?  The answer changes based on how many references there
are to a given fallocated region.

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 11:49                   ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-16 11:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, P??draig Brady, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 11:00:45PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> > The documentation for fallocate ought to be updated to include that as part of
> > guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> > shared blocks will be unshared.
> > 
> > Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> > COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> > also don't want to dive into btrfs f-allocation behavior at this time. :)
> > 
> > Ok, so I'll rework the XFS funshare code into something that hangs off the
> > regular fallocate call, and get rid of the explicit 'funshare' bits.
> 
> Yes, that would be my preference.  I'd also like to understand what
> exactly btrfs does in fallocate.

For which part?  The answer changes based on how many references there
are to a given fallocated region.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 11:49                   ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-16 11:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, P??draig Brady, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Wed, Oct 14, 2015 at 11:00:45PM -0700, Christoph Hellwig wrote:
> On Wed, Oct 14, 2015 at 11:46:08AM -0700, Darrick J. Wong wrote:
> > The documentation for fallocate ought to be updated to include that as part of
> > guaranteeing that subsequent writes to the range won't fail due to ENOSPC,
> > shared blocks will be unshared.
> > 
> > Incidentally, btrfs leaves shared blocks alone.  OTOH, given that it's totally
> > COW it probably doesn't make sense to unshare blocks anyway... but maybe I
> > also don't want to dive into btrfs f-allocation behavior at this time. :)
> > 
> > Ok, so I'll rework the XFS funshare code into something that hangs off the
> > regular fallocate call, and get rid of the explicit 'funshare' bits.
> 
> Yes, that would be my preference.  I'd also like to understand what
> exactly btrfs does in fallocate.

For which part?  The answer changes based on how many references there
are to a given fallocated region.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-16 11:46                     ` Austin S Hemmelgarn
@ 2015-10-16 12:02                         ` Pádraig Brady
  2015-10-16 12:21                         ` Christoph Hellwig
  1 sibling, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-16 12:02 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On 16/10/15 12:46, Austin S Hemmelgarn wrote:
> On 2015-10-16 01:38, Christoph Hellwig wrote:
>> On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote:
>>> My only point with saying we shouldn't reflink by default is that there are
>>> many (unintelligent) people who will assume that since the syscall has copy
>>> in it's name, that's what it will do; and, while I don't think we should
>>> cater to such individuals, it does make sense to have a syscall that says in
>>> it's name that it copies data actually do so by default.
>>
>> As far as the user is concerned a reflink is a copy.  A very efficient
>> copy.
> I should have been specific, what I meant was that some people will 
> assume that it actually creates a physical, on-disk byte-for-byte copy 
> of the data.  There are many people out there (and sadly I have to deal 
> with some at work) who are absolutely terrified of the concept of data 
> deduplication, and will likely refuse to use this syscall for _anything_ 
> if it reflinks by default on filesystems that support it.

Right. reflinking is transparent to the user, though its consequences are not.
Consequences being the possible extra latency or ENOSPC on CoW.
Therefore reflinking should be an explicit action/flag IMHO.

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:02                         ` Pádraig Brady
  0 siblings, 0 replies; 129+ messages in thread
From: Pádraig Brady @ 2015-10-16 12:02 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On 16/10/15 12:46, Austin S Hemmelgarn wrote:
> On 2015-10-16 01:38, Christoph Hellwig wrote:
>> On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote:
>>> My only point with saying we shouldn't reflink by default is that there are
>>> many (unintelligent) people who will assume that since the syscall has copy
>>> in it's name, that's what it will do; and, while I don't think we should
>>> cater to such individuals, it does make sense to have a syscall that says in
>>> it's name that it copies data actually do so by default.
>>
>> As far as the user is concerned a reflink is a copy.  A very efficient
>> copy.
> I should have been specific, what I meant was that some people will 
> assume that it actually creates a physical, on-disk byte-for-byte copy 
> of the data.  There are many people out there (and sadly I have to deal 
> with some at work) who are absolutely terrified of the concept of data 
> deduplication, and will likely refuse to use this syscall for _anything_ 
> if it reflinks by default on filesystems that support it.

Right. reflinking is transparent to the user, though its consequences are not.
Consequences being the possible extra latency or ENOSPC on CoW.
Therefore reflinking should be an explicit action/flag IMHO.

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:21                         ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 12:21 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Christoph Hellwig, Andy Lutomirski, Anna Schumaker,
	Darrick J. Wong, linux-nfs, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros

On Fri, Oct 16, 2015 at 07:46:41AM -0400, Austin S Hemmelgarn wrote:
> I should have been specific, what I meant was that some people will assume
> that it actually creates a physical, on-disk byte-for-byte copy of the data.
> There are many people out there (and sadly I have to deal with some at work)
> who are absolutely terrified of the concept of data deduplication, and will
> likely refuse to use this syscall for _anything_ if it reflinks by default
> on filesystems that support it.

If they use a file system that supports COW or dedup they are toast
already.  It's not the system call that does the 'dedup', it's the file
system or storage device, that's where they need to set their
preferences.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:21                         ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 12:21 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Christoph Hellwig, Andy Lutomirski, Anna Schumaker,
	Darrick J. Wong, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Fri, Oct 16, 2015 at 07:46:41AM -0400, Austin S Hemmelgarn wrote:
> I should have been specific, what I meant was that some people will assume
> that it actually creates a physical, on-disk byte-for-byte copy of the data.
> There are many people out there (and sadly I have to deal with some at work)
> who are absolutely terrified of the concept of data deduplication, and will
> likely refuse to use this syscall for _anything_ if it reflinks by default
> on filesystems that support it.

If they use a file system that supports COW or dedup they are toast
already.  It's not the system call that does the 'dedup', it's the file
system or storage device, that's where they need to set their
preferences.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:24                           ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 12:24 UTC (permalink / raw)
  To: P??draig Brady
  Cc: Austin S Hemmelgarn, Christoph Hellwig, Andy Lutomirski,
	Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

On Fri, Oct 16, 2015 at 01:02:23PM +0100, P??draig Brady wrote:
> Right. reflinking is transparent to the user, though its consequences are not.
> Consequences being the possible extra latency or ENOSPC on CoW.

You can get all these consequences without doing the file system reflink
by using a COW file system, any dedup scheme or thinly provisioned or
COW storage devices.

> Therefore reflinking should be an explicit action/flag IMHO.

This still does not make any sense, as it only prevents one of many
ways a file could do COW operations underneath.  If you don't want
ENOSPC use fallocate, or the proposed COPY_FALLOC flag.  If you want
care about latency you need to carefull benchmark your setup but in
general falloc / COPY_FALLOC might be a good starting point.  But for
99% of the copies a reflink is exactly the right thing to do.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:24                           ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 12:24 UTC (permalink / raw)
  To: P??draig Brady
  Cc: Austin S Hemmelgarn, Christoph Hellwig, Andy Lutomirski,
	Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

On Fri, Oct 16, 2015 at 01:02:23PM +0100, P??draig Brady wrote:
> Right. reflinking is transparent to the user, though its consequences are not.
> Consequences being the possible extra latency or ENOSPC on CoW.

You can get all these consequences without doing the file system reflink
by using a COW file system, any dedup scheme or thinly provisioned or
COW storage devices.

> Therefore reflinking should be an explicit action/flag IMHO.

This still does not make any sense, as it only prevents one of many
ways a file could do COW operations underneath.  If you don't want
ENOSPC use fallocate, or the proposed COPY_FALLOC flag.  If you want
care about latency you need to carefull benchmark your setup but in
general falloc / COPY_FALLOC might be a good starting point.  But for
99% of the copies a reflink is exactly the right thing to do.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 12:25                     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 12:25 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Darrick J. Wong, P??draig Brady,
	Anna Schumaker, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, mtk.manpages, andros

On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > Yes, that would be my preference.  I'd also like to understand what
> > exactly btrfs does in fallocate.
> 
> For which part?  The answer changes based on how many references there
> are to a given fallocated region.

Both cases.  With btrfs allocating new block on every write how do you
avoid that ENOSPC?  Is there a unassigned block preallocation that's
made persistent in some way?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 12:25                     ` Christoph Hellwig
  0 siblings, 0 replies; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 12:25 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Darrick J. Wong, P??draig Brady,
	Anna Schumaker, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > Yes, that would be my preference.  I'd also like to understand what
> > exactly btrfs does in fallocate.
> 
> For which part?  The answer changes based on how many references there
> are to a given fallocated region.

Both cases.  With btrfs allocating new block on every write how do you
avoid that ENOSPC?  Is there a unassigned block preallocation that's
made persistent in some way?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:46                             ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-16 12:46 UTC (permalink / raw)
  To: Christoph Hellwig, P??draig Brady
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 1434 bytes --]

On 2015-10-16 08:24, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 01:02:23PM +0100, P??draig Brady wrote:
>> Right. reflinking is transparent to the user, though its consequences are not.
>> Consequences being the possible extra latency or ENOSPC on CoW.
>
> You can get all these consequences without doing the file system reflink
> by using a COW file system, any dedup scheme or thinly provisioned or
> COW storage devices.
>
>> Therefore reflinking should be an explicit action/flag IMHO.
>
> This still does not make any sense, as it only prevents one of many
> ways a file could do COW operations underneath.  If you don't want
> ENOSPC use fallocate, or the proposed COPY_FALLOC flag.  If you want
> care about latency you need to carefull benchmark your setup but in
> general falloc / COPY_FALLOC might be a good starting point.  But for
> 99% of the copies a reflink is exactly the right thing to do.
There is at least one reason other than avoiding ENOSPC and minimizing 
latency that people may want to avoid reflinking things: They actually 
_want_ multiple physically independent copies of the same file on the 
disk.  Usually people do go about this wrong (some people I know don't 
understand that having multiple copies of a file on the same filesystem 
provides no greater safety than one copy), but that doesn't mean that 
this isn't a perfectly valid use case for copying a file.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
@ 2015-10-16 12:46                             ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-16 12:46 UTC (permalink / raw)
  To: Christoph Hellwig, P??draig Brady
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros-HgOvQuBEEgTQT0dZR+AlfA

[-- Attachment #1: Type: text/plain, Size: 1434 bytes --]

On 2015-10-16 08:24, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 01:02:23PM +0100, P??draig Brady wrote:
>> Right. reflinking is transparent to the user, though its consequences are not.
>> Consequences being the possible extra latency or ENOSPC on CoW.
>
> You can get all these consequences without doing the file system reflink
> by using a COW file system, any dedup scheme or thinly provisioned or
> COW storage devices.
>
>> Therefore reflinking should be an explicit action/flag IMHO.
>
> This still does not make any sense, as it only prevents one of many
> ways a file could do COW operations underneath.  If you don't want
> ENOSPC use fallocate, or the proposed COPY_FALLOC flag.  If you want
> care about latency you need to carefull benchmark your setup but in
> general falloc / COPY_FALLOC might be a good starting point.  But for
> 99% of the copies a reflink is exactly the right thing to do.
There is at least one reason other than avoiding ENOSPC and minimizing 
latency that people may want to avoid reflinking things: They actually 
_want_ multiple physically independent copies of the same file on the 
disk.  Usually people do go about this wrong (some people I know don't 
understand that having multiple copies of a file on the same filesystem 
provides no greater safety than one copy), but that doesn't mean that 
this isn't a perfectly valid use case for copying a file.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-16 12:21                         ` Christoph Hellwig
  (?)
@ 2015-10-16 12:50                         ` Austin S Hemmelgarn
  2015-10-16 13:12                           ` Christoph Hellwig
  -1 siblings, 1 reply; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-16 12:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 2061 bytes --]

On 2015-10-16 08:21, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 07:46:41AM -0400, Austin S Hemmelgarn wrote:
>> I should have been specific, what I meant was that some people will assume
>> that it actually creates a physical, on-disk byte-for-byte copy of the data.
>> There are many people out there (and sadly I have to deal with some at work)
>> who are absolutely terrified of the concept of data deduplication, and will
>> likely refuse to use this syscall for _anything_ if it reflinks by default
>> on filesystems that support it.
>
> If they use a file system that supports COW or dedup they are toast
> already.  It's not the system call that does the 'dedup', it's the file
> system or storage device, that's where they need to set their
> preferences.
BTRFS is COW and supports deduplication, it does _absolutely zero_ 
reflinking and/or deduplication unless you explicitly tell it to do so. 
  Likewise, ZFS is COW and supports deduplication, it also does 
_absolutely zero_ reflinking and/or deduplication unless you tell it to 
(note that in-band deduplication is off by default on ZFS).  AFAIK, XFS 
will not automatically reflink instead of copying either (and if it does 
decide to do it automatically, that will just be something else to add 
to the list of why I will never use it on any of my systems). OCFS2 
supports reflinks (although not many people know this, and I think it 
implements them slightly differently from BTRFS/ZFS/XFS) and yet again, 
does _absolutely zero_ reflinking unless you tell it to.  Based on this, 
it is in no way the filesystem that does the deduplication and 
reflinking, it only handles the implementation and provides the option 
to the user to do it.

Certain parts of userspace do try to reflink things instead of copying 
(for example, coreutils recently started doing so in mv and has had the 
option to do so with cp for a while now), but a properly designed 
general purpose filesystem does not and should not do this without the 
user telling it to do so.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-16 12:50                         ` Austin S Hemmelgarn
@ 2015-10-16 13:12                           ` Christoph Hellwig
  2015-10-16 14:11                             ` Austin S Hemmelgarn
  0 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2015-10-16 13:12 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Christoph Hellwig, Andy Lutomirski, Anna Schumaker,
	Darrick J. Wong, linux-nfs, Linux btrfs Developers List,
	Linux FS Devel, Linux API, Zach Brown, Al Viro, Chris Mason,
	Michael Kerrisk-manpages, andros

On Fri, Oct 16, 2015 at 08:50:41AM -0400, Austin S Hemmelgarn wrote:
> Certain parts of userspace do try to reflink things instead of copying (for
> example, coreutils recently started doing so in mv and has had the option to
> do so with cp for a while now), but a properly designed general purpose
> filesystem does not and should not do this without the user telling it to do
> so.

But they do.  Get out of your narrow local Linux file system view.
Every all flash array or hyperconverge hypervisor will dedeup the hell
out of your data, heck some SSDs even do it on the device.  Your NFS or
CIFS server already does or soon will do dedup and reflinks behind the
scenes, that's the whole point of adding these features to the protocol.

And except for the odd fear or COW or dedup, and the ENOSPC issue for
which we have a flag with a very well defined meaning I've still not
heard any good arguments against it.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 13:19                       ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-16 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, P??draig Brady, Anna Schumaker, linux-nfs,
	linux-btrfs, linux-fsdevel, linux-api, zab, viro, mtk.manpages,
	andros

On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > > Yes, that would be my preference.  I'd also like to understand what
> > > exactly btrfs does in fallocate.
> > 
> > For which part?  The answer changes based on how many references there
> > are to a given fallocated region.
> 
> Both cases.  With btrfs allocating new block on every write how do you
> avoid that ENOSPC?  Is there a unassigned block preallocation that's
> made persistent in some way?

So:

fallocate 1g -> foo

reflink foo foo2

We've now implicitly doubled the size of the fallocate, but at reflink
time btrfs doesn't account for the doubling.  It's actually much
better in this case to just use a hole because neither foo or foo2 can
use the preallocated space until the 1g is fully unshared.

When we're doing writes, it'll check the preallocated extents for extra
refs and force COW if any exist.  So writes into a preallocated region
can enospc.

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 13:19                       ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-16 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, P??draig Brady, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > > Yes, that would be my preference.  I'd also like to understand what
> > > exactly btrfs does in fallocate.
> > 
> > For which part?  The answer changes based on how many references there
> > are to a given fallocated region.
> 
> Both cases.  With btrfs allocating new block on every write how do you
> avoid that ENOSPC?  Is there a unassigned block preallocation that's
> made persistent in some way?

So:

fallocate 1g -> foo

reflink foo foo2

We've now implicitly doubled the size of the fallocate, but at reflink
time btrfs doesn't account for the doubling.  It's actually much
better in this case to just use a hole because neither foo or foo2 can
use the preallocated space until the 1g is fully unshared.

When we're doing writes, it'll check the preallocated extents for extra
refs and force COW if any exist.  So writes into a preallocated region
can enospc.

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-16 13:19                       ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-16 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, P??draig Brady, Anna Schumaker,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > > Yes, that would be my preference.  I'd also like to understand what
> > > exactly btrfs does in fallocate.
> > 
> > For which part?  The answer changes based on how many references there
> > are to a given fallocated region.
> 
> Both cases.  With btrfs allocating new block on every write how do you
> avoid that ENOSPC?  Is there a unassigned block preallocation that's
> made persistent in some way?

So:

fallocate 1g -> foo

reflink foo foo2

We've now implicitly doubled the size of the fallocate, but at reflink
time btrfs doesn't account for the doubling.  It's actually much
better in this case to just use a hole because neither foo or foo2 can
use the preallocated space until the 1g is fully unshared.

When we're doing writes, it'll check the preallocated extents for extra
refs and force COW if any exist.  So writes into a preallocated region
can enospc.

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-16 13:12                           ` Christoph Hellwig
@ 2015-10-16 14:11                             ` Austin S Hemmelgarn
  0 siblings, 0 replies; 129+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-16 14:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Anna Schumaker, Darrick J. Wong, linux-nfs,
	Linux btrfs Developers List, Linux FS Devel, Linux API,
	Zach Brown, Al Viro, Chris Mason, Michael Kerrisk-manpages,
	andros

[-- Attachment #1: Type: text/plain, Size: 3446 bytes --]

On 2015-10-16 09:12, Christoph Hellwig wrote:
> On Fri, Oct 16, 2015 at 08:50:41AM -0400, Austin S Hemmelgarn wrote:
>> Certain parts of userspace do try to reflink things instead of copying (for
>> example, coreutils recently started doing so in mv and has had the option to
>> do so with cp for a while now), but a properly designed general purpose
>> filesystem does not and should not do this without the user telling it to do
>> so.
>
> But they do.  Get out of your narrow local Linux file system view.
> Every all flash array or hyperconverge hypervisor will dedeup the hell
> out of your data, heck some SSDs even do it on the device.  Your NFS or
> CIFS server already does or soon will do dedup and reflinks behind the
> scenes, that's the whole point of adding these features to the protocol.
Unless things have significantly changed on Windows and OS X, NTFS and 
HFS+ do not do automatic data deduplication (I'm not sure whether either 
even supports reflinks, although NTFS is at least partly COW), and I 
know for certain that FAT, UDF, Minix, BeFS, and Venti do not do so. 
NFS and CIFS/SMB both have support in the protocol, but unless either 
the client asks for it specifically, or the server is manually 
configured to do it automatically (although current versions of Windows 
server might do it by default, but if they do it is not documented 
anywhere I've seen), they don't do it.  9P has no provisions for 
reflinks/deduplication.  AFS/Coda/Ceph/Lustre/GFS2 might do 
deduplication, but I'm pretty certain that they do not do so by default, 
and even then they really don't fit the 'general purpose' bit in my 
statement above.  So, overall, my statement still holds for any widely 
used filesystem technology that is actually 'general purpose'.

Furthermore, if you actually read my statement, you will notice that I 
only said that _filesystems_ should not do it without being told to do 
so, and (intentionally) said absolutely nothing about any kind of 
storage devices or virtualization.  Ideally, SSD's really shouldn't do 
it either unless they have a 100% guarantee that the entire block going 
bad will not render the data unrecoverable (most do in fact use ECC 
internally, but they typically only handle two or three bad bits out of 
a full byte).  And as far as hypervisors go, a good storage hypervisor 
should be providing some guarantee of reliability, which means either it 
is already storing multiple copies of _everything_ or using some form of 
erasure coding so that it can recover from issues with the underlying 
storage devices without causing issues for higher levels, thus meaning 
that deduplication in that context is safe for all intents and purposes.
> And except for the odd fear or COW or dedup, and the ENOSPC issue for
> which we have a flag with a very well defined meaning I've still not
> heard any good arguments against it.
Most people who I know who demonstrate this fear are just fine with COW, 
it's the deduplication that they're terrified of, and TBH that's largely 
because they've only ever seen it used in unsafe ways.  My main argument 
(which I admittedly have not really stated properly at all during this 
discussion) is that almost everyone is likely to jump on this, which 
_will_ change long established semantics in many things that switch to 
this, and there will almost certainly be serious backlash from that.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
  2015-10-16 13:19                       ` Chris Mason
  (?)
  (?)
@ 2015-10-16 21:44                       ` Dave Chinner
  2015-10-17 13:44                           ` Chris Mason
  -1 siblings, 1 reply; 129+ messages in thread
From: Dave Chinner @ 2015-10-16 21:44 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Darrick J. Wong, P??draig Brady,
	Anna Schumaker, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, mtk.manpages, andros

On Fri, Oct 16, 2015 at 09:19:50AM -0400, Chris Mason wrote:
> On Fri, Oct 16, 2015 at 05:25:44AM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 16, 2015 at 07:49:19AM -0400, Chris Mason wrote:
> > > > Yes, that would be my preference.  I'd also like to understand what
> > > > exactly btrfs does in fallocate.
> > > 
> > > For which part?  The answer changes based on how many references there
> > > are to a given fallocated region.
> > 
> > Both cases.  With btrfs allocating new block on every write how do you
> > avoid that ENOSPC?  Is there a unassigned block preallocation that's
> > made persistent in some way?
> 
> So:
> 
> fallocate 1g -> foo
> 
> reflink foo foo2
> 
> We've now implicitly doubled the size of the fallocate, but at reflink

No, I don't think it implies that at all. the posix_fallocate()
"future writes will succeed" guarantee only applies to foo, not to
/copies/ such as foo2. At it's core, reflink is just an optimised
file copy mechanism - the resultant copy should have the same
behaviour as a file copied by read/write. Copies done by physically
copying data do not duplicate fallocate() regions or guarantees from
the source file to the destination file.

> time btrfs doesn't account for the doubling.  It's actually much
> better in this case to just use a hole because neither foo or foo2 can
> use the preallocated space until the 1g is fully unshared.

Right - this implies unwritten extents should not be shared by
reflink, instead either skipped (i.e. leave as a hole in foo2 as you
suggest) or duplicated so that the next write to the region of foo2
will also succeed. I'd suggest that COPY_FALLOC (or whatever it'll
get called) implies the latter behaviour, the default behaviour
being the former...

> When we're doing writes, it'll check the preallocated extents for extra
> refs and force COW if any exist.  So writes into a preallocated region
> can enospc.

This really seems like an btrfs interpretation/implementation
issue, not a problem for reflink in general.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
  2015-10-16 21:44                       ` Dave Chinner
  2015-10-17 13:44                           ` Chris Mason
@ 2015-10-17 13:44                           ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-17 13:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, P??draig Brady,
	Anna Schumaker, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, mtk.manpages, andros

On Sat, Oct 17, 2015 at 08:44:35AM +1100, Dave Chinner wrote:
> 
> > When we're doing writes, it'll check the preallocated extents for extra
> > refs and force COW if any exist.  So writes into a preallocated region
> > can enospc.
> 
> This really seems like an btrfs interpretation/implementation
> issue, not a problem for reflink in general.
> 

Right, now matter how we do it there are tradeoffs, and this one seemed
the least surprising to me.  I don't think it's a big problem at all.

Automatically replacing preallocated extents with holes during clone
seems like a better compromise though (at least for btrfs anyway).

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-17 13:44                           ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-17 13:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, P??draig Brady,
	Anna Schumaker, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Sat, Oct 17, 2015 at 08:44:35AM +1100, Dave Chinner wrote:
> 
> > When we're doing writes, it'll check the preallocated extents for extra
> > refs and force COW if any exist.  So writes into a preallocated region
> > can enospc.
> 
> This really seems like an btrfs interpretation/implementation
> issue, not a problem for reflink in general.
> 

Right, now matter how we do it there are tradeoffs, and this one seemed
the least surprising to me.  I don't think it's a big problem at all.

Automatically replacing preallocated extents with holes during clone
seems like a better compromise though (at least for btrfs anyway).

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
@ 2015-10-17 13:44                           ` Chris Mason
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Mason @ 2015-10-17 13:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, P??draig Brady,
	Anna Schumaker, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, zab-ugsP4Wv/S6ZeoWH0uzbU5w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	andros-HgOvQuBEEgTQT0dZR+AlfA

On Sat, Oct 17, 2015 at 08:44:35AM +1100, Dave Chinner wrote:
> 
> > When we're doing writes, it'll check the preallocated extents for extra
> > refs and force COW if any exist.  So writes into a preallocated region
> > can enospc.
> 
> This really seems like an btrfs interpretation/implementation
> issue, not a problem for reflink in general.
> 

Right, now matter how we do it there are tradeoffs, and this one seemed
the least surprising to me.  I don't think it's a big problem at all.

Automatically replacing preallocated extents with holes during clone
seems like a better compromise though (at least for btrfs anyway).

-chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
  2015-10-13  7:27         ` Christoph Hellwig
  (?)
@ 2015-11-10  6:24         ` Darrick J. Wong
  -1 siblings, 0 replies; 129+ messages in thread
From: Darrick J. Wong @ 2015-11-10  6:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anna Schumaker, linux-nfs, linux-btrfs, linux-fsdevel, linux-api,
	zab, viro, clm, mtk.manpages, andros

On Tue, Oct 13, 2015 at 12:27:37AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 04:17:49PM -0700, Darrick J. Wong wrote:
> > Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
> > 
> > What does everyone think about generalizing EXTENT_SAME?  The interface enables
> > one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> > that the extra complexity buys us the ability to ... multi-dedupe at the same
> > time, with locks held on the source file?
> > 
> > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
> > hate the interface.
> 
> It's not pretty, but if the btrfs folks have a good reason for it I
> don't see a reason to diverge.

I started hoisting EXTENT_SAME into the VFS but I don't like the name because
this ioctl implies some sort of action, but "EXTENT SAME" lacks a verb.  Since
we have to introduce a new symbol anyway, I'm going to use FS_DEDUPE_RANGE.

struct file_dedupe_range {
	...
}

#define FI_DEDUPE_RANGE         _IOWR(0x94, 54, struct file_dedupe_range)

(Honestly, I'm not in love with FICLONERANGE either, but FIDEDUPRANGE was just
unpronounceable mess.)

Also, for the btrfs folks: Why does extent_same call mnt_want_write_file on the
fd that we pass into the ioctl?  Shouldn't we be calling it on the fd that's in
the btrfs_ioctl_extent_same_info structure because that'ss the file that gets
its blocks remapped?

--D

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2015-11-10  6:25 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-30 17:26 [PATCH v5 0/9] VFS: In-kernel copy system call Anna Schumaker
2015-09-30 17:26 ` Anna Schumaker
2015-09-30 17:26 ` Anna Schumaker
2015-09-30 17:26 ` [PATCH v5 1/9] vfs: add copy_file_range syscall and vfs helper Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-09-30 17:26 ` [PATCH v5 2/9] x86: add sys_copy_file_range to syscall tables Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-09-30 17:26 ` [PATCH v5 3/9] btrfs: add .copy_file_range file operation Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-09-30 17:26 ` [PATCH v5 4/9] vfs: Copy should check len after file open mode Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-10-11 14:22   ` Christoph Hellwig
2015-10-11 14:22     ` Christoph Hellwig
2015-09-30 17:26 ` [PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-10-11 14:22   ` Christoph Hellwig
2015-10-14 17:37     ` Anna Schumaker
2015-10-14 17:37       ` Anna Schumaker
2015-10-14 17:37       ` Anna Schumaker
2015-10-14 18:25       ` Christoph Hellwig
2015-10-14 18:27         ` Anna Schumaker
2015-10-14 18:27           ` Anna Schumaker
2015-10-14 18:27           ` Anna Schumaker
2015-09-30 17:26 ` [PATCH v5 6/9] vfs: Copy should use file_out rather than file_in Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-10-11 14:24   ` Christoph Hellwig
2015-09-30 17:26 ` [PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-10-11 14:23   ` Christoph Hellwig
2015-10-14 17:41     ` Anna Schumaker
2015-10-14 17:41       ` Anna Schumaker
2015-10-14 18:25       ` Christoph Hellwig
2015-10-14 18:25         ` Christoph Hellwig
2015-09-30 17:26 ` [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-10-08  1:40   ` Neil Brown
2015-10-09 11:15     ` Pádraig Brady
2015-10-09 11:15       ` Pádraig Brady
2015-10-13 20:25       ` Anna Schumaker
2015-10-13 20:25         ` Anna Schumaker
2015-10-14  7:41         ` Christoph Hellwig
2015-10-14  7:41           ` Christoph Hellwig
2015-10-13 19:45     ` Anna Schumaker
2015-10-13 19:45       ` Anna Schumaker
2015-10-13 19:45       ` Anna Schumaker
2015-10-11 14:22   ` Christoph Hellwig
2015-10-11 14:22     ` Christoph Hellwig
2015-10-12 23:17     ` Darrick J. Wong
2015-10-12 23:17       ` Darrick J. Wong
2015-10-13  3:36       ` Trond Myklebust
2015-10-13  7:19         ` Darrick J. Wong
2015-10-13  7:19           ` Darrick J. Wong
2015-10-13  7:30         ` Christoph Hellwig
2015-10-13  7:30           ` Christoph Hellwig
2015-10-13  7:27       ` Christoph Hellwig
2015-10-13  7:27         ` Christoph Hellwig
2015-11-10  6:24         ` Darrick J. Wong
2015-10-14 17:59       ` Anna Schumaker
2015-10-14 17:59         ` Anna Schumaker
2015-10-14 17:59         ` Anna Schumaker
2015-10-14 18:08         ` Andy Lutomirski
2015-10-14 18:27           ` Christoph Hellwig
2015-10-14 18:38             ` Andy Lutomirski
2015-10-14 18:38               ` Andy Lutomirski
2015-10-14 18:49               ` Christoph Hellwig
2015-10-14 18:49                 ` Christoph Hellwig
2015-10-14 18:53                 ` Andy Lutomirski
2015-10-14 19:14                   ` Austin S Hemmelgarn
2015-10-14 19:14                     ` Austin S Hemmelgarn
2015-10-14 19:39                     ` Pádraig Brady
2015-10-14 19:39                       ` Pádraig Brady
2015-10-15  5:56                   ` Christoph Hellwig
2015-10-15  5:56                     ` Christoph Hellwig
2015-10-14 19:08             ` Austin S Hemmelgarn
2015-10-14 19:08               ` Austin S Hemmelgarn
2015-10-15  6:36               ` Christoph Hellwig
2015-10-15  6:36                 ` Christoph Hellwig
2015-10-15 12:24                 ` Austin S Hemmelgarn
2015-10-15 12:24                   ` Austin S Hemmelgarn
2015-10-16  5:38                   ` Christoph Hellwig
2015-10-16  5:38                     ` Christoph Hellwig
2015-10-16 11:46                     ` Austin S Hemmelgarn
2015-10-16 12:02                       ` Pádraig Brady
2015-10-16 12:02                         ` Pádraig Brady
2015-10-16 12:24                         ` Christoph Hellwig
2015-10-16 12:24                           ` Christoph Hellwig
2015-10-16 12:46                           ` Austin S Hemmelgarn
2015-10-16 12:46                             ` Austin S Hemmelgarn
2015-10-16 12:21                       ` Christoph Hellwig
2015-10-16 12:21                         ` Christoph Hellwig
2015-10-16 12:50                         ` Austin S Hemmelgarn
2015-10-16 13:12                           ` Christoph Hellwig
2015-10-16 14:11                             ` Austin S Hemmelgarn
2015-10-14 18:11         ` Darrick J. Wong
2015-10-14 18:11           ` Darrick J. Wong
2015-10-14 18:26           ` Andy Lutomirski
2015-09-30 17:26 ` [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-10-11 14:29   ` Christoph Hellwig
2015-10-11 14:29     ` Christoph Hellwig
2015-10-12 10:23     ` Pádraig Brady
2015-10-12 10:23       ` Pádraig Brady
2015-10-12 14:34       ` Christoph Hellwig
2015-10-12 23:41         ` Darrick J. Wong
2015-10-12 23:41           ` Darrick J. Wong
2015-10-13  7:29           ` Christoph Hellwig
2015-10-13  7:29             ` Christoph Hellwig
2015-10-14 18:46             ` Darrick J. Wong
2015-10-14 18:46               ` Darrick J. Wong
2015-10-15  6:00               ` Christoph Hellwig
2015-10-16 11:49                 ` Chris Mason
2015-10-16 11:49                   ` Chris Mason
2015-10-16 11:49                   ` Chris Mason
2015-10-16 12:25                   ` Christoph Hellwig
2015-10-16 12:25                     ` Christoph Hellwig
2015-10-16 13:19                     ` Chris Mason
2015-10-16 13:19                       ` Chris Mason
2015-10-16 13:19                       ` Chris Mason
2015-10-16 21:44                       ` Dave Chinner
2015-10-17 13:44                         ` Chris Mason
2015-10-17 13:44                           ` Chris Mason
2015-10-17 13:44                           ` Chris Mason
2015-10-15  8:35               ` Dave Chinner
2015-10-15  8:35                 ` Dave Chinner
2015-09-30 17:26 ` [PATCH v5 10/9] copy_file_range.2: New page documenting copy_file_range() Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker
2015-09-30 17:26   ` Anna Schumaker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.