Linux-NFS Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/11] fs: fixes for major copy_file_range() issues
@ 2018-12-03  8:34 Dave Chinner
  2018-12-03  8:34 ` [PATCH 01/11] vfs: copy_file_range source range over EOF should fail Dave Chinner
                   ` (11 more replies)
  0 siblings, 12 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

Hi folks,

As most of you already know, we really suck at introducing new
functionality. The recent problems we found with clone/dedupe file
range interfaces also plague the copy_file_range() API and
implementation. Not only doesn't it do exactly what the man page
says, the man page doesn't document everything the syscal does
either.

There's a few problems:
	- can overwrite setuid files
	- can read from and overwrite active swap files
	- can overwrite immutable files
	- doesn't update timestamps
	- doesn't obey resource limits
	- doesn't catch overlapping copy ranges to the same file
	- doesn't consistently implement fallback strategies
	- does error out when the source range extends past EOF like
	  the man page says it should
	- isn't consistent with clone file range behaviour
	- inconsistent behaviour between filesystems
	- inconsistent fallback implementations

And so on. There's so much wrong, and I haven't even got to the
problems that the generic fallback code (i.e. do_splice_direct()
has). That's for another day.

So, what this series attempts to do is clean up the code, implement
all the missing checks, provide an infrastructure layout that allows
for consistent behaviour across filesystems and allows filesysetms
to control fallback mechanisms and cross-device copies.

I'll repeat that so it's clear: the series also enabled cross-device
copies once all the problems are sorted out.

To that end, the current fallback code is moved to
generic_copy_file_range(), and that is called only if the filesystem
does not provide a ->copy_file_range implementation. If the
filesystem provides such a method, itmust implement the page cache
copy fallback itself by calling generic_copy_file_range() when
appropriate. I did this because different filesystems have different
copy-offload capabilities and so need to fall back in different
situations. It's easier to have them call generic_copy_file_range()
to do that copy when necessary than it is to have them try to
communicate back up to vfs_copy_file_range() that it should run a
fallback copy.

To make all the implementations perform the same validity checks, 
I've created a generic_copy_file_checks() which is similar to the
checks we do for clone/dedupe. It's not quite the same, but the core
is very similar. This strips setuid, updates timestamps, checks and
enforces filesystem and resource limits, bounds checks the copy
ranges, etc.

This needs to be run before we call ->remap_file_range() so that we
end up with consistent behaviour across copy_file_range() calls.
e.g. we want an XFS filesystem with reflink=1 (i.e. supports
->remap_file_range()) to behave the same as an XFS filesystem with
reflink=0. Hence we need to check all the parameters up front so we
don't end up with calls to ->remap_file_range() resulting in
different behaviour.

It also means that ->copy_file_range implementations only need to
bounds checking the input against fileystem internal constraints,
not everything. This makes the filesystem implementations simpler,
and means they can call the falloback generic_copy_file_range()
implementation without having to care about further bounds checking.

I have not changed the fallback behaviour of the CIFS, Ceph or NFS
client implementations. The still reject copy_file_range() to the
same file with EINVAL, even though it is supported by the fallback
and filesystems that implement ->remap_file_range(). I'll leave it
for the maintainers to decide if they want to implement the manual
data copy fallback or not. My personal opinion is that they should
implement the fallback where-ever they can, but userspace has to be
prepared for copy_file_range() to fail and so implementing the
fallback is an optional feature.

In terms of testing, Darrick and I have been beating the hell out of
copy_file_range with fsx on XFS to sort out all the data corruption
problems it has exposed (we're still working on that). Patches have
been posted to enhance fsx and fsstress in fstests to exercise
clone/dedupe/copy_file_range. Thread here:

https://www.spinics.net/lists/fstests/msg10920.html

I've also written a bounds/behaviour exercising test:

https://marc.info/?l=fstests&m=154381938829897&w=2
https://marc.info/?l=fstests&m=154381939029898&w=2
https://marc.info/?l=fstests&m=154381939229899&w=2
https://marc.info/?l=fstests&m=154381939329900&w=2

I don't know whether I've got all the permission tests right in this
patchset. There's absolutely no documentation telling us when we
should use file_permission, inode_permission, etc in the
documentation or the code, so I just added the things that made the
tests do the things i think are the right things to be doing.

To run the tests, you'll also need modifications to xfs_io to allow
it to modify state appropriately. This is something we have
overlooked in the past, and so a lots of xfs_io based behaviour
checking is not actually testing the syscall we thought it was
testing but is instead testing the permission checking of the open()
syscall. Those patches are here:

https://marc.info/?l=linux-xfs&m=154378403323889&w=2
https://marc.info/?l=linux-xfs&m=154378403523890&w=2
https://marc.info/?l=linux-xfs&m=154378403323888&w=2
https://marc.info/?l=linux-xfs&m=154379644526132&w=2

These changes really need to go in before we merge any more
copy_file_range() features - we need to get the basics right and get
test coverage over it before we unleash things like NFS server-side
copies on unsuspecting users with filesystems that have busted
copy_file_range() implementations.

I'll be appending a man page patch to this series that documents all
the errors this syscall can throw, the expected behaviours, etc. The
test and the man page were written together first, and the
implementation changes were done second. So if you don't agree with
the behaviour, discuss what the man page patch should say and define,
then I'll change the test to reflect that and I'll go from there.

-Dave.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:46   ` Amir Goldstein
  2018-12-03  8:34 ` [PATCH 02/11] vfs: introduce generic_copy_file_range() Dave Chinner
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

The man page says:

EINVAL Requested range extends beyond the end of the source file

But the current behaviour is that copy_file_range does a short
copy up to the source file EOF. Fix the kernel behaviour to match
the behaviour described in the man page.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4dae0399c75a..09d1816cf3cf 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1581,6 +1581,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (len == 0)
 		return 0;
 
+	/* If the source range crosses EOF, fail the copy */
+	if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
+		return -EINVAL;
+
 	file_start_write(file_out);
 
 	/*
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 02/11] vfs: introduce generic_copy_file_range()
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
  2018-12-03  8:34 ` [PATCH 01/11] vfs: copy_file_range source range over EOF should fail Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 10:03   ` Amir Goldstein
  2018-12-04 15:14   ` Christoph Hellwig
  2018-12-03  8:34 ` [PATCH 03/11] vfs: no fallback for ->copy_file_range Dave Chinner
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

Right now if vfs_copy_file_range() does not use any offload
mechanism, it falls back to calling do_splice_direct(). This fails
to do basic sanity checks on the files being copied. Before we
start adding this necessarily functionality to the fallback path,
separate it out into generic_copy_file_range().

generic_copy_file_range() has the same prototype as
->copy_file_range() so that filesystems can use it in their custom
->copy_file_range() method if they so choose.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c    | 35 ++++++++++++++++++++++++++++++++---
 include/linux/fs.h |  3 +++
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 09d1816cf3cf..50114694c98b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1540,6 +1540,36 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
 }
 #endif
 
+/**
+ * generic_copy_file_range - copy data between two files
+ * @file_in:	file structure to read from
+ * @pos_in:	file offset to read from
+ * @file_out:	file structure to write data to
+ * @pos_out:	file offset to write data to
+ * @len:	amount of data to copy
+ * @flags:	copy flags
+ *
+ * This is a generic filesystem helper to copy data from one file to another.
+ * It has no constraints on the source or destination file owners - the files
+ * can belong to different superblocks and different filesystem types. Short
+ * copies are allowed.
+ *
+ * This should be called from the @file_out filesystem, as per the
+ * ->copy_file_range() method.
+ *
+ * Returns the number of bytes copied or a negative error indicating the
+ * failure.
+ */
+
+ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+			len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+}
+EXPORT_SYMBOL(generic_copy_file_range);
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1611,9 +1641,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			goto done;
 	}
 
-	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
-			len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
-
+	ret = generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
+					len, flags);
 done:
 	if (ret > 0) {
 		fsnotify_access(file_in);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c0807471f..a4478764cf63 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				size_t len, unsigned int flags);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
  2018-12-03  8:34 ` [PATCH 01/11] vfs: copy_file_range source range over EOF should fail Dave Chinner
  2018-12-03  8:34 ` [PATCH 02/11] vfs: introduce generic_copy_file_range() Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 10:22   ` Amir Goldstein
                     ` (2 more replies)
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
                   ` (8 subsequent siblings)
  11 siblings, 3 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

Now that we have generic_copy_file_range(), remove it as a fallback
case when offloads fail. This puts the responsibility for executing
fallbacks on the filesystems that implement ->copy_file_range and
allows us to add operational validity checks to
generic_copy_file_range().

Rework vfs_copy_file_range() to call a new do_copy_file_range()
helper to exceute the copying callout, and move calls to
generic_file_copy_range() into filesystem methods where they
currently return failures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/ceph/file.c      | 17 ++++++++++++++++-
 fs/cifs/cifsfs.c    |  4 ++++
 fs/fuse/file.c      | 17 ++++++++++++++++-
 fs/nfs/nfs4file.c   |  4 ++++
 fs/overlayfs/file.c |  9 ++++++++-
 fs/read_write.c     | 24 +++++++++++++++---------
 6 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 189df668b6a0..cf29f0410dcb 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1885,7 +1885,7 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
 	return 0;
 }
 
-static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 				    struct file *dst_file, loff_t dst_off,
 				    size_t len, unsigned int flags)
 {
@@ -2096,6 +2096,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	return ret;
 }
 
+static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+				    struct file *dst_file, loff_t dst_off,
+				    size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
+					len, flags);
+
+	if (ret == -EOPNOTSUPP)
+		ret = generic_copy_file_range(src_file, src_off, dst_file,
+					dst_off, len, flags);
+	return ret;
+}
+
 const struct file_operations ceph_file_fops = {
 	.open = ceph_open,
 	.release = ceph_release,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 865706edb307..5ef4baec6234 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1141,6 +1141,10 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
 	rc = cifs_file_copychunk_range(xid, src_file, off, dst_file, destoff,
 					len, flags);
 	free_xid(xid);
+
+	if (rc == -EOPNOTSUPP)
+		rc = generic_copy_file_range(src_file, off, dst_file,
+					destoff, len, flags);
 	return rc;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b52f9baaa3e7..b86fb0298739 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3024,7 +3024,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	return err;
 }
 
-static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
+static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    size_t len, unsigned int flags)
 {
@@ -3100,6 +3100,21 @@ static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	return err;
 }
 
+static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
+				    struct file *dst_file, loff_t dst_off,
+				    size_t len, unsigned int flags)
+{
+	ssize_t ret;
+
+	ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
+					len, flags);
+
+	if (ret == -EOPNOTSUPP)
+		ret = generic_copy_file_range(src_file, src_off, dst_file,
+					dst_off, len, flags);
+	return ret;
+}
+
 static const struct file_operations fuse_file_operations = {
 	.llseek		= fuse_file_llseek,
 	.read_iter	= fuse_file_read_iter,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 46d691ba04bc..d7766a6eb0f4 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -141,6 +141,10 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
 	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
 	if (ret == -EAGAIN)
 		goto retry;
+
+	if (ret == -EOPNOTSUPP)
+		ret = generic_copy_file_range(file_in, pos_in, file_out,
+					pos_out, count, flags);
 	return ret;
 }
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 84dd957efa24..68736e5d6a56 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 				   struct file *file_out, loff_t pos_out,
 				   size_t len, unsigned int flags)
 {
-	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
+	ssize_t ret;
+
+	ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
 			    OVL_COPY);
+
+	if (ret == -EOPNOTSUPP)
+		ret = generic_copy_file_range(file_in, pos_in, file_out,
+					pos_out, len, flags);
+	return ret;
 }
 
 static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/read_write.c b/fs/read_write.c
index 50114694c98b..44339b44accc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1570,6 +1570,18 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_copy_file_range);
 
+static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
+			    struct file *file_out, loff_t pos_out,
+			    size_t len, unsigned int flags)
+{
+	if (file_out->f_op->copy_file_range)
+		return file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+
+	return generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
+					len, flags);
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1634,15 +1646,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 		}
 	}
 
-	if (file_out->f_op->copy_file_range) {
-		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
-						      pos_out, len, flags);
-		if (ret != -EOPNOTSUPP)
-			goto done;
-	}
-
-	ret = generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
-					len, flags);
+	ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+				flags);
+	WARN_ON_ONCE(ret == -EOPNOTSUPP);
 done:
 	if (ret > 0) {
 		fsnotify_access(file_in);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (2 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 03/11] vfs: no fallback for ->copy_file_range Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:42   ` Amir Goldstein
                     ` (4 more replies)
  2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
                   ` (7 subsequent siblings)
  11 siblings, 5 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c    | 27 ++++++------------
 include/linux/fs.h |  3 ++
 mm/filemap.c       | 69 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+), 18 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 44339b44accc..69809345977e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1578,7 +1578,7 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
 		return file_out->f_op->copy_file_range(file_in, pos_in, file_out,
 						      pos_out, len, flags);
 
-	return generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
+	return generic_copy_file_range(file_in, pos_in, file_out, pos_out,
 					len, flags);
 }
 
@@ -1598,10 +1598,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (flags != 0)
 		return -EINVAL;
 
-	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
-		return -EISDIR;
-	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
-		return -EINVAL;
+	/* this could be relaxed once a method supports cross-fs copies */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
+	ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
+					flags);
+	if (ret < 0)
+		return ret;
 
 	ret = rw_verify_area(READ, file_in, &pos_in, len);
 	if (unlikely(ret))
@@ -1611,22 +1615,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (unlikely(ret))
 		return ret;
 
-	if (!(file_in->f_mode & FMODE_READ) ||
-	    !(file_out->f_mode & FMODE_WRITE) ||
-	    (file_out->f_flags & O_APPEND))
-		return -EBADF;
-
-	/* this could be relaxed once a method supports cross-fs copies */
-	if (inode_in->i_sb != inode_out->i_sb)
-		return -EXDEV;
-
 	if (len == 0)
 		return 0;
 
-	/* If the source range crosses EOF, fail the copy */
-	if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
-		return -EINVAL;
-
 	file_start_write(file_out);
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a4478764cf63..0d9d2d93d4df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3022,6 +3022,9 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
 				loff_t *count, unsigned int remap_flags);
+extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				size_t *count, unsigned int flags);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..0a170425935b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2975,6 +2975,75 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	return 0;
 }
 
+
+/*
+ * Performs necessary checks before doing a file copy
+ *
+ * Can adjust amount of bytes to copy
+ * Returns appropriate error code that caller should return or
+ * zero in case the copy should be allowed.
+ */
+int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 size_t *req_count, unsigned int flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	uint64_t count = *req_count;
+	uint64_t bcount;
+	loff_t size_in, size_out;
+	loff_t bs = inode_out->i_sb->s_blocksize;
+	int ret;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode_out))
+		return -EPERM;
+
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		return -ETXTBSY;
+
+	/* Don't copy dirs, pipes, sockets... */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND))
+		return -EBADF;
+
+	/* Ensure offsets don't wrap. */
+	if (pos_in + count < pos_in || pos_out + count < pos_out)
+		return -EOVERFLOW;
+
+	size_in = i_size_read(inode_in);
+	size_out = i_size_read(inode_out);
+
+	/* If the source range crosses EOF, fail the copy */
+	if (pos_in >= size_in)
+		return -EINVAL;
+	if (pos_in + count > size_in)
+		return -EINVAL;
+
+	ret = generic_access_check_limits(file_in, pos_in, &count);
+	if (ret)
+		return ret;
+
+	ret = generic_write_check_limits(file_out, pos_out, &count);
+	if (ret)
+		return ret;
+
+	/* Don't allow overlapped copying within the same file. */
+	if (inode_in == inode_out &&
+	    pos_out + count > pos_in &&
+	    pos_out < pos_in + count)
+		return -EINVAL;
+
+	*req_count = count;
+	return 0;
+}
+
 int pagecache_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (3 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:47   ` Amir Goldstein
                     ` (3 more replies)
  2018-12-03  8:34 ` [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits Dave Chinner
                   ` (6 subsequent siblings)
  11 siblings, 4 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

Similar to FI_DEDUPERANGE, make copy_file_range() check that we have
write permissions to the destination inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/filemap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0a170425935b..876df5275514 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3013,6 +3013,11 @@ int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 	    (file_out->f_flags & O_APPEND))
 		return -EBADF;
 
+	/* may sure we really are allowed to write to the destination inode */
+	ret = inode_permission(inode_out, MAY_WRITE);
+	if (ret < 0)
+		return ret;
+
 	/* Ensure offsets don't wrap. */
 	if (pos_in + count < pos_in || pos_out + count < pos_out)
 		return -EOVERFLOW;
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (4 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:51   ` Amir Goldstein
  2018-12-04 15:21   ` Christoph Hellwig
  2018-12-03  8:34 ` [PATCH 07/11] vfs: copy_file_range should update file timestamps Dave Chinner
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

The file we are copying data into needs to have its setuid bit
stripped before we start the data copy so that unprivileged users
can't copy data into executables that are run with root privs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 69809345977e..3b101183ea19 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1574,6 +1574,16 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
+	ssize_t ret;
+
+	/*
+	 * Clear the security bits if the process is not being run by root.
+	 * This keeps people from modifying setuid and setgid binaries.
+	 */
+	ret = file_remove_privs(file_out);
+	if (ret)
+		return ret;
+
 	if (file_out->f_op->copy_file_range)
 		return file_out->f_op->copy_file_range(file_in, pos_in, file_out,
 						      pos_out, len, flags);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 07/11] vfs: copy_file_range should update file timestamps
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (5 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 10:47   ` Amir Goldstein
  2018-12-04 15:24   ` Christoph Hellwig
  2018-12-03  8:34 ` [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range Dave Chinner
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

Timestamps are not updated right now, so programs looking for
timestamp updates for file modifications (like rsync) will not
detect that files have changed. We are also accessing the source
data when doing a copy (but not when cloning) so we need to update
atime on the source file as well.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 3b101183ea19..3288db1d5f21 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1576,6 +1576,16 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
 {
 	ssize_t ret;
 
+	/* Update source timestamps, because we are accessing file data */
+	file_accessed(file_in);
+
+	/* Update destination timestamps, since we can alter file contents. */
+	if (!(file_out->f_mode & FMODE_NOCMTIME)) {
+		ret = file_update_time(file_out);
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * Clear the security bits if the process is not being run by root.
 	 * This keeps people from modifying setuid and setgid binaries.
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (6 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 07/11] vfs: copy_file_range should update file timestamps Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 11:04   ` Amir Goldstein
                     ` (2 more replies)
  2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
                   ` (3 subsequent siblings)
  11 siblings, 3 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

before we can enable cross-device copies into copy_file_range(),
we have to ensure that ->remap_file_range() implemenations will
correctly reject attempts to do cross filesystem clones. Currently
these checks are done above calls to ->remap_file_range(), but
we need to drive them inwards so that we get EXDEV protection for all
callers of ->remap_file_range().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 3288db1d5f21..174cf92eea1d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	bool same_inode = (inode_in == inode_out);
 	int ret;
 
+	/*
+	 * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
+	 * the same mount. Practically, they only need to be on the same file
+	 * system. We check this here rather than at the ioctl layers because
+	 * this is effectively a limitation of the fielsystem implementations,
+	 * not so much the API itself. Further, ->remap_file_range() can be
+	 * called from syscalls that don't have cross device copy restrictions
+	 * (such as copy_file_range()) and so we need to catch them before we
+	 * do any damage.
+	 */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+
 	/* Don't touch certain kinds of inodes */
 	if (IS_IMMUTABLE(inode_out))
 		return -EPERM;
@@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
-	/*
-	 * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
-	 * the same mount. Practically, they only need to be on the same file
-	 * system.
-	 */
-	if (inode_in->i_sb != inode_out->i_sb)
-		return -EXDEV;
-
 	if (!(file_in->f_mode & FMODE_READ) ||
 	    !(file_out->f_mode & FMODE_WRITE) ||
 	    (file_out->f_flags & O_APPEND))
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (7 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:36   ` Amir Goldstein
                     ` (3 more replies)
  2018-12-03  8:34 ` [PATCH 10/11] vfs: allow generic_copy_file_range to copy across devices Dave Chinner
                   ` (2 subsequent siblings)
  11 siblings, 4 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

We want to enable cross-filesystem copy_file_range functionality
where possible, so push the "same superblock only" checks down to
the individual filesystem callouts so they can make their own
decisions about cross-superblock copy offload.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/ceph/file.c      |  4 +++-
 fs/cifs/cifsfs.c    |  8 +++++++-
 fs/fuse/file.c      |  5 ++++-
 fs/nfs/nfs4file.c   | 16 ++++++++++------
 fs/overlayfs/file.c | 10 +++++++++-
 fs/read_write.c     | 10 ++++------
 6 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index cf29f0410dcb..eb876e19c1dc 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1905,6 +1905,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 
 	if (src_inode == dst_inode)
 		return -EINVAL;
+	if (src_inode->i_sb != dst_inode->i_sb)
+		return -EXDEV;
 	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
 		return -EROFS;
 
@@ -2105,7 +2107,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
 					len, flags);
 
-	if (ret == -EOPNOTSUPP)
+	if (ret == -EOPNOTSUPP || ret == -EXDEV)
 		ret = generic_copy_file_range(src_file, src_off, dst_file,
 					dst_off, len, flags);
 	return ret;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 5ef4baec6234..03e4b9eacbd1 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1072,6 +1072,12 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
 		goto out;
 	}
 
+	if (src_inode->i_sb != target_inode->i_sb) {
+		rc = -EXDEV;
+		goto out;
+	}
+
+
 	if (!src_file->private_data || !dst_file->private_data) {
 		rc = -EBADF;
 		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
@@ -1142,7 +1148,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
 					len, flags);
 	free_xid(xid);
 
-	if (rc == -EOPNOTSUPP)
+	if (rc == -EOPNOTSUPP || rc == -EXDEV)
 		rc = generic_copy_file_range(src_file, off, dst_file,
 					destoff, len, flags);
 	return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b86fb0298739..0758f831a4eb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3053,6 +3053,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 	if (fc->no_copy_file_range)
 		return -EOPNOTSUPP;
 
+	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+		return -EXDEV;
+
 	inode_lock(inode_out);
 
 	if (fc->writeback_cache) {
@@ -3109,7 +3112,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
 	ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
 					len, flags);
 
-	if (ret == -EOPNOTSUPP)
+	if (ret == -EOPNOTSUPP || ret == -EXDEV)
 		ret = generic_copy_file_range(src_file, src_off, dst_file,
 					dst_off, len, flags);
 	return ret;
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index d7766a6eb0f4..4783c0c1c49e 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -133,16 +133,20 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    size_t count, unsigned int flags)
 {
-	ssize_t ret;
+	ssize_t ret = -EXDEV;
 
 	if (file_inode(file_in) == file_inode(file_out))
 		return -EINVAL;
-retry:
-	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
-	if (ret == -EAGAIN)
-		goto retry;
 
-	if (ret == -EOPNOTSUPP)
+	/* only offload copy if superblock is the same */
+	if (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
+		do {
+			ret = nfs42_proc_copy(file_in, pos_in, file_out,
+					pos_out, count);
+		} while (ret == -EAGAIN);
+	}
+
+	if (ret == -EOPNOTSUPP || ret == -EXDEV)
 		ret = generic_copy_file_range(file_in, pos_in, file_out,
 					pos_out, count, flags);
 	return ret;
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 68736e5d6a56..34fb0398d016 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -443,6 +443,14 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	const struct cred *old_cred;
 	loff_t ret;
 
+	/*
+	 * Temporary. Cross device copy checks should be left to the copy file
+	 * call on the real inodes, but existing behaviour checks the upper
+	 * files only.
+	 */
+	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+		return -EXDEV;
+
 	ret = ovl_real_fdget(file_out, &real_out);
 	if (ret)
 		return ret;
@@ -491,7 +499,7 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 	ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
 			    OVL_COPY);
 
-	if (ret == -EOPNOTSUPP)
+	if (ret == -EOPNOTSUPP || ret == -EXDEV)
 		ret = generic_copy_file_range(file_in, pos_in, file_out,
 					pos_out, len, flags);
 	return ret;
diff --git a/fs/read_write.c b/fs/read_write.c
index 174cf92eea1d..4e0666de0d69 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1565,6 +1565,10 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
+	/* Temporary, do_splice_direct supports cross-sb copies */
+	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+		return -EXDEV;
+
 	return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
 			len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
 }
@@ -1611,17 +1615,11 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
-	struct inode *inode_in = file_inode(file_in);
-	struct inode *inode_out = file_inode(file_out);
 	ssize_t ret;
 
 	if (flags != 0)
 		return -EINVAL;
 
-	/* this could be relaxed once a method supports cross-fs copies */
-	if (inode_in->i_sb != inode_out->i_sb)
-		return -EXDEV;
-
 	ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
 					flags);
 	if (ret < 0)
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 10/11] vfs: allow generic_copy_file_range to copy across devices
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (8 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:54   ` Amir Goldstein
  2018-12-03  8:34 ` [PATCH 11/11] ovl: allow cross-device copy_file_range calls Dave Chinner
  2018-12-03  8:39 ` [PATCH 12/11] man-pages: copy_file_range updates Dave Chinner
  11 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

do_splice_direct() can copy across superblocks without problems.
Remove the same superblock restriction on this fallback code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/read_write.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4e0666de0d69..b0f231b10836 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1565,10 +1565,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 			    struct file *file_out, loff_t pos_out,
 			    size_t len, unsigned int flags)
 {
-	/* Temporary, do_splice_direct supports cross-sb copies */
-	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
-		return -EXDEV;
-
 	return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
 			len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
 }
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 11/11] ovl: allow cross-device copy_file_range calls
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (9 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 10/11] vfs: allow generic_copy_file_range to copy across devices Dave Chinner
@ 2018-12-03  8:34 ` Dave Chinner
  2018-12-03 12:55   ` Amir Goldstein
  2018-12-03  8:39 ` [PATCH 12/11] man-pages: copy_file_range updates Dave Chinner
  11 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

From: Dave Chinner <dchinner@redhat.com>

Restrictions on cross-device copy_file_range() only affect the
vfs_copy_file_range() call to the lower filesystems. They will
handle the copy appropriately, so OVL will never see a EXDEV error
from them. Hence we can remove the EXDEV checks and error handling
from the ovl_copy_file_range() implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/overlayfs/file.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 34fb0398d016..146901d204df 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -443,14 +443,6 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	const struct cred *old_cred;
 	loff_t ret;
 
-	/*
-	 * Temporary. Cross device copy checks should be left to the copy file
-	 * call on the real inodes, but existing behaviour checks the upper
-	 * files only.
-	 */
-	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
-		return -EXDEV;
-
 	ret = ovl_real_fdget(file_out, &real_out);
 	if (ret)
 		return ret;
@@ -499,7 +491,8 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
 	ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
 			    OVL_COPY);
 
-	if (ret == -EOPNOTSUPP || ret == -EXDEV)
+	WARN_ON_ONCE(ret == -EXDEV);
+	if (ret == -EOPNOTSUPP)
 		ret = generic_copy_file_range(file_in, pos_in, file_out,
 					pos_out, len, flags);
 	return ret;
-- 
2.19.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 12/11] man-pages: copy_file_range updates
  2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
                   ` (10 preceding siblings ...)
  2018-12-03  8:34 ` [PATCH 11/11] ovl: allow cross-device copy_file_range calls Dave Chinner
@ 2018-12-03  8:39 ` Dave Chinner
  2018-12-03 13:05   ` Amir Goldstein
  2019-05-21  5:52   ` Amir Goldstein
  11 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03  8:39 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel,
	linux-cifs, linux-api

From: Dave Chinner <dchinner@redhat.com>

Update with all the missing errors the syscall can return, the
behaviour the syscall should have w.r.t. to copies within single
files, etc.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 man2/copy_file_range.2 | 94 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 20374abb21f0..23b00c2f3fea 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -42,9 +42,9 @@ without the additional cost of transferring data from the kernel to user space
 and then back into the kernel.
 It copies up to
 .I len
-bytes of data from file descriptor
+bytes of data from the source file descriptor
 .I fd_in
-to file descriptor
+to target file descriptor
 .IR fd_out ,
 overwriting any data that exists within the requested range of the target file.
 .PP
@@ -74,6 +74,11 @@ is not changed, but
 .I off_in
 is adjusted appropriately.
 .PP
+.I fd_in
+and
+.I fd_out
+can refer to the same file. If they refer to the same file, then the source and
+target ranges are not allowed to overlap.
 .PP
 The
 .I flags
@@ -93,34 +98,73 @@ is set to indicate the error.
 .SH ERRORS
 .TP
 .B EBADF
-One or more file descriptors are not valid; or
+One or more file descriptors are not valid.
+.TP
+.B EBADF
 .I fd_in
 is not open for reading; or
 .I fd_out
-is not open for writing; or
-the
+is not open for writing.
+.TP
+.B EBADF
+The
 .B O_APPEND
 flag is set for the open file description referred to by
 .IR fd_out .
 .TP
 .B EFBIG
-An attempt was made to write a file that exceeds the implementation-defined
-maximum file size or the process's file size limit,
-or to write at a position past the maximum allowed offset.
+An attempt was made to write at a position past the maximum file offset the
+kernel supports.
+.TP
+.B EFBIG
+An attempt was made to write a range that exceeds the allowed maximum file size.
+The maximum file size differs between filesystem implemenations and can be
+different to the maximum allowed file offset.
+.TP
+.B EFBIG
+An attempt was made to write beyond the process's file size resource
+limit. This may also result in the process receiving a
+.I SIGXFSZ
+signal.
 .TP
 .B EINVAL
-Requested range extends beyond the end of the source file; or the
-.I flags
-argument is not 0.
+.I (off_in + len)
+spans the end of the source file.
 .TP
-.B EIO
-A low-level I/O error occurred while copying.
+.B EINVAL
+.I fd_in
+and
+.I fd_out
+refer to the same file and the source and target ranges overlap.
+.TP
+.B EINVAL
+.I fd_in
+or
+.I fd_out
+is not a regular file.
 .TP
 .B EISDIR
 .I fd_in
 or
 .I fd_out
 refers to a directory.
+.B EINVAL
+The
+.I flags
+argument is not 0.
+.TP
+.B EINVAL
+.I off_in
+or
+.I (off_in + len)
+is beyond the maximum valid file offset.
+.TP
+.B EOVERFLOW
+The requested source or destination range is too large to represent in the
+specified data types.
+.TP
+.B EIO
+A low-level I/O error occurred while copying.
 .TP
 .B ENOMEM
 Out of memory.
@@ -128,16 +172,32 @@ Out of memory.
 .B ENOSPC
 There is not enough space on the target filesystem to complete the copy.
 .TP
-.B EXDEV
-The files referred to by
-.IR file_in " and " file_out
-are not on the same mounted filesystem.
+.B TXTBSY
+.I fd_in
+or
+.I fd_out
+refers to an active swap file.
+.TP
+.B EPERM
+.I fd_out
+refers to an immutable file.
+.TP
+.B EACCES
+The user does not have write permissions for the destination file.
 .SH VERSIONS
 The
 .BR copy_file_range ()
 system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
 emulation when it is not available.
 .\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
+.PP
+A major rework of the kernel implementation occurred in 4.21. Areas of the API
+that weren't clearly defined were clarified and the API bounds are much more
+strictly checked than on earlier kernels. Applications should target the
+behaviour and requirements of 4.21 kernels.
+.PP
+First support for cross-filesystem copies was introduced in Linux 4.21. Older
+kernels will return -EXDEV when cross-filesystem copies are attempted.
 .SH CONFORMING TO
 The
 .BR copy_file_range ()

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/11] vfs: introduce generic_copy_file_range()
  2018-12-03  8:34 ` [PATCH 02/11] vfs: introduce generic_copy_file_range() Dave Chinner
@ 2018-12-03 10:03   ` Amir Goldstein
  2018-12-03 23:00     ` Dave Chinner
  2018-12-04 15:14   ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 10:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
	Miklos Szeredi

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Right now if vfs_copy_file_range() does not use any offload
> mechanism, it falls back to calling do_splice_direct(). This fails
> to do basic sanity checks on the files being copied. Before we
> start adding this necessarily functionality to the fallback path,
> separate it out into generic_copy_file_range().
>
> generic_copy_file_range() has the same prototype as
> ->copy_file_range() so that filesystems can use it in their custom
> ->copy_file_range() method if they so choose.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Question:
2 years ago you suggested that I covert the overlayfs copy up
code that does a do_direct_splice() with a loop of vfs_copy_file_range():
https://marc.info/?l=linux-fsdevel&m=147369468521525&w=2
We ended up with a slightly different solution, but with your recent
changes, I can get back to your original proposal.

Back then, I wondered whether it makes sense to push the killable
loop of shorter do_direct_splice() calls into the vfs helper.
What do you think about adding this to generic_copy_file_range()
now? (I can do that after your changes are merged).

The fact that userspace *can* enter a very long unkillable loop
with current copy_file_range() syscall doesn't mean that we
*should* persist this situation. After all, fixing the brokenness
of the existing interface is what you set out to do.

With that change in place, overlayfs could call only
vfs_copy_file_range() as you suggested and not as a fallback to
do_clone_file_range().

Thanks,
Amir.

>  fs/read_write.c    | 35 ++++++++++++++++++++++++++++++++---
>  include/linux/fs.h |  3 +++
>  2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 09d1816cf3cf..50114694c98b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1540,6 +1540,36 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
>  }
>  #endif
>
> +/**
> + * generic_copy_file_range - copy data between two files
> + * @file_in:   file structure to read from
> + * @pos_in:    file offset to read from
> + * @file_out:  file structure to write data to
> + * @pos_out:   file offset to write data to
> + * @len:       amount of data to copy
> + * @flags:     copy flags
> + *
> + * This is a generic filesystem helper to copy data from one file to another.
> + * It has no constraints on the source or destination file owners - the files
> + * can belong to different superblocks and different filesystem types. Short
> + * copies are allowed.
> + *
> + * This should be called from the @file_out filesystem, as per the
> + * ->copy_file_range() method.
> + *
> + * Returns the number of bytes copied or a negative error indicating the
> + * failure.
> + */
> +
> +ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> +                           struct file *file_out, loff_t pos_out,
> +                           size_t len, unsigned int flags)
> +{
> +       return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> +                       len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> +}
> +EXPORT_SYMBOL(generic_copy_file_range);
> +
>  /*
>   * copy_file_range() differs from regular file read and write in that it
>   * specifically allows return partial success.  When it does so is up to
> @@ -1611,9 +1641,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>                         goto done;
>         }
>
> -       ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> -                       len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> -
> +       ret = generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
> +                                       len, flags);
>  done:
>         if (ret > 0) {
>                 fsnotify_access(file_in);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c95c0807471f..a4478764cf63 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1874,6 +1874,9 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
>                 unsigned long, loff_t *, rwf_t);
>  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
>                                    loff_t, size_t, unsigned int);
> +extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> +                               struct file *file_out, loff_t pos_out,
> +                               size_t len, unsigned int flags);
>  extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>                                          struct file *file_out, loff_t pos_out,
>                                          loff_t *count,
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-03  8:34 ` [PATCH 03/11] vfs: no fallback for ->copy_file_range Dave Chinner
@ 2018-12-03 10:22   ` Amir Goldstein
  2018-12-03 23:02     ` Dave Chinner
  2018-12-03 18:23   ` Anna Schumaker
  2018-12-04 15:16   ` Christoph Hellwig
  2 siblings, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 10:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Now that we have generic_copy_file_range(), remove it as a fallback
> case when offloads fail. This puts the responsibility for executing
> fallbacks on the filesystems that implement ->copy_file_range and
> allows us to add operational validity checks to
> generic_copy_file_range().
>
> Rework vfs_copy_file_range() to call a new do_copy_file_range()
> helper to exceute the copying callout, and move calls to
> generic_file_copy_range() into filesystem methods where they
> currently return failures.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

You may add
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

After fixing the overlayfs issue below.
...

> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 84dd957efa24..68736e5d6a56 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
>                                    struct file *file_out, loff_t pos_out,
>                                    size_t len, unsigned int flags)
>  {
> -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> +       ssize_t ret;
> +
> +       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
>                             OVL_COPY);
> +
> +       if (ret == -EOPNOTSUPP)
> +               ret = generic_copy_file_range(file_in, pos_in, file_out,
> +                                       pos_out, len, flags);
> +       return ret;
>  }
>

This is unneeded, because ovl_copyfile(OVL_COPY) is implemented
by calling vfs_copy_file_range() (on the underlying files) and it is
not possible
to get EOPNOTSUPP from vfs_copy_file_range().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/11] vfs: copy_file_range should update file timestamps
  2018-12-03  8:34 ` [PATCH 07/11] vfs: copy_file_range should update file timestamps Dave Chinner
@ 2018-12-03 10:47   ` Amir Goldstein
  2018-12-03 17:33     ` Olga Kornievskaia
  2018-12-03 23:19     ` Dave Chinner
  2018-12-04 15:24   ` Christoph Hellwig
  1 sibling, 2 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 10:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Timestamps are not updated right now, so programs looking for
> timestamp updates for file modifications (like rsync) will not
> detect that files have changed. We are also accessing the source
> data when doing a copy (but not when cloning) so we need to update
> atime on the source file as well.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3b101183ea19..3288db1d5f21 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1576,6 +1576,16 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>  {
>         ssize_t ret;
>
> +       /* Update source timestamps, because we are accessing file data */
> +       file_accessed(file_in);
> +
> +       /* Update destination timestamps, since we can alter file contents. */
> +       if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> +               ret = file_update_time(file_out);
> +               if (ret)
> +                       return ret;
> +       }
> +

If there is a consistency about who is responsible of calling file_accessed()
and file_update_time() it eludes me. grep tells me that they are mostly
handled by filesystem code or generic helpers called by filesystem code
and not in the vfs helpers.

FMODE_NOCMTIME seems like an xfs specific flag (for DMAPI?), which
most generic callers of file_update_time() completely ignore.
This seems like another argument in favor of leaving the responsibility
of the timestamp updates to the filesystem.

Maybe I am missing something?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03  8:34 ` [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range Dave Chinner
@ 2018-12-03 11:04   ` Amir Goldstein
  2018-12-03 19:11     ` Darrick J. Wong
  2018-12-03 23:34     ` Dave Chinner
  2018-12-03 18:24   ` Darrick J. Wong
  2018-12-04  8:18   ` Olga Kornievskaia
  2 siblings, 2 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 11:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> before we can enable cross-device copies into copy_file_range(),
> we have to ensure that ->remap_file_range() implemenations will
> correctly reject attempts to do cross filesystem clones. Currently

But you only fixed remap_file_range() implemenations of xfs and ocfs2...

> these checks are done above calls to ->remap_file_range(), but
> we need to drive them inwards so that we get EXDEV protection for all
> callers of ->remap_file_range().
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3288db1d5f21..174cf92eea1d 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>         bool same_inode = (inode_in == inode_out);
>         int ret;
>
> +       /*
> +        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> +        * the same mount. Practically, they only need to be on the same file
> +        * system. We check this here rather than at the ioctl layers because
> +        * this is effectively a limitation of the fielsystem implementations,
> +        * not so much the API itself. Further, ->remap_file_range() can be
> +        * called from syscalls that don't have cross device copy restrictions
> +        * (such as copy_file_range()) and so we need to catch them before we
> +        * do any damage.
> +        */
> +       if (inode_in->i_sb != inode_out->i_sb)
> +               return -EXDEV;
> +
>         /* Don't touch certain kinds of inodes */
>         if (IS_IMMUTABLE(inode_out))
>                 return -EPERM;
> @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
>         if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>                 return -EINVAL;
>
> -       /*
> -        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> -        * the same mount. Practically, they only need to be on the same file
> -        * system.
> -        */
> -       if (inode_in->i_sb != inode_out->i_sb)
> -               return -EXDEV;
> -

That leaves {nfs42,cifs,btrfs}_remap_file_range() exposed to passing
files not of their own fs type let alone same sb when do_clone_file_range()
is called from ovl_copy_up_data().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
@ 2018-12-03 12:36   ` Amir Goldstein
  2018-12-03 17:58   ` Olga Kornievskaia
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> We want to enable cross-filesystem copy_file_range functionality
> where possible, so push the "same superblock only" checks down to
> the individual filesystem callouts so they can make their own
> decisions about cross-superblock copy offload.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good.
You may add
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Similar comment about overlayfs as patch 3.

 diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 68736e5d6a56..34fb0398d016 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -443,6 +443,14 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
>         const struct cred *old_cred;
>         loff_t ret;
>
> +       /*
> +        * Temporary. Cross device copy checks should be left to the copy file
> +        * call on the real inodes, but existing behaviour checks the upper
> +        * files only.
> +        */
> +       if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +               return -EXDEV;
> +
>         ret = ovl_real_fdget(file_out, &real_out);
>         if (ret)
>                 return ret;
> @@ -491,7 +499,7 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
>         ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
>                             OVL_COPY);
>
> -       if (ret == -EOPNOTSUPP)
> +       if (ret == -EOPNOTSUPP || ret == -EXDEV)
>                 ret = generic_copy_file_range(file_in, pos_in, file_out,
>                                         pos_out, len, flags);

This fallback is already provided by vfs_copy_file_range().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
@ 2018-12-03 12:42   ` Amir Goldstein
  2018-12-03 19:04   ` Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Like the clone and dedupe interfaces we've recently fixed, the
> copy_file_range() implementation is missing basic sanity, limits and
> boundary condition tests on the parameters that are passed to it
> from userspace. Create a new "generic_copy_file_checks()" function
> modelled on the generic_remap_checks() function to provide this
> missing functionality.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-03  8:34 ` [PATCH 01/11] vfs: copy_file_range source range over EOF should fail Dave Chinner
@ 2018-12-03 12:46   ` Amir Goldstein
  2018-12-04 15:13     ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The man page says:
>
> EINVAL Requested range extends beyond the end of the source file
>
> But the current behaviour is that copy_file_range does a short
> copy up to the source file EOF. Fix the kernel behaviour to match
> the behaviour described in the man page.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 4dae0399c75a..09d1816cf3cf 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1581,6 +1581,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>         if (len == 0)
>                 return 0;
>
> +       /* If the source range crosses EOF, fail the copy */
> +       if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
> +               return -EINVAL;
> +

i_size_read()...

Otherwise
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.


> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
@ 2018-12-03 12:47   ` Amir Goldstein
  2018-12-03 18:18   ` Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Similar to FI_DEDUPERANGE, make copy_file_range() check that we have
> write permissions to the destination inode.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits
  2018-12-03  8:34 ` [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits Dave Chinner
@ 2018-12-03 12:51   ` Amir Goldstein
  2018-12-04 15:21   ` Christoph Hellwig
  1 sibling, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The file we are copying data into needs to have its setuid bit
> stripped before we start the data copy so that unprivileged users
> can't copy data into executables that are run with root privs.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/11] vfs: allow generic_copy_file_range to copy across devices
  2018-12-03  8:34 ` [PATCH 10/11] vfs: allow generic_copy_file_range to copy across devices Dave Chinner
@ 2018-12-03 12:54   ` Amir Goldstein
  0 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> do_splice_direct() can copy across superblocks without problems.
> Remove the same superblock restriction on this fallback code.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 11/11] ovl: allow cross-device copy_file_range calls
  2018-12-03  8:34 ` [PATCH 11/11] ovl: allow cross-device copy_file_range calls Dave Chinner
@ 2018-12-03 12:55   ` Amir Goldstein
  0 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 12:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Restrictions on cross-device copy_file_range() only affect the
> vfs_copy_file_range() call to the lower filesystems. They will
> handle the copy appropriately, so OVL will never see a EXDEV error
> from them. Hence we can remove the EXDEV checks and error handling
> from the ovl_copy_file_range() implementation.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/11] man-pages: copy_file_range updates
  2018-12-03  8:39 ` [PATCH 12/11] man-pages: copy_file_range updates Dave Chinner
@ 2018-12-03 13:05   ` Amir Goldstein
  2019-05-21  5:52   ` Amir Goldstein
  1 sibling, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-03 13:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
	linux-api

On Mon, Dec 3, 2018 at 10:40 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Update with all the missing errors the syscall can return, the
> behaviour the syscall should have w.r.t. to copies within single
> files, etc.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  man2/copy_file_range.2 | 94 +++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 77 insertions(+), 17 deletions(-)
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 20374abb21f0..23b00c2f3fea 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -42,9 +42,9 @@ without the additional cost of transferring data from the kernel to user space
>  and then back into the kernel.
>  It copies up to
>  .I len
> -bytes of data from file descriptor
> +bytes of data from the source file descriptor
>  .I fd_in
> -to file descriptor
> +to target file descriptor
>  .IR fd_out ,
>  overwriting any data that exists within the requested range of the target file.
>  .PP
> @@ -74,6 +74,11 @@ is not changed, but
>  .I off_in
>  is adjusted appropriately.
>  .PP
> +.I fd_in
> +and
> +.I fd_out
> +can refer to the same file. If they refer to the same file, then the source and
> +target ranges are not allowed to overlap.
>  .PP
>  The
>  .I flags
> @@ -93,34 +98,73 @@ is set to indicate the error.
>  .SH ERRORS
>  .TP
>  .B EBADF
> -One or more file descriptors are not valid; or
> +One or more file descriptors are not valid.
> +.TP
> +.B EBADF
>  .I fd_in
>  is not open for reading; or
>  .I fd_out
> -is not open for writing; or
> -the
> +is not open for writing.
> +.TP
> +.B EBADF
> +The
>  .B O_APPEND
>  flag is set for the open file description referred to by
>  .IR fd_out .
>  .TP
>  .B EFBIG
> -An attempt was made to write a file that exceeds the implementation-defined
> -maximum file size or the process's file size limit,
> -or to write at a position past the maximum allowed offset.
> +An attempt was made to write at a position past the maximum file offset the
> +kernel supports.
> +.TP
> +.B EFBIG
> +An attempt was made to write a range that exceeds the allowed maximum file size.
> +The maximum file size differs between filesystem implemenations and can be
> +different to the maximum allowed file offset.
> +.TP
> +.B EFBIG
> +An attempt was made to write beyond the process's file size resource
> +limit. This may also result in the process receiving a
> +.I SIGXFSZ
> +signal.
>  .TP
>  .B EINVAL
> -Requested range extends beyond the end of the source file; or the
> -.I flags
> -argument is not 0.
> +.I (off_in + len)
> +spans the end of the source file.
>  .TP
> -.B EIO
> -A low-level I/O error occurred while copying.
> +.B EINVAL
> +.I fd_in
> +and
> +.I fd_out
> +refer to the same file and the source and target ranges overlap.
> +.TP
> +.B EINVAL
> +.I fd_in
> +or
> +.I fd_out
> +is not a regular file.
>  .TP
>  .B EISDIR
>  .I fd_in
>  or
>  .I fd_out
>  refers to a directory.
> +.B EINVAL
> +The
> +.I flags
> +argument is not 0.
> +.TP
> +.B EINVAL
> +.I off_in
> +or
> +.I (off_in + len)
> +is beyond the maximum valid file offset.
> +.TP
> +.B EOVERFLOW
> +The requested source or destination range is too large to represent in the
> +specified data types.
> +.TP
> +.B EIO
> +A low-level I/O error occurred while copying.
>  .TP
>  .B ENOMEM
>  Out of memory.
> @@ -128,16 +172,32 @@ Out of memory.
>  .B ENOSPC
>  There is not enough space on the target filesystem to complete the copy.
>  .TP
> -.B EXDEV
> -The files referred to by
> -.IR file_in " and " file_out
> -are not on the same mounted filesystem.
> +.B TXTBSY
> +.I fd_in
> +or
> +.I fd_out
> +refers to an active swap file.
> +.TP
> +.B EPERM
> +.I fd_out
> +refers to an immutable file.
> +.TP
> +.B EACCES
> +The user does not have write permissions for the destination file.
>  .SH VERSIONS
>  The
>  .BR copy_file_range ()
>  system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
>  emulation when it is not available.
>  .\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
> +.PP
> +A major rework of the kernel implementation occurred in 4.21. Areas of the API
> +that weren't clearly defined were clarified and the API bounds are much more
> +strictly checked than on earlier kernels. Applications should target the
> +behaviour and requirements of 4.21 kernels.
> +.PP
> +First support for cross-filesystem copies was introduced in Linux 4.21. Older
> +kernels will return -EXDEV when cross-filesystem copies are attempted.

IMO, you should leave the entry for expected error EXDEV in place and prefix
it with "Prior to Linux 4.21..."

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/11] vfs: copy_file_range should update file timestamps
  2018-12-03 10:47   ` Amir Goldstein
@ 2018-12-03 17:33     ` Olga Kornievskaia
  2018-12-03 18:22       ` Darrick J. Wong
  2018-12-03 23:19     ` Dave Chinner
  1 sibling, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-03 17:33 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: david, linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs,
	ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 5:47 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Timestamps are not updated right now, so programs looking for
> > timestamp updates for file modifications (like rsync) will not
> > detect that files have changed. We are also accessing the source
> > data when doing a copy (but not when cloning) so we need to update
> > atime on the source file as well.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/read_write.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3b101183ea19..3288db1d5f21 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1576,6 +1576,16 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> >  {
> >         ssize_t ret;
> >
> > +       /* Update source timestamps, because we are accessing file data */
> > +       file_accessed(file_in);
> > +
> > +       /* Update destination timestamps, since we can alter file contents. */
> > +       if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> > +               ret = file_update_time(file_out);
> > +               if (ret)
> > +                       return ret;
> > +       }
> > +
>
> If there is a consistency about who is responsible of calling file_accessed()
> and file_update_time() it eludes me. grep tells me that they are mostly
> handled by filesystem code or generic helpers called by filesystem code
> and not in the vfs helpers.
>
> FMODE_NOCMTIME seems like an xfs specific flag (for DMAPI?), which
> most generic callers of file_update_time() completely ignore.
> This seems like another argument in favor of leaving the responsibility
> of the timestamp updates to the filesystem.
>
> Maybe I am missing something?
>

I had similar question before about who is responsible for doing the
checks. I agree that attributes should be updated for the case when no
filesystem support exist for copy_file_range() but this code does it
for all the cases. I also wonder if it's appropriate to update the
attributes before the copy is actually done?

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
  2018-12-03 12:36   ` Amir Goldstein
@ 2018-12-03 17:58   ` Olga Kornievskaia
  2018-12-03 18:53   ` Anna Schumaker
  2018-12-04 15:43   ` Christoph Hellwig
  3 siblings, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-03 17:58 UTC (permalink / raw)
  To: david
  Cc: linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs, ceph-devel,
	linux-cifs

On Mon, Dec 3, 2018 at 3:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> We want to enable cross-filesystem copy_file_range functionality
> where possible, so push the "same superblock only" checks down to
> the individual filesystem callouts so they can make their own
> decisions about cross-superblock copy offload.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Overall VFS/NFS bits look good to me. I'm re-basing my client and
server patch series on top of this and will test it out.

Thank you.

> ---
>  fs/ceph/file.c      |  4 +++-
>  fs/cifs/cifsfs.c    |  8 +++++++-
>  fs/fuse/file.c      |  5 ++++-
>  fs/nfs/nfs4file.c   | 16 ++++++++++------
>  fs/overlayfs/file.c | 10 +++++++++-
>  fs/read_write.c     | 10 ++++------
>  6 files changed, 37 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index cf29f0410dcb..eb876e19c1dc 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1905,6 +1905,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>
>         if (src_inode == dst_inode)
>                 return -EINVAL;
> +       if (src_inode->i_sb != dst_inode->i_sb)
> +               return -EXDEV;
>         if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>                 return -EROFS;
>
> @@ -2105,7 +2107,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>         ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
>                                         len, flags);
>
> -       if (ret == -EOPNOTSUPP)
> +       if (ret == -EOPNOTSUPP || ret == -EXDEV)
>                 ret = generic_copy_file_range(src_file, src_off, dst_file,
>                                         dst_off, len, flags);
>         return ret;
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 5ef4baec6234..03e4b9eacbd1 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1072,6 +1072,12 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
>                 goto out;
>         }
>
> +       if (src_inode->i_sb != target_inode->i_sb) {
> +               rc = -EXDEV;
> +               goto out;
> +       }
> +
> +
>         if (!src_file->private_data || !dst_file->private_data) {
>                 rc = -EBADF;
>                 cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
> @@ -1142,7 +1148,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
>                                         len, flags);
>         free_xid(xid);
>
> -       if (rc == -EOPNOTSUPP)
> +       if (rc == -EOPNOTSUPP || rc == -EXDEV)
>                 rc = generic_copy_file_range(src_file, off, dst_file,
>                                         destoff, len, flags);
>         return rc;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index b86fb0298739..0758f831a4eb 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3053,6 +3053,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>         if (fc->no_copy_file_range)
>                 return -EOPNOTSUPP;
>
> +       if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +               return -EXDEV;
> +
>         inode_lock(inode_out);
>
>         if (fc->writeback_cache) {
> @@ -3109,7 +3112,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
>         ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
>                                         len, flags);
>
> -       if (ret == -EOPNOTSUPP)
> +       if (ret == -EOPNOTSUPP || ret == -EXDEV)
>                 ret = generic_copy_file_range(src_file, src_off, dst_file,
>                                         dst_off, len, flags);
>         return ret;
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index d7766a6eb0f4..4783c0c1c49e 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -133,16 +133,20 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>                                     struct file *file_out, loff_t pos_out,
>                                     size_t count, unsigned int flags)
>  {
> -       ssize_t ret;
> +       ssize_t ret = -EXDEV;
>
>         if (file_inode(file_in) == file_inode(file_out))
>                 return -EINVAL;
> -retry:
> -       ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> -       if (ret == -EAGAIN)
> -               goto retry;
>
> -       if (ret == -EOPNOTSUPP)
> +       /* only offload copy if superblock is the same */
> +       if (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> +               do {
> +                       ret = nfs42_proc_copy(file_in, pos_in, file_out,
> +                                       pos_out, count);
> +               } while (ret == -EAGAIN);
> +       }
> +
> +       if (ret == -EOPNOTSUPP || ret == -EXDEV)
>                 ret = generic_copy_file_range(file_in, pos_in, file_out,
>                                         pos_out, count, flags);
>         return ret;
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 68736e5d6a56..34fb0398d016 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -443,6 +443,14 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
>         const struct cred *old_cred;
>         loff_t ret;
>
> +       /*
> +        * Temporary. Cross device copy checks should be left to the copy file
> +        * call on the real inodes, but existing behaviour checks the upper
> +        * files only.
> +        */
> +       if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +               return -EXDEV;
> +
>         ret = ovl_real_fdget(file_out, &real_out);
>         if (ret)
>                 return ret;
> @@ -491,7 +499,7 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
>         ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
>                             OVL_COPY);
>
> -       if (ret == -EOPNOTSUPP)
> +       if (ret == -EOPNOTSUPP || ret == -EXDEV)
>                 ret = generic_copy_file_range(file_in, pos_in, file_out,
>                                         pos_out, len, flags);
>         return ret;
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 174cf92eea1d..4e0666de0d69 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1565,6 +1565,10 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>                             struct file *file_out, loff_t pos_out,
>                             size_t len, unsigned int flags)
>  {
> +       /* Temporary, do_splice_direct supports cross-sb copies */
> +       if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +               return -EXDEV;
> +
>         return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
>                         len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
>  }
> @@ -1611,17 +1615,11 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>                             struct file *file_out, loff_t pos_out,
>                             size_t len, unsigned int flags)
>  {
> -       struct inode *inode_in = file_inode(file_in);
> -       struct inode *inode_out = file_inode(file_out);
>         ssize_t ret;
>
>         if (flags != 0)
>                 return -EINVAL;
>
> -       /* this could be relaxed once a method supports cross-fs copies */
> -       if (inode_in->i_sb != inode_out->i_sb)
> -               return -EXDEV;
> -
>         ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
>                                         flags);
>         if (ret < 0)
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
  2018-12-03 12:47   ` Amir Goldstein
@ 2018-12-03 18:18   ` Darrick J. Wong
  2018-12-03 23:55     ` Dave Chinner
  2018-12-03 18:53   ` Eric Biggers
  2018-12-04 15:19   ` Christoph Hellwig
  3 siblings, 1 reply; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-03 18:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 07:34:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to FI_DEDUPERANGE, make copy_file_range() check that we have

TLDR: No, it's not similar to FIDEDUPERANGE -- the use of
inode_permission() in allow_file_dedupe() is to enable callers to dedupe
into a file for which the caller has write permissions but opened the
file O_RDONLY.

[Please keep reading...]

> write permissions to the destination inode.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/filemap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0a170425935b..876df5275514 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3013,6 +3013,11 @@ int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  	    (file_out->f_flags & O_APPEND))
>  		return -EBADF;
>  
> +	/* may sure we really are allowed to write to the destination inode */
> +	ret = inode_permission(inode_out, MAY_WRITE);

What's the difference between security_file_permission and
inode_permission, and when do we call them for a regular
open-write-close sequence?  Hmmm, let me take a look:

It looks like we call inode_permission at open() time to make sure that
the file permissions permit writes and the file isn't immutable.
security_file_permission gets called at write() time to recheck with the
security policy, but once a process has been granted a writable file
descriptor, it retains that privilege until it closes the fd.  In other
words, we check at open time, not at operation time.

I think.  Nothing is ever that simple, so let's check behavior:

So let's try opening a file for write, removing write permissions, then
writing to the file:

$ rm -rf xyz
$ touch xyz
$ ls -lad xyz
-rw-rw-r-- 1 djwong djwong 0 Dec  3 09:28 xyz
$ xfs_io xyz
xfs_io> pwrite -S 0x58 0 4k
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0000 sec (130.208 MiB/sec and 33333.3333 ops/sec)
xfs_io> 
[1]+  Stopped                 xfs_io xyz
$ chmod a-w xyz
$ sudo chown root:root xyz
$ ls -lad xyz
-r--r--r-- 1 root root 4096 Dec  3 09:28 xyz
$ fg
xfs_io xyz

xfs_io> pwrite -S 0x58 0 8k
wrote 8192/8192 bytes at offset 0
8 KiB, 2 ops; 0.0000 sec (558.036 MiB/sec and 142857.1429 ops/sec)
xfs_io>

Yep, we can write to an already-open file even if we change its
ownership and take away write permission.  How about the immutable flag?

$ touch b
$ xfs_io b
xfs_io> pwrite -S 0x58 0 4k
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0000 sec (75.120 MiB/sec and 19230.7692 ops/sec)
xfs_io> sync
xfs_io> 
[1]+  Stopped                 xfs_io b
$ sudo chown root:root b
$ sudo chattr +i b
$ ls -lad b ; lsattr b
-rw-rw-r-- 1 root root 4096 Dec  3 09:51 b
----i-------------- b
$ fg
xfs_io b

xfs_io> pwrite -S 0x58 0 6k
wrote 6144/6144 bytes at offset 0
6 KiB, 2 ops; 0.0000 sec (102.796 MiB/sec and 35087.7193 ops/sec)
xfs_io> sync
xfs_io> 

Similarly, it looks like we can write to an already-open file even if we
change its ownership and mark the file immutable.  Both of these are a
little unexpected; I would have thought at least that +i would have
prevented the write.

How about reflink?

$ rm -rf a b
$ xfs_io -f -c 'pwrite -S 0x58 0 64k' a
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (1.744 GiB/sec and 457142.8571 ops/sec)
$ xfs_io -f -c 'pwrite -S 0x58 0 64k' b
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (1.969 GiB/sec and 516129.0323 ops/sec)
$ xfs_io b
xfs_io> 
[1]+  Stopped                 xfs_io b
$ chmod a-w b
$ sudo chown root:root b
$ sudo chattr +i b
$ fg
xfs_io b

xfs_io> reflink a 0 0 4k
XFS_IOC_CLONE_RANGE: Operation not permitted
xfs_io> 
[1]+  Stopped                 xfs_io b
$ sudo chattr -i b
$ fg
xfs_io b

xfs_io> reflink a 0 0 4k
linked 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0004 sec (7.988 MiB/sec and 2044.9898 ops/sec)
xfs_io> 

We cannot reflink into a file that becomes immutable after we open it
for write, but we can reflink into a file that loses its write
permissions after we open it.  What about dedupe?

$ rm -rf a b
$ xfs_io -f -c 'pwrite -S 0x58 0 64k' a
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (1.795 GiB/sec and 470588.2353 ops/sec)
$ xfs_io -f -c 'pwrite -S 0x58 0 64k' b
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0001 sec (512.295 MiB/sec and 131147.5410 ops/sec)
$ chmod a-w b
$ sudo chown root:root b
$ sudo chattr +i b
$ fg
xfs_io b

xfs_io> dedupe a 0 0 4k
XFS_IOC_FILE_EXTENT_SAME: Operation not permitted
xfs_io> 
[1]+  Stopped                 xfs_io b
$ sudo chattr -i b
$ fg
xfs_io b

xfs_io> dedupe a 0 0 4k
deduped 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0160 sec (249.688 KiB/sec and 62.4220 ops/sec)
xfs_io>

We also cannot dedupe into a file that becomes immutable after we open
it for write, but we can dedupe into a file that loses its write
permissions after we open it.

Summarized:

op:		after +immutable?	after chmod a-w?
write		yes			yes
clonerange	no			yes
dedupe		no			yes
newcopyrange	no			no

My reaction: I don't think that writes should be allowed after an
administrator marks a file immutable (but that's a separate issue) but I
do think we should be consistent in allowing copying into a file that
has lost its write permissions after we opened the file for write, like
we do for write() and the remap ioct....

*OH*

Now I remember what the FI_DEDUPERANGE inode_permission call is for!
It's because dedupe tools want to be able to open a file readonly and
have dedupe remap another file's identical blocks into the readonly
file, provided that the process would have been able to open the file
for writing had it asked.

[Hugging hug hug huggy hugging hug of hug interface!!! :P]

--D

> +	if (ret < 0)
> +		return ret;
> +
>  	/* Ensure offsets don't wrap. */
>  	if (pos_in + count < pos_in || pos_out + count < pos_out)
>  		return -EOVERFLOW;
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/11] vfs: copy_file_range should update file timestamps
  2018-12-03 17:33     ` Olga Kornievskaia
@ 2018-12-03 18:22       ` Darrick J. Wong
  0 siblings, 0 replies; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-03 18:22 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Amir Goldstein, david, linux-fsdevel, linux-xfs, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 12:33:50PM -0500, Olga Kornievskaia wrote:
> On Mon, Dec 3, 2018 at 5:47 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Timestamps are not updated right now, so programs looking for
> > > timestamp updates for file modifications (like rsync) will not
> > > detect that files have changed. We are also accessing the source
> > > data when doing a copy (but not when cloning) so we need to update
> > > atime on the source file as well.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/read_write.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 3b101183ea19..3288db1d5f21 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1576,6 +1576,16 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> > >  {
> > >         ssize_t ret;
> > >
> > > +       /* Update source timestamps, because we are accessing file data */
> > > +       file_accessed(file_in);
> > > +
> > > +       /* Update destination timestamps, since we can alter file contents. */
> > > +       if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> > > +               ret = file_update_time(file_out);
> > > +               if (ret)
> > > +                       return ret;
> > > +       }
> > > +
> >
> > If there is a consistency about who is responsible of calling file_accessed()
> > and file_update_time() it eludes me. grep tells me that they are mostly
> > handled by filesystem code or generic helpers called by filesystem code
> > and not in the vfs helpers.
> >
> > FMODE_NOCMTIME seems like an xfs specific flag (for DMAPI?), which
> > most generic callers of file_update_time() completely ignore.
> > This seems like another argument in favor of leaving the responsibility
> > of the timestamp updates to the filesystem.
> >
> > Maybe I am missing something?
> >
> 
> I had similar question before about who is responsible for doing the
> checks. I agree that attributes should be updated for the case when no
> filesystem support exist for copy_file_range() but this code does it
> for all the cases. I also wonder if it's appropriate to update the
> attributes before the copy is actually done?

The other functions that change file contents (write, clonerange) update
mtime and remove suid before initiating the operation.  For mtime I
think we should maintain consistent behavior, and for suid removal we
definitely need to revoke that before we change the file contents.

--D

> > Thanks,
> > Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-03  8:34 ` [PATCH 03/11] vfs: no fallback for ->copy_file_range Dave Chinner
  2018-12-03 10:22   ` Amir Goldstein
@ 2018-12-03 18:23   ` Anna Schumaker
  2018-12-04 15:16   ` Christoph Hellwig
  2 siblings, 0 replies; 83+ messages in thread
From: Anna Schumaker @ 2018-12-03 18:23 UTC (permalink / raw)
  To: Dave Chinner, linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

Hi Dave,

On Mon, 2018-12-03 at 19:34 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have generic_copy_file_range(), remove it as a fallback
> case when offloads fail. This puts the responsibility for executing
> fallbacks on the filesystems that implement ->copy_file_range and
> allows us to add operational validity checks to
> generic_copy_file_range().
> 
> Rework vfs_copy_file_range() to call a new do_copy_file_range()
> helper to exceute the copying callout, and move calls to
> generic_file_copy_range() into filesystem methods where they
> currently return failures.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/ceph/file.c      | 17 ++++++++++++++++-
>  fs/cifs/cifsfs.c    |  4 ++++
>  fs/fuse/file.c      | 17 ++++++++++++++++-
>  fs/nfs/nfs4file.c   |  4 ++++

The NFS bits look okay to me:
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com

>  fs/overlayfs/file.c |  9 ++++++++-
>  fs/read_write.c     | 24 +++++++++++++++---------
>  6 files changed, 63 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 189df668b6a0..cf29f0410dcb 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1885,7 +1885,7 @@ static int is_file_size_ok(struct inode *src_inode,
> struct inode *dst_inode,
>  	return 0;
>  }
>  
> -static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> +static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>  				    struct file *dst_file, loff_t dst_off,
>  				    size_t len, unsigned int flags)
>  {
> @@ -2096,6 +2096,21 @@ static ssize_t ceph_copy_file_range(struct file
> *src_file, loff_t src_off,
>  	return ret;
>  }
>  
> +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> +				    struct file *dst_file, loff_t dst_off,
> +				    size_t len, unsigned int flags)
> +{
> +	ssize_t ret;
> +
> +	ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> +					len, flags);
> +
> +	if (ret == -EOPNOTSUPP)
> +		ret = generic_copy_file_range(src_file, src_off, dst_file,
> +					dst_off, len, flags);
> +	return ret;
> +}
> +
>  const struct file_operations ceph_file_fops = {
>  	.open = ceph_open,
>  	.release = ceph_release,
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 865706edb307..5ef4baec6234 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1141,6 +1141,10 @@ static ssize_t cifs_copy_file_range(struct file
> *src_file, loff_t off,
>  	rc = cifs_file_copychunk_range(xid, src_file, off, dst_file, destoff,
>  					len, flags);
>  	free_xid(xid);
> +
> +	if (rc == -EOPNOTSUPP)
> +		rc = generic_copy_file_range(src_file, off, dst_file,
> +					destoff, len, flags);
>  	return rc;
>  }
>  
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index b52f9baaa3e7..b86fb0298739 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3024,7 +3024,7 @@ static long fuse_file_fallocate(struct file *file, int
> mode, loff_t offset,
>  	return err;
>  }
>  
> -static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> +static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>  				    struct file *file_out, loff_t pos_out,
>  				    size_t len, unsigned int flags)
>  {
> @@ -3100,6 +3100,21 @@ static ssize_t fuse_copy_file_range(struct file
> *file_in, loff_t pos_in,
>  	return err;
>  }
>  
> +static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> +				    struct file *dst_file, loff_t dst_off,
> +				    size_t len, unsigned int flags)
> +{
> +	ssize_t ret;
> +
> +	ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> +					len, flags);
> +
> +	if (ret == -EOPNOTSUPP)
> +		ret = generic_copy_file_range(src_file, src_off, dst_file,
> +					dst_off, len, flags);
> +	return ret;
> +}
> +
>  static const struct file_operations fuse_file_operations = {
>  	.llseek		= fuse_file_llseek,
>  	.read_iter	= fuse_file_read_iter,
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 46d691ba04bc..d7766a6eb0f4 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -141,6 +141,10 @@ static ssize_t nfs4_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
>  	if (ret == -EAGAIN)
>  		goto retry;
> +
> +	if (ret == -EOPNOTSUPP)
> +		ret = generic_copy_file_range(file_in, pos_in, file_out,
> +					pos_out, count, flags);
>  	return ret;
>  }
>  
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 84dd957efa24..68736e5d6a56 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  				   struct file *file_out, loff_t pos_out,
>  				   size_t len, unsigned int flags)
>  {
> -	return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> +	ssize_t ret;
> +
> +	ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
>  			    OVL_COPY);
> +
> +	if (ret == -EOPNOTSUPP)
> +		ret = generic_copy_file_range(file_in, pos_in, file_out,
> +					pos_out, len, flags);
> +	return ret;
>  }
>  
>  static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in,
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 50114694c98b..44339b44accc 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1570,6 +1570,18 @@ ssize_t generic_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  }
>  EXPORT_SYMBOL(generic_copy_file_range);
>  
> +static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> +			    struct file *file_out, loff_t pos_out,
> +			    size_t len, unsigned int flags)
> +{
> +	if (file_out->f_op->copy_file_range)
> +		return file_out->f_op->copy_file_range(file_in, pos_in,
> file_out,
> +						      pos_out, len, flags);
> +
> +	return generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
> +					len, flags);
> +}
> +
>  /*
>   * copy_file_range() differs from regular file read and write in that it
>   * specifically allows return partial success.  When it does so is up to
> @@ -1634,15 +1646,9 @@ ssize_t vfs_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  		}
>  	}
>  
> -	if (file_out->f_op->copy_file_range) {
> -		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
> -						      pos_out, len, flags);
> -		if (ret != -EOPNOTSUPP)
> -			goto done;
> -	}
> -
> -	ret = generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
> -					len, flags);
> +	ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> +				flags);
> +	WARN_ON_ONCE(ret == -EOPNOTSUPP);
>  done:
>  	if (ret > 0) {
>  		fsnotify_access(file_in);


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03  8:34 ` [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range Dave Chinner
  2018-12-03 11:04   ` Amir Goldstein
@ 2018-12-03 18:24   ` Darrick J. Wong
  2018-12-04  8:18   ` Olga Kornievskaia
  2 siblings, 0 replies; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-03 18:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 07:34:13PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> before we can enable cross-device copies into copy_file_range(),
> we have to ensure that ->remap_file_range() implemenations will
> correctly reject attempts to do cross filesystem clones. Currently
> these checks are done above calls to ->remap_file_range(), but
> we need to drive them inwards so that we get EXDEV protection for all
> callers of ->remap_file_range().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3288db1d5f21..174cf92eea1d 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  	bool same_inode = (inode_in == inode_out);
>  	int ret;
>  
> +	/*
> +	 * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> +	 * the same mount. Practically, they only need to be on the same file
> +	 * system. We check this here rather than at the ioctl layers because
> +	 * this is effectively a limitation of the fielsystem implementations,

"filesystem"...

--D

> +	 * not so much the API itself. Further, ->remap_file_range() can be
> +	 * called from syscalls that don't have cross device copy restrictions
> +	 * (such as copy_file_range()) and so we need to catch them before we
> +	 * do any damage.
> +	 */
> +	if (inode_in->i_sb != inode_out->i_sb)
> +		return -EXDEV;
> +
>  	/* Don't touch certain kinds of inodes */
>  	if (IS_IMMUTABLE(inode_out))
>  		return -EPERM;
> @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
>  	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>  		return -EINVAL;
>  
> -	/*
> -	 * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> -	 * the same mount. Practically, they only need to be on the same file
> -	 * system.
> -	 */
> -	if (inode_in->i_sb != inode_out->i_sb)
> -		return -EXDEV;
> -
>  	if (!(file_in->f_mode & FMODE_READ) ||
>  	    !(file_out->f_mode & FMODE_WRITE) ||
>  	    (file_out->f_flags & O_APPEND))
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
  2018-12-03 12:47   ` Amir Goldstein
  2018-12-03 18:18   ` Darrick J. Wong
@ 2018-12-03 18:53   ` Eric Biggers
  2018-12-04 15:19   ` Christoph Hellwig
  3 siblings, 0 replies; 83+ messages in thread
From: Eric Biggers @ 2018-12-03 18:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 07:34:10PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to FI_DEDUPERANGE, make copy_file_range() check that we have
> write permissions to the destination inode.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/filemap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0a170425935b..876df5275514 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3013,6 +3013,11 @@ int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  	    (file_out->f_flags & O_APPEND))
>  		return -EBADF;
>  
> +	/* may sure we really are allowed to write to the destination inode */
> +	ret = inode_permission(inode_out, MAY_WRITE);
> +	if (ret < 0)
> +		return ret;
> +
>  	/* Ensure offsets don't wrap. */
>  	if (pos_in + count < pos_in || pos_out + count < pos_out)
>  		return -EOVERFLOW;
> -- 
> 2.19.1
> 

Why?  The file descriptor was already checked for write permission above:

       if (!(file_in->f_mode & FMODE_READ) ||
            !(file_out->f_mode & FMODE_WRITE) ||
            (file_out->f_flags & O_APPEND))
                return -EBADF;

Yes, that doesn't detect removing write permission from the *inode*, but write()
doesn't either.

- Eric

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
  2018-12-03 12:36   ` Amir Goldstein
  2018-12-03 17:58   ` Olga Kornievskaia
@ 2018-12-03 18:53   ` Anna Schumaker
  2018-12-03 19:27     ` Olga Kornievskaia
  2018-12-03 23:40     ` Dave Chinner
  2018-12-04 15:43   ` Christoph Hellwig
  3 siblings, 2 replies; 83+ messages in thread
From: Anna Schumaker @ 2018-12-03 18:53 UTC (permalink / raw)
  To: Dave Chinner, linux-fsdevel, linux-xfs
  Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Mon, 2018-12-03 at 19:34 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We want to enable cross-filesystem copy_file_range functionality
> where possible, so push the "same superblock only" checks down to
> the individual filesystem callouts so they can make their own
> decisions about cross-superblock copy offload.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/ceph/file.c      |  4 +++-
>  fs/cifs/cifsfs.c    |  8 +++++++-
>  fs/fuse/file.c      |  5 ++++-
>  fs/nfs/nfs4file.c   | 16 ++++++++++------
>  fs/overlayfs/file.c | 10 +++++++++-
>  fs/read_write.c     | 10 ++++------
>  6 files changed, 37 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index cf29f0410dcb..eb876e19c1dc 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1905,6 +1905,8 @@ static ssize_t __ceph_copy_file_range(struct file
> *src_file, loff_t src_off,
>  
>  	if (src_inode == dst_inode)
>  		return -EINVAL;
> +	if (src_inode->i_sb != dst_inode->i_sb)
> +		return -EXDEV;
>  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>  		return -EROFS;
>  
> @@ -2105,7 +2107,7 @@ static ssize_t ceph_copy_file_range(struct file
> *src_file, loff_t src_off,
>  	ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
>  					len, flags);
>  
> -	if (ret == -EOPNOTSUPP)
> +	if (ret == -EOPNOTSUPP || ret == -EXDEV)
>  		ret = generic_copy_file_range(src_file, src_off, dst_file,
>  					dst_off, len, flags);
>  	return ret;
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 5ef4baec6234..03e4b9eacbd1 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1072,6 +1072,12 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
>  		goto out;
>  	}
>  
> +	if (src_inode->i_sb != target_inode->i_sb) {
> +		rc = -EXDEV;
> +		goto out;
> +	}
> +
> +
>  	if (!src_file->private_data || !dst_file->private_data) {
>  		rc = -EBADF;
>  		cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
> @@ -1142,7 +1148,7 @@ static ssize_t cifs_copy_file_range(struct file
> *src_file, loff_t off,
>  					len, flags);
>  	free_xid(xid);
>  
> -	if (rc == -EOPNOTSUPP)
> +	if (rc == -EOPNOTSUPP || rc == -EXDEV)
>  		rc = generic_copy_file_range(src_file, off, dst_file,
>  					destoff, len, flags);
>  	return rc;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index b86fb0298739..0758f831a4eb 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3053,6 +3053,9 @@ static ssize_t __fuse_copy_file_range(struct file
> *file_in, loff_t pos_in,
>  	if (fc->no_copy_file_range)
>  		return -EOPNOTSUPP;
>  
> +	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +		return -EXDEV;
> +
>  	inode_lock(inode_out);
>  
>  	if (fc->writeback_cache) {
> @@ -3109,7 +3112,7 @@ static ssize_t fuse_copy_file_range(struct file
> *src_file, loff_t src_off,
>  	ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
>  					len, flags);
>  
> -	if (ret == -EOPNOTSUPP)
> +	if (ret == -EOPNOTSUPP || ret == -EXDEV)
>  		ret = generic_copy_file_range(src_file, src_off, dst_file,
>  					dst_off, len, flags);
>  	return ret;
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index d7766a6eb0f4..4783c0c1c49e 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -133,16 +133,20 @@ static ssize_t nfs4_copy_file_range(struct file
> *file_in, loff_t pos_in,
>  				    struct file *file_out, loff_t pos_out,
>  				    size_t count, unsigned int flags)
>  {
> -	ssize_t ret;
> +	ssize_t ret = -EXDEV;
>  
>  	if (file_inode(file_in) == file_inode(file_out))
>  		return -EINVAL;
> -retry:
> -	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> -	if (ret == -EAGAIN)
> -		goto retry;
>  
> -	if (ret == -EOPNOTSUPP)
> +	/* only offload copy if superblock is the same */
> +	if (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> +		do {
> +			ret = nfs42_proc_copy(file_in, pos_in, file_out,
> +					pos_out, count);
> +		} while (ret == -EAGAIN);

I'm not convinced we can actually return -EAGAIN from nfs42_proc_copy().  The
nfs_get_lock_context() function doesn't return it, and if _nfs42_proc_copy()
returns -EAGAIN it's immediately retried by nfs42_proc_copy() instead of
returning.

Olga, am I missing something here?
Anna

> +	}
> +
> +	if (ret == -EOPNOTSUPP || ret == -EXDEV)
>  		ret = generic_copy_file_range(file_in, pos_in, file_out,
>  					pos_out, count, flags);
>  	return ret;
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 68736e5d6a56..34fb0398d016 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -443,6 +443,14 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t
> pos_in,
>  	const struct cred *old_cred;
>  	loff_t ret;
>  
> +	/*
> +	 * Temporary. Cross device copy checks should be left to the copy file
> +	 * call on the real inodes, but existing behaviour checks the upper
> +	 * files only.
> +	 */
> +	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +		return -EXDEV;
> +
>  	ret = ovl_real_fdget(file_out, &real_out);
>  	if (ret)
>  		return ret;
> @@ -491,7 +499,7 @@ static ssize_t ovl_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  	ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
>  			    OVL_COPY);
>  
> -	if (ret == -EOPNOTSUPP)
> +	if (ret == -EOPNOTSUPP || ret == -EXDEV)
>  		ret = generic_copy_file_range(file_in, pos_in, file_out,
>  					pos_out, len, flags);
>  	return ret;
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 174cf92eea1d..4e0666de0d69 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1565,6 +1565,10 @@ ssize_t generic_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  			    struct file *file_out, loff_t pos_out,
>  			    size_t len, unsigned int flags)
>  {
> +	/* Temporary, do_splice_direct supports cross-sb copies */
> +	if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +		return -EXDEV;
> +
>  	return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
>  			len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
>  }
> @@ -1611,17 +1615,11 @@ ssize_t vfs_copy_file_range(struct file *file_in,
> loff_t pos_in,
>  			    struct file *file_out, loff_t pos_out,
>  			    size_t len, unsigned int flags)
>  {
> -	struct inode *inode_in = file_inode(file_in);
> -	struct inode *inode_out = file_inode(file_out);
>  	ssize_t ret;
>  
>  	if (flags != 0)
>  		return -EINVAL;
>  
> -	/* this could be relaxed once a method supports cross-fs copies */
> -	if (inode_in->i_sb != inode_out->i_sb)
> -		return -EXDEV;
> -
>  	ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
>  					flags);
>  	if (ret < 0)


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
  2018-12-03 12:42   ` Amir Goldstein
@ 2018-12-03 19:04   ` Darrick J. Wong
  2018-12-03 21:33   ` Olga Kornievskaia
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-03 19:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 07:34:09PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Like the clone and dedupe interfaces we've recently fixed, the
> copy_file_range() implementation is missing basic sanity, limits and
> boundary condition tests on the parameters that are passed to it
> from userspace. Create a new "generic_copy_file_checks()" function
> modelled on the generic_remap_checks() function to provide this
> missing functionality.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c    | 27 ++++++------------
>  include/linux/fs.h |  3 ++
>  mm/filemap.c       | 69 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 81 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 44339b44accc..69809345977e 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1578,7 +1578,7 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>  		return file_out->f_op->copy_file_range(file_in, pos_in, file_out,
>  						      pos_out, len, flags);
>  
> -	return generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
> +	return generic_copy_file_range(file_in, pos_in, file_out, pos_out,
>  					len, flags);
>  }
>  
> @@ -1598,10 +1598,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  	if (flags != 0)
>  		return -EINVAL;
>  
> -	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> -		return -EISDIR;
> -	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> -		return -EINVAL;
> +	/* this could be relaxed once a method supports cross-fs copies */
> +	if (inode_in->i_sb != inode_out->i_sb)
> +		return -EXDEV;
> +
> +	ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> +					flags);
> +	if (ret < 0)
> +		return ret;
>  
>  	ret = rw_verify_area(READ, file_in, &pos_in, len);
>  	if (unlikely(ret))
> @@ -1611,22 +1615,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  	if (unlikely(ret))
>  		return ret;
>  
> -	if (!(file_in->f_mode & FMODE_READ) ||
> -	    !(file_out->f_mode & FMODE_WRITE) ||
> -	    (file_out->f_flags & O_APPEND))
> -		return -EBADF;
> -
> -	/* this could be relaxed once a method supports cross-fs copies */
> -	if (inode_in->i_sb != inode_out->i_sb)
> -		return -EXDEV;
> -
>  	if (len == 0)
>  		return 0;
>  
> -	/* If the source range crosses EOF, fail the copy */
> -	if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
> -		return -EINVAL;
> -
>  	file_start_write(file_out);
>  
>  	/*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a4478764cf63..0d9d2d93d4df 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3022,6 +3022,9 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  				struct file *file_out, loff_t pos_out,
>  				loff_t *count, unsigned int remap_flags);
> +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> +				struct file *file_out, loff_t pos_out,
> +				size_t *count, unsigned int flags);
>  extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 81adec8ee02c..0a170425935b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2975,6 +2975,75 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  	return 0;
>  }
>  
> +
> +/*
> + * Performs necessary checks before doing a file copy
> + *
> + * Can adjust amount of bytes to copy
> + * Returns appropriate error code that caller should return or
> + * zero in case the copy should be allowed.
> + */
> +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> +			 struct file *file_out, loff_t pos_out,
> +			 size_t *req_count, unsigned int flags)
> +{
> +	struct inode *inode_in = file_inode(file_in);
> +	struct inode *inode_out = file_inode(file_out);
> +	uint64_t count = *req_count;
> +	uint64_t bcount;
> +	loff_t size_in, size_out;
> +	loff_t bs = inode_out->i_sb->s_blocksize;
> +	int ret;
> +
> +	/* Don't touch certain kinds of inodes */
> +	if (IS_IMMUTABLE(inode_out))
> +		return -EPERM;
> +
> +	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> +		return -ETXTBSY;
> +
> +	/* Don't copy dirs, pipes, sockets... */
> +	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> +		return -EISDIR;
> +	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +		return -EINVAL;
> +
> +	if (!(file_in->f_mode & FMODE_READ) ||
> +	    !(file_out->f_mode & FMODE_WRITE) ||
> +	    (file_out->f_flags & O_APPEND))
> +		return -EBADF;

These five checks are the same as the ones that are split between
do_clone_file_range and generic_remap_file_range_prep.  Perhaps they
should be factored into a single function that can be called from the
do_clone_file_range function as well as do_copy_file_range?

(I suspect also vfs_dedupe_file_range_one() should call it too, but the
dedupe code is so grotty and weird...)

--D

> +
> +	/* Ensure offsets don't wrap. */
> +	if (pos_in + count < pos_in || pos_out + count < pos_out)
> +		return -EOVERFLOW;
> +
> +	size_in = i_size_read(inode_in);
> +	size_out = i_size_read(inode_out);
> +
> +	/* If the source range crosses EOF, fail the copy */
> +	if (pos_in >= size_in)
> +		return -EINVAL;
> +	if (pos_in + count > size_in)
> +		return -EINVAL;
> +
> +	ret = generic_access_check_limits(file_in, pos_in, &count);
> +	if (ret)
> +		return ret;
> +
> +	ret = generic_write_check_limits(file_out, pos_out, &count);
> +	if (ret)
> +		return ret;
> +
> +	/* Don't allow overlapped copying within the same file. */
> +	if (inode_in == inode_out &&
> +	    pos_out + count > pos_in &&
> +	    pos_out < pos_in + count)
> +		return -EINVAL;
> +
> +	*req_count = count;
> +	return 0;
> +}
> +
>  int pagecache_write_begin(struct file *file, struct address_space *mapping,
>  				loff_t pos, unsigned len, unsigned flags,
>  				struct page **pagep, void **fsdata)
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03 11:04   ` Amir Goldstein
@ 2018-12-03 19:11     ` Darrick J. Wong
  2018-12-03 23:37       ` Dave Chinner
  2018-12-03 23:34     ` Dave Chinner
  1 sibling, 1 reply; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-03 19:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 01:04:07PM +0200, Amir Goldstein wrote:
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > before we can enable cross-device copies into copy_file_range(),
> > we have to ensure that ->remap_file_range() implemenations will
> > correctly reject attempts to do cross filesystem clones. Currently
> 
> But you only fixed remap_file_range() implemenations of xfs and ocfs2...
> 
> > these checks are done above calls to ->remap_file_range(), but
> > we need to drive them inwards so that we get EXDEV protection for all
> > callers of ->remap_file_range().
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/read_write.c | 21 +++++++++++++--------
> >  1 file changed, 13 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3288db1d5f21..174cf92eea1d 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >         bool same_inode = (inode_in == inode_out);
> >         int ret;
> >
> > +       /*
> > +        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > +        * the same mount. Practically, they only need to be on the same file
> > +        * system. We check this here rather than at the ioctl layers because
> > +        * this is effectively a limitation of the fielsystem implementations,
> > +        * not so much the API itself. Further, ->remap_file_range() can be
> > +        * called from syscalls that don't have cross device copy restrictions
> > +        * (such as copy_file_range()) and so we need to catch them before we
> > +        * do any damage.
> > +        */
> > +       if (inode_in->i_sb != inode_out->i_sb)
> > +               return -EXDEV;
> > +
> >         /* Don't touch certain kinds of inodes */
> >         if (IS_IMMUTABLE(inode_out))
> >                 return -EPERM;
> > @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> >         if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> >                 return -EINVAL;
> >
> > -       /*
> > -        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > -        * the same mount. Practically, they only need to be on the same file
> > -        * system.
> > -        */
> > -       if (inode_in->i_sb != inode_out->i_sb)
> > -               return -EXDEV;
> > -
> 

I think this is sort of backwards -- the checks should stay in
do_clone_file_range, and vfs_copy_file_range should be calling that
instead of directly calling ->remap_range():

vfs_copy_file_range()
{
	file_start_write(...);
	ret = do_clone_file_range(...);
	if (ret > 0)
		return ret;
	ret = do_copy_file_range(...);
	file_end_write(...);
	return ret;
}

> That leaves {nfs42,cifs,btrfs}_remap_file_range() exposed to passing
> files not of their own fs type let alone same sb when do_clone_file_range()
> is called from ovl_copy_up_data().

...and then I think this problem goes away.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03 18:53   ` Anna Schumaker
@ 2018-12-03 19:27     ` Olga Kornievskaia
  2018-12-03 23:40     ` Dave Chinner
  1 sibling, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-03 19:27 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: david, linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs,
	ceph-devel, linux-cifs

On Mon, Dec 3, 2018 at 1:53 PM Anna Schumaker <schumaker.anna@gmail.com> wrote:
>
> On Mon, 2018-12-03 at 19:34 +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > We want to enable cross-filesystem copy_file_range functionality
> > where possible, so push the "same superblock only" checks down to
> > the individual filesystem callouts so they can make their own
> > decisions about cross-superblock copy offload.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/ceph/file.c      |  4 +++-
> >  fs/cifs/cifsfs.c    |  8 +++++++-
> >  fs/fuse/file.c      |  5 ++++-
> >  fs/nfs/nfs4file.c   | 16 ++++++++++------
> >  fs/overlayfs/file.c | 10 +++++++++-
> >  fs/read_write.c     | 10 ++++------
> >  6 files changed, 37 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index cf29f0410dcb..eb876e19c1dc 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1905,6 +1905,8 @@ static ssize_t __ceph_copy_file_range(struct file
> > *src_file, loff_t src_off,
> >
> >       if (src_inode == dst_inode)
> >               return -EINVAL;
> > +     if (src_inode->i_sb != dst_inode->i_sb)
> > +             return -EXDEV;
> >       if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> >               return -EROFS;
> >
> > @@ -2105,7 +2107,7 @@ static ssize_t ceph_copy_file_range(struct file
> > *src_file, loff_t src_off,
> >       ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> >                                       len, flags);
> >
> > -     if (ret == -EOPNOTSUPP)
> > +     if (ret == -EOPNOTSUPP || ret == -EXDEV)
> >               ret = generic_copy_file_range(src_file, src_off, dst_file,
> >                                       dst_off, len, flags);
> >       return ret;
> > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> > index 5ef4baec6234..03e4b9eacbd1 100644
> > --- a/fs/cifs/cifsfs.c
> > +++ b/fs/cifs/cifsfs.c
> > @@ -1072,6 +1072,12 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> >               goto out;
> >       }
> >
> > +     if (src_inode->i_sb != target_inode->i_sb) {
> > +             rc = -EXDEV;
> > +             goto out;
> > +     }
> > +
> > +
> >       if (!src_file->private_data || !dst_file->private_data) {
> >               rc = -EBADF;
> >               cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
> > @@ -1142,7 +1148,7 @@ static ssize_t cifs_copy_file_range(struct file
> > *src_file, loff_t off,
> >                                       len, flags);
> >       free_xid(xid);
> >
> > -     if (rc == -EOPNOTSUPP)
> > +     if (rc == -EOPNOTSUPP || rc == -EXDEV)
> >               rc = generic_copy_file_range(src_file, off, dst_file,
> >                                       destoff, len, flags);
> >       return rc;
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index b86fb0298739..0758f831a4eb 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -3053,6 +3053,9 @@ static ssize_t __fuse_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> >       if (fc->no_copy_file_range)
> >               return -EOPNOTSUPP;
> >
> > +     if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > +             return -EXDEV;
> > +
> >       inode_lock(inode_out);
> >
> >       if (fc->writeback_cache) {
> > @@ -3109,7 +3112,7 @@ static ssize_t fuse_copy_file_range(struct file
> > *src_file, loff_t src_off,
> >       ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> >                                       len, flags);
> >
> > -     if (ret == -EOPNOTSUPP)
> > +     if (ret == -EOPNOTSUPP || ret == -EXDEV)
> >               ret = generic_copy_file_range(src_file, src_off, dst_file,
> >                                       dst_off, len, flags);
> >       return ret;
> > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > index d7766a6eb0f4..4783c0c1c49e 100644
> > --- a/fs/nfs/nfs4file.c
> > +++ b/fs/nfs/nfs4file.c
> > @@ -133,16 +133,20 @@ static ssize_t nfs4_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> >                                   struct file *file_out, loff_t pos_out,
> >                                   size_t count, unsigned int flags)
> >  {
> > -     ssize_t ret;
> > +     ssize_t ret = -EXDEV;
> >
> >       if (file_inode(file_in) == file_inode(file_out))
> >               return -EINVAL;
> > -retry:
> > -     ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> > -     if (ret == -EAGAIN)
> > -             goto retry;
> >
> > -     if (ret == -EOPNOTSUPP)
> > +     /* only offload copy if superblock is the same */
> > +     if (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> > +             do {
> > +                     ret = nfs42_proc_copy(file_in, pos_in, file_out,
> > +                                     pos_out, count);
> > +             } while (ret == -EAGAIN);
>
> I'm not convinced we can actually return -EAGAIN from nfs42_proc_copy().  The
> nfs_get_lock_context() function doesn't return it, and if _nfs42_proc_copy()
> returns -EAGAIN it's immediately retried by nfs42_proc_copy() instead of
> returning.
>
> Olga, am I missing something here?

I'll update it in the client patches that are coming out.

> Anna
>
> > +     }
> > +
> > +     if (ret == -EOPNOTSUPP || ret == -EXDEV)
> >               ret = generic_copy_file_range(file_in, pos_in, file_out,
> >                                       pos_out, count, flags);
> >       return ret;
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index 68736e5d6a56..34fb0398d016 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -443,6 +443,14 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t
> > pos_in,
> >       const struct cred *old_cred;
> >       loff_t ret;
> >
> > +     /*
> > +      * Temporary. Cross device copy checks should be left to the copy file
> > +      * call on the real inodes, but existing behaviour checks the upper
> > +      * files only.
> > +      */
> > +     if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > +             return -EXDEV;
> > +
> >       ret = ovl_real_fdget(file_out, &real_out);
> >       if (ret)
> >               return ret;
> > @@ -491,7 +499,7 @@ static ssize_t ovl_copy_file_range(struct file *file_in,
> > loff_t pos_in,
> >       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> >                           OVL_COPY);
> >
> > -     if (ret == -EOPNOTSUPP)
> > +     if (ret == -EOPNOTSUPP || ret == -EXDEV)
> >               ret = generic_copy_file_range(file_in, pos_in, file_out,
> >                                       pos_out, len, flags);
> >       return ret;
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 174cf92eea1d..4e0666de0d69 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1565,6 +1565,10 @@ ssize_t generic_copy_file_range(struct file *file_in,
> > loff_t pos_in,
> >                           struct file *file_out, loff_t pos_out,
> >                           size_t len, unsigned int flags)
> >  {
> > +     /* Temporary, do_splice_direct supports cross-sb copies */
> > +     if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > +             return -EXDEV;
> > +
> >       return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> >                       len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> >  }
> > @@ -1611,17 +1615,11 @@ ssize_t vfs_copy_file_range(struct file *file_in,
> > loff_t pos_in,
> >                           struct file *file_out, loff_t pos_out,
> >                           size_t len, unsigned int flags)
> >  {
> > -     struct inode *inode_in = file_inode(file_in);
> > -     struct inode *inode_out = file_inode(file_out);
> >       ssize_t ret;
> >
> >       if (flags != 0)
> >               return -EINVAL;
> >
> > -     /* this could be relaxed once a method supports cross-fs copies */
> > -     if (inode_in->i_sb != inode_out->i_sb)
> > -             return -EXDEV;
> > -
> >       ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> >                                       flags);
> >       if (ret < 0)
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
  2018-12-03 12:42   ` Amir Goldstein
  2018-12-03 19:04   ` Darrick J. Wong
@ 2018-12-03 21:33   ` Olga Kornievskaia
  2018-12-03 23:04     ` Dave Chinner
  2018-12-04 15:18   ` Christoph Hellwig
  2018-12-12 11:31   ` Luis Henriques
  4 siblings, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-03 21:33 UTC (permalink / raw)
  To: david
  Cc: linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs, ceph-devel,
	linux-cifs

On Mon, Dec 3, 2018 at 3:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Like the clone and dedupe interfaces we've recently fixed, the
> copy_file_range() implementation is missing basic sanity, limits and
> boundary condition tests on the parameters that are passed to it
> from userspace. Create a new "generic_copy_file_checks()" function
> modelled on the generic_remap_checks() function to provide this
> missing functionality.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c    | 27 ++++++------------
>  include/linux/fs.h |  3 ++
>  mm/filemap.c       | 69 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 81 insertions(+), 18 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 44339b44accc..69809345977e 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1578,7 +1578,7 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>                 return file_out->f_op->copy_file_range(file_in, pos_in, file_out,
>                                                       pos_out, len, flags);
>
> -       return generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
> +       return generic_copy_file_range(file_in, pos_in, file_out, pos_out,
>                                         len, flags);
>  }
>
> @@ -1598,10 +1598,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>         if (flags != 0)
>                 return -EINVAL;
>
> -       if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> -               return -EISDIR;
> -       if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> -               return -EINVAL;
> +       /* this could be relaxed once a method supports cross-fs copies */
> +       if (inode_in->i_sb != inode_out->i_sb)
> +               return -EXDEV;
> +
> +       ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> +                                       flags);
> +       if (ret < 0)
> +               return ret;
>
>         ret = rw_verify_area(READ, file_in, &pos_in, len);
>         if (unlikely(ret))
> @@ -1611,22 +1615,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>         if (unlikely(ret))
>                 return ret;
>
> -       if (!(file_in->f_mode & FMODE_READ) ||
> -           !(file_out->f_mode & FMODE_WRITE) ||
> -           (file_out->f_flags & O_APPEND))
> -               return -EBADF;
> -
> -       /* this could be relaxed once a method supports cross-fs copies */
> -       if (inode_in->i_sb != inode_out->i_sb)
> -               return -EXDEV;
> -
>         if (len == 0)
>                 return 0;
>
> -       /* If the source range crosses EOF, fail the copy */
> -       if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
> -               return -EINVAL;
> -
>         file_start_write(file_out);
>
>         /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a4478764cf63..0d9d2d93d4df 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3022,6 +3022,9 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>                                 struct file *file_out, loff_t pos_out,
>                                 loff_t *count, unsigned int remap_flags);
> +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> +                               struct file *file_out, loff_t pos_out,
> +                               size_t *count, unsigned int flags);
>  extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 81adec8ee02c..0a170425935b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2975,6 +2975,75 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
>         return 0;
>  }
>
> +
> +/*
> + * Performs necessary checks before doing a file copy
> + *
> + * Can adjust amount of bytes to copy
> + * Returns appropriate error code that caller should return or
> + * zero in case the copy should be allowed.
> + */
> +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> +                        struct file *file_out, loff_t pos_out,
> +                        size_t *req_count, unsigned int flags)
> +{
> +       struct inode *inode_in = file_inode(file_in);
> +       struct inode *inode_out = file_inode(file_out);
> +       uint64_t count = *req_count;
> +       uint64_t bcount;
> +       loff_t size_in, size_out;
> +       loff_t bs = inode_out->i_sb->s_blocksize;
> +       int ret;

I got compile warnings:

mm/filemap.c: In function ‘generic_copy_file_checks’:
mm/filemap.c:2995:9: warning: unused variable ‘bs’ [-Wunused-variable]
  loff_t bs = inode_out->i_sb->s_blocksize;
         ^
mm/filemap.c:2993:11: warning: unused variable ‘bcount’ [-Wunused-variable]
  uint64_t bcount;

> +
> +       /* Don't touch certain kinds of inodes */
> +       if (IS_IMMUTABLE(inode_out))
> +               return -EPERM;
> +
> +       if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> +               return -ETXTBSY;
> +
> +       /* Don't copy dirs, pipes, sockets... */
> +       if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> +               return -EISDIR;
> +       if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> +               return -EINVAL;
> +
> +       if (!(file_in->f_mode & FMODE_READ) ||
> +           !(file_out->f_mode & FMODE_WRITE) ||
> +           (file_out->f_flags & O_APPEND))
> +               return -EBADF;
> +
> +       /* Ensure offsets don't wrap. */
> +       if (pos_in + count < pos_in || pos_out + count < pos_out)
> +               return -EOVERFLOW;
> +
> +       size_in = i_size_read(inode_in);
> +       size_out = i_size_read(inode_out);
> +
> +       /* If the source range crosses EOF, fail the copy */
> +       if (pos_in >= size_in)
> +               return -EINVAL;
> +       if (pos_in + count > size_in)
> +               return -EINVAL;
> +
> +       ret = generic_access_check_limits(file_in, pos_in, &count);
> +       if (ret)
> +               return ret;
> +
> +       ret = generic_write_check_limits(file_out, pos_out, &count);
> +       if (ret)
> +               return ret;
> +
> +       /* Don't allow overlapped copying within the same file. */
> +       if (inode_in == inode_out &&
> +           pos_out + count > pos_in &&
> +           pos_out < pos_in + count)
> +               return -EINVAL;
> +
> +       *req_count = count;
> +       return 0;
> +}
> +
>  int pagecache_write_begin(struct file *file, struct address_space *mapping,
>                                 loff_t pos, unsigned len, unsigned flags,
>                                 struct page **pagep, void **fsdata)
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/11] vfs: introduce generic_copy_file_range()
  2018-12-03 10:03   ` Amir Goldstein
@ 2018-12-03 23:00     ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
	Miklos Szeredi

On Mon, Dec 03, 2018 at 12:03:41PM +0200, Amir Goldstein wrote:
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Right now if vfs_copy_file_range() does not use any offload
> > mechanism, it falls back to calling do_splice_direct(). This fails
> > to do basic sanity checks on the files being copied. Before we
> > start adding this necessarily functionality to the fallback path,
> > separate it out into generic_copy_file_range().
> >
> > generic_copy_file_range() has the same prototype as
> > ->copy_file_range() so that filesystems can use it in their custom
> > ->copy_file_range() method if they so choose.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Looks good.
> 
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> Question:
> 2 years ago you suggested that I covert the overlayfs copy up
> code that does a do_direct_splice() with a loop of vfs_copy_file_range():
> https://marc.info/?l=linux-fsdevel&m=147369468521525&w=2
> We ended up with a slightly different solution, but with your recent
> changes, I can get back to your original proposal.
> 
> Back then, I wondered whether it makes sense to push the killable
> loop of shorter do_direct_splice() calls into the vfs helper.
> What do you think about adding this to generic_copy_file_range()
> now? (I can do that after your changes are merged).

No. Adding another loop on top of all the loops already in the
do_direct_splice() is just crazy. The code is hard enough to follow
to begin with. If we are going to make do_splice_direct() killable,
then it needs to be done the splice_direct_to_actor loop that
already splits large splice ranges up into smaller chunks.

As it is, addressing the flaws of do_splice_direct() is not
something I'm about to do in this patchset. It has many issues, and
it's yet another piece of work we need to undertake to make
copy_file_range() somewhat user friendly.

> The fact that userspace *can* enter a very long unkillable loop
> with current copy_file_range() syscall doesn't mean that we
> *should* persist this situation. After all, fixing the brokenness
> of the existing interface is what you set out to do.

That's not an API issue - that's an implementation problem.

Quite frankly, making copy offload implementations killable is going
to "fun" for filesystems that offload the copy to remote servers, so
whatever we do fo the fallback isn't going to prevent
copy_file_range() from being unkillable.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-03 10:22   ` Amir Goldstein
@ 2018-12-03 23:02     ` Dave Chinner
  2018-12-06  4:16       ` Amir Goldstein
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:02 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 12:22:21PM +0200, Amir Goldstein wrote:
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Now that we have generic_copy_file_range(), remove it as a fallback
> > case when offloads fail. This puts the responsibility for executing
> > fallbacks on the filesystems that implement ->copy_file_range and
> > allows us to add operational validity checks to
> > generic_copy_file_range().
> >
> > Rework vfs_copy_file_range() to call a new do_copy_file_range()
> > helper to exceute the copying callout, and move calls to
> > generic_file_copy_range() into filesystem methods where they
> > currently return failures.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> You may add
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> After fixing the overlayfs issue below.
> ...
> 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index 84dd957efa24..68736e5d6a56 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> >                                    struct file *file_out, loff_t pos_out,
> >                                    size_t len, unsigned int flags)
> >  {
> > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > +       ssize_t ret;
> > +
> > +       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> >                             OVL_COPY);
> > +
> > +       if (ret == -EOPNOTSUPP)
> > +               ret = generic_copy_file_range(file_in, pos_in, file_out,
> > +                                       pos_out, len, flags);
> > +       return ret;
> >  }
> >
> 
> This is unneeded, because ovl_copyfile(OVL_COPY) is implemented
> by calling vfs_copy_file_range() (on the underlying files) and it is
> not possible
> to get EOPNOTSUPP from vfs_copy_file_range().

Except that it is possible. e.g. If the underlying filesystem tries
a copy offload, gets a "not supported" failure from the remote
server and then doesn't implement a fallback.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03 21:33   ` Olga Kornievskaia
@ 2018-12-03 23:04     ` Dave Chinner
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:04 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs, ceph-devel,
	linux-cifs

On Mon, Dec 03, 2018 at 04:33:30PM -0500, Olga Kornievskaia wrote:
> On Mon, Dec 3, 2018 at 3:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3022,6 +3022,9 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >                                 struct file *file_out, loff_t pos_out,
> >                                 loff_t *count, unsigned int remap_flags);
> > +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > +                               struct file *file_out, loff_t pos_out,
> > +                               size_t *count, unsigned int flags);
> >  extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> >  extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> >  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 81adec8ee02c..0a170425935b 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2975,6 +2975,75 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >         return 0;
> >  }
> >
> > +
> > +/*
> > + * Performs necessary checks before doing a file copy
> > + *
> > + * Can adjust amount of bytes to copy
> > + * Returns appropriate error code that caller should return or
> > + * zero in case the copy should be allowed.
> > + */
> > +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > +                        struct file *file_out, loff_t pos_out,
> > +                        size_t *req_count, unsigned int flags)
> > +{
> > +       struct inode *inode_in = file_inode(file_in);
> > +       struct inode *inode_out = file_inode(file_out);
> > +       uint64_t count = *req_count;
> > +       uint64_t bcount;
> > +       loff_t size_in, size_out;
> > +       loff_t bs = inode_out->i_sb->s_blocksize;
> > +       int ret;
> 
> I got compile warnings:
> 
> mm/filemap.c: In function ‘generic_copy_file_checks’:
> mm/filemap.c:2995:9: warning: unused variable ‘bs’ [-Wunused-variable]
>   loff_t bs = inode_out->i_sb->s_blocksize;
>          ^
> mm/filemap.c:2993:11: warning: unused variable ‘bcount’ [-Wunused-variable]
>   uint64_t bcount;

Strange. Yes, they certainly are there when I compile my stack up to
this point, but when I compile the whole series they aren't there.

I'll fix it up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/11] vfs: copy_file_range should update file timestamps
  2018-12-03 10:47   ` Amir Goldstein
  2018-12-03 17:33     ` Olga Kornievskaia
@ 2018-12-03 23:19     ` Dave Chinner
  1 sibling, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:19 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 12:47:39PM +0200, Amir Goldstein wrote:
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Timestamps are not updated right now, so programs looking for
> > timestamp updates for file modifications (like rsync) will not
> > detect that files have changed. We are also accessing the source
> > data when doing a copy (but not when cloning) so we need to update
> > atime on the source file as well.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/read_write.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3b101183ea19..3288db1d5f21 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1576,6 +1576,16 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> >  {
> >         ssize_t ret;
> >
> > +       /* Update source timestamps, because we are accessing file data */
> > +       file_accessed(file_in);
> > +
> > +       /* Update destination timestamps, since we can alter file contents. */
> > +       if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> > +               ret = file_update_time(file_out);
> > +               if (ret)
> > +                       return ret;
> > +       }
> > +
> 
> If there is a consistency about who is responsible of calling file_accessed()
> and file_update_time() it eludes me. grep tells me that they are mostly
> handled by filesystem code or generic helpers called by filesystem code
> and not in the vfs helpers.

This isn't the "vfs helper" - this is the code that executes a data
copy. We have to do these timestamp updates regardless of the copy
mechanism used so it makes no real sense to force every
implementation to do it, and then also have to ensure the generic
fallback does it as well. Do it once for everyone, then nobody else
needs to care about it.

> FMODE_NOCMTIME seems like an xfs specific flag (for DMAPI?), which

It's a generic VFS flag that originally only XFS used. We check it
in places where data IO to XFS files might be done. Given that we
have vfs functions doing write on behalf of XFS filesystems (such as
remap_file_range() and copy_file_range() the timestamp updates need
to check this flag.

> most generic callers of file_update_time() completely ignore.

Because most cases don't get called from a context that can have
FMODE_NOCMTIME set. If more filesystems start to use FMODE_NOCMTIME
then it will have to be more widely checked.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03 11:04   ` Amir Goldstein
  2018-12-03 19:11     ` Darrick J. Wong
@ 2018-12-03 23:34     ` Dave Chinner
  1 sibling, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 01:04:07PM +0200, Amir Goldstein wrote:
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > before we can enable cross-device copies into copy_file_range(),
> > we have to ensure that ->remap_file_range() implemenations will
> > correctly reject attempts to do cross filesystem clones. Currently
> 
> But you only fixed remap_file_range() implemenations of xfs and ocfs2...
> 
> > these checks are done above calls to ->remap_file_range(), but
> > we need to drive them inwards so that we get EXDEV protection for all
> > callers of ->remap_file_range().
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/read_write.c | 21 +++++++++++++--------
> >  1 file changed, 13 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 3288db1d5f21..174cf92eea1d 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >         bool same_inode = (inode_in == inode_out);
> >         int ret;
> >
> > +       /*
> > +        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > +        * the same mount. Practically, they only need to be on the same file
> > +        * system. We check this here rather than at the ioctl layers because
> > +        * this is effectively a limitation of the fielsystem implementations,
> > +        * not so much the API itself. Further, ->remap_file_range() can be
> > +        * called from syscalls that don't have cross device copy restrictions
> > +        * (such as copy_file_range()) and so we need to catch them before we
> > +        * do any damage.
> > +        */
> > +       if (inode_in->i_sb != inode_out->i_sb)
> > +               return -EXDEV;
> > +
> >         /* Don't touch certain kinds of inodes */
> >         if (IS_IMMUTABLE(inode_out))
> >                 return -EPERM;
> > @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> >         if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> >                 return -EINVAL;
> >
> > -       /*
> > -        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > -        * the same mount. Practically, they only need to be on the same file
> > -        * system.
> > -        */
> > -       if (inode_in->i_sb != inode_out->i_sb)
> > -               return -EXDEV;
> > -
> 
> That leaves {nfs42,cifs,btrfs}_remap_file_range() exposed to passing
> files not of their own fs type let alone same sb when do_clone_file_range()
> is called from ovl_copy_up_data().

For some reason I thought everyone called
generic_remap_file_range_prep() so they behaved the same way. My
mistake.

Really, though, I'm of the opinion that those filesystems should be
changed to call the generic checks rather than open code their own
incomplete/incompatible set of checks. This is exactly what I'm
trying to avoid with copy_file_range() - checks are done in one
place, all filesystems have the same checks done - so that future
modification and maintenance is so much easier.

We need to do the same thing to the remap_file_range()
implementations.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03 19:11     ` Darrick J. Wong
@ 2018-12-03 23:37       ` Dave Chinner
  2018-12-03 23:58         ` Darrick J. Wong
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Amir Goldstein, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 11:11:30AM -0800, Darrick J. Wong wrote:
> On Mon, Dec 03, 2018 at 01:04:07PM +0200, Amir Goldstein wrote:
> > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > before we can enable cross-device copies into copy_file_range(),
> > > we have to ensure that ->remap_file_range() implemenations will
> > > correctly reject attempts to do cross filesystem clones. Currently
> > 
> > But you only fixed remap_file_range() implemenations of xfs and ocfs2...
> > 
> > > these checks are done above calls to ->remap_file_range(), but
> > > we need to drive them inwards so that we get EXDEV protection for all
> > > callers of ->remap_file_range().
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/read_write.c | 21 +++++++++++++--------
> > >  1 file changed, 13 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 3288db1d5f21..174cf92eea1d 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> > >         bool same_inode = (inode_in == inode_out);
> > >         int ret;
> > >
> > > +       /*
> > > +        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > > +        * the same mount. Practically, they only need to be on the same file
> > > +        * system. We check this here rather than at the ioctl layers because
> > > +        * this is effectively a limitation of the fielsystem implementations,
> > > +        * not so much the API itself. Further, ->remap_file_range() can be
> > > +        * called from syscalls that don't have cross device copy restrictions
> > > +        * (such as copy_file_range()) and so we need to catch them before we
> > > +        * do any damage.
> > > +        */
> > > +       if (inode_in->i_sb != inode_out->i_sb)
> > > +               return -EXDEV;
> > > +
> > >         /* Don't touch certain kinds of inodes */
> > >         if (IS_IMMUTABLE(inode_out))
> > >                 return -EPERM;
> > > @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> > >         if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> > >                 return -EINVAL;
> > >
> > > -       /*
> > > -        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > > -        * the same mount. Practically, they only need to be on the same file
> > > -        * system.
> > > -        */
> > > -       if (inode_in->i_sb != inode_out->i_sb)
> > > -               return -EXDEV;
> > > -
> > 
> 
> I think this is sort of backwards -- the checks should stay in
> do_clone_file_range, and vfs_copy_file_range should be calling that
> instead of directly calling ->remap_range():
> 
> vfs_copy_file_range()
> {
> 	file_start_write(...);
> 	ret = do_clone_file_range(...);
> 	if (ret > 0)
> 		return ret;
> 	ret = do_copy_file_range(...);
> 	file_end_write(...);
> 	return ret;
> }

I'm already confused by the way we weave in and out of "vfs_/do_*"
functions, and this just makes it worse.

Just what the hell is supposed to be in a "vfs_" prefixed function,
and why the hell is it considered a "vfs" level function if we then
export it's internal functions for individual filesystems to use?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03 18:53   ` Anna Schumaker
  2018-12-03 19:27     ` Olga Kornievskaia
@ 2018-12-03 23:40     ` Dave Chinner
  1 sibling, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:40 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 01:53:35PM -0500, Anna Schumaker wrote:
> On Mon, 2018-12-03 at 19:34 +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We want to enable cross-filesystem copy_file_range functionality
> > where possible, so push the "same superblock only" checks down to
> > the individual filesystem callouts so they can make their own
> > decisions about cross-superblock copy offload.
....
> > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > index d7766a6eb0f4..4783c0c1c49e 100644
> > --- a/fs/nfs/nfs4file.c
> > +++ b/fs/nfs/nfs4file.c
> > @@ -133,16 +133,20 @@ static ssize_t nfs4_copy_file_range(struct file
> > *file_in, loff_t pos_in,
> >  				    struct file *file_out, loff_t pos_out,
> >  				    size_t count, unsigned int flags)
> >  {
> > -	ssize_t ret;
> > +	ssize_t ret = -EXDEV;
> >  
> >  	if (file_inode(file_in) == file_inode(file_out))
> >  		return -EINVAL;
> > -retry:
> > -	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> > -	if (ret == -EAGAIN)
> > -		goto retry;
> >  
> > -	if (ret == -EOPNOTSUPP)
> > +	/* only offload copy if superblock is the same */
> > +	if (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> > +		do {
> > +			ret = nfs42_proc_copy(file_in, pos_in, file_out,
> > +					pos_out, count);
> > +		} while (ret == -EAGAIN);
> 
> I'm not convinced we can actually return -EAGAIN from nfs42_proc_copy().  The
> nfs_get_lock_context() function doesn't return it, and if _nfs42_proc_copy()
> returns -EAGAIN it's immediately retried by nfs42_proc_copy() instead of
> returning.

Not really my concern, nor something that should be fixed in this
patchset. i.e. the function does the same thing before and after
this patch, so whether EAGAIN can occurr or not is irrelevant to
this patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03 18:18   ` Darrick J. Wong
@ 2018-12-03 23:55     ` Dave Chinner
  2018-12-05 17:28       ` bfields
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 10:18:03AM -0800, Darrick J. Wong wrote:
> On Mon, Dec 03, 2018 at 07:34:10PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Similar to FI_DEDUPERANGE, make copy_file_range() check that we have
> 
> TLDR: No, it's not similar to FIDEDUPERANGE -- the use of
> inode_permission() in allow_file_dedupe() is to enable callers to dedupe
> into a file for which the caller has write permissions but opened the
> file O_RDONLY.

What a grotty, nasty hack.

> [Please keep reading...]
> 
> > write permissions to the destination inode.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  mm/filemap.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 0a170425935b..876df5275514 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3013,6 +3013,11 @@ int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> >  	    (file_out->f_flags & O_APPEND))
> >  		return -EBADF;
> >  
> > +	/* may sure we really are allowed to write to the destination inode */
> > +	ret = inode_permission(inode_out, MAY_WRITE);
> 
> What's the difference between security_file_permission and
> inode_permission, and when do we call them for a regular
> open-write-close sequence?  Hmmm, let me take a look:
.....
> We also cannot dedupe into a file that becomes immutable after we open
> it for write, but we can dedupe into a file that loses its write
> permissions after we open it.

It's more nuanced than that - dedupe will proceed after write
permissions have been removed only if you are root or own the file,
otherwise it will fail.

Updated summary:

> op:		after +immutable?	after chmod a-w?
> write		yes			yes
> clonerange	no			yes
> dedupe	no			maybe
> newcopyrange	no			no
>
> My reaction: I don't think that writes should be allowed after an
> administrator marks a file immutable (but that's a separate issue) but I
> do think we should be consistent in allowing copying into a file that
> has lost its write permissions after we opened the file for write, like
> we do for write() and the remap ioct....

If we want to allow copying to files we don't actually have
permission to write to anymore, then I'll remove this from the test,
the man page and the code. But, quite frankly, I don't trust remote
server side copies to follow the same permission models as the
client side OS, so I think we have to treat copy_file_range
differently to a normal write syscall....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03 23:37       ` Dave Chinner
@ 2018-12-03 23:58         ` Darrick J. Wong
  2018-12-04  9:17           ` Amir Goldstein
  0 siblings, 1 reply; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-03 23:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Tue, Dec 04, 2018 at 10:37:14AM +1100, Dave Chinner wrote:
> On Mon, Dec 03, 2018 at 11:11:30AM -0800, Darrick J. Wong wrote:
> > On Mon, Dec 03, 2018 at 01:04:07PM +0200, Amir Goldstein wrote:
> > > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > before we can enable cross-device copies into copy_file_range(),
> > > > we have to ensure that ->remap_file_range() implemenations will
> > > > correctly reject attempts to do cross filesystem clones. Currently
> > > 
> > > But you only fixed remap_file_range() implemenations of xfs and ocfs2...
> > > 
> > > > these checks are done above calls to ->remap_file_range(), but
> > > > we need to drive them inwards so that we get EXDEV protection for all
> > > > callers of ->remap_file_range().
> > > >
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/read_write.c | 21 +++++++++++++--------
> > > >  1 file changed, 13 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index 3288db1d5f21..174cf92eea1d 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> > > >         bool same_inode = (inode_in == inode_out);
> > > >         int ret;
> > > >
> > > > +       /*
> > > > +        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > > > +        * the same mount. Practically, they only need to be on the same file
> > > > +        * system. We check this here rather than at the ioctl layers because
> > > > +        * this is effectively a limitation of the fielsystem implementations,
> > > > +        * not so much the API itself. Further, ->remap_file_range() can be
> > > > +        * called from syscalls that don't have cross device copy restrictions
> > > > +        * (such as copy_file_range()) and so we need to catch them before we
> > > > +        * do any damage.
> > > > +        */
> > > > +       if (inode_in->i_sb != inode_out->i_sb)
> > > > +               return -EXDEV;
> > > > +
> > > >         /* Don't touch certain kinds of inodes */
> > > >         if (IS_IMMUTABLE(inode_out))
> > > >                 return -EPERM;
> > > > @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> > > >         if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> > > >                 return -EINVAL;
> > > >
> > > > -       /*
> > > > -        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > > > -        * the same mount. Practically, they only need to be on the same file
> > > > -        * system.
> > > > -        */
> > > > -       if (inode_in->i_sb != inode_out->i_sb)
> > > > -               return -EXDEV;
> > > > -
> > > 
> > 
> > I think this is sort of backwards -- the checks should stay in
> > do_clone_file_range, and vfs_copy_file_range should be calling that
> > instead of directly calling ->remap_range():
> > 
> > vfs_copy_file_range()
> > {
> > 	file_start_write(...);
> > 	ret = do_clone_file_range(...);
> > 	if (ret > 0)
> > 		return ret;
> > 	ret = do_copy_file_range(...);
> > 	file_end_write(...);
> > 	return ret;
> > }
> 
> I'm already confused by the way we weave in and out of "vfs_/do_*"
> functions, and this just makes it worse.
> 
> Just what the hell is supposed to be in a "vfs_" prefixed function,
> and why the hell is it considered a "vfs" level function if we then
> export it's internal functions for individual filesystems to use?

I /think/ vfs_ functions are file_start_write()/file_end_write()
wrappers around a similarly named function that lacks the freeze
protection??

(AFAICT Amir made that split so that overlayfs could use these
functions, though I do not know if everything vfs_ was made that way
/specifically/ for overlayfs or if that's the way things have been and
ovlfs simply takes advantage of it...)

Guhhh, none of this is documented......

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03  8:34 ` [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range Dave Chinner
  2018-12-03 11:04   ` Amir Goldstein
  2018-12-03 18:24   ` Darrick J. Wong
@ 2018-12-04  8:18   ` Olga Kornievskaia
  2 siblings, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-04  8:18 UTC (permalink / raw)
  To: david
  Cc: linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs, ceph-devel,
	linux-cifs

On Mon, Dec 3, 2018 at 3:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> before we can enable cross-device copies into copy_file_range(),
> we have to ensure that ->remap_file_range() implemenations will
> correctly reject attempts to do cross filesystem clones. Currently
> these checks are done above calls to ->remap_file_range(), but
> we need to drive them inwards so that we get EXDEV protection for all
> callers of ->remap_file_range().

If there is no check before calling ->remap_file_range() then NFS
barfs. Perhaps it needs a check internally that checks that both file
handles are from the NFS but this was not needed before. Or there
needs to the a check in VFS.

>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/read_write.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3288db1d5f21..174cf92eea1d 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1909,6 +1909,19 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>         bool same_inode = (inode_in == inode_out);
>         int ret;
>
> +       /*
> +        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> +        * the same mount. Practically, they only need to be on the same file
> +        * system. We check this here rather than at the ioctl layers because
> +        * this is effectively a limitation of the fielsystem implementations,
> +        * not so much the API itself. Further, ->remap_file_range() can be
> +        * called from syscalls that don't have cross device copy restrictions
> +        * (such as copy_file_range()) and so we need to catch them before we
> +        * do any damage.
> +        */
> +       if (inode_in->i_sb != inode_out->i_sb)
> +               return -EXDEV;
> +
>         /* Don't touch certain kinds of inodes */
>         if (IS_IMMUTABLE(inode_out))
>                 return -EPERM;
> @@ -2013,14 +2026,6 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
>         if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
>                 return -EINVAL;
>
> -       /*
> -        * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> -        * the same mount. Practically, they only need to be on the same file
> -        * system.
> -        */
> -       if (inode_in->i_sb != inode_out->i_sb)
> -               return -EXDEV;
> -
>         if (!(file_in->f_mode & FMODE_READ) ||
>             !(file_out->f_mode & FMODE_WRITE) ||
>             (file_out->f_flags & O_APPEND))
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
  2018-12-03 23:58         ` Darrick J. Wong
@ 2018-12-04  9:17           ` Amir Goldstein
  0 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-04  9:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

> > > I think this is sort of backwards -- the checks should stay in
> > > do_clone_file_range, and vfs_copy_file_range should be calling that
> > > instead of directly calling ->remap_range():
> > >
> > > vfs_copy_file_range()
> > > {
> > >     file_start_write(...);
> > >     ret = do_clone_file_range(...);
> > >     if (ret > 0)
> > >             return ret;
> > >     ret = do_copy_file_range(...);
> > >     file_end_write(...);
> > >     return ret;
> > > }
> >
> > I'm already confused by the way we weave in and out of "vfs_/do_*"
> > functions, and this just makes it worse.
> >
> > Just what the hell is supposed to be in a "vfs_" prefixed function,
> > and why the hell is it considered a "vfs" level function if we then
> > export it's internal functions for individual filesystems to use?
>
> I /think/ vfs_ functions are file_start_write()/file_end_write()
> wrappers around a similarly named function that lacks the freeze
> protection??

That is definitely not an official definition of vfs_ vs. do_, but I found
this rule to be a common practice, which is why I swapped
{do,vfs}_clone_file_range(). But around vfs you can find many examples
where do_ helpers wrap vfs_ helpers.

>
> (AFAICT Amir made that split so that overlayfs could use these
> functions, though I do not know if everything vfs_ was made that way
> /specifically/ for overlayfs or if that's the way things have been and
> ovlfs simply takes advantage of it...)
>
> Guhhh, none of this is documented......
>

It looks like in git epoc, things were pretty straight forward.
vfs_XXX was the interface called after sys_XXX converted
userspace arguments (e.g. char *name, int fd) to vfs objects
(e.g. struct path,dentry,inode,file). Sometimes vfs_ helpers called
do_ helpers for several reasons. See for example epoc version
of fs/namei.c fs/read_write.c.
Even then there were exception. For example do_sendfile()
doesn't even have a vfs_ interface, although it is clear what
that prospect interface would look like.
To that end, do_splice_direct() acts as the standard do_ helper
to that non-existing vfs_ interface.

From there on, I guess things kinda grew organically.
fs/namei.c syscalls grew do_XXXat() helpers between syscalls
and vfs_XXX interface.

Overlayfs uses vfs_ interface 99% of the time, so from that perspective
it is regarded as an interface with vfs objects as arguments that does
NOT skip security_ checks and does NOT bypass freeze protection.

Overlayfs calling do_clone_file_range() and do_splice_direct() are
the only exception to this rule.
If we would want to replace those calls in ovl_copy_up_data() with
a single call to do_copy_file_range(), than said helper should NOT
be taking freeze protection and should do the fallback between
filesystem copy_file_range and generic_copy_file_range.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-03 12:46   ` Amir Goldstein
@ 2018-12-04 15:13     ` Christoph Hellwig
  2018-12-04 21:29       ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:13 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The man page says:
> >
> > EINVAL Requested range extends beyond the end of the source file
> >
> > But the current behaviour is that copy_file_range does a short
> > copy up to the source file EOF. Fix the kernel behaviour to match
> > the behaviour described in the man page.

I think the behavior implemented is a lot more useful than the one
documented..

> > +       /* If the source range crosses EOF, fail the copy */
> > +       if (pos_in >= i_size(inode_in) || pos_in + len > i_size(inode_in))
> > +               return -EINVAL;
> > +
> 
> i_size_read()...
> 
> Otherwise
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Looks like this doesn't even compile?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/11] vfs: introduce generic_copy_file_range()
  2018-12-03  8:34 ` [PATCH 02/11] vfs: introduce generic_copy_file_range() Dave Chinner
  2018-12-03 10:03   ` Amir Goldstein
@ 2018-12-04 15:14   ` Christoph Hellwig
  1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-03  8:34 ` [PATCH 03/11] vfs: no fallback for ->copy_file_range Dave Chinner
  2018-12-03 10:22   ` Amir Goldstein
  2018-12-03 18:23   ` Anna Schumaker
@ 2018-12-04 15:16   ` Christoph Hellwig
  2 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
                     ` (2 preceding siblings ...)
  2018-12-03 21:33   ` Olga Kornievskaia
@ 2018-12-04 15:18   ` Christoph Hellwig
  2018-12-12 11:31   ` Luis Henriques
  4 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

Ok, this fixes the earlier i_size() compile failure..

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
                     ` (2 preceding siblings ...)
  2018-12-03 18:53   ` Eric Biggers
@ 2018-12-04 15:19   ` Christoph Hellwig
  3 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

As Darrick already pointed our this looks wrong - for "normal" file
operations, which copy defintively should be we should only allow
write access on a writable fd.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits
  2018-12-03  8:34 ` [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits Dave Chinner
  2018-12-03 12:51   ` Amir Goldstein
@ 2018-12-04 15:21   ` Christoph Hellwig
  1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

file_remove_privs needs to be called with i_rsem held, which I don't
think we do here.  It also really should be called under the same
i_rwsem critical section that does modify the file content, so we'll
have to move the call into the methods.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/11] vfs: copy_file_range should update file timestamps
  2018-12-03  8:34 ` [PATCH 07/11] vfs: copy_file_range should update file timestamps Dave Chinner
  2018-12-03 10:47   ` Amir Goldstein
@ 2018-12-04 15:24   ` Christoph Hellwig
  1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Mon, Dec 03, 2018 at 07:34:12PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Timestamps are not updated right now, so programs looking for
> timestamp updates for file modifications (like rsync) will not
> detect that files have changed. We are also accessing the source
> data when doing a copy (but not when cloning) so we need to update
> atime on the source file as well.

This needs to be done inside the method, as a few file systems
do odd things about timestamps (yes, even now that XFS doesn't do that
anymore :)).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
                     ` (2 preceding siblings ...)
  2018-12-03 18:53   ` Anna Schumaker
@ 2018-12-04 15:43   ` Christoph Hellwig
  2018-12-04 22:18     ` Dave Chinner
  3 siblings, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

Well, this isn't bugfixes anymore, but adding new features..

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-04 15:13     ` Christoph Hellwig
@ 2018-12-04 21:29       ` Dave Chinner
  2018-12-04 21:47         ` Olga Kornievskaia
  2018-12-05 14:12         ` Christoph Hellwig
  0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-04 21:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Amir Goldstein, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > The man page says:
> > >
> > > EINVAL Requested range extends beyond the end of the source file
> > >
> > > But the current behaviour is that copy_file_range does a short
> > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > the behaviour described in the man page.
> 
> I think the behavior implemented is a lot more useful than the one
> documented..

The current behaviour is really nasty. Because copy_file_range() can
return short copies, the caller has to implement a loop to ensure
the range hey want get copied.  When the source range you are
trying to copy overlaps source EOF, this loop:

	while (len > 0) {
		ret = copy_file_range(... len ...)
		...
		off_in += ret;
		off_out += ret;
		len -= ret;
	}

Currently the fallback code copies up to the end of the source file
on the first copy and then fails the second copy with EINVAL because
the source range is now completely beyond EOF.

So, from an application perspective, did the copy succeed or did it
fail?

Existing tools that exercise copy_file_range (like xfs_io) consider
this a failure, because the second copy_file_range() call returns
EINVAL and not some "there is no more to copy" marker like read()
returning 0 bytes when attempting to read beyond EOF.

IOWs, we cannot tell the difference between a real error and a short
copy because the input range spans EOF and it was silently
shortened. That's the API problem we need to fix here - the existing
behaviour is really crappy for applications. Erroring out
immmediately is one solution, and it's what the man page says should
happen so that is what I implemented.

Realistically, though, I think an attempt to read beyond EOF for the
copy should result in behaviour like read() (i.e. return 0 bytes),
not EINVAL. The existing behaviour needs to change, though.

> > i_size_read()...
> > 
> > Otherwise
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> Looks like this doesn't even compile?

It's fixed in a later patch that consolidates the checks into a
generic check function, but I'm not sure why my "compile every
patch" script didn't catch this.

Cheers,

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-04 21:29       ` Dave Chinner
@ 2018-12-04 21:47         ` Olga Kornievskaia
  2018-12-04 22:31           ` Dave Chinner
  2018-12-05 14:12         ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-04 21:47 UTC (permalink / raw)
  To: david
  Cc: Christoph Hellwig, Amir Goldstein, linux-fsdevel, linux-xfs,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > The man page says:
> > > >
> > > > EINVAL Requested range extends beyond the end of the source file
> > > >
> > > > But the current behaviour is that copy_file_range does a short
> > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > the behaviour described in the man page.
> >
> > I think the behavior implemented is a lot more useful than the one
> > documented..
>
> The current behaviour is really nasty. Because copy_file_range() can
> return short copies, the caller has to implement a loop to ensure
> the range hey want get copied.  When the source range you are
> trying to copy overlaps source EOF, this loop:
>
>         while (len > 0) {
>                 ret = copy_file_range(... len ...)
>                 ...
>                 off_in += ret;
>                 off_out += ret;
>                 len -= ret;
>         }
>
> Currently the fallback code copies up to the end of the source file
> on the first copy and then fails the second copy with EINVAL because
> the source range is now completely beyond EOF.
>
> So, from an application perspective, did the copy succeed or did it
> fail?
>
> Existing tools that exercise copy_file_range (like xfs_io) consider
> this a failure, because the second copy_file_range() call returns
> EINVAL and not some "there is no more to copy" marker like read()
> returning 0 bytes when attempting to read beyond EOF.
>
> IOWs, we cannot tell the difference between a real error and a short
> copy because the input range spans EOF and it was silently
> shortened. That's the API problem we need to fix here - the existing
> behaviour is really crappy for applications. Erroring out
> immmediately is one solution, and it's what the man page says should
> happen so that is what I implemented.
>
> Realistically, though, I think an attempt to read beyond EOF for the
> copy should result in behaviour like read() (i.e. return 0 bytes),
> not EINVAL. The existing behaviour needs to change, though.

There are two checks to consider
1. pos_in >= EOF should return EINVAL
2. however what's perhaps should be relaxed is pos_in+len >= EOF
should return a short copy.

Having check#1 enforced allows to us to differentiate between a real
error and a short copy.

>
> > > i_size_read()...
> > >
> > > Otherwise
> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Looks like this doesn't even compile?
>
> It's fixed in a later patch that consolidates the checks into a
> generic check function, but I'm not sure why my "compile every
> patch" script didn't catch this.
>
> Cheers,
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-04 15:43   ` Christoph Hellwig
@ 2018-12-04 22:18     ` Dave Chinner
  2018-12-04 23:33       ` Olga Kornievskaia
  2018-12-05 14:09       ` Christoph Hellwig
  0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-04 22:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Tue, Dec 04, 2018 at 07:43:47AM -0800, Christoph Hellwig wrote:
> Well, this isn't bugfixes anymore, but adding new features..

I made that perfectly clear in the cover description. I called it
twice, one of them explicitly stating that this series made these
infrastructure changes because we have pending functionality that
dependents on cross-device copies being supported in a sane manner.

I'll drop it if you want, but then I'll just have to come back after
all the NFS code is merged and do yet more cleanup work.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-04 21:47         ` Olga Kornievskaia
@ 2018-12-04 22:31           ` Dave Chinner
  2018-12-05 16:51             ` bfields
  2019-05-20  9:10             ` Amir Goldstein
  0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-04 22:31 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Christoph Hellwig, Amir Goldstein, linux-fsdevel, linux-xfs,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > > > > The man page says:
> > > > >
> > > > > EINVAL Requested range extends beyond the end of the source file
> > > > >
> > > > > But the current behaviour is that copy_file_range does a short
> > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > the behaviour described in the man page.
> > >
> > > I think the behavior implemented is a lot more useful than the one
> > > documented..
> >
> > The current behaviour is really nasty. Because copy_file_range() can
> > return short copies, the caller has to implement a loop to ensure
> > the range hey want get copied.  When the source range you are
> > trying to copy overlaps source EOF, this loop:
> >
> >         while (len > 0) {
> >                 ret = copy_file_range(... len ...)
> >                 ...
> >                 off_in += ret;
> >                 off_out += ret;
> >                 len -= ret;
> >         }
> >
> > Currently the fallback code copies up to the end of the source file
> > on the first copy and then fails the second copy with EINVAL because
> > the source range is now completely beyond EOF.
> >
> > So, from an application perspective, did the copy succeed or did it
> > fail?
> >
> > Existing tools that exercise copy_file_range (like xfs_io) consider
> > this a failure, because the second copy_file_range() call returns
> > EINVAL and not some "there is no more to copy" marker like read()
> > returning 0 bytes when attempting to read beyond EOF.
> >
> > IOWs, we cannot tell the difference between a real error and a short
> > copy because the input range spans EOF and it was silently
> > shortened. That's the API problem we need to fix here - the existing
> > behaviour is really crappy for applications. Erroring out
> > immmediately is one solution, and it's what the man page says should
> > happen so that is what I implemented.
> >
> > Realistically, though, I think an attempt to read beyond EOF for the
> > copy should result in behaviour like read() (i.e. return 0 bytes),
> > not EINVAL. The existing behaviour needs to change, though.
> 
> There are two checks to consider
> 1. pos_in >= EOF should return EINVAL
> 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> should return a short copy.
> 
> Having check#1 enforced allows to us to differentiate between a real
> error and a short copy.

That's what the code does right now and *exactly what I'm trying to
fix* because it EINVAL is ambiguous and not an indicator that we've
reached the end of the source file. EINVAL can indicate several
different errors, so it really has to be treated as a "copy failed"
error by applications.

Have a look at read/pread() - they return 0 in this case to indicate
a short read, and the value of zero is explicitly defined as meaning
"read position is beyond EOF".  Applications know straight away that
there is no more data to be read and there was no error, so can
terminate on a successful short read.

We need to allow applications to terminate copy loops on a
successful short copy. IOWs, applications need to either:

	- get an immediate error saying the range is invalid rather
	  than doing a short copy (as per the man page); or
	- have an explicit marker to say "no more data to be copied"

Applications need the "no more data to copy" case to be explicit and
unambiguous so they can make sane decisions about whether a short
copy was successful because the file was shorter than expected or
whether a short copy was a result of a real error being encountered.
The current behaviour is largely unusable for applications because
they have to guess at the reason for EINVAL part way through a
copy....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-04 22:18     ` Dave Chinner
@ 2018-12-04 23:33       ` Olga Kornievskaia
  2018-12-05 14:09       ` Christoph Hellwig
  1 sibling, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-04 23:33 UTC (permalink / raw)
  To: david
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Tue, Dec 4, 2018 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 07:43:47AM -0800, Christoph Hellwig wrote:
> > Well, this isn't bugfixes anymore, but adding new features..
>
> I made that perfectly clear in the cover description. I called it
> twice, one of them explicitly stating that this series made these
> infrastructure changes because we have pending functionality that
> dependents on cross-device copies being supported in a sane manner.
>
> I'll drop it if you want, but then I'll just have to come back after
> all the NFS code is merged and do yet more cleanup work.

This doesn't needs to be fixed in this patch series. I think Anna was
pointing out to me for something to take a look at.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-04 22:18     ` Dave Chinner
  2018-12-04 23:33       ` Olga Kornievskaia
@ 2018-12-05 14:09       ` Christoph Hellwig
  2018-12-05 17:01         ` Olga Kornievskaia
  1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-05 14:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, olga.kornievskaia,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Wed, Dec 05, 2018 at 09:18:47AM +1100, Dave Chinner wrote:
> I'll drop it if you want, but then I'll just have to come back after
> all the NFS code is merged and do yet more cleanup work.

IFF we want these NFS "features" we'll have to get it right before
merging the code.  But even with that I'd rather fix the glaring
issues you are fixing in your first patches as a priority before
adding more features.  In other words:  don't worry about NFS, lets
get the existing code right before worrying about the next round
of potential issues.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-04 21:29       ` Dave Chinner
  2018-12-04 21:47         ` Olga Kornievskaia
@ 2018-12-05 14:12         ` Christoph Hellwig
  2018-12-05 21:08           ` Dave Chinner
  1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-05 14:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Amir Goldstein, linux-fsdevel, linux-xfs,
	Olga Kornievskaia, Linux NFS Mailing List, overlayfs, ceph-devel,
	linux-cifs

> Realistically, though, I think an attempt to read beyond EOF for the
> copy should result in behaviour like read() (i.e. return 0 bytes),
> not EINVAL. The existing behaviour needs to change, though.

I agree with this statement.  So we don't we implement these semantics?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-04 22:31           ` Dave Chinner
@ 2018-12-05 16:51             ` bfields
  2019-05-20  9:10             ` Amir Goldstein
  1 sibling, 0 replies; 83+ messages in thread
From: bfields @ 2018-12-05 16:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Olga Kornievskaia, Christoph Hellwig, Amir Goldstein,
	linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs, ceph-devel,
	linux-cifs

On Wed, Dec 05, 2018 at 09:31:02AM +1100, Dave Chinner wrote:
> That's what the code does right now and *exactly what I'm trying to
> fix* because it EINVAL is ambiguous and not an indicator that we've
> reached the end of the source file. EINVAL can indicate several
> different errors, so it really has to be treated as a "copy failed"
> error by applications.
> 
> Have a look at read/pread() - they return 0 in this case to indicate
> a short read, and the value of zero is explicitly defined as meaning
> "read position is beyond EOF".  Applications know straight away that
> there is no more data to be read and there was no error, so can
> terminate on a successful short read.
> 
> We need to allow applications to terminate copy loops on a
> successful short copy.

I'm a little confused by your definition of "short copy" and "short
read".  Are you using that to mean a copy/read that returns zero?  I
usually see it used to mean any successful call that returned less than
the requested amount.  I'd expect a zero return to terminate a copy
loop, but not any positive return.

--b.


> IOWs, applications need to either:
> 
> 	- get an immediate error saying the range is invalid rather
> 	  than doing a short copy (as per the man page); or
> 	- have an explicit marker to say "no more data to be copied"
> 
> Applications need the "no more data to copy" case to be explicit and
> unambiguous so they can make sane decisions about whether a short
> copy was successful because the file was shorter than expected or
> whether a short copy was a result of a real error being encountered.
> The current behaviour is largely unusable for applications because
> they have to guess at the reason for EINVAL part way through a
> copy....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
  2018-12-05 14:09       ` Christoph Hellwig
@ 2018-12-05 17:01         ` Olga Kornievskaia
  0 siblings, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-05 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: david, linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs,
	ceph-devel, linux-cifs

On Wed, Dec 5, 2018 at 9:09 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Dec 05, 2018 at 09:18:47AM +1100, Dave Chinner wrote:
> > I'll drop it if you want, but then I'll just have to come back after
> > all the NFS code is merged and do yet more cleanup work.
>
> IFF we want these NFS "features" we'll have to get it right before
> merging the code.  But even with that I'd rather fix the glaring
> issues you are fixing in your first patches as a priority before
> adding more features.  In other words:  don't worry about NFS, lets
> get the existing code right before worrying about the next round
> of potential issues.

Dave,

Do you mind in v2 removing the 'retry, ret=EAGAIN' piece and leave the
call to the nfs42_copy_file_range() (with the superblock block check)?
If not, I could provide the patch.

This is a piece of code that got in as a part of the async copy
patches and it was meant for the upcoming server-to-server series.
This code will go right back in with the next series. But since dead
piece of code is glaring wrong currently by all means let's fix it.

Thank you.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
  2018-12-03 23:55     ` Dave Chinner
@ 2018-12-05 17:28       ` bfields
  0 siblings, 0 replies; 83+ messages in thread
From: bfields @ 2018-12-05 17:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, olga.kornievskaia,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Tue, Dec 04, 2018 at 10:55:17AM +1100, Dave Chinner wrote:
> On Mon, Dec 03, 2018 at 10:18:03AM -0800, Darrick J. Wong wrote:
> > On Mon, Dec 03, 2018 at 07:34:10PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Similar to FI_DEDUPERANGE, make copy_file_range() check that we have
> > 
> > TLDR: No, it's not similar to FIDEDUPERANGE -- the use of
> > inode_permission() in allow_file_dedupe() is to enable callers to dedupe
> > into a file for which the caller has write permissions but opened the
> > file O_RDONLY.
> 
> What a grotty, nasty hack.
> 
> > [Please keep reading...]
> > 
> > > write permissions to the destination inode.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  mm/filemap.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 0a170425935b..876df5275514 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3013,6 +3013,11 @@ int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > >  	    (file_out->f_flags & O_APPEND))
> > >  		return -EBADF;
> > >  
> > > +	/* may sure we really are allowed to write to the destination inode */
> > > +	ret = inode_permission(inode_out, MAY_WRITE);
> > 
> > What's the difference between security_file_permission and
> > inode_permission, and when do we call them for a regular
> > open-write-close sequence?  Hmmm, let me take a look:
> .....
> > We also cannot dedupe into a file that becomes immutable after we open
> > it for write, but we can dedupe into a file that loses its write
> > permissions after we open it.
> 
> It's more nuanced than that - dedupe will proceed after write
> permissions have been removed only if you are root or own the file,
> otherwise it will fail.
> 
> Updated summary:
> 
> > op:		after +immutable?	after chmod a-w?
> > write		yes			yes
> > clonerange	no			yes
> > dedupe	no			maybe
> > newcopyrange	no			no
> >
> > My reaction: I don't think that writes should be allowed after an
> > administrator marks a file immutable (but that's a separate issue) but I
> > do think we should be consistent in allowing copying into a file that
> > has lost its write permissions after we opened the file for write, like
> > we do for write() and the remap ioct....
> 
> If we want to allow copying to files we don't actually have
> permission to write to anymore, then I'll remove this from the test,
> the man page and the code. But, quite frankly, I don't trust remote
> server side copies to follow the same permission models as the
> client side OS, so I think we have to treat copy_file_range
> differently to a normal write syscall....

The NFS COPY command takes references to the protocol's equivalent to
open files, and I'd expect permission checks should depend on the open
mode, not the current file permissions.

But server behavior may vary.  I'm not sure that's a good guide for what
to do locally.

In general I'm more comfortable the closer copy is to read & write.

--b.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-05 14:12         ` Christoph Hellwig
@ 2018-12-05 21:08           ` Dave Chinner
  2018-12-05 21:30             ` Christoph Hellwig
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-05 21:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Amir Goldstein, linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs

On Wed, Dec 05, 2018 at 06:12:52AM -0800, Christoph Hellwig wrote:
> > Realistically, though, I think an attempt to read beyond EOF for the
> > copy should result in behaviour like read() (i.e. return 0 bytes),
> > not EINVAL. The existing behaviour needs to change, though.
> 
> I agree with this statement.  So we don't we implement these semantics?

No, we don't.

I will rework the patch series to make attempts to copy beyond the
end of the source file return 0 to indicate that there is no more
data to copy rather than return an error.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-05 21:08           ` Dave Chinner
@ 2018-12-05 21:30             ` Christoph Hellwig
  0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-05 21:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Amir Goldstein, linux-fsdevel, linux-xfs,
	Olga Kornievskaia, Linux NFS Mailing List, overlayfs, ceph-devel,
	linux-cifs

On Thu, Dec 06, 2018 at 08:08:24AM +1100, Dave Chinner wrote:
> On Wed, Dec 05, 2018 at 06:12:52AM -0800, Christoph Hellwig wrote:
> > > Realistically, though, I think an attempt to read beyond EOF for the
> > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > not EINVAL. The existing behaviour needs to change, though.
> > 
> > I agree with this statement.  So we don't we implement these semantics?
> 
> No, we don't.

Sorry - I was rushing that sentence out.  It should have been:

So why don't we implement these semantics?

> 
> I will rework the patch series to make attempts to copy beyond the
> end of the source file return 0 to indicate that there is no more
> data to copy rather than return an error.

Great, thanks!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-03 23:02     ` Dave Chinner
@ 2018-12-06  4:16       ` Amir Goldstein
  2018-12-06 21:30         ` Dave Chinner
  0 siblings, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2018-12-06  4:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
	Miklos Szeredi

On Tue, Dec 4, 2018 at 1:02 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 03, 2018 at 12:22:21PM +0200, Amir Goldstein wrote:
> > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Now that we have generic_copy_file_range(), remove it as a fallback
> > > case when offloads fail. This puts the responsibility for executing
> > > fallbacks on the filesystems that implement ->copy_file_range and
> > > allows us to add operational validity checks to
> > > generic_copy_file_range().
> > >
> > > Rework vfs_copy_file_range() to call a new do_copy_file_range()
> > > helper to exceute the copying callout, and move calls to
> > > generic_file_copy_range() into filesystem methods where they
> > > currently return failures.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >
> > You may add
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > After fixing the overlayfs issue below.
> > ...
> >
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index 84dd957efa24..68736e5d6a56 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> > >                                    struct file *file_out, loff_t pos_out,
> > >                                    size_t len, unsigned int flags)
> > >  {
> > > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > +       ssize_t ret;
> > > +
> > > +       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > >                             OVL_COPY);
> > > +
> > > +       if (ret == -EOPNOTSUPP)
> > > +               ret = generic_copy_file_range(file_in, pos_in, file_out,
> > > +                                       pos_out, len, flags);
> > > +       return ret;
> > >  }
> > >
> >
> > This is unneeded, because ovl_copyfile(OVL_COPY) is implemented
> > by calling vfs_copy_file_range() (on the underlying files) and it is
> > not possible
> > to get EOPNOTSUPP from vfs_copy_file_range().
>
> Except that it is possible. e.g. If the underlying filesystem tries
> a copy offload, gets a "not supported" failure from the remote
> server and then doesn't implement a fallback.
>

I'm in the opinion that ovl_copy_file_range() and do_copy_file_range()
are a like. If you choose not to fallback in the latter to
generic_copy_file_range() for misbehaving filesystem and WARN_ON
this case, there is no reason for overlayfs to cover up for the
misbehaving underlying filesystem.

If you want to cover up for misbehaving filesystem, please do it
in do_copy_file_range() and drop the WARN_ON_ONCE().
Come to think about it, I understand your reasoning for pushing
generic_copy_file_range() down to filesystems so they can fallback to
it in several error conditions.
I do not follow the reasoning of NOT falling back to
generic_copy_file_range() in vfs if EOPNOTSUPP is returned from
filesystem. IOW, if we want to cover up for misbehaving filesystem,
this would have been a more robust code:

+static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
+                           struct file *file_out, loff_t pos_out,
+                           size_t len, unsigned int flags)
+{
+       ssize_t ret;
+
+       if (file_out->f_op->copy_file_range) {
+               ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+                                                     pos_out, len, flags);
+               if (!WARN_ON_ONCE(ret == -EOPNOTSUPP))
+                       return ret;
+       }
+       return generic_copy_file_range(file_in, &pos_in, file_out, &pos_out,
+                                       len, flags);
+}
+

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-06  4:16       ` Amir Goldstein
@ 2018-12-06 21:30         ` Dave Chinner
  2018-12-07  5:38           ` Amir Goldstein
  0 siblings, 1 reply; 83+ messages in thread
From: Dave Chinner @ 2018-12-06 21:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
	Miklos Szeredi

On Thu, Dec 06, 2018 at 06:16:46AM +0200, Amir Goldstein wrote:
> On Tue, Dec 4, 2018 at 1:02 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Dec 03, 2018 at 12:22:21PM +0200, Amir Goldstein wrote:
> > > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > Now that we have generic_copy_file_range(), remove it as a fallback
> > > > case when offloads fail. This puts the responsibility for executing
> > > > fallbacks on the filesystems that implement ->copy_file_range and
> > > > allows us to add operational validity checks to
> > > > generic_copy_file_range().
> > > >
> > > > Rework vfs_copy_file_range() to call a new do_copy_file_range()
> > > > helper to exceute the copying callout, and move calls to
> > > > generic_file_copy_range() into filesystem methods where they
> > > > currently return failures.
> > > >
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > >
> > > You may add
> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > After fixing the overlayfs issue below.
> > > ...
> > >
> > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > > index 84dd957efa24..68736e5d6a56 100644
> > > > --- a/fs/overlayfs/file.c
> > > > +++ b/fs/overlayfs/file.c
> > > > @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> > > >                                    struct file *file_out, loff_t pos_out,
> > > >                                    size_t len, unsigned int flags)
> > > >  {
> > > > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > > +       ssize_t ret;
> > > > +
> > > > +       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > >                             OVL_COPY);
> > > > +
> > > > +       if (ret == -EOPNOTSUPP)
> > > > +               ret = generic_copy_file_range(file_in, pos_in, file_out,
> > > > +                                       pos_out, len, flags);
> > > > +       return ret;
> > > >  }
> > > >
> > >
> > > This is unneeded, because ovl_copyfile(OVL_COPY) is implemented
> > > by calling vfs_copy_file_range() (on the underlying files) and it is
> > > not possible
> > > to get EOPNOTSUPP from vfs_copy_file_range().
> >
> > Except that it is possible. e.g. If the underlying filesystem tries
> > a copy offload, gets a "not supported" failure from the remote
> > server and then doesn't implement a fallback.
> >
> 
> I'm in the opinion that ovl_copy_file_range() and do_copy_file_range()
> are a like. If you choose not to fallback in the latter to
> generic_copy_file_range() for misbehaving filesystem and WARN_ON
> this case, there is no reason for overlayfs to cover up for the
> misbehaving underlying filesystem.
> 
> If you want to cover up for misbehaving filesystem, please do it
> in do_copy_file_range() and drop the WARN_ON_ONCE().
> Come to think about it, I understand your reasoning for pushing
> generic_copy_file_range() down to filesystems so they can fallback to
> it in several error conditions.
> I do not follow the reasoning of NOT falling back to
> generic_copy_file_range() in vfs if EOPNOTSUPP is returned from
> filesystem. IOW, if we want to cover up for misbehaving filesystem,
> this would have been a more robust code:

Since when have we defined a filesystem returning -EOPNOTSUPP as a
"misbehaving filesystem"? Userspace has to handle errors in
copy_file_range() with it's own fallback copy code (i.e. it cannot
rely on the kernel actually supporting copy_file_range at all).
Hence it's perfectly fine for a filesystem implementation to encode
"offload or fail entirely" semantics if they want.

Yes, I've been shouted at by developers quite recently who
*demanded* that copy_file_range (and other offloads like
fallocate(ZERO_RANGE)) *fail* if they cannot "offload" the operation
to make it "fast". The application developers want to use different
algorithms if the kernel offload isn't any faster than userspace
doing the dumb thing and phsyically pushing bytes around itself.

I've pushed back on this as much as I can, but it doesn't change the
fact that for many situations doing do_splice_direct() is exactly
the wrong thing to do (e.g. because copy_file_range() on a TB+ scale
file couldn't be offloaded by the filesystem because the server said
EOPNOTSUPP)

IOWs, for some filesystems or situations where it makes sense to
have fail-fast semantics and leave the decision of what to do next
in the hands of the userspace application that has the context
necessary to determine what the best action to take is.  And to do
that, we need to give control of the fallback to the filesystems.

Flexibility is what is needed here, not a dumb, hard coded "the VFS
always know what's right for you" policy that triggers when nobody
really wants it to.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/11] vfs: no fallback for ->copy_file_range
  2018-12-06 21:30         ` Dave Chinner
@ 2018-12-07  5:38           ` Amir Goldstein
  0 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-07  5:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
	Miklos Szeredi

On Thu, Dec 6, 2018 at 11:31 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Dec 06, 2018 at 06:16:46AM +0200, Amir Goldstein wrote:
> > On Tue, Dec 4, 2018 at 1:02 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Dec 03, 2018 at 12:22:21PM +0200, Amir Goldstein wrote:
> > > > On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > > > > Now that we have generic_copy_file_range(), remove it as a fallback
> > > > > case when offloads fail. This puts the responsibility for executing
> > > > > fallbacks on the filesystems that implement ->copy_file_range and
> > > > > allows us to add operational validity checks to
> > > > > generic_copy_file_range().
> > > > >
> > > > > Rework vfs_copy_file_range() to call a new do_copy_file_range()
> > > > > helper to exceute the copying callout, and move calls to
> > > > > generic_file_copy_range() into filesystem methods where they
> > > > > currently return failures.
> > > > >
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > You may add
> > > > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > After fixing the overlayfs issue below.
> > > > ...
> > > >
> > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > > > index 84dd957efa24..68736e5d6a56 100644
> > > > > --- a/fs/overlayfs/file.c
> > > > > +++ b/fs/overlayfs/file.c
> > > > > @@ -486,8 +486,15 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > >                                    struct file *file_out, loff_t pos_out,
> > > > >                                    size_t len, unsigned int flags)
> > > > >  {
> > > > > -       return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > > > +       ssize_t ret;
> > > > > +
> > > > > +       ret =  ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
> > > > >                             OVL_COPY);
> > > > > +
> > > > > +       if (ret == -EOPNOTSUPP)
> > > > > +               ret = generic_copy_file_range(file_in, pos_in, file_out,
> > > > > +                                       pos_out, len, flags);
> > > > > +       return ret;
> > > > >  }
> > > > >
> > > >
> > > > This is unneeded, because ovl_copyfile(OVL_COPY) is implemented
> > > > by calling vfs_copy_file_range() (on the underlying files) and it is
> > > > not possible
> > > > to get EOPNOTSUPP from vfs_copy_file_range().
> > >
> > > Except that it is possible. e.g. If the underlying filesystem tries
> > > a copy offload, gets a "not supported" failure from the remote
> > > server and then doesn't implement a fallback.
> > >
> >
> > I'm in the opinion that ovl_copy_file_range() and do_copy_file_range()
> > are a like. If you choose not to fallback in the latter to
> > generic_copy_file_range() for misbehaving filesystem and WARN_ON
> > this case, there is no reason for overlayfs to cover up for the
> > misbehaving underlying filesystem.
> >
> > If you want to cover up for misbehaving filesystem, please do it
> > in do_copy_file_range() and drop the WARN_ON_ONCE().
> > Come to think about it, I understand your reasoning for pushing
> > generic_copy_file_range() down to filesystems so they can fallback to
> > it in several error conditions.
> > I do not follow the reasoning of NOT falling back to
> > generic_copy_file_range() in vfs if EOPNOTSUPP is returned from
> > filesystem. IOW, if we want to cover up for misbehaving filesystem,
> > this would have been a more robust code:
>
> Since when have we defined a filesystem returning -EOPNOTSUPP as a
> "misbehaving filesystem"?

Since you wrote:

WARN_ON_ONCE(ret == -EOPNOTSUPP);

If filesystem is allowed to return EOPNOTSUPP from ->copy_file_range()
then what is this warning about?

> Userspace has to handle errors in
> copy_file_range() with it's own fallback copy code (i.e. it cannot
> rely on the kernel actually supporting copy_file_range at all).
> Hence it's perfectly fine for a filesystem implementation to encode
> "offload or fail entirely" semantics if they want.
>
> Yes, I've been shouted at by developers quite recently who
> *demanded* that copy_file_range (and other offloads like
> fallocate(ZERO_RANGE)) *fail* if they cannot "offload" the operation
> to make it "fast". The application developers want to use different
> algorithms if the kernel offload isn't any faster than userspace
> doing the dumb thing and phsyically pushing bytes around itself.
>
> I've pushed back on this as much as I can, but it doesn't change the
> fact that for many situations doing do_splice_direct() is exactly
> the wrong thing to do (e.g. because copy_file_range() on a TB+ scale
> file couldn't be offloaded by the filesystem because the server said
> EOPNOTSUPP)
>
> IOWs, for some filesystems or situations where it makes sense to
> have fail-fast semantics and leave the decision of what to do next
> in the hands of the userspace application that has the context
> necessary to determine what the best action to take is.  And to do
> that, we need to give control of the fallback to the filesystems.
>
> Flexibility is what is needed here, not a dumb, hard coded "the VFS
> always know what's right for you" policy that triggers when nobody
> really wants it to.
>

You misunderstood me.
Please remove the fallback to generic_copy_file_range() in
ovl_copy_file_range() as I requested in initial review for the exact
same reasons that you list above.

The overlayfs implementation of ovl_copy_file_range() is just
handing over the call to underlying vfs_copy_file_range().
If the latter is expected to return EOPNOTSUPP, so does the
overlayfs implementation.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
                     ` (3 preceding siblings ...)
  2018-12-04 15:18   ` Christoph Hellwig
@ 2018-12-12 11:31   ` Luis Henriques
  2018-12-12 16:42     ` Darrick J. Wong
  2018-12-12 18:55     ` Olga Kornievskaia
  4 siblings, 2 replies; 83+ messages in thread
From: Luis Henriques @ 2018-12-12 11:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, olga.kornievskaia,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

Dave Chinner <david@fromorbit.com> writes:

<snip>

> +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> +			 struct file *file_out, loff_t pos_out,
> +			 size_t *req_count, unsigned int flags)
> +{

<snip>

> +	/* Don't allow overlapped copying within the same file. */
> +	if (inode_in == inode_out &&
> +	    pos_out + count > pos_in &&
> +	    pos_out < pos_in + count)
> +		return -EINVAL;

I was wondering if, with the above check, it would make sense to also
have an extra patch changing some filesystems (ceph, nfs and cifs) to
simply return -EOPNOTSUPP (instead of -EINVAL) when inode_in ==
inode_out.  Something like the diff below (not tested!).

This caught my attention when I was running the latest generic xfstests
on ceph and realised that I had some new failures due to the recently
added copy_file_range support in fsx by Darrick.  The failures were
caused by the usage of the same fd both as source and destination.

Cheers,
-- 
Luis


diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 189df668b6a0..c22ac60ec0ba 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1904,7 +1904,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	bool do_final_copy = false;
 
 	if (src_inode == dst_inode)
-		return -EINVAL;
+		return -EOPNOTSUPP;
 	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
 		return -EROFS;
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 865706edb307..d4f63eae531e 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1068,7 +1068,7 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
 	cifs_dbg(FYI, "copychunk range\n");
 
 	if (src_inode == target_inode) {
-		rc = -EINVAL;
+		rc = -EOPNOTSUPP;
 		goto out;
 	}
 
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 46d691ba04bc..910a2abade92 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -136,7 +136,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
 	ssize_t ret;
 
 	if (file_inode(file_in) == file_inode(file_out))
-		return -EINVAL;
+		return -EOPNOTSUPP;
 retry:
 	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
 	if (ret == -EAGAIN)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-12 11:31   ` Luis Henriques
@ 2018-12-12 16:42     ` Darrick J. Wong
  2018-12-12 18:55     ` Olga Kornievskaia
  1 sibling, 0 replies; 83+ messages in thread
From: Darrick J. Wong @ 2018-12-12 16:42 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Dave Chinner, linux-fsdevel, linux-xfs, olga.kornievskaia,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Wed, Dec 12, 2018 at 11:31:23AM +0000, Luis Henriques wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> <snip>
> 
> > +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > +			 struct file *file_out, loff_t pos_out,
> > +			 size_t *req_count, unsigned int flags)
> > +{
> 
> <snip>
> 
> > +	/* Don't allow overlapped copying within the same file. */
> > +	if (inode_in == inode_out &&
> > +	    pos_out + count > pos_in &&
> > +	    pos_out < pos_in + count)
> > +		return -EINVAL;
> 
> I was wondering if, with the above check, it would make sense to also
> have an extra patch changing some filesystems (ceph, nfs and cifs) to
> simply return -EOPNOTSUPP (instead of -EINVAL) when inode_in ==
> inode_out.  Something like the diff below (not tested!).
> 
> This caught my attention when I was running the latest generic xfstests
> on ceph and realised that I had some new failures due to the recently
> added copy_file_range support in fsx by Darrick.  The failures were
> caused by the usage of the same fd both as source and destination.

Looks reasonable to /me/, since EOPNOTSUPP currently triggers the
splice fallback.

--D

> Cheers,
> -- 
> Luis
> 
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 189df668b6a0..c22ac60ec0ba 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1904,7 +1904,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>  	bool do_final_copy = false;
>  
>  	if (src_inode == dst_inode)
> -		return -EINVAL;
> +		return -EOPNOTSUPP;
>  	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>  		return -EROFS;
>  
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 865706edb307..d4f63eae531e 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1068,7 +1068,7 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
>  	cifs_dbg(FYI, "copychunk range\n");
>  
>  	if (src_inode == target_inode) {
> -		rc = -EINVAL;
> +		rc = -EOPNOTSUPP;
>  		goto out;
>  	}
>  
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 46d691ba04bc..910a2abade92 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -136,7 +136,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>  	ssize_t ret;
>  
>  	if (file_inode(file_in) == file_inode(file_out))
> -		return -EINVAL;
> +		return -EOPNOTSUPP;
>  retry:
>  	ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
>  	if (ret == -EAGAIN)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-12 11:31   ` Luis Henriques
  2018-12-12 16:42     ` Darrick J. Wong
@ 2018-12-12 18:55     ` Olga Kornievskaia
  2018-12-12 19:42       ` Matthew Wilcox
  1 sibling, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-12 18:55 UTC (permalink / raw)
  To: lhenriques
  Cc: david, Darrick J. Wong, linux-fsdevel, linux-xfs, linux-nfs,
	linux-unionfs, ceph-devel, linux-cifs

On Wed, Dec 12, 2018 at 6:31 AM Luis Henriques <lhenriques@suse.com> wrote:
>
> Dave Chinner <david@fromorbit.com> writes:
>
> <snip>
>
> > +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > +                      struct file *file_out, loff_t pos_out,
> > +                      size_t *req_count, unsigned int flags)
> > +{
>
> <snip>
>
> > +     /* Don't allow overlapped copying within the same file. */
> > +     if (inode_in == inode_out &&
> > +         pos_out + count > pos_in &&
> > +         pos_out < pos_in + count)
> > +             return -EINVAL;
>
> I was wondering if, with the above check, it would make sense to also
> have an extra patch changing some filesystems (ceph, nfs and cifs) to
> simply return -EOPNOTSUPP (instead of -EINVAL) when inode_in ==
> inode_out.  Something like the diff below (not tested!).
>
> This caught my attention when I was running the latest generic xfstests
> on ceph and realised that I had some new failures due to the recently
> added copy_file_range support in fsx by Darrick.  The failures were
> caused by the usage of the same fd both as source and destination.
>
> Cheers,
> --
> Luis
>
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 189df668b6a0..c22ac60ec0ba 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1904,7 +1904,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>         bool do_final_copy = false;
>
>         if (src_inode == dst_inode)
> -               return -EINVAL;
> +               return -EOPNOTSUPP;
>         if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>                 return -EROFS;
>
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 865706edb307..d4f63eae531e 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1068,7 +1068,7 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
>         cifs_dbg(FYI, "copychunk range\n");
>
>         if (src_inode == target_inode) {
> -               rc = -EINVAL;
> +               rc = -EOPNOTSUPP;
>                 goto out;
>         }
>
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 46d691ba04bc..910a2abade92 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -136,7 +136,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>         ssize_t ret;
>
>         if (file_inode(file_in) == file_inode(file_out))
> -               return -EINVAL;
> +               return -EOPNOTSUPP;

Please don't change the NFS bits. This is against the NFS
specifications. RFC 7862 15.2.3

(snippet)
SAVED_FH and CURRENT_FH must be different files.  If SAVED_FH and
   CURRENT_FH refer to the same file, the operation MUST fail with
   NFS4ERR_INVAL.

>  retry:
>         ret = nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
>         if (ret == -EAGAIN)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-12 18:55     ` Olga Kornievskaia
@ 2018-12-12 19:42       ` Matthew Wilcox
  2018-12-12 20:22         ` Olga Kornievskaia
  0 siblings, 1 reply; 83+ messages in thread
From: Matthew Wilcox @ 2018-12-12 19:42 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: lhenriques, david, Darrick J. Wong, linux-fsdevel, linux-xfs,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Wed, Dec 12, 2018 at 01:55:28PM -0500, Olga Kornievskaia wrote:
> On Wed, Dec 12, 2018 at 6:31 AM Luis Henriques <lhenriques@suse.com> wrote:
> > I was wondering if, with the above check, it would make sense to also
> > have an extra patch changing some filesystems (ceph, nfs and cifs) to
> > simply return -EOPNOTSUPP (instead of -EINVAL) when inode_in ==
> > inode_out.  Something like the diff below (not tested!).

> > +++ b/fs/nfs/nfs4file.c
> > @@ -136,7 +136,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> >         ssize_t ret;
> >
> >         if (file_inode(file_in) == file_inode(file_out))
> > -               return -EINVAL;
> > +               return -EOPNOTSUPP;
> 
> Please don't change the NFS bits. This is against the NFS
> specifications. RFC 7862 15.2.3
> 
> (snippet)
> SAVED_FH and CURRENT_FH must be different files.  If SAVED_FH and
>    CURRENT_FH refer to the same file, the operation MUST fail with
>    NFS4ERR_INVAL.

I don't see how that applies.  That refers to a requirement _in the
protocol_ that determines what the server MUST do if the client sends
it two FHs which refer to the same file.

What we're talking about here is how a Linux filesystem behaves when
receiving a copy_file_range() referring to the same file.  As long as
the Linux filesystem doesn't react by sending out one of these invalid
protocol messages, I don't see the problem.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-12 19:42       ` Matthew Wilcox
@ 2018-12-12 20:22         ` Olga Kornievskaia
  2018-12-13 10:29           ` Luis Henriques
  0 siblings, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-12 20:22 UTC (permalink / raw)
  To: willy
  Cc: lhenriques, david, Darrick J. Wong, linux-fsdevel, linux-xfs,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

On Wed, Dec 12, 2018 at 2:43 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Dec 12, 2018 at 01:55:28PM -0500, Olga Kornievskaia wrote:
> > On Wed, Dec 12, 2018 at 6:31 AM Luis Henriques <lhenriques@suse.com> wrote:
> > > I was wondering if, with the above check, it would make sense to also
> > > have an extra patch changing some filesystems (ceph, nfs and cifs) to
> > > simply return -EOPNOTSUPP (instead of -EINVAL) when inode_in ==
> > > inode_out.  Something like the diff below (not tested!).
>
> > > +++ b/fs/nfs/nfs4file.c
> > > @@ -136,7 +136,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> > >         ssize_t ret;
> > >
> > >         if (file_inode(file_in) == file_inode(file_out))
> > > -               return -EINVAL;
> > > +               return -EOPNOTSUPP;
> >
> > Please don't change the NFS bits. This is against the NFS
> > specifications. RFC 7862 15.2.3
> >
> > (snippet)
> > SAVED_FH and CURRENT_FH must be different files.  If SAVED_FH and
> >    CURRENT_FH refer to the same file, the operation MUST fail with
> >    NFS4ERR_INVAL.
>
> I don't see how that applies.  That refers to a requirement _in the
> protocol_ that determines what the server MUST do if the client sends
> it two FHs which refer to the same file.
>
> What we're talking about here is how a Linux filesystem behaves when
> receiving a copy_file_range() referring to the same file.  As long as
> the Linux filesystem doesn't react by sending out one of these invalid
> protocol messages, I don't see the problem.

Ok then this should be changed to call generic_copy_file_range() not
returning the EOPNOTSUPP since there is no longer fallback in vfs to
call the generic_copy_file_range() and in turn responsibility of each
file system.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/11] vfs: add missing checks to copy_file_range
  2018-12-12 20:22         ` Olga Kornievskaia
@ 2018-12-13 10:29           ` Luis Henriques
  0 siblings, 0 replies; 83+ messages in thread
From: Luis Henriques @ 2018-12-13 10:29 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: willy, david, Darrick J. Wong, linux-fsdevel, linux-xfs,
	linux-nfs, linux-unionfs, ceph-devel, linux-cifs

Olga Kornievskaia <olga.kornievskaia@gmail.com> writes:

> On Wed, Dec 12, 2018 at 2:43 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Wed, Dec 12, 2018 at 01:55:28PM -0500, Olga Kornievskaia wrote:
>> > On Wed, Dec 12, 2018 at 6:31 AM Luis Henriques <lhenriques@suse.com> wrote:
>> > > I was wondering if, with the above check, it would make sense to also
>> > > have an extra patch changing some filesystems (ceph, nfs and cifs) to
>> > > simply return -EOPNOTSUPP (instead of -EINVAL) when inode_in ==
>> > > inode_out.  Something like the diff below (not tested!).
>>
>> > > +++ b/fs/nfs/nfs4file.c
>> > > @@ -136,7 +136,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>> > >         ssize_t ret;
>> > >
>> > >         if (file_inode(file_in) == file_inode(file_out))
>> > > -               return -EINVAL;
>> > > +               return -EOPNOTSUPP;
>> >
>> > Please don't change the NFS bits. This is against the NFS
>> > specifications. RFC 7862 15.2.3
>> >
>> > (snippet)
>> > SAVED_FH and CURRENT_FH must be different files.  If SAVED_FH and
>> >    CURRENT_FH refer to the same file, the operation MUST fail with
>> >    NFS4ERR_INVAL.
>>
>> I don't see how that applies.  That refers to a requirement _in the
>> protocol_ that determines what the server MUST do if the client sends
>> it two FHs which refer to the same file.
>>
>> What we're talking about here is how a Linux filesystem behaves when
>> receiving a copy_file_range() referring to the same file.  As long as
>> the Linux filesystem doesn't react by sending out one of these invalid
>> protocol messages, I don't see the problem.
>
> Ok then this should be changed to call generic_copy_file_range() not
> returning the EOPNOTSUPP since there is no longer fallback in vfs to
> call the generic_copy_file_range() and in turn responsibility of each
> file system.

Ah, I didn't look close enough and didn't realised the nfs code was
doing something slightly different from the other 2 FSs.  In that case
simply deleting that check seems to be enough to fallback to the vfs
generic_copy_file_range.

Anyway, please find below an updated patch (with proper changelog).

Cheers,
-- 
Luis

From f66a07e22dc93827bdafc1666d4980edc986bce4 Mon Sep 17 00:00:00 2001
From: Luis Henriques <lhenriques@suse.com>
Date: Thu, 13 Dec 2018 10:19:54 +0000
Subject: [PATCH] vfs: fallback to generic_copy_file_range if copying within
 the same file

If source and destination inode are the same simply fallback to the VFS
generic_copy_file_range, as we've already checked overlapping areas in
generic_copy_file_checks.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/file.c    | 2 +-
 fs/cifs/cifsfs.c  | 2 +-
 fs/nfs/nfs4file.c | 3 ---
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index eb876e19c1dc..ff48dc52c30e 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1904,7 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	bool do_final_copy = false;
 
 	if (src_inode == dst_inode)
-		return -EINVAL;
+		return -EOPNOTSUPP;
 	if (src_inode->i_sb != dst_inode->i_sb)
 		return -EXDEV;
 	if (ceph_snap(dst_inode) != CEPH_NOSNAP)
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 03e4b9eacbd1..3c66454c59b6 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1068,7 +1068,7 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
 	cifs_dbg(FYI, "copychunk range\n");
 
 	if (src_inode == target_inode) {
-		rc = -EINVAL;
+		rc = -EOPNOTSUPP;
 		goto out;
 	}
 
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4783c0c1c49e..dc7f344849e9 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -135,9 +135,6 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
 {
 	ssize_t ret = -EXDEV;
 
-	if (file_inode(file_in) == file_inode(file_out))
-		return -EINVAL;
-
 	/* only offload copy if superblock is the same */
 	if (file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
 		do {

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2018-12-04 22:31           ` Dave Chinner
  2018-12-05 16:51             ` bfields
@ 2019-05-20  9:10             ` Amir Goldstein
  2019-05-20 13:12               ` Olga Kornievskaia
  1 sibling, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2019-05-20  9:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Olga Kornievskaia, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-nfs, overlayfs, ceph-devel, CIFS

On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > >
> > > > > > The man page says:
> > > > > >
> > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > >
> > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > the behaviour described in the man page.
> > > >
> > > > I think the behavior implemented is a lot more useful than the one
> > > > documented..
> > >
> > > The current behaviour is really nasty. Because copy_file_range() can
> > > return short copies, the caller has to implement a loop to ensure
> > > the range hey want get copied.  When the source range you are
> > > trying to copy overlaps source EOF, this loop:
> > >
> > >         while (len > 0) {
> > >                 ret = copy_file_range(... len ...)
> > >                 ...
> > >                 off_in += ret;
> > >                 off_out += ret;
> > >                 len -= ret;
> > >         }
> > >
> > > Currently the fallback code copies up to the end of the source file
> > > on the first copy and then fails the second copy with EINVAL because
> > > the source range is now completely beyond EOF.
> > >
> > > So, from an application perspective, did the copy succeed or did it
> > > fail?
> > >
> > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > this a failure, because the second copy_file_range() call returns
> > > EINVAL and not some "there is no more to copy" marker like read()
> > > returning 0 bytes when attempting to read beyond EOF.
> > >
> > > IOWs, we cannot tell the difference between a real error and a short
> > > copy because the input range spans EOF and it was silently
> > > shortened. That's the API problem we need to fix here - the existing
> > > behaviour is really crappy for applications. Erroring out
> > > immmediately is one solution, and it's what the man page says should
> > > happen so that is what I implemented.
> > >
> > > Realistically, though, I think an attempt to read beyond EOF for the
> > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > not EINVAL. The existing behaviour needs to change, though.
> >
> > There are two checks to consider
> > 1. pos_in >= EOF should return EINVAL
> > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > should return a short copy.
> >
> > Having check#1 enforced allows to us to differentiate between a real
> > error and a short copy.
>
> That's what the code does right now and *exactly what I'm trying to
> fix* because it EINVAL is ambiguous and not an indicator that we've
> reached the end of the source file. EINVAL can indicate several
> different errors, so it really has to be treated as a "copy failed"
> error by applications.
>
> Have a look at read/pread() - they return 0 in this case to indicate
> a short read, and the value of zero is explicitly defined as meaning
> "read position is beyond EOF".  Applications know straight away that
> there is no more data to be read and there was no error, so can
> terminate on a successful short read.
>
> We need to allow applications to terminate copy loops on a
> successful short copy. IOWs, applications need to either:
>
>         - get an immediate error saying the range is invalid rather
>           than doing a short copy (as per the man page); or
>         - have an explicit marker to say "no more data to be copied"
>
> Applications need the "no more data to copy" case to be explicit and
> unambiguous so they can make sane decisions about whether a short
> copy was successful because the file was shorter than expected or
> whether a short copy was a result of a real error being encountered.
> The current behaviour is largely unusable for applications because
> they have to guess at the reason for EINVAL part way through a
> copy....
>

Dave,

I went a head and implemented the desired behavior.
However, while testing I observed that the desired behavior is already
the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
xfs_io copy loop ends as expected:
copy_file_range(4, [0], 3, [0], 10, 0)  = 2
copy_file_range(4, [2], 3, [2], 8, 0)   = 0

This was tested on ext4 and xfs with reflink on recent kernel as well as on
v4.20-rc1 (era of original patch set).

Where and how did you observe the EINVAL behavior described above?
(besides man page that is). There are even xfstests (which you modified)
that verify the return 0 for past EOF behavior.

For now, I am just dropping this patch from the patch series.
Let me know if I am missing something.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2019-05-20  9:10             ` Amir Goldstein
@ 2019-05-20 13:12               ` Olga Kornievskaia
  2019-05-20 13:36                 ` Amir Goldstein
  0 siblings, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2019-05-20 13:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-nfs, overlayfs, ceph-devel, CIFS

On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > >
> > > > > > > The man page says:
> > > > > > >
> > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > >
> > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > the behaviour described in the man page.
> > > > >
> > > > > I think the behavior implemented is a lot more useful than the one
> > > > > documented..
> > > >
> > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > return short copies, the caller has to implement a loop to ensure
> > > > the range hey want get copied.  When the source range you are
> > > > trying to copy overlaps source EOF, this loop:
> > > >
> > > >         while (len > 0) {
> > > >                 ret = copy_file_range(... len ...)
> > > >                 ...
> > > >                 off_in += ret;
> > > >                 off_out += ret;
> > > >                 len -= ret;
> > > >         }
> > > >
> > > > Currently the fallback code copies up to the end of the source file
> > > > on the first copy and then fails the second copy with EINVAL because
> > > > the source range is now completely beyond EOF.
> > > >
> > > > So, from an application perspective, did the copy succeed or did it
> > > > fail?
> > > >
> > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > this a failure, because the second copy_file_range() call returns
> > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > returning 0 bytes when attempting to read beyond EOF.
> > > >
> > > > IOWs, we cannot tell the difference between a real error and a short
> > > > copy because the input range spans EOF and it was silently
> > > > shortened. That's the API problem we need to fix here - the existing
> > > > behaviour is really crappy for applications. Erroring out
> > > > immmediately is one solution, and it's what the man page says should
> > > > happen so that is what I implemented.
> > > >
> > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > not EINVAL. The existing behaviour needs to change, though.
> > >
> > > There are two checks to consider
> > > 1. pos_in >= EOF should return EINVAL
> > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > should return a short copy.
> > >
> > > Having check#1 enforced allows to us to differentiate between a real
> > > error and a short copy.
> >
> > That's what the code does right now and *exactly what I'm trying to
> > fix* because it EINVAL is ambiguous and not an indicator that we've
> > reached the end of the source file. EINVAL can indicate several
> > different errors, so it really has to be treated as a "copy failed"
> > error by applications.
> >
> > Have a look at read/pread() - they return 0 in this case to indicate
> > a short read, and the value of zero is explicitly defined as meaning
> > "read position is beyond EOF".  Applications know straight away that
> > there is no more data to be read and there was no error, so can
> > terminate on a successful short read.
> >
> > We need to allow applications to terminate copy loops on a
> > successful short copy. IOWs, applications need to either:
> >
> >         - get an immediate error saying the range is invalid rather
> >           than doing a short copy (as per the man page); or
> >         - have an explicit marker to say "no more data to be copied"
> >
> > Applications need the "no more data to copy" case to be explicit and
> > unambiguous so they can make sane decisions about whether a short
> > copy was successful because the file was shorter than expected or
> > whether a short copy was a result of a real error being encountered.
> > The current behaviour is largely unusable for applications because
> > they have to guess at the reason for EINVAL part way through a
> > copy....
> >
>
> Dave,
>
> I went a head and implemented the desired behavior.
> However, while testing I observed that the desired behavior is already
> the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> xfs_io copy loop ends as expected:
> copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> copy_file_range(4, [2], 3, [2], 8, 0)   = 0
>
> This was tested on ext4 and xfs with reflink on recent kernel as well as on
> v4.20-rc1 (era of original patch set).
>
> Where and how did you observe the EINVAL behavior described above?
> (besides man page that is). There are even xfstests (which you modified)
> that verify the return 0 for past EOF behavior.
>
> For now, I am just dropping this patch from the patch series.
> Let me know if I am missing something.

The was fixing inconsistency in what the man page specified (ie., it
must fail with EINVAL if offsets are out of range) which was never
enforced by the code. The patch then could be to fix the existing
semantics (man page) of the system call.

Copy file range range is not only read and write but rather
lseek+read+write and if somebody specifies an incorrect offset to the
lseek the system call should fail. Thus I still think that copy file
range should enforce that specifying a source offset beyond the end of
the file should fail with EINVAL.

If the copy file range returned 0 bytes does it mean it's a stopping
condition, not according to the current semantics.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2019-05-20 13:12               ` Olga Kornievskaia
@ 2019-05-20 13:36                 ` Amir Goldstein
  2019-05-20 13:58                   ` Olga Kornievskaia
  0 siblings, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2019-05-20 13:36 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-nfs, overlayfs, ceph-devel, CIFS

On Mon, May 20, 2019 at 4:12 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > >
> > > > > > > > The man page says:
> > > > > > > >
> > > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > > >
> > > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > > the behaviour described in the man page.
> > > > > >
> > > > > > I think the behavior implemented is a lot more useful than the one
> > > > > > documented..
> > > > >
> > > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > > return short copies, the caller has to implement a loop to ensure
> > > > > the range hey want get copied.  When the source range you are
> > > > > trying to copy overlaps source EOF, this loop:
> > > > >
> > > > >         while (len > 0) {
> > > > >                 ret = copy_file_range(... len ...)
> > > > >                 ...
> > > > >                 off_in += ret;
> > > > >                 off_out += ret;
> > > > >                 len -= ret;
> > > > >         }
> > > > >
> > > > > Currently the fallback code copies up to the end of the source file
> > > > > on the first copy and then fails the second copy with EINVAL because
> > > > > the source range is now completely beyond EOF.
> > > > >
> > > > > So, from an application perspective, did the copy succeed or did it
> > > > > fail?
> > > > >
> > > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > > this a failure, because the second copy_file_range() call returns
> > > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > > returning 0 bytes when attempting to read beyond EOF.
> > > > >
> > > > > IOWs, we cannot tell the difference between a real error and a short
> > > > > copy because the input range spans EOF and it was silently
> > > > > shortened. That's the API problem we need to fix here - the existing
> > > > > behaviour is really crappy for applications. Erroring out
> > > > > immmediately is one solution, and it's what the man page says should
> > > > > happen so that is what I implemented.
> > > > >
> > > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > > not EINVAL. The existing behaviour needs to change, though.
> > > >
> > > > There are two checks to consider
> > > > 1. pos_in >= EOF should return EINVAL
> > > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > > should return a short copy.
> > > >
> > > > Having check#1 enforced allows to us to differentiate between a real
> > > > error and a short copy.
> > >
> > > That's what the code does right now and *exactly what I'm trying to
> > > fix* because it EINVAL is ambiguous and not an indicator that we've
> > > reached the end of the source file. EINVAL can indicate several
> > > different errors, so it really has to be treated as a "copy failed"
> > > error by applications.
> > >
> > > Have a look at read/pread() - they return 0 in this case to indicate
> > > a short read, and the value of zero is explicitly defined as meaning
> > > "read position is beyond EOF".  Applications know straight away that
> > > there is no more data to be read and there was no error, so can
> > > terminate on a successful short read.
> > >
> > > We need to allow applications to terminate copy loops on a
> > > successful short copy. IOWs, applications need to either:
> > >
> > >         - get an immediate error saying the range is invalid rather
> > >           than doing a short copy (as per the man page); or
> > >         - have an explicit marker to say "no more data to be copied"
> > >
> > > Applications need the "no more data to copy" case to be explicit and
> > > unambiguous so they can make sane decisions about whether a short
> > > copy was successful because the file was shorter than expected or
> > > whether a short copy was a result of a real error being encountered.
> > > The current behaviour is largely unusable for applications because
> > > they have to guess at the reason for EINVAL part way through a
> > > copy....
> > >
> >
> > Dave,
> >
> > I went a head and implemented the desired behavior.
> > However, while testing I observed that the desired behavior is already
> > the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> > xfs_io copy loop ends as expected:
> > copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> > copy_file_range(4, [2], 3, [2], 8, 0)   = 0
> >
> > This was tested on ext4 and xfs with reflink on recent kernel as well as on
> > v4.20-rc1 (era of original patch set).
> >
> > Where and how did you observe the EINVAL behavior described above?
> > (besides man page that is). There are even xfstests (which you modified)
> > that verify the return 0 for past EOF behavior.
> >
> > For now, I am just dropping this patch from the patch series.
> > Let me know if I am missing something.
>
> The was fixing inconsistency in what the man page specified (ie., it
> must fail with EINVAL if offsets are out of range) which was never
> enforced by the code. The patch then could be to fix the existing
> semantics (man page) of the system call.
>
> Copy file range range is not only read and write but rather
> lseek+read+write and if somebody specifies an incorrect offset to the

Nope. it is like either read+write or pread+pwrite.

> lseek the system call should fail. Thus I still think that copy file
> range should enforce that specifying a source offset beyond the end of
> the file should fail with EINVAL.

You appear to be out numbered by reviewers that think copy_file_range(2)
should behave like pread(2) and return 0 when offf_in >= size_in.

>
> If the copy file range returned 0 bytes does it mean it's a stopping
> condition, not according to the current semantics.

Yes. Same as read(2)/pread(2).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2019-05-20 13:36                 ` Amir Goldstein
@ 2019-05-20 13:58                   ` Olga Kornievskaia
  2019-05-20 14:02                     ` Amir Goldstein
  0 siblings, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2019-05-20 13:58 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-nfs, overlayfs, ceph-devel, CIFS

On Mon, May 20, 2019 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, May 20, 2019 at 4:12 PM Olga Kornievskaia
> <olga.kornievskaia@gmail.com> wrote:
> >
> > On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > > >
> > > > > > > > > The man page says:
> > > > > > > > >
> > > > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > > > >
> > > > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > > > the behaviour described in the man page.
> > > > > > >
> > > > > > > I think the behavior implemented is a lot more useful than the one
> > > > > > > documented..
> > > > > >
> > > > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > > > return short copies, the caller has to implement a loop to ensure
> > > > > > the range hey want get copied.  When the source range you are
> > > > > > trying to copy overlaps source EOF, this loop:
> > > > > >
> > > > > >         while (len > 0) {
> > > > > >                 ret = copy_file_range(... len ...)
> > > > > >                 ...
> > > > > >                 off_in += ret;
> > > > > >                 off_out += ret;
> > > > > >                 len -= ret;
> > > > > >         }
> > > > > >
> > > > > > Currently the fallback code copies up to the end of the source file
> > > > > > on the first copy and then fails the second copy with EINVAL because
> > > > > > the source range is now completely beyond EOF.
> > > > > >
> > > > > > So, from an application perspective, did the copy succeed or did it
> > > > > > fail?
> > > > > >
> > > > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > > > this a failure, because the second copy_file_range() call returns
> > > > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > > > returning 0 bytes when attempting to read beyond EOF.
> > > > > >
> > > > > > IOWs, we cannot tell the difference between a real error and a short
> > > > > > copy because the input range spans EOF and it was silently
> > > > > > shortened. That's the API problem we need to fix here - the existing
> > > > > > behaviour is really crappy for applications. Erroring out
> > > > > > immmediately is one solution, and it's what the man page says should
> > > > > > happen so that is what I implemented.
> > > > > >
> > > > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > > > not EINVAL. The existing behaviour needs to change, though.
> > > > >
> > > > > There are two checks to consider
> > > > > 1. pos_in >= EOF should return EINVAL
> > > > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > > > should return a short copy.
> > > > >
> > > > > Having check#1 enforced allows to us to differentiate between a real
> > > > > error and a short copy.
> > > >
> > > > That's what the code does right now and *exactly what I'm trying to
> > > > fix* because it EINVAL is ambiguous and not an indicator that we've
> > > > reached the end of the source file. EINVAL can indicate several
> > > > different errors, so it really has to be treated as a "copy failed"
> > > > error by applications.
> > > >
> > > > Have a look at read/pread() - they return 0 in this case to indicate
> > > > a short read, and the value of zero is explicitly defined as meaning
> > > > "read position is beyond EOF".  Applications know straight away that
> > > > there is no more data to be read and there was no error, so can
> > > > terminate on a successful short read.
> > > >
> > > > We need to allow applications to terminate copy loops on a
> > > > successful short copy. IOWs, applications need to either:
> > > >
> > > >         - get an immediate error saying the range is invalid rather
> > > >           than doing a short copy (as per the man page); or
> > > >         - have an explicit marker to say "no more data to be copied"
> > > >
> > > > Applications need the "no more data to copy" case to be explicit and
> > > > unambiguous so they can make sane decisions about whether a short
> > > > copy was successful because the file was shorter than expected or
> > > > whether a short copy was a result of a real error being encountered.
> > > > The current behaviour is largely unusable for applications because
> > > > they have to guess at the reason for EINVAL part way through a
> > > > copy....
> > > >
> > >
> > > Dave,
> > >
> > > I went a head and implemented the desired behavior.
> > > However, while testing I observed that the desired behavior is already
> > > the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> > > xfs_io copy loop ends as expected:
> > > copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> > > copy_file_range(4, [2], 3, [2], 8, 0)   = 0
> > >
> > > This was tested on ext4 and xfs with reflink on recent kernel as well as on
> > > v4.20-rc1 (era of original patch set).
> > >
> > > Where and how did you observe the EINVAL behavior described above?
> > > (besides man page that is). There are even xfstests (which you modified)
> > > that verify the return 0 for past EOF behavior.
> > >
> > > For now, I am just dropping this patch from the patch series.
> > > Let me know if I am missing something.
> >
> > The was fixing inconsistency in what the man page specified (ie., it
> > must fail with EINVAL if offsets are out of range) which was never
> > enforced by the code. The patch then could be to fix the existing
> > semantics (man page) of the system call.
> >
> > Copy file range range is not only read and write but rather
> > lseek+read+write and if somebody specifies an incorrect offset to the
>
> Nope. it is like either read+write or pread+pwrite.
>
> > lseek the system call should fail. Thus I still think that copy file
> > range should enforce that specifying a source offset beyond the end of
> > the file should fail with EINVAL.
>
> You appear to be out numbered by reviewers that think copy_file_range(2)
> should behave like pread(2) and return 0 when offf_in >= size_in.
>
> >
> > If the copy file range returned 0 bytes does it mean it's a stopping
> > condition, not according to the current semantics.
>
> Yes. Same as read(2)/pread(2).

If that's the case, then it's great. Perhaps it's the fact that the
copy_file_range man page doesn't talk about it that makes it
confusing.

>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
  2019-05-20 13:58                   ` Olga Kornievskaia
@ 2019-05-20 14:02                     ` Amir Goldstein
  0 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2019-05-20 14:02 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-nfs, overlayfs, ceph-devel, CIFS

On Mon, May 20, 2019 at 4:58 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> On Mon, May 20, 2019 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, May 20, 2019 at 4:12 PM Olga Kornievskaia
> > <olga.kornievskaia@gmail.com> wrote:
> > >
> > > On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > > > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > >
> > > > > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > > > >
> > > > > > > > > > The man page says:
> > > > > > > > > >
> > > > > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > > > > >
> > > > > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > > > > the behaviour described in the man page.
> > > > > > > >
> > > > > > > > I think the behavior implemented is a lot more useful than the one
> > > > > > > > documented..
> > > > > > >
> > > > > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > > > > return short copies, the caller has to implement a loop to ensure
> > > > > > > the range hey want get copied.  When the source range you are
> > > > > > > trying to copy overlaps source EOF, this loop:
> > > > > > >
> > > > > > >         while (len > 0) {
> > > > > > >                 ret = copy_file_range(... len ...)
> > > > > > >                 ...
> > > > > > >                 off_in += ret;
> > > > > > >                 off_out += ret;
> > > > > > >                 len -= ret;
> > > > > > >         }
> > > > > > >
> > > > > > > Currently the fallback code copies up to the end of the source file
> > > > > > > on the first copy and then fails the second copy with EINVAL because
> > > > > > > the source range is now completely beyond EOF.
> > > > > > >
> > > > > > > So, from an application perspective, did the copy succeed or did it
> > > > > > > fail?
> > > > > > >
> > > > > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > > > > this a failure, because the second copy_file_range() call returns
> > > > > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > > > > returning 0 bytes when attempting to read beyond EOF.
> > > > > > >
> > > > > > > IOWs, we cannot tell the difference between a real error and a short
> > > > > > > copy because the input range spans EOF and it was silently
> > > > > > > shortened. That's the API problem we need to fix here - the existing
> > > > > > > behaviour is really crappy for applications. Erroring out
> > > > > > > immmediately is one solution, and it's what the man page says should
> > > > > > > happen so that is what I implemented.
> > > > > > >
> > > > > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > > > > not EINVAL. The existing behaviour needs to change, though.
> > > > > >
> > > > > > There are two checks to consider
> > > > > > 1. pos_in >= EOF should return EINVAL
> > > > > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > > > > should return a short copy.
> > > > > >
> > > > > > Having check#1 enforced allows to us to differentiate between a real
> > > > > > error and a short copy.
> > > > >
> > > > > That's what the code does right now and *exactly what I'm trying to
> > > > > fix* because it EINVAL is ambiguous and not an indicator that we've
> > > > > reached the end of the source file. EINVAL can indicate several
> > > > > different errors, so it really has to be treated as a "copy failed"
> > > > > error by applications.
> > > > >
> > > > > Have a look at read/pread() - they return 0 in this case to indicate
> > > > > a short read, and the value of zero is explicitly defined as meaning
> > > > > "read position is beyond EOF".  Applications know straight away that
> > > > > there is no more data to be read and there was no error, so can
> > > > > terminate on a successful short read.
> > > > >
> > > > > We need to allow applications to terminate copy loops on a
> > > > > successful short copy. IOWs, applications need to either:
> > > > >
> > > > >         - get an immediate error saying the range is invalid rather
> > > > >           than doing a short copy (as per the man page); or
> > > > >         - have an explicit marker to say "no more data to be copied"
> > > > >
> > > > > Applications need the "no more data to copy" case to be explicit and
> > > > > unambiguous so they can make sane decisions about whether a short
> > > > > copy was successful because the file was shorter than expected or
> > > > > whether a short copy was a result of a real error being encountered.
> > > > > The current behaviour is largely unusable for applications because
> > > > > they have to guess at the reason for EINVAL part way through a
> > > > > copy....
> > > > >
> > > >
> > > > Dave,
> > > >
> > > > I went a head and implemented the desired behavior.
> > > > However, while testing I observed that the desired behavior is already
> > > > the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> > > > xfs_io copy loop ends as expected:
> > > > copy_file_range(4, [0], 3, [0], 10, 0)  = 2
> > > > copy_file_range(4, [2], 3, [2], 8, 0)   = 0
> > > >
> > > > This was tested on ext4 and xfs with reflink on recent kernel as well as on
> > > > v4.20-rc1 (era of original patch set).
> > > >
> > > > Where and how did you observe the EINVAL behavior described above?
> > > > (besides man page that is). There are even xfstests (which you modified)
> > > > that verify the return 0 for past EOF behavior.
> > > >
> > > > For now, I am just dropping this patch from the patch series.
> > > > Let me know if I am missing something.
> > >
> > > The was fixing inconsistency in what the man page specified (ie., it
> > > must fail with EINVAL if offsets are out of range) which was never
> > > enforced by the code. The patch then could be to fix the existing
> > > semantics (man page) of the system call.
> > >
> > > Copy file range range is not only read and write but rather
> > > lseek+read+write and if somebody specifies an incorrect offset to the
> >
> > Nope. it is like either read+write or pread+pwrite.
> >
> > > lseek the system call should fail. Thus I still think that copy file
> > > range should enforce that specifying a source offset beyond the end of
> > > the file should fail with EINVAL.
> >
> > You appear to be out numbered by reviewers that think copy_file_range(2)
> > should behave like pread(2) and return 0 when offf_in >= size_in.
> >
> > >
> > > If the copy file range returned 0 bytes does it mean it's a stopping
> > > condition, not according to the current semantics.
> >
> > Yes. Same as read(2)/pread(2).
>
> If that's the case, then it's great. Perhaps it's the fact that the
> copy_file_range man page doesn't talk about it that makes it
> confusing.
>

We agreed that updating the man page is better, see:
https://github.com/amir73il/man-pages/commits/copy_file_range-v2

I'm currently testing reworked patches.
Will post them once they pass the tests.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/11] man-pages: copy_file_range updates
  2018-12-03  8:39 ` [PATCH 12/11] man-pages: copy_file_range updates Dave Chinner
  2018-12-03 13:05   ` Amir Goldstein
@ 2019-05-21  5:52   ` Amir Goldstein
  1 sibling, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2019-05-21  5:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
	Linux NFS Mailing List, overlayfs, ceph-devel, CIFS, linux-api

On Mon, Dec 3, 2018 at 10:40 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Update with all the missing errors the syscall can return, the
> behaviour the syscall should have w.r.t. to copies within single
> files, etc.

Below are the changes I have made to V2 of this man-page update in accordance to
agreed change of behavior (i.e. short copy up to EOF).

This is a heads up before posting to verify my interpretation is correct.
I still have more testing to do before posting.

The main thing is adding:
 .BR copy_file_range ()
 will return the number of bytes copied between files.
 This could be less than the length originally requested.
+If the file offset of
+.I fd_in
+is at or past the end of file, no bytes are copied, and
+.BR copy_file_range ()
+returns zero.

But see also other changes below...

>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  man2/copy_file_range.2 | 94 +++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 77 insertions(+), 17 deletions(-)
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 20374abb21f0..23b00c2f3fea 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -42,9 +42,9 @@ without the additional cost of transferring data from the kernel to user space
>  and then back into the kernel.
>  It copies up to
>  .I len
> -bytes of data from file descriptor
> +bytes of data from the source file descriptor
>  .I fd_in
> -to file descriptor
> +to target file descriptor
>  .IR fd_out ,
>  overwriting any data that exists within the requested range of the target file.
>  .PP
> @@ -74,6 +74,11 @@ is not changed, but
>  .I off_in
>  is adjusted appropriately.
>  .PP
> +.I fd_in
> +and
> +.I fd_out
> +can refer to the same file. If they refer to the same file, then the source and
> +target ranges are not allowed to overlap.
>  .PP
>  The
>  .I flags
> @@ -93,34 +98,73 @@ is set to indicate the error.
>  .SH ERRORS
>  .TP
>  .B EBADF
> -One or more file descriptors are not valid; or
> +One or more file descriptors are not valid.
> +.TP
> +.B EBADF
>  .I fd_in
>  is not open for reading; or
>  .I fd_out
> -is not open for writing; or
> -the
> +is not open for writing.
> +.TP
> +.B EBADF
> +The
>  .B O_APPEND
>  flag is set for the open file description referred to by
>  .IR fd_out .
>  .TP
>  .B EFBIG
> -An attempt was made to write a file that exceeds the implementation-defined
> -maximum file size or the process's file size limit,
> -or to write at a position past the maximum allowed offset.
> +An attempt was made to write at a position past the maximum file offset the
> +kernel supports.

Updated to "...attempt made to read or write..."

> +.TP
> +.B EFBIG
> +An attempt was made to write a range that exceeds the allowed maximum file size.
> +The maximum file size differs between filesystem implemenations and can be
> +different to the maximum allowed file offset.
> +.TP
> +.B EFBIG
> +An attempt was made to write beyond the process's file size resource
> +limit. This may also result in the process receiving a
> +.I SIGXFSZ
> +signal.
>  .TP
>  .B EINVAL
> -Requested range extends beyond the end of the source file; or the

Removed this.

> -.I flags
> -argument is not 0.
> +.I (off_in + len)
> +spans the end of the source file.
>  .TP
> -.B EIO
> -A low-level I/O error occurred while copying.
> +.B EINVAL
> +.I fd_in
> +and
> +.I fd_out
> +refer to the same file and the source and target ranges overlap.
> +.TP
> +.B EINVAL
> +.I fd_in
> +or
> +.I fd_out
> +is not a regular file.
>  .TP
>  .B EISDIR
>  .I fd_in
>  or
>  .I fd_out
>  refers to a directory.
> +.B EINVAL
> +The
> +.I flags
> +argument is not 0.
> +.TP
> +.B EINVAL
> +.I off_in
> +or
> +.I (off_in + len)
> +is beyond the maximum valid file offset.

Removed this. Updated entry for EFBIG with in offset.

> +.TP
> +.B EOVERFLOW
> +The requested source or destination range is too large to represent in the
> +specified data types.
> +.TP
> +.B EIO
> +A low-level I/O error occurred while copying.
>  .TP
>  .B ENOMEM
>  Out of memory.
> @@ -128,16 +172,32 @@ Out of memory.
>  .B ENOSPC
>  There is not enough space on the target filesystem to complete the copy.
>  .TP
> -.B EXDEV
> -The files referred to by
> -.IR file_in " and " file_out

Kept this one with added "(pre Linux 5.3)"

> -are not on the same mounted filesystem.
> +.B TXTBSY
> +.I fd_in
> +or
> +.I fd_out
> +refers to an active swap file.
> +.TP
> +.B EPERM
> +.I fd_out
> +refers to an immutable file.
> +.TP
> +.B EACCES
> +The user does not have write permissions for the destination file.
>  .SH VERSIONS
>  The
>  .BR copy_file_range ()
>  system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
>  emulation when it is not available.
>  .\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
> +.PP
> +A major rework of the kernel implementation occurred in 4.21. Areas of the API
> +that weren't clearly defined were clarified and the API bounds are much more
> +strictly checked than on earlier kernels. Applications should target the
> +behaviour and requirements of 4.21 kernels.
> +.PP
> +First support for cross-filesystem copies was introduced in Linux 4.21. Older
> +kernels will return -EXDEV when cross-filesystem copies are attempted.
>  .SH CONFORMING TO
>  The
>  .BR copy_file_range ()

Updates example loop termination condition to:
         len \-= ret;
-    } while (len > 0);
+    } while (len > 0 && ret > 0);


WIP is available here:
https://github.com/amir73il/man-pages/commits/copy_file_range

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, back to index

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03  8:34 [PATCH 0/11] fs: fixes for major copy_file_range() issues Dave Chinner
2018-12-03  8:34 ` [PATCH 01/11] vfs: copy_file_range source range over EOF should fail Dave Chinner
2018-12-03 12:46   ` Amir Goldstein
2018-12-04 15:13     ` Christoph Hellwig
2018-12-04 21:29       ` Dave Chinner
2018-12-04 21:47         ` Olga Kornievskaia
2018-12-04 22:31           ` Dave Chinner
2018-12-05 16:51             ` bfields
2019-05-20  9:10             ` Amir Goldstein
2019-05-20 13:12               ` Olga Kornievskaia
2019-05-20 13:36                 ` Amir Goldstein
2019-05-20 13:58                   ` Olga Kornievskaia
2019-05-20 14:02                     ` Amir Goldstein
2018-12-05 14:12         ` Christoph Hellwig
2018-12-05 21:08           ` Dave Chinner
2018-12-05 21:30             ` Christoph Hellwig
2018-12-03  8:34 ` [PATCH 02/11] vfs: introduce generic_copy_file_range() Dave Chinner
2018-12-03 10:03   ` Amir Goldstein
2018-12-03 23:00     ` Dave Chinner
2018-12-04 15:14   ` Christoph Hellwig
2018-12-03  8:34 ` [PATCH 03/11] vfs: no fallback for ->copy_file_range Dave Chinner
2018-12-03 10:22   ` Amir Goldstein
2018-12-03 23:02     ` Dave Chinner
2018-12-06  4:16       ` Amir Goldstein
2018-12-06 21:30         ` Dave Chinner
2018-12-07  5:38           ` Amir Goldstein
2018-12-03 18:23   ` Anna Schumaker
2018-12-04 15:16   ` Christoph Hellwig
2018-12-03  8:34 ` [PATCH 04/11] vfs: add missing checks to copy_file_range Dave Chinner
2018-12-03 12:42   ` Amir Goldstein
2018-12-03 19:04   ` Darrick J. Wong
2018-12-03 21:33   ` Olga Kornievskaia
2018-12-03 23:04     ` Dave Chinner
2018-12-04 15:18   ` Christoph Hellwig
2018-12-12 11:31   ` Luis Henriques
2018-12-12 16:42     ` Darrick J. Wong
2018-12-12 18:55     ` Olga Kornievskaia
2018-12-12 19:42       ` Matthew Wilcox
2018-12-12 20:22         ` Olga Kornievskaia
2018-12-13 10:29           ` Luis Henriques
2018-12-03  8:34 ` [PATCH 05/11] vfs: use inode_permission in copy_file_range() Dave Chinner
2018-12-03 12:47   ` Amir Goldstein
2018-12-03 18:18   ` Darrick J. Wong
2018-12-03 23:55     ` Dave Chinner
2018-12-05 17:28       ` bfields
2018-12-03 18:53   ` Eric Biggers
2018-12-04 15:19   ` Christoph Hellwig
2018-12-03  8:34 ` [PATCH 06/11] vfs: copy_file_range needs to strip setuid bits Dave Chinner
2018-12-03 12:51   ` Amir Goldstein
2018-12-04 15:21   ` Christoph Hellwig
2018-12-03  8:34 ` [PATCH 07/11] vfs: copy_file_range should update file timestamps Dave Chinner
2018-12-03 10:47   ` Amir Goldstein
2018-12-03 17:33     ` Olga Kornievskaia
2018-12-03 18:22       ` Darrick J. Wong
2018-12-03 23:19     ` Dave Chinner
2018-12-04 15:24   ` Christoph Hellwig
2018-12-03  8:34 ` [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range Dave Chinner
2018-12-03 11:04   ` Amir Goldstein
2018-12-03 19:11     ` Darrick J. Wong
2018-12-03 23:37       ` Dave Chinner
2018-12-03 23:58         ` Darrick J. Wong
2018-12-04  9:17           ` Amir Goldstein
2018-12-03 23:34     ` Dave Chinner
2018-12-03 18:24   ` Darrick J. Wong
2018-12-04  8:18   ` Olga Kornievskaia
2018-12-03  8:34 ` [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down Dave Chinner
2018-12-03 12:36   ` Amir Goldstein
2018-12-03 17:58   ` Olga Kornievskaia
2018-12-03 18:53   ` Anna Schumaker
2018-12-03 19:27     ` Olga Kornievskaia
2018-12-03 23:40     ` Dave Chinner
2018-12-04 15:43   ` Christoph Hellwig
2018-12-04 22:18     ` Dave Chinner
2018-12-04 23:33       ` Olga Kornievskaia
2018-12-05 14:09       ` Christoph Hellwig
2018-12-05 17:01         ` Olga Kornievskaia
2018-12-03  8:34 ` [PATCH 10/11] vfs: allow generic_copy_file_range to copy across devices Dave Chinner
2018-12-03 12:54   ` Amir Goldstein
2018-12-03  8:34 ` [PATCH 11/11] ovl: allow cross-device copy_file_range calls Dave Chinner
2018-12-03 12:55   ` Amir Goldstein
2018-12-03  8:39 ` [PATCH 12/11] man-pages: copy_file_range updates Dave Chinner
2018-12-03 13:05   ` Amir Goldstein
2019-05-21  5:52   ` Amir Goldstein

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org linux-nfs@archiver.kernel.org
	public-inbox-index linux-nfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox