archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <>
To: Nicolas Boichat <>
Cc: "Darrick J . Wong" <>,
	Luis Lozano <>,
	Ian Lance Taylor <>,
	Alexander Viro <>,
	Amir Goldstein <>,
	"Darrick J. Wong" <>,
	Dave Chinner <>,,
Subject: Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size
Date: Wed, 27 Jan 2021 10:38:40 +1100	[thread overview]
Message-ID: <20210126233840.GG4626@dread.disaster.area> (raw)
In-Reply-To: <20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid>

On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> copy_file_range (which calls generic_copy_file_checks) uses the
> inode file size to adjust the copy count parameter. This breaks
> with special filesystems like procfs/sysfs, where the file size
> appears to be zero, but content is actually returned when a read
> operation is performed.
> This commit ignores the source file size, and makes copy_file_range
> match the end of file behaviour documented in POSIX's "read",
> where 0 is returned to mark EOF. This would allow "cp" and other
> standard tools to make use of copy_file_range with the exact same
> behaviour as they had in the past.
> Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> Signed-off-by: Nicolas Boichat <>


As I've explained, this is intentional and bypassing it is not a
work around for enabling cfr on filesystems that produce ephemeral,
volatile read-once data using seq-file pipes that masquerade as
regular files with zero size. These files are behaving like pipes
and only work because the VFS has to support read() and friends from
pipes that don't publish the amount of data they contain to the VFS

copy_file_range() does not support such behaviour.

copy_file_range() -writes- data, so we have to check that those
writes do not extend past boundaries that the destination inode
imposes on the operation. e.g. maximum offset limits, whether the
ranges overlap in the same file, etc.

Hence we need to know how much data there is present to copy before
we can check if it is safe to perform the -write- of the data we are
going to read. Hence we cannot safely support data sources that
cannot tell us how much data is present before we start the copy

IOWs, these source file EOF restrictions are required by the write
side of copy_file_range(), not the read side.

> ---
> This can be reproduced with this simple test case:
>  #define _GNU_SOURCE
>  #include <fcntl.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <sys/stat.h>
>  #include <unistd.h>
>  int
>  main(int argc, char **argv)
>  {
>    int fd_in, fd_out;
>    loff_t ret;
>    fd_in = open("/proc/version", O_RDONLY);
>    fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
>    do {
>      ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
>      printf("%d bytes copied\n", (int)ret);
>    } while (ret > 0);
>    return 0;
>  }
> Without this patch, `version` output file is empty, and no bytes
> are copied:
> 0 bytes copied

$ ls -l /proc/version
-r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version

It's a zero length file.

sysfs does this just fine - it's regular files have a size of
at least PAGE_SIZE rather than zero, and so copy_file_range works
just fine on them:

$ ls -l /sys/block/nvme0n1/capability
-r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
$ cat /sys/block/nvme0n1/capability
$ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" /tmp/foo
$ sudo cat /tmp/foo

And the behaviour is exactly as you'd expect a read() loop to copy
the file to behave:

openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3
openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
copy_file_range(4, [0], 3, [0], 4096, 0) = 3
copy_file_range(4, [3], 3, [3], 4093, 0) = 0

See? Inode size of 4096 means there's a maximum of 4kB of data that
can be read from this file.  copy_file_range() now behaves exactly
as read() would, returning a short copy and then 0 bytes to indicate

If you want ephemeral data pipes masquerading as regular files to
work with copy_file_range, then the filesystem implementation needs
to provide the VFS with a data size that indicates the maximum
amount of data that the pipe can produce in a continuous read loop.
Otherwise we cannot validate the range of the write we may be asked
to perform...

> Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> copy operation. However, that fails to copy any content when the input
> file is from sysfs/tracefs, with an apparent size of 0 (but there is
> still content when you `cat` it, of course).

Libraries using copy_file_range() must be prepared for it to fail
and fall back to normal copy mechanisms. Of course, with these
special zero length files that contain ephemeral data, userspace can't
actually tell that they contain data from userspace using stat(). So
as far as userspace is concerned, copy_file_range() correctly
returned zero bytes copied from a zero byte long file and there's
nothing more to do.

This zero length file behaviour is, fundamentally, a kernel
filesystem implementation bug, not a copy_file_range() bug.


Dave Chinner

  reply	other threads:[~2021-01-27  4:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-26  5:50 [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size Nicolas Boichat
2021-01-26 23:38 ` Dave Chinner [this message]
2021-01-28  0:46   ` Nicolas Boichat
2021-01-28  5:57     ` Darrick J. Wong
2021-02-12  4:48       ` Nicolas Boichat
2021-01-26 23:50 ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210126233840.GG4626@dread.disaster.area \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).