linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC v1 01/19] Don't copy beyond the end of the file
       [not found]               ` <924FF7A2-27CD-4848-BD61-748758C2533F@netapp.com>
@ 2017-03-06 19:23                 ` J. Bruce Fields
  2017-03-07 14:18                   ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: J. Bruce Fields @ 2017-03-06 19:23 UTC (permalink / raw)
  To: linux-man
  Cc: Christoph Hellwig, Trond.Myklebust, linux-nfs, linux-fsdevel,
	Olga Kornievskaia

On Mon, Mar 06, 2017 at 11:27:23AM -0500, Olga Kornievskaia wrote:
> I don’t see copy_file_range() specifying that 0 means end of the file.

Is there a reason that copy_file_range shouldn't be like read?  (From
read(2): "On  success,  the number of bytes read is returned (zero
indicates end of file)".

I haven't checked, but suspect that's already true of the
implementations we have.

Also, from copy_file_range():

	EINVAL Requested  range  extends beyond the end of the source
		file; or the flags argument is not 0.

Some filesystems do this, some don't; I think the man page should make
it clear that this behavior is not required.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] Don't copy beyond the end of the file
  2017-03-06 19:23                 ` [RFC v1 01/19] Don't copy beyond the end of the file J. Bruce Fields
@ 2017-03-07 14:18                   ` Olga Kornievskaia
  0 siblings, 0 replies; 16+ messages in thread
From: Olga Kornievskaia @ 2017-03-07 14:18 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: linux-man, Christoph Hellwig, Trond Myklebust, linux-nfs,
	linux-fsdevel, Olga Kornievskaia

On Mon, Mar 6, 2017 at 2:23 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Mon, Mar 06, 2017 at 11:27:23AM -0500, Olga Kornievskaia wrote:
>> I don’t see copy_file_range() specifying that 0 means end of the file.
>
> Is there a reason that copy_file_range shouldn't be like read?  (From
> read(2): "On  success,  the number of bytes read is returned (zero
> indicates end of file)".

The only thing I can think of is the fact that copy_file_range() does
write too. And return 0 from the write is not expected/allowed (except
when input count was 0)?

> I haven't checked, but suspect that's already true of the
> implementations we have.
>
> Also, from copy_file_range():
>
>         EINVAL Requested  range  extends beyond the end of the source
>                 file; or the flags argument is not 0.
>
> Some filesystems do this, some don't; I think the man page should make
> it clear that this behavior is not required.
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
       [not found]             ` <7FDA8E80-3C62-48BB-9E2B-195B4BA340C0@netapp.com>
@ 2017-03-08 19:53               ` J. Bruce Fields
  2017-03-08 20:00                 ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: J. Bruce Fields @ 2017-03-08 19:53 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Christoph Hellwig, Trond.Myklebust, linux-nfs, linux-fsdevel

On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org>
> > wrote:
> > 
> > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
> >> Since copy isn't atomic that check is never going to be reliable.
> > 
> > That's true for everything that COPY does.  By that logic we should
> > not implement it at all (a logic that I'd fully support)
> 
> If you were to only keep CLONE then you’d lose a huge performance gain
> you get from server-to-server COPY. 

Yes.  Also, I think copy-like copy implementations have reasonable
semantics that are basically the same as read:

	- copy can return successfully with less copied than requested.
	- it's fine for the copied range to start and/or end past end of
	  file, it'll just return a short read.
	- A copy of more than 0 bytes returning 0 means you're at end of
	  file.

The particular problem here is that that doesn't fit how clone works at
all.

It feels like what happened is that copy_file_range() was made mainly
for the clone case, with the idea that copy might be reluctantly
accepted as a second-class implementation.

But the performance gain of copy offload is too big to just ignore, and
in fact it's what copy_file_range does on every filesystem but btrfs and
ocfs2 (and maybe cifs?), so I don't think we can just ignore it.

If we had separate copy_file_range and clone_file_range, I *think* it
could all be made sensible.  Am I missing something?

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-08 19:53               ` [RFC v1 01/19] fs: " J. Bruce Fields
@ 2017-03-08 20:00                 ` Olga Kornievskaia
  2017-03-08 20:18                   ` J. Bruce Fields
  2017-03-08 20:18                   ` Trond Myklebust
  0 siblings, 2 replies; 16+ messages in thread
From: Olga Kornievskaia @ 2017-03-08 20:00 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christoph Hellwig, Trond.Myklebust, linux-nfs, linux-fsdevel


> On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
>> 
>>> On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org>
>>> wrote:
>>> 
>>> On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
>>>> Since copy isn't atomic that check is never going to be reliable.
>>> 
>>> That's true for everything that COPY does.  By that logic we should
>>> not implement it at all (a logic that I'd fully support)
>> 
>> If you were to only keep CLONE then you’d lose a huge performance gain
>> you get from server-to-server COPY. 
> 
> Yes.  Also, I think copy-like copy implementations have reasonable
> semantics that are basically the same as read:
> 
> 	- copy can return successfully with less copied than requested.
> 	- it's fine for the copied range to start and/or end past end of
> 	  file, it'll just return a short read.
> 	- A copy of more than 0 bytes returning 0 means you're at end of
> 	  file.
> 
> The particular problem here is that that doesn't fit how clone works at
> all.
> 
> It feels like what happened is that copy_file_range() was made mainly
> for the clone case, with the idea that copy might be reluctantly
> accepted as a second-class implementation.
> 
> But the performance gain of copy offload is too big to just ignore, and
> in fact it's what copy_file_range does on every filesystem but btrfs and
> ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> 
> If we had separate copy_file_range and clone_file_range, I *think* it
> could all be made sensible.  Am I missing something?
> 

How would the application (cp) know when to call the clone_file_range and when to call copy_file_range?
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-08 20:00                 ` Olga Kornievskaia
@ 2017-03-08 20:18                   ` J. Bruce Fields
  2017-03-08 20:18                   ` Trond Myklebust
  1 sibling, 0 replies; 16+ messages in thread
From: J. Bruce Fields @ 2017-03-08 20:18 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Christoph Hellwig, Trond.Myklebust, linux-nfs, linux-fsdevel

On Wed, Mar 08, 2017 at 03:00:52PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> >> 
> >>> On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.org>
> >>> wrote:
> >>> 
> >>> On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields wrote:
> >>>> Since copy isn't atomic that check is never going to be reliable.
> >>> 
> >>> That's true for everything that COPY does.  By that logic we should
> >>> not implement it at all (a logic that I'd fully support)
> >> 
> >> If you were to only keep CLONE then you’d lose a huge performance gain
> >> you get from server-to-server COPY. 
> > 
> > Yes.  Also, I think copy-like copy implementations have reasonable
> > semantics that are basically the same as read:
> > 
> > 	- copy can return successfully with less copied than requested.
> > 	- it's fine for the copied range to start and/or end past end of
> > 	  file, it'll just return a short read.
> > 	- A copy of more than 0 bytes returning 0 means you're at end of
> > 	  file.
> > 
> > The particular problem here is that that doesn't fit how clone works at
> > all.
> > 
> > It feels like what happened is that copy_file_range() was made mainly
> > for the clone case, with the idea that copy might be reluctantly
> > accepted as a second-class implementation.
> > 
> > But the performance gain of copy offload is too big to just ignore, and
> > in fact it's what copy_file_range does on every filesystem but btrfs and
> > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> > 
> > If we had separate copy_file_range and clone_file_range, I *think* it
> > could all be made sensible.  Am I missing something?
> 
> How would the application (cp) know when to call the clone_file_range and when to call copy_file_range?

Try clone and then fall back on copy if that's not available?

Which is the same thing vfs_copy_file_range() is doing now, but it'd
seem less confusing if that logic was in the application.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-08 20:00                 ` Olga Kornievskaia
  2017-03-08 20:18                   ` J. Bruce Fields
@ 2017-03-08 20:18                   ` Trond Myklebust
  2017-03-08 20:32                     ` bfields
  1 sibling, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2017-03-08 20:18 UTC (permalink / raw)
  To: bfields, kolga; +Cc: hch, Trond Myklebust, linux-nfs, linux-fsdevel

On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org>
> > wrote:
> > 
> > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> > > 
> > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.o
> > > > rg>
> > > > wrote:
> > > > 
> > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > wrote:
> > > > > Since copy isn't atomic that check is never going to be
> > > > > reliable.
> > > > 
> > > > That's true for everything that COPY does.  By that logic we
> > > > should
> > > > not implement it at all (a logic that I'd fully support)
> > > 
> > > If you were to only keep CLONE then you’d lose a huge performance
> > > gain
> > > you get from server-to-server COPY. 
> > 
> > Yes.  Also, I think copy-like copy implementations have reasonable
> > semantics that are basically the same as read:
> > 
> > 	- copy can return successfully with less copied than requested.
> > 	- it's fine for the copied range to start and/or end past end
> > of
> > 	  file, it'll just return a short read.
> > 	- A copy of more than 0 bytes returning 0 means you're at end
> > of
> > 	  file.
> > 
> > The particular problem here is that that doesn't fit how clone
> > works at
> > all.
> > 
> > It feels like what happened is that copy_file_range() was made
> > mainly
> > for the clone case, with the idea that copy might be reluctantly
> > accepted as a second-class implementation.

Historically? No... Christoph added clone as a valid implementation of
copy_file_range() almost a year after Zach and Anna defined the
semantics of vfs_copy_file_range(). git blame is your friend...

> > 
> > But the performance gain of copy offload is too big to just ignore,
> > and
> > in fact it's what copy_file_range does on every filesystem but
> > btrfs and
> > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> > 
> > If we had separate copy_file_range and clone_file_range, I *think*
> > it
> > could all be made sensible.  Am I missing something?
> > 
> 
> How would the application (cp) know when to call the clone_file_range
> and when to call copy_file_range?

cp can probably call copy_file_range(), but any application that needs
atomic semantics (i.e. a binary operation success/fail) must call
clone_file_range().

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-08 20:18                   ` Trond Myklebust
@ 2017-03-08 20:32                     ` bfields
  2017-03-08 20:49                       ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: bfields @ 2017-03-08 20:32 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: kolga, hch, linux-nfs, linux-fsdevel

On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:
> On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.org>
> > > wrote:
> > > 
> > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia wrote:
> > > > 
> > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infradead.o
> > > > > rg>
> > > > > wrote:
> > > > > 
> > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > > wrote:
> > > > > > Since copy isn't atomic that check is never going to be
> > > > > > reliable.
> > > > > 
> > > > > That's true for everything that COPY does.  By that logic we
> > > > > should
> > > > > not implement it at all (a logic that I'd fully support)
> > > > 
> > > > If you were to only keep CLONE then you’d lose a huge performance
> > > > gain
> > > > you get from server-to-server COPY. 
> > > 
> > > Yes.  Also, I think copy-like copy implementations have reasonable
> > > semantics that are basically the same as read:
> > > 
> > > 	- copy can return successfully with less copied than requested.
> > > 	- it's fine for the copied range to start and/or end past end
> > > of
> > > 	  file, it'll just return a short read.
> > > 	- A copy of more than 0 bytes returning 0 means you're at end
> > > of
> > > 	  file.
> > > 
> > > The particular problem here is that that doesn't fit how clone
> > > works at
> > > all.
> > > 
> > > It feels like what happened is that copy_file_range() was made
> > > mainly
> > > for the clone case, with the idea that copy might be reluctantly
> > > accepted as a second-class implementation.
> 
> Historically? No... Christoph added clone as a valid implementation of
> copy_file_range() almost a year after Zach and Anna defined the
> semantics of vfs_copy_file_range(). git blame is your friend...

Yeah, I know.  It still feels to me like the interface was originally
designed with clone in mind, but that's my vague impression from the man
pages and half-remembered conversations.

Though the lack of a "just copy the whole file regardless of size" case
is weird for clone.  All you can do is stat the file and then hope it
doesn't change before you issue the copy_file_range.  But I'd think it'd
be easy for an atomic clone implementation to handle, say, getting a
snapshot of a log file while it's getting continuously appended to.

> > > But the performance gain of copy offload is too big to just ignore,
> > > and
> > > in fact it's what copy_file_range does on every filesystem but
> > > btrfs and
> > > ocfs2 (and maybe cifs?), so I don't think we can just ignore it.
> > > 
> > > If we had separate copy_file_range and clone_file_range, I *think*
> > > it
> > > could all be made sensible.  Am I missing something?
> > > 
> > 
> > How would the application (cp) know when to call the clone_file_range
> > and when to call copy_file_range?
> 
> cp can probably call copy_file_range(), but any application that needs
> atomic semantics (i.e. a binary operation success/fail) must call
> clone_file_range().

I don't believe there's a clone_file_range().  I see the vfs interface,
but no system call.

And implementing a simple cp is harder than it should be when you don't
know whether it's implemented as copy or clone.  You have to stat for
the file size first, retry if you got it wrong, and also retry if you
get a short read.  The example in the clone_file_range() man page is
incomplete.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-08 20:32                     ` bfields
@ 2017-03-08 20:49                       ` Trond Myklebust
  2017-03-09 15:29                         ` bfields
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2017-03-08 20:49 UTC (permalink / raw)
  To: bfields; +Cc: hch, linux-nfs, kolga, linux-fsdevel

On Wed, 2017-03-08 at 15:32 -0500, bfields@fieldses.org wrote:
> On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:
> > On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.o
> > > > rg>
> > > > wrote:
> > > > 
> > > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia
> > > > wrote:
> > > > > 
> > > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infrade
> > > > > > ad.o
> > > > > > rg>
> > > > > > wrote:
> > > > > > 
> > > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > > > wrote:
> > > > > > > Since copy isn't atomic that check is never going to be
> > > > > > > reliable.
> > > > > > 
> > > > > > That's true for everything that COPY does.  By that logic
> > > > > > we
> > > > > > should
> > > > > > not implement it at all (a logic that I'd fully support)
> > > > > 
> > > > > If you were to only keep CLONE then you’d lose a huge
> > > > > performance
> > > > > gain
> > > > > you get from server-to-server COPY. 
> > > > 
> > > > Yes.  Also, I think copy-like copy implementations have
> > > > reasonable
> > > > semantics that are basically the same as read:
> > > > 
> > > > 	- copy can return successfully with less copied than
> > > > requested.
> > > > 	- it's fine for the copied range to start and/or end
> > > > past end
> > > > of
> > > > 	  file, it'll just return a short read.
> > > > 	- A copy of more than 0 bytes returning 0 means you're
> > > > at end
> > > > of
> > > > 	  file.
> > > > 
> > > > The particular problem here is that that doesn't fit how clone
> > > > works at
> > > > all.
> > > > 
> > > > It feels like what happened is that copy_file_range() was made
> > > > mainly
> > > > for the clone case, with the idea that copy might be
> > > > reluctantly
> > > > accepted as a second-class implementation.
> > 
> > Historically? No... Christoph added clone as a valid implementation
> > of
> > copy_file_range() almost a year after Zach and Anna defined the
> > semantics of vfs_copy_file_range(). git blame is your friend...
> 
> Yeah, I know.  It still feels to me like the interface was originally
> designed with clone in mind, but that's my vague impression from the
> man
> pages and half-remembered conversations.
> 
> Though the lack of a "just copy the whole file regardless of size"
> case
> is weird for clone.  All you can do is stat the file and then hope it
> doesn't change before you issue the copy_file_range.  But I'd think
> it'd
> be easy for an atomic clone implementation to handle, say, getting a
> snapshot of a log file while it's getting continuously appended to.

It really isn't that interesting in the continuously appended case
(what difference does it make if you only get data from just a few
moments ago), but I can see it being an issue in the case of random
writes where the file size is being extended.

The thing is that in both those cases, the copy_file_range() semantics
are worse, since they don't even guarantee a time-consistent copy.

> > > > But the performance gain of copy offload is too big to just
> > > > ignore,
> > > > and
> > > > in fact it's what copy_file_range does on every filesystem but
> > > > btrfs and
> > > > ocfs2 (and maybe cifs?), so I don't think we can just ignore
> > > > it.
> > > > 
> > > > If we had separate copy_file_range and clone_file_range, I
> > > > *think*
> > > > it
> > > > could all be made sensible.  Am I missing something?
> > > > 
> > > 
> > > How would the application (cp) know when to call the
> > > clone_file_range
> > > and when to call copy_file_range?
> > 
> > cp can probably call copy_file_range(), but any application that
> > needs
> > atomic semantics (i.e. a binary operation success/fail) must call
> > clone_file_range().
> 
> I don't believe there's a clone_file_range().  I see the vfs
> interface,
> but no system call.

There is a standard FICLONERANGE ioctl() that can be used on all
filesystems that support the vfs interface.

> And implementing a simple cp is harder than it should be when you
> don't
> know whether it's implemented as copy or clone.  You have to stat for
> the file size first, retry if you got it wrong, and also retry if you
> get a short read.  The example in the clone_file_range() man page is
> incomplete.

As I said, you shouldn't be using copy_file_range() either in the case
where the file is being modified.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-08 20:49                       ` Trond Myklebust
@ 2017-03-09 15:29                         ` bfields
  2017-03-09 15:35                           ` hch
  0 siblings, 1 reply; 16+ messages in thread
From: bfields @ 2017-03-09 15:29 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: hch, linux-nfs, kolga, linux-fsdevel

On Wed, Mar 08, 2017 at 08:49:56PM +0000, Trond Myklebust wrote:
> On Wed, 2017-03-08 at 15:32 -0500, bfields@fieldses.org wrote:
> > On Wed, Mar 08, 2017 at 08:18:31PM +0000, Trond Myklebust wrote:
> > > On Wed, 2017-03-08 at 15:00 -0500, Olga Kornievskaia wrote:
> > > > > On Mar 8, 2017, at 2:53 PM, J. Bruce Fields <bfields@fieldses.o
> > > > > rg>
> > > > > wrote:
> > > > > 
> > > > > On Wed, Mar 08, 2017 at 12:32:12PM -0500, Olga Kornievskaia
> > > > > wrote:
> > > > > > 
> > > > > > > On Mar 8, 2017, at 12:25 PM, Christoph Hellwig <hch@infrade
> > > > > > > ad.o
> > > > > > > rg>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > On Wed, Mar 08, 2017 at 12:05:21PM -0500, J. Bruce Fields
> > > > > > > wrote:
> > > > > > > > Since copy isn't atomic that check is never going to be
> > > > > > > > reliable.
> > > > > > > 
> > > > > > > That's true for everything that COPY does.  By that logic
> > > > > > > we
> > > > > > > should
> > > > > > > not implement it at all (a logic that I'd fully support)
> > > > > > 
> > > > > > If you were to only keep CLONE then you’d lose a huge
> > > > > > performance
> > > > > > gain
> > > > > > you get from server-to-server COPY. 
> > > > > 
> > > > > Yes.  Also, I think copy-like copy implementations have
> > > > > reasonable
> > > > > semantics that are basically the same as read:
> > > > > 
> > > > > 	- copy can return successfully with less copied than
> > > > > requested.
> > > > > 	- it's fine for the copied range to start and/or end
> > > > > past end
> > > > > of
> > > > > 	  file, it'll just return a short read.
> > > > > 	- A copy of more than 0 bytes returning 0 means you're
> > > > > at end
> > > > > of
> > > > > 	  file.
> > > > > 
> > > > > The particular problem here is that that doesn't fit how clone
> > > > > works at
> > > > > all.
> > > > > 
> > > > > It feels like what happened is that copy_file_range() was made
> > > > > mainly
> > > > > for the clone case, with the idea that copy might be
> > > > > reluctantly
> > > > > accepted as a second-class implementation.
> > > 
> > > Historically? No... Christoph added clone as a valid implementation
> > > of
> > > copy_file_range() almost a year after Zach and Anna defined the
> > > semantics of vfs_copy_file_range(). git blame is your friend...
> > 
> > Yeah, I know.  It still feels to me like the interface was originally
> > designed with clone in mind, but that's my vague impression from the
> > man
> > pages and half-remembered conversations.
> > 
> > Though the lack of a "just copy the whole file regardless of size"
> > case
> > is weird for clone.  All you can do is stat the file and then hope it
> > doesn't change before you issue the copy_file_range.  But I'd think
> > it'd
> > be easy for an atomic clone implementation to handle, say, getting a
> > snapshot of a log file while it's getting continuously appended to.
> 
> It really isn't that interesting in the continuously appended case
> (what difference does it make if you only get data from just a few
> moments ago), but I can see it being an issue in the case of random
> writes where the file size is being extended.

Bah, yes, apologies for the bad example.

> The thing is that in both those cases, the copy_file_range() semantics
> are worse, since they don't even guarantee a time-consistent copy.
>
> > > > > But the performance gain of copy offload is too big to just
> > > > > ignore,
> > > > > and
> > > > > in fact it's what copy_file_range does on every filesystem but
> > > > > btrfs and
> > > > > ocfs2 (and maybe cifs?), so I don't think we can just ignore
> > > > > it.
> > > > > 
> > > > > If we had separate copy_file_range and clone_file_range, I
> > > > > *think*
> > > > > it
> > > > > could all be made sensible.  Am I missing something?
> > > > > 
> > > > 
> > > > How would the application (cp) know when to call the
> > > > clone_file_range
> > > > and when to call copy_file_range?
> > > 
> > > cp can probably call copy_file_range(), but any application that
> > > needs
> > > atomic semantics (i.e. a binary operation success/fail) must call
> > > clone_file_range().
> > 
> > I don't believe there's a clone_file_range().  I see the vfs
> > interface,
> > but no system call.
> 
> There is a standard FICLONERANGE ioctl() that can be used on all
> filesystems that support the vfs interface.

Oh, thanks, I forgot about that.

So I don't understand why it needed to be added to copy_file_range().
The copy and clone semantics are different enough that I think callers
want to know which they're getting.

> > And implementing a simple cp is harder than it should be when you
> > don't
> > know whether it's implemented as copy or clone.  You have to stat for
> > the file size first, retry if you got it wrong, and also retry if you
> > get a short read.  The example in the clone_file_range() man page is
> > incomplete.
> 
> As I said, you shouldn't be using copy_file_range() either in the case
> where the file is being modified.

Don't we want it to have more or less the same behavior as a read-write
loop?  People are probably running backup programs that depend on just
simple copies, and maybe the results are good enough for their purposes,
or maybe they're actually corrupting parts of their backups and don't
know, but we can't suddenly start aborting their backups with errors and
tell users it's for their own good.  So copy_file_range() callers will
need to handle EINVAL on changing files somehow.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 15:29                         ` bfields
@ 2017-03-09 15:35                           ` hch
  2017-03-09 16:16                             ` bfields
  0 siblings, 1 reply; 16+ messages in thread
From: hch @ 2017-03-09 15:35 UTC (permalink / raw)
  To: bfields; +Cc: Trond Myklebust, hch, linux-nfs, kolga, linux-fsdevel

[please trim your replies instead of this giant unreadable garbage,
 thanks!]

On Thu, Mar 09, 2017 at 10:29:48AM -0500, bfields@fieldses.org wrote:
> So I don't understand why it needed to be added to copy_file_range().
> The copy and clone semantics are different enough that I think callers
> want to know which they're getting.

Because if a file systems implements clone is literally is always better
than doing a copy loop, so using it is an absolute non-brainer.

> Don't we want it to have more or less the same behavior as a read-write
> loop?  People are probably running backup programs that depend on just
> simple copies, and maybe the results are good enough for their purposes,
> or maybe they're actually corrupting parts of their backups and don't
> know, but we can't suddenly start aborting their backups with errors and
> tell users it's for their own good.  So copy_file_range() callers will
> need to handle EINVAL on changing files somehow.

They do, and the system call has been in the tree for almost a year and
a half, so we can't simply change it.  Fortunately we do have a flags
argument that can be used to implement your preferred semantics if you
care deeply enough about them.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 15:35                           ` hch
@ 2017-03-09 16:16                             ` bfields
  2017-03-09 16:17                               ` hch
  2017-03-09 17:35                               ` Olga Kornievskaia
  0 siblings, 2 replies; 16+ messages in thread
From: bfields @ 2017-03-09 16:16 UTC (permalink / raw)
  To: hch; +Cc: Trond Myklebust, linux-nfs, kolga, linux-fsdevel

On Thu, Mar 09, 2017 at 07:35:59AM -0800, hch@infradead.org wrote:
> On Thu, Mar 09, 2017 at 10:29:48AM -0500, bfields@fieldses.org wrote:
> > So I don't understand why it needed to be added to copy_file_range().
> > The copy and clone semantics are different enough that I think callers
> > want to know which they're getting.
> 
> Because if a file systems implements clone is literally is always better
> than doing a copy loop, so using it is an absolute non-brainer.

I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
seems more annoying and error-prone to be prepared for both, as opposed
to trying clone and then explicitly falling back to copy if that doesn't
work.  Maybe it's not that big a deal.

> They do, and the system call has been in the tree for almost a year and
> a half, so we can't simply change it.  Fortunately we do have a flags
> argument that can be used to implement your preferred semantics if you
> care deeply enough about them.

Yeah.

There are also some other offset, length combinations that currently
return -EINVAL, I wonder if any of those could be repurposed e.g. for a
"keep copying to end of file" call.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 16:16                             ` bfields
@ 2017-03-09 16:17                               ` hch
  2017-03-09 17:28                                 ` Olga Kornievskaia
  2017-03-09 17:35                               ` Olga Kornievskaia
  1 sibling, 1 reply; 16+ messages in thread
From: hch @ 2017-03-09 16:17 UTC (permalink / raw)
  To: bfields; +Cc: hch, Trond Myklebust, linux-nfs, kolga, linux-fsdevel

On Thu, Mar 09, 2017 at 11:16:01AM -0500, bfields@fieldses.org wrote:
> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
> seems more annoying and error-prone to be prepared for both, as opposed
> to trying clone and then explicitly falling back to copy if that doesn't
> work.  Maybe it's not that big a deal.

We can do short copies^H^H^H^H^Hclones for clone just as easily,
at least for local filesystems (NFS would require some tweaks due to the
protocol).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 16:17                               ` hch
@ 2017-03-09 17:28                                 ` Olga Kornievskaia
  2017-03-09 18:40                                   ` bfields
  2017-03-09 21:55                                   ` hch
  0 siblings, 2 replies; 16+ messages in thread
From: Olga Kornievskaia @ 2017-03-09 17:28 UTC (permalink / raw)
  To: hch; +Cc: bfields, Trond Myklebust, linux-nfs, linux-fsdevel


> On Mar 9, 2017, at 11:17 AM, hch@infradead.org wrote:
> 
> On Thu, Mar 09, 2017 at 11:16:01AM -0500, bfields@fieldses.org wrote:
>> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
>> seems more annoying and error-prone to be prepared for both, as opposed
>> to trying clone and then explicitly falling back to copy if that doesn't
>> work.  Maybe it's not that big a deal.
> 
> We can do short copies^H^H^H^H^Hclones for clone just as easily,
> at least for local filesystems (NFS would require some tweaks due to the
> protocol).

I’m confused by the wording of “we can do … easily” . Is “can” = in the future? Currently, testing copy_file_range() on a btfs with argument of offset+len beyond the end of the file fail with EINVAL. Is NFS tweaking = revert the “MUST” in the spec for the check? 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 16:16                             ` bfields
  2017-03-09 16:17                               ` hch
@ 2017-03-09 17:35                               ` Olga Kornievskaia
  1 sibling, 0 replies; 16+ messages in thread
From: Olga Kornievskaia @ 2017-03-09 17:35 UTC (permalink / raw)
  To: bfields; +Cc: hch, Trond Myklebust, linux-nfs, linux-fsdevel


> On Mar 9, 2017, at 11:16 AM, bfields@fieldses.org wrote:
> 
> On Thu, Mar 09, 2017 at 07:35:59AM -0800, hch@infradead.org wrote:
>> On Thu, Mar 09, 2017 at 10:29:48AM -0500, bfields@fieldses.org wrote:
>>> So I don't understand why it needed to be added to copy_file_range().
>>> The copy and clone semantics are different enough that I think callers
>>> want to know which they're getting.
>> 
>> Because if a file systems implements clone is literally is always better
>> than doing a copy loop, so using it is an absolute non-brainer.
> 
> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
> seems more annoying and error-prone to be prepared for both, as opposed
> to trying clone and then explicitly falling back to copy if that doesn't
> work.  Maybe it's not that big a deal.

I’m confused at which layer are we calling clone() and if we get EINVAL we call copy()? In VFS? Or are we talking about the case where there are two different system calls clone_file_range() and copy_file_range() that are provided to the “cp” and it calls one and then falls back onto another?

> 
>> They do, and the system call has been in the tree for almost a year and
>> a half, so we can't simply change it.  Fortunately we do have a flags
>> argument that can be used to implement your preferred semantics if you
>> care deeply enough about them.
> 
> Yeah.
> 
> There are also some other offset, length combinations that currently
> return -EINVAL, I wonder if any of those could be repurposed e.g. for a
> "keep copying to end of file" call.
> 
> --b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 17:28                                 ` Olga Kornievskaia
@ 2017-03-09 18:40                                   ` bfields
  2017-03-09 21:55                                   ` hch
  1 sibling, 0 replies; 16+ messages in thread
From: bfields @ 2017-03-09 18:40 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: hch, Trond Myklebust, linux-nfs, linux-fsdevel

On Thu, Mar 09, 2017 at 12:28:03PM -0500, Olga Kornievskaia wrote:
> 
> > On Mar 9, 2017, at 11:17 AM, hch@infradead.org wrote:
> > 
> > On Thu, Mar 09, 2017 at 11:16:01AM -0500, bfields@fieldses.org
> > wrote:
> >> I guess I'm just hung up on the EINVAL vs. short copy behavior.  It
> >> seems more annoying and error-prone to be prepared for both, as
> >> opposed to trying clone and then explicitly falling back to copy if
> >> that doesn't work.  Maybe it's not that big a deal.
> > 
> > We can do short copies^H^H^H^H^Hclones for clone just as easily, at
> > least for local filesystems (NFS would require some tweaks due to
> > the protocol).
> 
> I’m confused by the wording of “we can do … easily” . Is “can” = in
> the future? Currently, testing copy_file_range() on a btfs with
> argument of offset+len beyond the end of the file fail with EINVAL. Is
> NFS tweaking = revert the “MUST” in the spec for the check? 

Checking https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-41
... looks like the CLONE result doesn't even have a length.

Most users are probably cloning the whole file, so I guess it only
matters in the case the file's changing (so that the size returned from
stat might not be correct by the time you do the clone).

Allowing short copies would be one way to handle that--I think it'd work
then to just always request a clone to the maximum length.

Alternatively if the linux interface just had a way to say "clone to end
of file", that'd solve most of the problem and be a nice fit for the
existing NFS protocol (which special-cases length 0 to mean "to end of
file").

Maybe we could special-case the maximum size_t to mean that.  (What is
that, 2^63-1?.)

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v1 01/19] fs: Don't copy beyond the end of the file
  2017-03-09 17:28                                 ` Olga Kornievskaia
  2017-03-09 18:40                                   ` bfields
@ 2017-03-09 21:55                                   ` hch
  1 sibling, 0 replies; 16+ messages in thread
From: hch @ 2017-03-09 21:55 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: hch, bfields, Trond Myklebust, linux-nfs, linux-fsdevel

On Thu, Mar 09, 2017 at 12:28:03PM -0500, Olga Kornievskaia wrote:
> I’m confused by the wording of “we can do … easily” . Is “can” = in the future? Currently, testing copy_file_range() on a btfs with argument of offset+len beyond the end of the file fail with EINVAL. Is NFS tweaking = revert the “MUST” in the spec for the check? 

The current clone and copy documents (which are our specs) require
this behavior, so that is expected.  But if we decided we want a variant
of copy / clone that doesn't do this it would be very easy to implement
in the local file systems.  Only for NFS we'd get a failure back and
would have to retry the clone based on the updated file size.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-03-09 22:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20170302160123.30375-1-kolga@netapp.com>
     [not found] ` <20170302160123.30375-2-kolga@netapp.com>
     [not found]   ` <20170302162221.GA6854@infradead.org>
     [not found]     ` <20170303204747.GE13877@fieldses.org>
     [not found]       ` <4B2A2E86-AFC8-49EA-9D53-7A53AD824CF1@netapp.com>
     [not found]         ` <20170303213230.GF13877@fieldses.org>
     [not found]           ` <B3F80DA0-B4F8-4628-88C5-E5C047620F17@netapp.com>
     [not found]             ` <20170304021008.GB21609@fieldses.org>
     [not found]               ` <924FF7A2-27CD-4848-BD61-748758C2533F@netapp.com>
2017-03-06 19:23                 ` [RFC v1 01/19] Don't copy beyond the end of the file J. Bruce Fields
2017-03-07 14:18                   ` Olga Kornievskaia
     [not found]       ` <20170307234051.GA29977@infradead.org>
     [not found]         ` <20170308170521.GA1020@fieldses.org>
     [not found]           ` <20170308172549.GA32011@infradead.org>
     [not found]             ` <7FDA8E80-3C62-48BB-9E2B-195B4BA340C0@netapp.com>
2017-03-08 19:53               ` [RFC v1 01/19] fs: " J. Bruce Fields
2017-03-08 20:00                 ` Olga Kornievskaia
2017-03-08 20:18                   ` J. Bruce Fields
2017-03-08 20:18                   ` Trond Myklebust
2017-03-08 20:32                     ` bfields
2017-03-08 20:49                       ` Trond Myklebust
2017-03-09 15:29                         ` bfields
2017-03-09 15:35                           ` hch
2017-03-09 16:16                             ` bfields
2017-03-09 16:17                               ` hch
2017-03-09 17:28                                 ` Olga Kornievskaia
2017-03-09 18:40                                   ` bfields
2017-03-09 21:55                                   ` hch
2017-03-09 17:35                               ` Olga Kornievskaia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).