linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* New copyfile system call - discuss before LSF?
@ 2013-02-21 11:37 Ric Wheeler
  2013-02-21 13:37 ` Hannes Reinecke
  2013-02-21 13:51 ` Myklebust, Trond
  0 siblings, 2 replies; 56+ messages in thread
From: Ric Wheeler @ 2013-02-21 11:37 UTC (permalink / raw)
  To: Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke


We have debated the need to have a system call to allow for offloading copy 
operations, for example to an NFS server (part to the new NFS 4.2 
specification), SCSI target device (two different SCSI commands do this), local 
file systems (reflink, etc) and I suspect many other possible parts of the stack 
could implement this.

The earliest discussion of such a system call I saw happened back in 2001, I 
know we had another more recent flurry (2-3 years back?) as well that got 
tangled up and died away.

Given the new popularity of this in storage devices and the use case for virt 
guests, any chance to get a proposal floated this year that might be able to 
land upstream in our life times :) ?

Ric




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 11:37 New copyfile system call - discuss before LSF? Ric Wheeler
@ 2013-02-21 13:37 ` Hannes Reinecke
  2013-02-21 13:51 ` Myklebust, Trond
  1 sibling, 0 replies; 56+ messages in thread
From: Hannes Reinecke @ 2013-02-21 13:37 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen

On 02/21/2013 12:37 PM, Ric Wheeler wrote:
>
> We have debated the need to have a system call to allow for
> offloading copy operations, for example to an NFS server (part to
> the new NFS 4.2 specification), SCSI target device (two different
> SCSI commands do this), local file systems (reflink, etc) and I
> suspect many other possible parts of the stack could implement this.
>
> The earliest discussion of such a system call I saw happened back in
> 2001, I know we had another more recent flurry (2-3 years back?) as
> well that got tangled up and died away.
>
Yeah, I remember. I talked to Mkp about it, who (as usual :-) had a 
patchset stashed away for this.
Or a preliminary attempt, anyway.
However, this was waiting for the DISCARD merging patches to go in, 
which in turn were waiting for the WRITE SAME patches IIRC.

Or something.

Martin?

> Given the new popularity of this in storage devices and the use case
> for virt guests, any chance to get a proposal floated this year that
> might be able to land upstream in our life times :) ?
>
Oh, most definitely.
Now that I finally have an array capable of doing ROD token copy
we should be reevaluating things.

I see to have the sg_xcopy program updated to do ROD copy, then we 
will have some real-world data.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 11:37 New copyfile system call - discuss before LSF? Ric Wheeler
  2013-02-21 13:37 ` Hannes Reinecke
@ 2013-02-21 13:51 ` Myklebust, Trond
  2013-02-21 14:57   ` Ric Wheeler
  2013-02-21 18:29   ` Jeremy Allison
  1 sibling, 2 replies; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-21 13:51 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke

On Thu, 2013-02-21 at 12:37 +0100, Ric Wheeler wrote:
> We have debated the need to have a system call to allow for offloading copy 
> operations, for example to an NFS server (part to the new NFS 4.2 
> specification), SCSI target device (two different SCSI commands do this), local 
> file systems (reflink, etc) and I suspect many other possible parts of the stack 
> could implement this.

sendfile64() pretty much already has the right arguments for a
"copyfile", however it would be nice to add a 'flags' parameter: the
NFSv4.2 version would use that to specify whether or not to copy file
metadata.

> The earliest discussion of such a system call I saw happened back in 2001, I 
> know we had another more recent flurry (2-3 years back?) as well that got 
> tangled up and died away.
>
> Given the new popularity of this in storage devices and the use case for virt 
> guests, any chance to get a proposal floated this year that might be able to 
> land upstream in our life times :) ?

I'm planning on soon dusting off the NFS prototype that NetApp wrote 3
years ago and converting at least the client implementation into
something that can go upstream. We do also have a server prototype for
Linux, but the copy offload between 2 different servers is a hack and
would need significant work.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 13:51 ` Myklebust, Trond
@ 2013-02-21 14:57   ` Ric Wheeler
  2013-02-21 16:36     ` Andreas Dilger
  2013-02-21 20:00     ` Paolo Bonzini
  2013-02-21 18:29   ` Jeremy Allison
  1 sibling, 2 replies; 56+ messages in thread
From: Ric Wheeler @ 2013-02-21 14:57 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Ric Wheeler, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

On 02/21/2013 02:51 PM, Myklebust, Trond wrote:
> On Thu, 2013-02-21 at 12:37 +0100, Ric Wheeler wrote:
>> We have debated the need to have a system call to allow for offloading copy
>> operations, for example to an NFS server (part to the new NFS 4.2
>> specification), SCSI target device (two different SCSI commands do this), local
>> file systems (reflink, etc) and I suspect many other possible parts of the stack
>> could implement this.
> sendfile64() pretty much already has the right arguments for a
> "copyfile", however it would be nice to add a 'flags' parameter: the
> NFSv4.2 version would use that to specify whether or not to copy file
> metadata.

That would seem to be enough to me and has the advantage that it is an 
relatively obvious extension to something that is at least not totally unknown 
to developers.

Do we need more than that for non-NFS paths I wonder? What does reflink need or 
the SCSI mechanism?

>
>> The earliest discussion of such a system call I saw happened back in 2001, I
>> know we had another more recent flurry (2-3 years back?) as well that got
>> tangled up and died away.
>>
>> Given the new popularity of this in storage devices and the use case for virt
>> guests, any chance to get a proposal floated this year that might be able to
>> land upstream in our life times :) ?
> I'm planning on soon dusting off the NFS prototype that NetApp wrote 3
> years ago and converting at least the client implementation into
> something that can go upstream. We do also have a server prototype for
> Linux, but the copy offload between 2 different servers is a hack and
> would need significant work.
>

That would be really interesting, thanks!



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 14:57   ` Ric Wheeler
@ 2013-02-21 16:36     ` Andreas Dilger
  2013-02-21 20:00     ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Andreas Dilger @ 2013-02-21 16:36 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Myklebust, Trond, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

On 2013-02-21, at 7:57 AM, Ric Wheeler wrote:
> On 02/21/2013 02:51 PM, Myklebust, Trond wrote:
>> On Thu, 2013-02-21 at 12:37 +0100, Ric Wheeler wrote:
>>> We have debated the need to have a system call to allow for offloading copy
>>> operations, for example to an NFS server (part to the new NFS 4.2
>>> specification), SCSI target device (two different SCSI commands do this), local
>>> file systems (reflink, etc) and I suspect many other possible parts of the stack
>>> could implement this.
>> sendfile64() pretty much already has the right arguments for a
>> "copyfile", however it would be nice to add a 'flags' parameter: the
>> NFSv4.2 version would use that to specify whether or not to copy file
>> metadata.
> 
> That would seem to be enough to me and has the advantage that it is an relatively obvious extension to something that is at least not totally unknown to developers.
> 
> Do we need more than that for non-NFS paths I wonder? What does reflink need or the SCSI mechanism?

IMHO, the critical part about a copy syscall is avoiding the data
copy to/from userspace.  Copying file attributes opens up a huge
morass of issues related to which attrs/xattrs/ACLs are copied,
yet those don't cost nearly so much as the data copies.

We definitely want the API to be flexible enough to do server-side
copies (e.g. NFS and CIFS), but we also need to allow data copies
for regular files between different local and/or network filesystems
within the VFS.

Cheers, Andreas

>>> The earliest discussion of such a system call I saw happened back in 2001, I
>>> know we had another more recent flurry (2-3 years back?) as well that got
>>> tangled up and died away.
>>> 
>>> Given the new popularity of this in storage devices and the use case for virt
>>> guests, any chance to get a proposal floated this year that might be able to
>>> land upstream in our life times :) ?
>> I'm planning on soon dusting off the NFS prototype that NetApp wrote 3
>> years ago and converting at least the client implementation into
>> something that can go upstream. We do also have a server prototype for
>> Linux, but the copy offload between 2 different servers is a hack and
>> would need significant work.
>> 
> 
> That would be really interesting, thanks!
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 13:51 ` Myklebust, Trond
  2013-02-21 14:57   ` Ric Wheeler
@ 2013-02-21 18:29   ` Jeremy Allison
  2013-02-22  0:29     ` Eric Wong
  1 sibling, 1 reply; 56+ messages in thread
From: Jeremy Allison @ 2013-02-21 18:29 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Ric Wheeler, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke

On Thu, Feb 21, 2013 at 01:51:53PM +0000, Myklebust, Trond wrote:
> On Thu, 2013-02-21 at 12:37 +0100, Ric Wheeler wrote:
> > We have debated the need to have a system call to allow for offloading copy 
> > operations, for example to an NFS server (part to the new NFS 4.2 
> > specification), SCSI target device (two different SCSI commands do this), local 
> > file systems (reflink, etc) and I suspect many other possible parts of the stack 
> > could implement this.
> 
> sendfile64() pretty much already has the right arguments for a
> "copyfile", however it would be nice to add a 'flags' parameter: the
> NFSv4.2 version would use that to specify whether or not to copy file
> metadata.

What would be really nice is if sendfile allowed zero-copy
from network socket to a file descriptor. That would help
a *lot* of my small system OEMs (and no splice() just doesn't
cut it :-).

Jeremy.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 14:57   ` Ric Wheeler
  2013-02-21 16:36     ` Andreas Dilger
@ 2013-02-21 20:00     ` Paolo Bonzini
  2013-02-21 20:50       ` Myklebust, Trond
  2013-02-21 22:05       ` Ric Wheeler
  1 sibling, 2 replies; 56+ messages in thread
From: Paolo Bonzini @ 2013-02-21 20:00 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Myklebust, Trond, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>>>
>> sendfile64() pretty much already has the right arguments for a
>> "copyfile", however it would be nice to add a 'flags' parameter: the
>> NFSv4.2 version would use that to specify whether or not to copy file
>> metadata.
> 
> That would seem to be enough to me and has the advantage that it is an
> relatively obvious extension to something that is at least not totally
> unknown to developers.
> 
> Do we need more than that for non-NFS paths I wonder? What does reflink
> need or the SCSI mechanism?

For virt we would like to be able to specify arbitrary block ranges.
Copying an entire file helps some copy operations like storage
migration.  However, it is not enough to convert the guest's offloaded
copies to host-side offloaded copies.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 20:00     ` Paolo Bonzini
@ 2013-02-21 20:50       ` Myklebust, Trond
  2013-02-21 22:24         ` Zach Brown
  2013-02-21 22:05       ` Ric Wheeler
  1 sibling, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-21 20:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ric Wheeler, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> >>>
> >> sendfile64() pretty much already has the right arguments for a
> >> "copyfile", however it would be nice to add a 'flags' parameter: the
> >> NFSv4.2 version would use that to specify whether or not to copy file
> >> metadata.
> > 
> > That would seem to be enough to me and has the advantage that it is an
> > relatively obvious extension to something that is at least not totally
> > unknown to developers.
> > 
> > Do we need more than that for non-NFS paths I wonder? What does reflink
> > need or the SCSI mechanism?
> 
> For virt we would like to be able to specify arbitrary block ranges.
> Copying an entire file helps some copy operations like storage
> migration.  However, it is not enough to convert the guest's offloaded
> copies to host-side offloaded copies.

So how would a system call based on sendfile64() plus my flag parameter
prevent an underlying implementation from meeting your criterion?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 20:00     ` Paolo Bonzini
  2013-02-21 20:50       ` Myklebust, Trond
@ 2013-02-21 22:05       ` Ric Wheeler
  2013-02-21 22:13         ` Myklebust, Trond
  1 sibling, 1 reply; 56+ messages in thread
From: Ric Wheeler @ 2013-02-21 22:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Myklebust, Trond, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

On 02/21/2013 09:00 PM, Paolo Bonzini wrote:
> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>>> sendfile64() pretty much already has the right arguments for a
>>> "copyfile", however it would be nice to add a 'flags' parameter: the
>>> NFSv4.2 version would use that to specify whether or not to copy file
>>> metadata.
>> That would seem to be enough to me and has the advantage that it is an
>> relatively obvious extension to something that is at least not totally
>> unknown to developers.
>>
>> Do we need more than that for non-NFS paths I wonder? What does reflink
>> need or the SCSI mechanism?
> For virt we would like to be able to specify arbitrary block ranges.
> Copying an entire file helps some copy operations like storage
> migration.  However, it is not enough to convert the guest's offloaded
> copies to host-side offloaded copies.
>
> Paolo

I don't think that the NFS protocol allows arbitrary ranges, but the SCSI 
commands are ranged based.

If I remember what the windows people said at a SNIA event a few years back, 
they have a requirement that the target file be pre-allocated (at least for the 
SCSI based copy). Not clear to me where they iterate over that target file to do 
the block range copies, but I suspect it is in their kernel.

Ric


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 22:05       ` Ric Wheeler
@ 2013-02-21 22:13         ` Myklebust, Trond
  2013-02-22  8:47           ` Ric Wheeler
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-21 22:13 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Paolo Bonzini, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

On Thu, 2013-02-21 at 23:05 +0100, Ric Wheeler wrote:
> On 02/21/2013 09:00 PM, Paolo Bonzini wrote:
> > Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> >>> sendfile64() pretty much already has the right arguments for a
> >>> "copyfile", however it would be nice to add a 'flags' parameter: the
> >>> NFSv4.2 version would use that to specify whether or not to copy file
> >>> metadata.
> >> That would seem to be enough to me and has the advantage that it is an
> >> relatively obvious extension to something that is at least not totally
> >> unknown to developers.
> >>
> >> Do we need more than that for non-NFS paths I wonder? What does reflink
> >> need or the SCSI mechanism?
> > For virt we would like to be able to specify arbitrary block ranges.
> > Copying an entire file helps some copy operations like storage
> > migration.  However, it is not enough to convert the guest's offloaded
> > copies to host-side offloaded copies.
> >
> > Paolo
> 
> I don't think that the NFS protocol allows arbitrary ranges, but the SCSI 
> commands are ranged based.
> 
> If I remember what the windows people said at a SNIA event a few years back, 
> they have a requirement that the target file be pre-allocated (at least for the 
> SCSI based copy). Not clear to me where they iterate over that target file to do 
> the block range copies, but I suspect it is in their kernel.

The NFSv4.2 copy offload protocol _does_ allow the copying of arbitrary
byte ranges. The main target for that functionality is indeed
virtualisation and thin provisioning of virtual machines.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 20:50       ` Myklebust, Trond
@ 2013-02-21 22:24         ` Zach Brown
  2013-02-22  1:29           ` Myklebust, Trond
                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Zach Brown @ 2013-02-21 22:24 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> > Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> > >>>
> > >> sendfile64() pretty much already has the right arguments for a
> > >> "copyfile", however it would be nice to add a 'flags' parameter: the
> > >> NFSv4.2 version would use that to specify whether or not to copy file
> > >> metadata.
> > > 
> > > That would seem to be enough to me and has the advantage that it is an
> > > relatively obvious extension to something that is at least not totally
> > > unknown to developers.
> > > 
> > > Do we need more than that for non-NFS paths I wonder? What does reflink
> > > need or the SCSI mechanism?
> > 
> > For virt we would like to be able to specify arbitrary block ranges.
> > Copying an entire file helps some copy operations like storage
> > migration.  However, it is not enough to convert the guest's offloaded
> > copies to host-side offloaded copies.
> 
> So how would a system call based on sendfile64() plus my flag parameter
> prevent an underlying implementation from meeting your criterion?

If I'm guessing correctly, sendfile64()+flags would be annoying because
it's missing an out_fd_offset.  The host will want to offload the
guest's copies by calling sendfile on block ranges of a guest disk image
file that correspond to the mappings of the in and out files in the
guest.

You could make it work with some locking and out_fd seeking to set the
write offset before calling sendfile64()+flags, but ugh.

 ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
                  out_offset, size_t count, int flags);

That seems closer.

We might also want to pre-emptively offer iovs instead of offsets,
because that's the very first thing that's going to be requested after
people prototype having to iterate calling sendfile() for each
contiguous copy region. 

- z

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 18:29   ` Jeremy Allison
@ 2013-02-22  0:29     ` Eric Wong
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Wong @ 2013-02-22  0:29 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Myklebust, Trond, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke

Jeremy Allison <jra@samba.org> wrote:
> On Thu, Feb 21, 2013 at 01:51:53PM +0000, Myklebust, Trond wrote:
> > On Thu, 2013-02-21 at 12:37 +0100, Ric Wheeler wrote:
> > > We have debated the need to have a system call to allow for offloading copy 
> > > operations, for example to an NFS server (part to the new NFS 4.2 
> > > specification), SCSI target device (two different SCSI commands do this), local 
> > > file systems (reflink, etc) and I suspect many other possible parts of the stack 
> > > could implement this.
> > 
> > sendfile64() pretty much already has the right arguments for a
> > "copyfile", however it would be nice to add a 'flags' parameter: the
> > NFSv4.2 version would use that to specify whether or not to copy file
> > metadata.
> 
> What would be really nice is if sendfile allowed zero-copy
> from network socket to a file descriptor. That would help
> a *lot* of my small system OEMs (and no splice() just doesn't
> cut it :-).

I've often wish the pipe requirement of splice() could be dropped,
to allow copying between arbitrary FDs.  Perhaps this can be done?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: New copyfile system call - discuss before LSF?
  2013-02-21 22:24         ` Zach Brown
@ 2013-02-22  1:29           ` Myklebust, Trond
  2013-02-23  0:32             ` Eric Wong
  2013-02-22  9:47           ` Paolo Bonzini
  2013-02-25 21:14           ` Andy Lutomirski
  2 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-22  1:29 UTC (permalink / raw)
  To: Zach Brown
  Cc: Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

> -----Original Message-----
> From: Zach Brown [mailto:zab@redhat.com]
> Sent: Thursday, February 21, 2013 5:25 PM
> To: Myklebust, Trond
> Cc: Paolo Bonzini; Ric Wheeler; Linux FS Devel; linux-kernel@vger.kernel.org;
> Chris L. Mason; Christoph Hellwig; Alexander Viro; Martin K. Petersen;
> Hannes Reinecke; Joel Becker
> Subject: Re: New copyfile system call - discuss before LSF?
> 
> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
> > On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> > > Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> > > >>>
> > > >> sendfile64() pretty much already has the right arguments for a
> > > >> "copyfile", however it would be nice to add a 'flags' parameter:
> > > >> the
> > > >> NFSv4.2 version would use that to specify whether or not to copy
> > > >> file metadata.
> > > >
> > > > That would seem to be enough to me and has the advantage that it
> > > > is an relatively obvious extension to something that is at least
> > > > not totally unknown to developers.
> > > >
> > > > Do we need more than that for non-NFS paths I wonder? What does
> > > > reflink need or the SCSI mechanism?
> > >
> > > For virt we would like to be able to specify arbitrary block ranges.
> > > Copying an entire file helps some copy operations like storage
> > > migration.  However, it is not enough to convert the guest's
> > > offloaded copies to host-side offloaded copies.
> >
> > So how would a system call based on sendfile64() plus my flag
> > parameter prevent an underlying implementation from meeting your
> criterion?
> 
> If I'm guessing correctly, sendfile64()+flags would be annoying because it's
> missing an out_fd_offset.  The host will want to offload the guest's copies by
> calling sendfile on block ranges of a guest disk image file that correspond to
> the mappings of the in and out files in the guest.
> 
> You could make it work with some locking and out_fd seeking to set the
> write offset before calling sendfile64()+flags, but ugh.
> 
>  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>                   out_offset, size_t count, int flags);
> 
> That seems closer.

psendfile() ?

I fully agree that sounds reasonable... Just being an ass. :-)

> We might also want to pre-emptively offer iovs instead of offsets, because
> that's the very first thing that's going to be requested after people prototype
> having to iterate calling sendfile() for each contiguous copy region.

vpsendfile() then? I agree that might be a little more future-proof. Particularly given that the underlying protocols tend to be fully asynchronous, and so it makes sense to queue up more than one copy at a time...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 22:13         ` Myklebust, Trond
@ 2013-02-22  8:47           ` Ric Wheeler
  0 siblings, 0 replies; 56+ messages in thread
From: Ric Wheeler @ 2013-02-22  8:47 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Ric Wheeler, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 02/21/2013 11:13 PM, Myklebust, Trond wrote:
> On Thu, 2013-02-21 at 23:05 +0100, Ric Wheeler wrote:
>> On 02/21/2013 09:00 PM, Paolo Bonzini wrote:
>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>>>>> sendfile64() pretty much already has the right arguments for a
>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
>>>>> NFSv4.2 version would use that to specify whether or not to copy file
>>>>> metadata.
>>>> That would seem to be enough to me and has the advantage that it is an
>>>> relatively obvious extension to something that is at least not totally
>>>> unknown to developers.
>>>>
>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
>>>> need or the SCSI mechanism?
>>> For virt we would like to be able to specify arbitrary block ranges.
>>> Copying an entire file helps some copy operations like storage
>>> migration.  However, it is not enough to convert the guest's offloaded
>>> copies to host-side offloaded copies.
>>>
>>> Paolo
>> I don't think that the NFS protocol allows arbitrary ranges, but the SCSI
>> commands are ranged based.
>>
>> If I remember what the windows people said at a SNIA event a few years back,
>> they have a requirement that the target file be pre-allocated (at least for the
>> SCSI based copy). Not clear to me where they iterate over that target file to do
>> the block range copies, but I suspect it is in their kernel.
> The NFSv4.2 copy offload protocol _does_ allow the copying of arbitrary
> byte ranges. The main target for that functionality is indeed
> virtualisation and thin provisioning of virtual machines.
>

For background, here is a pointer to Fred Knight's SNIA talk on the SCSI support 
for offload:

https://snia.org/sites/default/files2/SDC2011/presentations/monday/FrederickKnight_Storage_Data_Movement_Offload.pdf

and a talk from Spencer Shepler that gives some detail on the NFS spec, 
including the "server side copy" bits:

https://snia.org/sites/default/files2/SDC2011/presentations/wednesday/SpencerShepler_IETF_NFSv4_Working_Group_v4.pdf

The talks both have references to the actual specs for the gory details.

Ric





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 22:24         ` Zach Brown
  2013-02-22  1:29           ` Myklebust, Trond
@ 2013-02-22  9:47           ` Paolo Bonzini
  2013-02-22  9:52             ` Ric Wheeler
  2013-02-25 21:14           ` Andy Lutomirski
  2 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2013-02-22  9:47 UTC (permalink / raw)
  To: Zach Brown
  Cc: Myklebust, Trond, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

Il 21/02/2013 23:24, Zach Brown ha scritto:
> You could make it work with some locking and out_fd seeking to set the
> write offset before calling sendfile64()+flags, but ugh.
> 
>  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>                   out_offset, size_t count, int flags);
> 
> That seems closer.
> 
> We might also want to pre-emptively offer iovs instead of offsets,
> because that's the very first thing that's going to be requested after
> people prototype having to iterate calling sendfile() for each
> contiguous copy region. 

Indeed, I was about to propose that exactly.  So that would be
psendfilev.  I don't think psendfile is useful, and can be easily
provided at the libc level.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-22  9:47           ` Paolo Bonzini
@ 2013-02-22  9:52             ` Ric Wheeler
  2013-02-22 18:22               ` Zach Brown
  0 siblings, 1 reply; 56+ messages in thread
From: Ric Wheeler @ 2013-02-22  9:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Zach Brown, Myklebust, Trond, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 02/22/2013 10:47 AM, Paolo Bonzini wrote:
> Il 21/02/2013 23:24, Zach Brown ha scritto:
>> You could make it work with some locking and out_fd seeking to set the
>> write offset before calling sendfile64()+flags, but ugh.
>>
>>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>>                    out_offset, size_t count, int flags);
>>
>> That seems closer.
>>
>> We might also want to pre-emptively offer iovs instead of offsets,
>> because that's the very first thing that's going to be requested after
>> people prototype having to iterate calling sendfile() for each
>> contiguous copy region.
> Indeed, I was about to propose that exactly.  So that would be
> psendfilev.  I don't think psendfile is useful, and can be easily
> provided at the libc level.
>
> Paolo

This seems to be suspiciously close to a clear consensus on how to move forward 
after many years of spinning our wheels. Anyone want to promote an actual patch 
before we change our collective minds?

Ric


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-22  9:52             ` Ric Wheeler
@ 2013-02-22 18:22               ` Zach Brown
  2013-02-22 22:48                 ` Myklebust, Trond
  0 siblings, 1 reply; 56+ messages in thread
From: Zach Brown @ 2013-02-22 18:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Paolo Bonzini, Myklebust, Trond, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

> This seems to be suspiciously close to a clear consensus on how to
> move forward after many years of spinning our wheels. Anyone want to
> promote an actual patch before we change our collective minds?

It seems like we'd want to start with the exisiting (presumably
bitrotten) prototypes that Trond has for nfs and that Martin has for
block->scsi.  Mash the new syscall on top of and get them working in
current mainline.

I'd be happy to take responsibility for making forward progress if no
one else has the bandwidth.

Trond, Martin, would that make sense?  Are the most recent versions of
the prototypes available somewhere?

- z

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: New copyfile system call - discuss before LSF?
  2013-02-22 18:22               ` Zach Brown
@ 2013-02-22 22:48                 ` Myklebust, Trond
  0 siblings, 0 replies; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-22 22:48 UTC (permalink / raw)
  To: Zach Brown, Ric Wheeler
  Cc: Paolo Bonzini, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

> -----Original Message-----
> From: Zach Brown [mailto:zab@redhat.com]
> Sent: Friday, February 22, 2013 1:22 PM
> To: Ric Wheeler
> Cc: Paolo Bonzini; Myklebust, Trond; Linux FS Devel; linux-
> kernel@vger.kernel.org; Chris L. Mason; Christoph Hellwig; Alexander Viro;
> Martin K. Petersen; Hannes Reinecke; Joel Becker
> Subject: Re: New copyfile system call - discuss before LSF?
> 
> > This seems to be suspiciously close to a clear consensus on how to
> > move forward after many years of spinning our wheels. Anyone want to
> > promote an actual patch before we change our collective minds?
> 
> It seems like we'd want to start with the exisiting (presumably
> bitrotten) prototypes that Trond has for nfs and that Martin has for
> block->scsi.  Mash the new syscall on top of and get them working in
> current mainline.
> 
> I'd be happy to take responsibility for making forward progress if no one else
> has the bandwidth.
> 
> Trond, Martin, would that make sense?  Are the most recent versions of the
> prototypes available somewhere?

Hi Zach,

The wildly bitrotten NFS copyfile prototype can be found on

    ftp://ftp.netapp.com/frm-ntap/opensource/linux_copyfileat/v2/linux_copyfileat_v2.tgz

Please open with extreme caution and apply the resulting patches to a Linux 2.6.34.2 kernel...

Cheers
   Trond

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-22  1:29           ` Myklebust, Trond
@ 2013-02-23  0:32             ` Eric Wong
  2013-03-30 19:45               ` Pavel Machek
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Wong @ 2013-02-23  0:32 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Zach Brown, Paolo Bonzini, Ric Wheeler, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:
> > -----Original Message-----
> > From: Zach Brown [mailto:zab@redhat.com]
> > Sent: Thursday, February 21, 2013 5:25 PM
> > To: Myklebust, Trond
> > Cc: Paolo Bonzini; Ric Wheeler; Linux FS Devel; linux-kernel@vger.kernel.org;
> > Chris L. Mason; Christoph Hellwig; Alexander Viro; Martin K. Petersen;
> > Hannes Reinecke; Joel Becker
> > Subject: Re: New copyfile system call - discuss before LSF?
> > 
> > On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
> > > On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> > > > Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> > > > >>>
> > > > >> sendfile64() pretty much already has the right arguments for a
> > > > >> "copyfile", however it would be nice to add a 'flags' parameter:
> > > > >> the
> > > > >> NFSv4.2 version would use that to specify whether or not to copy
> > > > >> file metadata.
> > > > >
> > > > > That would seem to be enough to me and has the advantage that it
> > > > > is an relatively obvious extension to something that is at least
> > > > > not totally unknown to developers.
> > > > >
> > > > > Do we need more than that for non-NFS paths I wonder? What does
> > > > > reflink need or the SCSI mechanism?
> > > >
> > > > For virt we would like to be able to specify arbitrary block ranges.
> > > > Copying an entire file helps some copy operations like storage
> > > > migration.  However, it is not enough to convert the guest's
> > > > offloaded copies to host-side offloaded copies.
> > >
> > > So how would a system call based on sendfile64() plus my flag
> > > parameter prevent an underlying implementation from meeting your
> > criterion?
> > 
> > If I'm guessing correctly, sendfile64()+flags would be annoying because it's
> > missing an out_fd_offset.  The host will want to offload the guest's copies by
> > calling sendfile on block ranges of a guest disk image file that correspond to
> > the mappings of the in and out files in the guest.
> > 
> > You could make it work with some locking and out_fd seeking to set the
> > write offset before calling sendfile64()+flags, but ugh.
> > 
> >  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
> >                   out_offset, size_t count, int flags);
> > 
> > That seems closer.
> 
> psendfile() ?
> 
> I fully agree that sounds reasonable... Just being an ass. :-)

splice() already has offset for both fds and a flags arg:

       ssize_t splice(int fd_in, loff_t *off_in, int fd_out,
                      loff_t *off_out, size_t len, unsigned int flags);

The current downside is it requires one fd to be a pipe, so it's
just not very easy to use from my perspective[1].

> > We might also want to pre-emptively offer iovs instead of offsets, because
> > that's the very first thing that's going to be requested after people prototype
> > having to iterate calling sendfile() for each contiguous copy region.
> 
> vpsendfile() then? I agree that might be a little more future-proof. Particularly given that the underlying protocols tend to be fully asynchronous, and so it makes sense to queue up more than one copy at a time...

splicev() might be nice to have in that case, too.



[1] my splice() annoyances:
    * need to create/manage a pipe
    * copy size limited by pipe size
    * doesn't reduce userspace syscalls (just data copy overhead)
    * easy to misuse and starve with blocking sockets + big buffers
    * not many users, so bugs creep in (v3.7.8 was the first usable
      version of the 3.7 series for TCP sockets)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-21 22:24         ` Zach Brown
  2013-02-22  1:29           ` Myklebust, Trond
  2013-02-22  9:47           ` Paolo Bonzini
@ 2013-02-25 21:14           ` Andy Lutomirski
  2013-02-25 21:49             ` Ric Wheeler
  2013-02-26 21:02             ` Jörn Engel
  2 siblings, 2 replies; 56+ messages in thread
From: Andy Lutomirski @ 2013-02-25 21:14 UTC (permalink / raw)
  To: Zach Brown
  Cc: Myklebust, Trond, Paolo Bonzini, Ric Wheeler, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 02/21/2013 02:24 PM, Zach Brown wrote:
> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>>>>>>
>>>>> sendfile64() pretty much already has the right arguments for a
>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
>>>>> NFSv4.2 version would use that to specify whether or not to copy file
>>>>> metadata.
>>>>
>>>> That would seem to be enough to me and has the advantage that it is an
>>>> relatively obvious extension to something that is at least not totally
>>>> unknown to developers.
>>>>
>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
>>>> need or the SCSI mechanism?
>>>
>>> For virt we would like to be able to specify arbitrary block ranges.
>>> Copying an entire file helps some copy operations like storage
>>> migration.  However, it is not enough to convert the guest's offloaded
>>> copies to host-side offloaded copies.
>>
>> So how would a system call based on sendfile64() plus my flag parameter
>> prevent an underlying implementation from meeting your criterion?
> 
> If I'm guessing correctly, sendfile64()+flags would be annoying because
> it's missing an out_fd_offset.  The host will want to offload the
> guest's copies by calling sendfile on block ranges of a guest disk image
> file that correspond to the mappings of the in and out files in the
> guest.
> 
> You could make it work with some locking and out_fd seeking to set the
> write offset before calling sendfile64()+flags, but ugh.
> 
>  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>                   out_offset, size_t count, int flags);
> 
> That seems closer.
> 
> We might also want to pre-emptively offer iovs instead of offsets,
> because that's the very first thing that's going to be requested after
> people prototype having to iterate calling sendfile() for each
> contiguous copy region. 

I thought the first thing people would ask for is to atomically create a
new file and copy the old file into it (at least on local file systems).
 The idea is that nothing should see an empty destination file, either
by race or by crash.  (This feature would perhaps be described as a
pony, but it should be implementable.)

This would be like a better link(2).

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 21:14           ` Andy Lutomirski
@ 2013-02-25 21:49             ` Ric Wheeler
  2013-02-25 21:59               ` Myklebust, Trond
  2013-02-26 21:02             ` Jörn Engel
  1 sibling, 1 reply; 56+ messages in thread
From: Ric Wheeler @ 2013-02-25 21:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Zach Brown, Myklebust, Trond, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 02/25/2013 04:14 PM, Andy Lutomirski wrote:
> On 02/21/2013 02:24 PM, Zach Brown wrote:
>> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
>>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
>>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>>>>>> sendfile64() pretty much already has the right arguments for a
>>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
>>>>>> NFSv4.2 version would use that to specify whether or not to copy file
>>>>>> metadata.
>>>>> That would seem to be enough to me and has the advantage that it is an
>>>>> relatively obvious extension to something that is at least not totally
>>>>> unknown to developers.
>>>>>
>>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
>>>>> need or the SCSI mechanism?
>>>> For virt we would like to be able to specify arbitrary block ranges.
>>>> Copying an entire file helps some copy operations like storage
>>>> migration.  However, it is not enough to convert the guest's offloaded
>>>> copies to host-side offloaded copies.
>>> So how would a system call based on sendfile64() plus my flag parameter
>>> prevent an underlying implementation from meeting your criterion?
>> If I'm guessing correctly, sendfile64()+flags would be annoying because
>> it's missing an out_fd_offset.  The host will want to offload the
>> guest's copies by calling sendfile on block ranges of a guest disk image
>> file that correspond to the mappings of the in and out files in the
>> guest.
>>
>> You could make it work with some locking and out_fd seeking to set the
>> write offset before calling sendfile64()+flags, but ugh.
>>
>>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>>                    out_offset, size_t count, int flags);
>>
>> That seems closer.
>>
>> We might also want to pre-emptively offer iovs instead of offsets,
>> because that's the very first thing that's going to be requested after
>> people prototype having to iterate calling sendfile() for each
>> contiguous copy region.
> I thought the first thing people would ask for is to atomically create a
> new file and copy the old file into it (at least on local file systems).
>   The idea is that nothing should see an empty destination file, either
> by race or by crash.  (This feature would perhaps be described as a
> pony, but it should be implementable.)
>
> This would be like a better link(2).
>
> --Andy

Why would this need to be atomic? That would seem to be a very difficult 
property to provide across all target types with multi-GB sized files...

Ric



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 21:49             ` Ric Wheeler
@ 2013-02-25 21:59               ` Myklebust, Trond
  2013-02-25 22:16                 ` Andy Lutomirski
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-25 21:59 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andy Lutomirski, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Mon, 2013-02-25 at 16:49 -0500, Ric Wheeler wrote:
> On 02/25/2013 04:14 PM, Andy Lutomirski wrote:
> > On 02/21/2013 02:24 PM, Zach Brown wrote:
> >> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
> >>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> >>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> >>>>>> sendfile64() pretty much already has the right arguments for a
> >>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
> >>>>>> NFSv4.2 version would use that to specify whether or not to copy file
> >>>>>> metadata.
> >>>>> That would seem to be enough to me and has the advantage that it is an
> >>>>> relatively obvious extension to something that is at least not totally
> >>>>> unknown to developers.
> >>>>>
> >>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
> >>>>> need or the SCSI mechanism?
> >>>> For virt we would like to be able to specify arbitrary block ranges.
> >>>> Copying an entire file helps some copy operations like storage
> >>>> migration.  However, it is not enough to convert the guest's offloaded
> >>>> copies to host-side offloaded copies.
> >>> So how would a system call based on sendfile64() plus my flag parameter
> >>> prevent an underlying implementation from meeting your criterion?
> >> If I'm guessing correctly, sendfile64()+flags would be annoying because
> >> it's missing an out_fd_offset.  The host will want to offload the
> >> guest's copies by calling sendfile on block ranges of a guest disk image
> >> file that correspond to the mappings of the in and out files in the
> >> guest.
> >>
> >> You could make it work with some locking and out_fd seeking to set the
> >> write offset before calling sendfile64()+flags, but ugh.
> >>
> >>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
> >>                    out_offset, size_t count, int flags);
> >>
> >> That seems closer.
> >>
> >> We might also want to pre-emptively offer iovs instead of offsets,
> >> because that's the very first thing that's going to be requested after
> >> people prototype having to iterate calling sendfile() for each
> >> contiguous copy region.
> > I thought the first thing people would ask for is to atomically create a
> > new file and copy the old file into it (at least on local file systems).
> >   The idea is that nothing should see an empty destination file, either
> > by race or by crash.  (This feature would perhaps be described as a
> > pony, but it should be implementable.)
> >
> > This would be like a better link(2).
> >
> > --Andy
> 
> Why would this need to be atomic? That would seem to be a very difficult 
> property to provide across all target types with multi-GB sized files...

Right. It may sound cool, but what's the real-life use case?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 21:59               ` Myklebust, Trond
@ 2013-02-25 22:16                 ` Andy Lutomirski
  2013-02-25 23:28                   ` Myklebust, Trond
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2013-02-25 22:16 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Ric Wheeler, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Mon, Feb 25, 2013 at 1:59 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Mon, 2013-02-25 at 16:49 -0500, Ric Wheeler wrote:
>> On 02/25/2013 04:14 PM, Andy Lutomirski wrote:
>> > On 02/21/2013 02:24 PM, Zach Brown wrote:
>> >> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
>> >>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
>> >>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>> >>>>>> sendfile64() pretty much already has the right arguments for a
>> >>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
>> >>>>>> NFSv4.2 version would use that to specify whether or not to copy file
>> >>>>>> metadata.
>> >>>>> That would seem to be enough to me and has the advantage that it is an
>> >>>>> relatively obvious extension to something that is at least not totally
>> >>>>> unknown to developers.
>> >>>>>
>> >>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
>> >>>>> need or the SCSI mechanism?
>> >>>> For virt we would like to be able to specify arbitrary block ranges.
>> >>>> Copying an entire file helps some copy operations like storage
>> >>>> migration.  However, it is not enough to convert the guest's offloaded
>> >>>> copies to host-side offloaded copies.
>> >>> So how would a system call based on sendfile64() plus my flag parameter
>> >>> prevent an underlying implementation from meeting your criterion?
>> >> If I'm guessing correctly, sendfile64()+flags would be annoying because
>> >> it's missing an out_fd_offset.  The host will want to offload the
>> >> guest's copies by calling sendfile on block ranges of a guest disk image
>> >> file that correspond to the mappings of the in and out files in the
>> >> guest.
>> >>
>> >> You could make it work with some locking and out_fd seeking to set the
>> >> write offset before calling sendfile64()+flags, but ugh.
>> >>
>> >>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>> >>                    out_offset, size_t count, int flags);
>> >>
>> >> That seems closer.
>> >>
>> >> We might also want to pre-emptively offer iovs instead of offsets,
>> >> because that's the very first thing that's going to be requested after
>> >> people prototype having to iterate calling sendfile() for each
>> >> contiguous copy region.
>> > I thought the first thing people would ask for is to atomically create a
>> > new file and copy the old file into it (at least on local file systems).
>> >   The idea is that nothing should see an empty destination file, either
>> > by race or by crash.  (This feature would perhaps be described as a
>> > pony, but it should be implementable.)
>> >
>> > This would be like a better link(2).
>> >
>> > --Andy
>>
>> Why would this need to be atomic? That would seem to be a very difficult
>> property to provide across all target types with multi-GB sized files...
>
> Right. It may sound cool, but what's the real-life use case?
>

Download file from some source and then verify it.  Now copyfile it
into my repository of known-good files.

Admittedly I could link + unlink or rename it there, but I consider
hard links to be rather evil, especially when cow links are available.


--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 22:16                 ` Andy Lutomirski
@ 2013-02-25 23:28                   ` Myklebust, Trond
  2013-02-25 23:35                     ` Andy Lutomirski
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-25 23:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ric Wheeler, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Mon, 2013-02-25 at 14:16 -0800, Andy Lutomirski wrote:
> On Mon, Feb 25, 2013 at 1:59 PM, Myklebust, Trond
> <Trond.Myklebust@netapp.com> wrote:
> > On Mon, 2013-02-25 at 16:49 -0500, Ric Wheeler wrote:
> >> On 02/25/2013 04:14 PM, Andy Lutomirski wrote:
> >> > On 02/21/2013 02:24 PM, Zach Brown wrote:
> >> >> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
> >> >>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> >> >>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> >> >>>>>> sendfile64() pretty much already has the right arguments for a
> >> >>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
> >> >>>>>> NFSv4.2 version would use that to specify whether or not to copy file
> >> >>>>>> metadata.
> >> >>>>> That would seem to be enough to me and has the advantage that it is an
> >> >>>>> relatively obvious extension to something that is at least not totally
> >> >>>>> unknown to developers.
> >> >>>>>
> >> >>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
> >> >>>>> need or the SCSI mechanism?
> >> >>>> For virt we would like to be able to specify arbitrary block ranges.
> >> >>>> Copying an entire file helps some copy operations like storage
> >> >>>> migration.  However, it is not enough to convert the guest's offloaded
> >> >>>> copies to host-side offloaded copies.
> >> >>> So how would a system call based on sendfile64() plus my flag parameter
> >> >>> prevent an underlying implementation from meeting your criterion?
> >> >> If I'm guessing correctly, sendfile64()+flags would be annoying because
> >> >> it's missing an out_fd_offset.  The host will want to offload the
> >> >> guest's copies by calling sendfile on block ranges of a guest disk image
> >> >> file that correspond to the mappings of the in and out files in the
> >> >> guest.
> >> >>
> >> >> You could make it work with some locking and out_fd seeking to set the
> >> >> write offset before calling sendfile64()+flags, but ugh.
> >> >>
> >> >>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
> >> >>                    out_offset, size_t count, int flags);
> >> >>
> >> >> That seems closer.
> >> >>
> >> >> We might also want to pre-emptively offer iovs instead of offsets,
> >> >> because that's the very first thing that's going to be requested after
> >> >> people prototype having to iterate calling sendfile() for each
> >> >> contiguous copy region.
> >> > I thought the first thing people would ask for is to atomically create a
> >> > new file and copy the old file into it (at least on local file systems).
> >> >   The idea is that nothing should see an empty destination file, either
> >> > by race or by crash.  (This feature would perhaps be described as a
> >> > pony, but it should be implementable.)
> >> >
> >> > This would be like a better link(2).
> >> >
> >> > --Andy
> >>
> >> Why would this need to be atomic? That would seem to be a very difficult
> >> property to provide across all target types with multi-GB sized files...
> >
> > Right. It may sound cool, but what's the real-life use case?
> >
> 
> Download file from some source and then verify it.  Now copyfile it
> into my repository of known-good files.
> 
> Admittedly I could link + unlink or rename it there, but I consider
> hard links to be rather evil, especially when cow links are available.

Rename is the right way to do that as it can't corrupt the data after
you have verified it. copyfile can...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 23:28                   ` Myklebust, Trond
@ 2013-02-25 23:35                     ` Andy Lutomirski
  2013-02-25 23:45                       ` Myklebust, Trond
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2013-02-25 23:35 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Ric Wheeler, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Mon, Feb 25, 2013 at 3:28 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Mon, 2013-02-25 at 14:16 -0800, Andy Lutomirski wrote:
>> On Mon, Feb 25, 2013 at 1:59 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Mon, 2013-02-25 at 16:49 -0500, Ric Wheeler wrote:
>> >> On 02/25/2013 04:14 PM, Andy Lutomirski wrote:
>> >> > On 02/21/2013 02:24 PM, Zach Brown wrote:
>> >> >> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
>> >> >>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
>> >> >>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
>> >> >>>>>> sendfile64() pretty much already has the right arguments for a
>> >> >>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
>> >> >>>>>> NFSv4.2 version would use that to specify whether or not to copy file
>> >> >>>>>> metadata.
>> >> >>>>> That would seem to be enough to me and has the advantage that it is an
>> >> >>>>> relatively obvious extension to something that is at least not totally
>> >> >>>>> unknown to developers.
>> >> >>>>>
>> >> >>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
>> >> >>>>> need or the SCSI mechanism?
>> >> >>>> For virt we would like to be able to specify arbitrary block ranges.
>> >> >>>> Copying an entire file helps some copy operations like storage
>> >> >>>> migration.  However, it is not enough to convert the guest's offloaded
>> >> >>>> copies to host-side offloaded copies.
>> >> >>> So how would a system call based on sendfile64() plus my flag parameter
>> >> >>> prevent an underlying implementation from meeting your criterion?
>> >> >> If I'm guessing correctly, sendfile64()+flags would be annoying because
>> >> >> it's missing an out_fd_offset.  The host will want to offload the
>> >> >> guest's copies by calling sendfile on block ranges of a guest disk image
>> >> >> file that correspond to the mappings of the in and out files in the
>> >> >> guest.
>> >> >>
>> >> >> You could make it work with some locking and out_fd seeking to set the
>> >> >> write offset before calling sendfile64()+flags, but ugh.
>> >> >>
>> >> >>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
>> >> >>                    out_offset, size_t count, int flags);
>> >> >>
>> >> >> That seems closer.
>> >> >>
>> >> >> We might also want to pre-emptively offer iovs instead of offsets,
>> >> >> because that's the very first thing that's going to be requested after
>> >> >> people prototype having to iterate calling sendfile() for each
>> >> >> contiguous copy region.
>> >> > I thought the first thing people would ask for is to atomically create a
>> >> > new file and copy the old file into it (at least on local file systems).
>> >> >   The idea is that nothing should see an empty destination file, either
>> >> > by race or by crash.  (This feature would perhaps be described as a
>> >> > pony, but it should be implementable.)
>> >> >
>> >> > This would be like a better link(2).
>> >> >
>> >> > --Andy
>> >>
>> >> Why would this need to be atomic? That would seem to be a very difficult
>> >> property to provide across all target types with multi-GB sized files...
>> >
>> > Right. It may sound cool, but what's the real-life use case?
>> >
>>
>> Download file from some source and then verify it.  Now copyfile it
>> into my repository of known-good files.
>>
>> Admittedly I could link + unlink or rename it there, but I consider
>> hard links to be rather evil, especially when cow links are available.
>
> Rename is the right way to do that as it can't corrupt the data after
> you have verified it. copyfile can...

...copyfile doesn't exist.  I think it would be neat if it couldn't
corrupt data.

In any case, this may be a bad idea -- presumably you'd have to fsync
the file you're copying *from* first to avoid a massive performance
hit.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 23:35                     ` Andy Lutomirski
@ 2013-02-25 23:45                       ` Myklebust, Trond
  2013-02-26  0:03                         ` Zach Brown
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-02-25 23:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ric Wheeler, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Mon, 2013-02-25 at 15:35 -0800, Andy Lutomirski wrote:
> On Mon, Feb 25, 2013 at 3:28 PM, Myklebust, Trond
> <Trond.Myklebust@netapp.com> wrote:
> > On Mon, 2013-02-25 at 14:16 -0800, Andy Lutomirski wrote:
> >> On Mon, Feb 25, 2013 at 1:59 PM, Myklebust, Trond
> >> <Trond.Myklebust@netapp.com> wrote:
> >> > On Mon, 2013-02-25 at 16:49 -0500, Ric Wheeler wrote:
> >> >> On 02/25/2013 04:14 PM, Andy Lutomirski wrote:
> >> >> > On 02/21/2013 02:24 PM, Zach Brown wrote:
> >> >> >> On Thu, Feb 21, 2013 at 08:50:27PM +0000, Myklebust, Trond wrote:
> >> >> >>> On Thu, 2013-02-21 at 21:00 +0100, Paolo Bonzini wrote:
> >> >> >>>> Il 21/02/2013 15:57, Ric Wheeler ha scritto:
> >> >> >>>>>> sendfile64() pretty much already has the right arguments for a
> >> >> >>>>>> "copyfile", however it would be nice to add a 'flags' parameter: the
> >> >> >>>>>> NFSv4.2 version would use that to specify whether or not to copy file
> >> >> >>>>>> metadata.
> >> >> >>>>> That would seem to be enough to me and has the advantage that it is an
> >> >> >>>>> relatively obvious extension to something that is at least not totally
> >> >> >>>>> unknown to developers.
> >> >> >>>>>
> >> >> >>>>> Do we need more than that for non-NFS paths I wonder? What does reflink
> >> >> >>>>> need or the SCSI mechanism?
> >> >> >>>> For virt we would like to be able to specify arbitrary block ranges.
> >> >> >>>> Copying an entire file helps some copy operations like storage
> >> >> >>>> migration.  However, it is not enough to convert the guest's offloaded
> >> >> >>>> copies to host-side offloaded copies.
> >> >> >>> So how would a system call based on sendfile64() plus my flag parameter
> >> >> >>> prevent an underlying implementation from meeting your criterion?
> >> >> >> If I'm guessing correctly, sendfile64()+flags would be annoying because
> >> >> >> it's missing an out_fd_offset.  The host will want to offload the
> >> >> >> guest's copies by calling sendfile on block ranges of a guest disk image
> >> >> >> file that correspond to the mappings of the in and out files in the
> >> >> >> guest.
> >> >> >>
> >> >> >> You could make it work with some locking and out_fd seeking to set the
> >> >> >> write offset before calling sendfile64()+flags, but ugh.
> >> >> >>
> >> >> >>   ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
> >> >> >>                    out_offset, size_t count, int flags);
> >> >> >>
> >> >> >> That seems closer.
> >> >> >>
> >> >> >> We might also want to pre-emptively offer iovs instead of offsets,
> >> >> >> because that's the very first thing that's going to be requested after
> >> >> >> people prototype having to iterate calling sendfile() for each
> >> >> >> contiguous copy region.
> >> >> > I thought the first thing people would ask for is to atomically create a
> >> >> > new file and copy the old file into it (at least on local file systems).
> >> >> >   The idea is that nothing should see an empty destination file, either
> >> >> > by race or by crash.  (This feature would perhaps be described as a
> >> >> > pony, but it should be implementable.)
> >> >> >
> >> >> > This would be like a better link(2).
> >> >> >
> >> >> > --Andy
> >> >>
> >> >> Why would this need to be atomic? That would seem to be a very difficult
> >> >> property to provide across all target types with multi-GB sized files...
> >> >
> >> > Right. It may sound cool, but what's the real-life use case?
> >> >
> >>
> >> Download file from some source and then verify it.  Now copyfile it
> >> into my repository of known-good files.
> >>
> >> Admittedly I could link + unlink or rename it there, but I consider
> >> hard links to be rather evil, especially when cow links are available.
> >
> > Rename is the right way to do that as it can't corrupt the data after
> > you have verified it. copyfile can...
> 
> ...copyfile doesn't exist.

Wrong! The underlying NFS and SCSI copy offload protocols are fully
defined at this time, and will constrain any implementation that you may
dream up.

>   I think it would be neat if it couldn't
> corrupt data.

It would also be neat if the moon were made of cheese... The underlying
NFS and SCSI protocols do not guarantee perfect copies; the copy may,
for instance, be interrupted due to external circumstances.

> In any case, this may be a bad idea -- presumably you'd have to fsync
> the file you're copying *from* first to avoid a massive performance
> hit.

You have to do that anyway.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 23:45                       ` Myklebust, Trond
@ 2013-02-26  0:03                         ` Zach Brown
  2013-03-11  9:31                           ` Joel Becker
  0 siblings, 1 reply; 56+ messages in thread
From: Zach Brown @ 2013-02-26  0:03 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Andy Lutomirski, Ric Wheeler, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

> >   I think it would be neat if it couldn't
> > corrupt data.
> 
> It would also be neat if the moon were made of cheese...

And there we have the lsf2013 t-shirt slogan.  I think we're done here!

- z

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-25 21:14           ` Andy Lutomirski
  2013-02-25 21:49             ` Ric Wheeler
@ 2013-02-26 21:02             ` Jörn Engel
  2013-02-26 22:35               ` Andy Lutomirski
  2013-03-30 19:49               ` Pavel Machek
  1 sibling, 2 replies; 56+ messages in thread
From: Jörn Engel @ 2013-02-26 21:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Zach Brown, Myklebust, Trond, Paolo Bonzini, Ric Wheeler,
	Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke, Joel Becker

On Mon, 25 February 2013 13:14:52 -0800, Andy Lutomirski wrote:
> 
> I thought the first thing people would ask for is to atomically create a
> new file and copy the old file into it (at least on local file systems).
>  The idea is that nothing should see an empty destination file, either
> by race or by crash.  (This feature would perhaps be described as a
> pony, but it should be implementable.)

Having already wasted many week trying to implement your pony, I would
consider it about as possible as winning the lottery three times in a
row.  It clearly is in theory and yet,...

If you take a filesystem like ext[34] you are out of luck.  In those
filesystems it may not even be theoretically possible to get the
cleanup right for pathological cases.  And if you ignore pathological
cases and depend on userspace to do the cleanup for you, you have to
do ABI extentions that I don't want to mention with Al on Cc:.  My
personal notebook ran such a kernel for several years until hardware
improved to a point that I no longer wanted to forward-port the
patches.  It worked but it was far from pretty.

If you have a filesystem where you can simply bumb a reference count
to copy the file content, implementation is fairly straightforward.
But having a system call that is effectively limited to btrfs means
pretty much noone will use it - beside the people looking for
potential kernel exploits.

So my vote clearly goes to some variant of sendfile or splice.

Jörn

--
Man darf nicht das, was uns unwahrscheinlich und unnatürlich erscheint,
mit dem verwechseln, was absolut unmöglich ist.
-- Carl Friedrich Gauß

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-26 21:02             ` Jörn Engel
@ 2013-02-26 22:35               ` Andy Lutomirski
  2013-03-30 19:49               ` Pavel Machek
  1 sibling, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2013-02-26 22:35 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Zach Brown, Myklebust, Trond, Paolo Bonzini, Ric Wheeler,
	Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke, Joel Becker

On Tue, Feb 26, 2013 at 1:02 PM, Jörn Engel <joern@logfs.org> wrote:
> On Mon, 25 February 2013 13:14:52 -0800, Andy Lutomirski wrote:
>>
>> I thought the first thing people would ask for is to atomically create a
>> new file and copy the old file into it (at least on local file systems).
>>  The idea is that nothing should see an empty destination file, either
>> by race or by crash.  (This feature would perhaps be described as a
>> pony, but it should be implementable.)
>
> Having already wasted many week trying to implement your pony, I would
> consider it about as possible as winning the lottery three times in a
> row.  It clearly is in theory and yet,...
>
> If you take a filesystem like ext[34] you are out of luck.  In those
> filesystems it may not even be theoretically possible to get the
> cleanup right for pathological cases.  And if you ignore pathological
> cases and depend on userspace to do the cleanup for you, you have to
> do ABI extentions that I don't want to mention with Al on Cc:.  My
> personal notebook ran such a kernel for several years until hardware
> improved to a point that I no longer wanted to forward-port the
> patches.  It worked but it was far from pretty.
>
> If you have a filesystem where you can simply bumb a reference count
> to copy the file content, implementation is fairly straightforward.
> But having a system call that is effectively limited to btrfs means
> pretty much noone will use it - beside the people looking for
> potential kernel exploits.

:)

>
> So my vote clearly goes to some variant of sendfile or splice.

Don't get me wrong -- the vpsendfile (or whatever it's called) idea
sounds extremely useful too.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-26  0:03                         ` Zach Brown
@ 2013-03-11  9:31                           ` Joel Becker
  0 siblings, 0 replies; 56+ messages in thread
From: Joel Becker @ 2013-03-11  9:31 UTC (permalink / raw)
  To: Zach Brown
  Cc: Myklebust, Trond, Andy Lutomirski, Ric Wheeler, Paolo Bonzini,
	Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke

On Mon, Feb 25, 2013 at 04:03:01PM -0800, Zach Brown wrote:
> > >   I think it would be neat if it couldn't
> > > corrupt data.
> > 
> > It would also be neat if the moon were made of cheese...
> 
> And there we have the lsf2013 t-shirt slogan.  I think we're done here!
> 
> - z

Hey Everyone,
	So, of course, this thread happened while I was celebrating my
10-year anniversary on a warm, sunny island.  I won't trade.  But let me
drop my $0.02 in here.
	First, we have our T-shirt slogan.  That overrides every other
concern.
	Second, I agree that moving forward on anything is better than
not.  I haven't delivered the updated fastcopy(2) patch I promised two
years ago, and I have to admit that I can't promise code on any sane
timeframe.
	Back when I was working on this, I thought that link(2) was a
good model for a full-file copy.  Thus I came up with reflink(2).  This
eventually became the fastcopyu(2) proposal discussed two years ago.  I
did not think, and I still don't think, that we should conflate the API
for "copy/clone this file in some way" (ala fastcopy(2)) with
"duplicate/link this range of bytes" (ala BTRFS_IOC_CLONE_RANGE).  I
thought that splice(2) or something like it was a better fit for ranges;
this thread has already had the same thought.
	fastcopy(2) had a provision for CoW for atomicity, including
metadata.  This is because ocfs2 reflinks *can* provide atomic clones
with metadata included.  I would like any new proposal to allow for
that.  If it does not, of course, callers can continue to use
OCFS2_IOC_REFLINK, but I'd rather make it part of the generic behavior,
so that generic tools come with it.

Joel

-- 

"You don't make the poor richer by making the rich poorer."
	- Sir Winston Churchill

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-23  0:32             ` Eric Wong
@ 2013-03-30 19:45               ` Pavel Machek
  2013-03-31 21:23                 ` Eric Wong
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2013-03-30 19:45 UTC (permalink / raw)
  To: Eric Wong
  Cc: Myklebust, Trond, Zach Brown, Paolo Bonzini, Ric Wheeler,
	Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke, Joel Becker

Hi!

> > > If I'm guessing correctly, sendfile64()+flags would be annoying because it's
> > > missing an out_fd_offset.  The host will want to offload the guest's copies by
> > > calling sendfile on block ranges of a guest disk image file that correspond to
> > > the mappings of the in and out files in the guest.
> > > 
> > > You could make it work with some locking and out_fd seeking to set the
> > > write offset before calling sendfile64()+flags, but ugh.
> > > 
> > >  ssize_t sendfile(int out_fd, int in_fd, off_t in_offset, off_t
> > >                   out_offset, size_t count, int flags);
> > > 
> > > That seems closer.
> > 
> > psendfile() ?
> > 
> > I fully agree that sounds reasonable... Just being an ass. :-)
> 
> splice() already has offset for both fds and a flags arg:
> 
>        ssize_t splice(int fd_in, loff_t *off_in, int fd_out,
>                       loff_t *off_out, size_t len, unsigned int flags);
> 
> The current downside is it requires one fd to be a pipe, so it's
> just not very easy to use from my perspective[1].
...
> [1] my splice() annoyances:
>     * need to create/manage a pipe
>     * copy size limited by pipe size
>     * doesn't reduce userspace syscalls (just data copy overhead)
>     * easy to misuse and starve with blocking sockets + big buffers
>     * not many users, so bugs creep in (v3.7.8 was the first usable
>       version of the 3.7 series for TCP sockets)

Could library be created to make it less annoying to use, and harder
to misuse?

splice man page does not mention pipe size limit... 
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-02-26 21:02             ` Jörn Engel
  2013-02-26 22:35               ` Andy Lutomirski
@ 2013-03-30 19:49               ` Pavel Machek
  2013-03-30 20:08                 ` Andreas Dilger
  2013-03-30 22:40                 ` Andy Lutomirski
  1 sibling, 2 replies; 56+ messages in thread
From: Pavel Machek @ 2013-03-30 19:49 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Andy Lutomirski, Zach Brown, Myklebust, Trond, Paolo Bonzini,
	Ric Wheeler, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

Hi!

> > I thought the first thing people would ask for is to atomically create a
> > new file and copy the old file into it (at least on local file systems).
> >  The idea is that nothing should see an empty destination file, either
> > by race or by crash.  (This feature would perhaps be described as a
> > pony, but it should be implementable.)
> 
> Having already wasted many week trying to implement your pony, I would
> consider it about as possible as winning the lottery three times in a
> row.  It clearly is in theory and yet,...

Hmm, really? AFAICT it would be simple to provide open_deleted_file("directory")
syscall. You'd open_deleted_file(), copy source file into it, then
fsync(), then link it into filesystem.

That should have atomicity properties reflected.
									Pavel
								(who has too many (*)
									ponies around)
(*) 1 is sometimes too many when we talk about big mammals.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 19:49               ` Pavel Machek
@ 2013-03-30 20:08                 ` Andreas Dilger
  2013-03-30 21:45                   ` Pavel Machek
  2013-03-31 11:48                   ` Pádraig Brady
  2013-03-30 22:40                 ` Andy Lutomirski
  1 sibling, 2 replies; 56+ messages in thread
From: Andreas Dilger @ 2013-03-30 20:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jörn Engel, Andy Lutomirski, Zach Brown, Myklebust, Trond,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> Hmm, really? AFAICT it would be simple to provide an
> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> copy source file into it, then fsync(), then link it into filesystem.
> 
> That should have atomicity properties reflected.

Actually, the open_deleted_file() syscall is quite useful for many
different things all by itself.  Lots of applications need to create
temporary files that are unlinked at application failure (without a
race if app crashes after creating the file, but before unlinking).
It also avoids exposing temporary files into the namespace if other
applications are accessing the directory.

We've added a library routine that does this for Lustre in a hackish
way (magical filename created in target directory) for being able to
migrate files between data servers, HSM, defragmentation, rsync, etc.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 20:08                 ` Andreas Dilger
@ 2013-03-30 21:45                   ` Pavel Machek
  2013-03-30 21:57                     ` Myklebust, Trond
  2013-03-31  5:38                     ` AEDilger Gmail
  2013-03-31 11:48                   ` Pádraig Brady
  1 sibling, 2 replies; 56+ messages in thread
From: Pavel Machek @ 2013-03-30 21:45 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jörn Engel, Andy Lutomirski, Zach Brown, Myklebust, Trond,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> > Hmm, really? AFAICT it would be simple to provide an
> > open_deleted_file("directory") syscall. You'd open_deleted_file(),
> > copy source file into it, then fsync(), then link it into filesystem.
> > 
> > That should have atomicity properties reflected.
> 
> Actually, the open_deleted_file() syscall is quite useful for many
> different things all by itself.  Lots of applications need to create
> temporary files that are unlinked at application failure (without a
> race if app crashes after creating the file, but before unlinking).
> It also avoids exposing temporary files into the namespace if other
> applications are accessing the directory.

Hmm. open_deleted_file() will still need to get a directory... so it
will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
be acceptable interface?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 21:45                   ` Pavel Machek
@ 2013-03-30 21:57                     ` Myklebust, Trond
  2013-03-30 23:21                       ` Ric Wheeler
  2013-03-31  7:36                       ` Pavel Machek
  2013-03-31  5:38                     ` AEDilger Gmail
  1 sibling, 2 replies; 56+ messages in thread
From: Myklebust, Trond @ 2013-03-30 21:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Myklebust, Trond, Paolo Bonzini, Ric Wheeler, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker


On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
 wrote:

> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
>>> Hmm, really? AFAICT it would be simple to provide an
>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
>>> copy source file into it, then fsync(), then link it into filesystem.
>>> 
>>> That should have atomicity properties reflected.
>> 
>> Actually, the open_deleted_file() syscall is quite useful for many
>> different things all by itself.  Lots of applications need to create
>> temporary files that are unlinked at application failure (without a
>> race if app crashes after creating the file, but before unlinking).
>> It also avoids exposing temporary files into the namespace if other
>> applications are accessing the directory.
> 
> Hmm. open_deleted_file() will still need to get a directory... so it
> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> be acceptable interface?
> 									Pavel

...and what's the big plan to make this work on anything other than ext4 and btrfs?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 19:49               ` Pavel Machek
  2013-03-30 20:08                 ` Andreas Dilger
@ 2013-03-30 22:40                 ` Andy Lutomirski
  1 sibling, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2013-03-30 22:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jörn Engel, Zach Brown, Myklebust, Trond, Paolo Bonzini,
	Ric Wheeler, Linux FS Devel, linux-kernel, Chris L. Mason,
	Christoph Hellwig, Alexander Viro, Martin K. Petersen,
	Hannes Reinecke, Joel Becker

On Sat, Mar 30, 2013 at 12:49 PM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> > I thought the first thing people would ask for is to atomically create a
>> > new file and copy the old file into it (at least on local file systems).
>> >  The idea is that nothing should see an empty destination file, either
>> > by race or by crash.  (This feature would perhaps be described as a
>> > pony, but it should be implementable.)
>>
>> Having already wasted many week trying to implement your pony, I would
>> consider it about as possible as winning the lottery three times in a
>> row.  It clearly is in theory and yet,...
>
> Hmm, really? AFAICT it would be simple to provide open_deleted_file("directory")
> syscall. You'd open_deleted_file(), copy source file into it, then
> fsync(), then link it into filesystem.

Isn't linking a deleted file back into the filesystem explicitly
forbidden?  I'm pretty sure that linking from /proc/fd/whatever
doesn't work.  (I've often wanted a flink system call that takes a
file descriptor and links it somewhere.  If it came with an option to
control whether it would overwrite an existing file, even better.)

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 21:57                     ` Myklebust, Trond
@ 2013-03-30 23:21                       ` Ric Wheeler
  2013-03-31  2:53                         ` Andreas Dilger
  2013-03-31  7:36                       ` Pavel Machek
  1 sibling, 1 reply; 56+ messages in thread
From: Ric Wheeler @ 2013-03-30 23:21 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Pavel Machek, Andreas Dilger, Jörn Engel, Andy Lutomirski,
	Zach Brown, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
>   wrote:
>
>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
>>>> Hmm, really? AFAICT it would be simple to provide an
>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
>>>> copy source file into it, then fsync(), then link it into filesystem.
>>>>
>>>> That should have atomicity properties reflected.
>>> Actually, the open_deleted_file() syscall is quite useful for many
>>> different things all by itself.  Lots of applications need to create
>>> temporary files that are unlinked at application failure (without a
>>> race if app crashes after creating the file, but before unlinking).
>>> It also avoids exposing temporary files into the namespace if other
>>> applications are accessing the directory.
>> Hmm. open_deleted_file() will still need to get a directory... so it
>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
>> be acceptable interface?
>> 									Pavel
> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
>
> Cheers,
>    Trond

I know that change can be a good thing, but are we really solving a pressing 
problem given that application developers have dealt with open/rename as the way 
to get "atomic" file creation for several decades now ?

Regards,

Ric


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 23:21                       ` Ric Wheeler
@ 2013-03-31  2:53                         ` Andreas Dilger
  2013-03-31  3:52                           ` Myklebust, Trond
  0 siblings, 1 reply; 56+ messages in thread
From: Andreas Dilger @ 2013-03-31  2:53 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Myklebust, Trond, Pavel Machek, Jörn Engel, Andy Lutomirski,
	Zach Brown, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 2013-03-30, at 16:21, Ric Wheeler <rwheeler@redhat.com> wrote:

> On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
>> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
>>  wrote:
>> 
>>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
>>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
>>>>> Hmm, really? AFAICT it would be simple to provide an
>>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
>>>>> copy source file into it, then fsync(), then link it into filesystem.
>>>>> 
>>>>> That should have atomicity properties reflected.
>>>> Actually, the open_deleted_file() syscall is quite useful for many
>>>> different things all by itself.  Lots of applications need to create
>>>> temporary files that are unlinked at application failure (without a
>>>> race if app crashes after creating the file, but before unlinking).
>>>> It also avoids exposing temporary files into the namespace if other
>>>> applications are accessing the directory.
>>> Hmm. open_deleted_file() will still need to get a directory... so it
>>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
>>> be acceptable interface?
>>>                                    Pavel
>> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
>> 
>> Cheers,
>>   Trond
> 
> I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ?

Using open()+rename() has side effects:
- changes ctime/mtime on parent directory
- leaves temporary file in path during creation
- leaves temporary file in namespace during operations, and after crash

Cheers, Andreas

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  2:53                         ` Andreas Dilger
@ 2013-03-31  3:52                           ` Myklebust, Trond
  2013-03-31  4:18                             ` Andy Lutomirski
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-03-31  3:52 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ric Wheeler, Pavel Machek, Jörn Engel, Andy Lutomirski,
	Zach Brown, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sat, 2013-03-30 at 19:53 -0700, Andreas Dilger wrote:
> On 2013-03-30, at 16:21, Ric Wheeler <rwheeler@redhat.com> wrote:
> 
> > On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
> >> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
> >>  wrote:
> >> 
> >>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
> >>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> >>>>> Hmm, really? AFAICT it would be simple to provide an
> >>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> >>>>> copy source file into it, then fsync(), then link it into filesystem.
> >>>>> 
> >>>>> That should have atomicity properties reflected.
> >>>> Actually, the open_deleted_file() syscall is quite useful for many
> >>>> different things all by itself.  Lots of applications need to create
> >>>> temporary files that are unlinked at application failure (without a
> >>>> race if app crashes after creating the file, but before unlinking).
> >>>> It also avoids exposing temporary files into the namespace if other
> >>>> applications are accessing the directory.
> >>> Hmm. open_deleted_file() will still need to get a directory... so it
> >>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> >>> be acceptable interface?
> >>>                                    Pavel
> >> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> >> 
> >> Cheers,
> >>   Trond
> > 
> > I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ?
> 
> Using open()+rename() has side effects:
> - changes ctime/mtime on parent directory
> - leaves temporary file in path during creation
> - leaves temporary file in namespace during operations, and after crash

So what is the actual problem that is being solved? Yes, the above may
be disadvantages, but none of them have proven to be show-stoppers so
far.

So far, I've seen no justification for Andy's atomicity requirement
other than "it would be nice if...". That's not enough IMO...


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  3:52                           ` Myklebust, Trond
@ 2013-03-31  4:18                             ` Andy Lutomirski
  2013-03-31  4:36                               ` Myklebust, Trond
  0 siblings, 1 reply; 56+ messages in thread
From: Andy Lutomirski @ 2013-03-31  4:18 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Andreas Dilger, Ric Wheeler, Pavel Machek, Jörn Engel,
	Zach Brown, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sat, Mar 30, 2013 at 8:52 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Sat, 2013-03-30 at 19:53 -0700, Andreas Dilger wrote:
>> On 2013-03-30, at 16:21, Ric Wheeler <rwheeler@redhat.com> wrote:
>>
>> > On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
>> >> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
>> >>  wrote:
>> >>
>> >>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
>> >>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
>> >>>>> Hmm, really? AFAICT it would be simple to provide an
>> >>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
>> >>>>> copy source file into it, then fsync(), then link it into filesystem.
>> >>>>>
>> >>>>> That should have atomicity properties reflected.
>> >>>> Actually, the open_deleted_file() syscall is quite useful for many
>> >>>> different things all by itself.  Lots of applications need to create
>> >>>> temporary files that are unlinked at application failure (without a
>> >>>> race if app crashes after creating the file, but before unlinking).
>> >>>> It also avoids exposing temporary files into the namespace if other
>> >>>> applications are accessing the directory.
>> >>> Hmm. open_deleted_file() will still need to get a directory... so it
>> >>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
>> >>> be acceptable interface?
>> >>>                                    Pavel
>> >> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
>> >>
>> >> Cheers,
>> >>   Trond
>> >
>> > I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ?
>>
>> Using open()+rename() has side effects:
>> - changes ctime/mtime on parent directory
>> - leaves temporary file in path during creation
>> - leaves temporary file in namespace during operations, and after crash
>
> So what is the actual problem that is being solved? Yes, the above may
> be disadvantages, but none of them have proven to be show-stoppers so
> far.
>
> So far, I've seen no justification for Andy's atomicity requirement
> other than "it would be nice if...". That's not enough IMO...

ISTM vpsendfile (or whatever it's called) plus a way to create deleted
files plus a way to relink deleted files gives atomic copies.  Perhaps
this is less efficient than would be ideal for OCFS2, though.

--Andy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  4:18                             ` Andy Lutomirski
@ 2013-03-31  4:36                               ` Myklebust, Trond
  2013-03-31  4:45                                 ` Myklebust, Trond
  2013-04-01 15:49                                 ` J. Bruce Fields
  0 siblings, 2 replies; 56+ messages in thread
From: Myklebust, Trond @ 2013-03-31  4:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andreas Dilger, Ric Wheeler, Pavel Machek, Jörn Engel,
	Zach Brown, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sat, 2013-03-30 at 21:18 -0700, Andy Lutomirski wrote:
> On Sat, Mar 30, 2013 at 8:52 PM, Myklebust, Trond
> <Trond.Myklebust@netapp.com> wrote:
> > On Sat, 2013-03-30 at 19:53 -0700, Andreas Dilger wrote:
> >> On 2013-03-30, at 16:21, Ric Wheeler <rwheeler@redhat.com> wrote:
> >>
> >> > On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
> >> >> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
> >> >>  wrote:
> >> >>
> >> >>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
> >> >>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> >> >>>>> Hmm, really? AFAICT it would be simple to provide an
> >> >>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> >> >>>>> copy source file into it, then fsync(), then link it into filesystem.
> >> >>>>>
> >> >>>>> That should have atomicity properties reflected.
> >> >>>> Actually, the open_deleted_file() syscall is quite useful for many
> >> >>>> different things all by itself.  Lots of applications need to create
> >> >>>> temporary files that are unlinked at application failure (without a
> >> >>>> race if app crashes after creating the file, but before unlinking).
> >> >>>> It also avoids exposing temporary files into the namespace if other
> >> >>>> applications are accessing the directory.
> >> >>> Hmm. open_deleted_file() will still need to get a directory... so it
> >> >>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> >> >>> be acceptable interface?
> >> >>>                                    Pavel
> >> >> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> >> >>
> >> >> Cheers,
> >> >>   Trond
> >> >
> >> > I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ?
> >>
> >> Using open()+rename() has side effects:
> >> - changes ctime/mtime on parent directory
> >> - leaves temporary file in path during creation
> >> - leaves temporary file in namespace during operations, and after crash
> >
> > So what is the actual problem that is being solved? Yes, the above may
> > be disadvantages, but none of them have proven to be show-stoppers so
> > far.
> >
> > So far, I've seen no justification for Andy's atomicity requirement
> > other than "it would be nice if...". That's not enough IMO...
> 
> ISTM vpsendfile (or whatever it's called) plus a way to create deleted
> files plus a way to relink deleted files gives atomic copies.  Perhaps
> this is less efficient than would be ideal for OCFS2, though.

What real-life problem does the atomicity requirement solve? None of our
customers have ever asked for it. They don't care...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  4:36                               ` Myklebust, Trond
@ 2013-03-31  4:45                                 ` Myklebust, Trond
  2013-04-01 15:49                                 ` J. Bruce Fields
  1 sibling, 0 replies; 56+ messages in thread
From: Myklebust, Trond @ 2013-03-31  4:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andreas Dilger, Ric Wheeler, Pavel Machek, Jörn Engel,
	Zach Brown, Paolo Bonzini, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sun, 2013-03-31 at 00:36 -0400, Trond Myklebust wrote:
> On Sat, 2013-03-30 at 21:18 -0700, Andy Lutomirski wrote:
> > On Sat, Mar 30, 2013 at 8:52 PM, Myklebust, Trond
> > <Trond.Myklebust@netapp.com> wrote:
> > > On Sat, 2013-03-30 at 19:53 -0700, Andreas Dilger wrote:
> > >> On 2013-03-30, at 16:21, Ric Wheeler <rwheeler@redhat.com> wrote:
> > >>
> > >> > On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
> > >> >> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
> > >> >>  wrote:
> > >> >>
> > >> >>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
> > >> >>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> > >> >>>>> Hmm, really? AFAICT it would be simple to provide an
> > >> >>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> > >> >>>>> copy source file into it, then fsync(), then link it into filesystem.
> > >> >>>>>
> > >> >>>>> That should have atomicity properties reflected.
> > >> >>>> Actually, the open_deleted_file() syscall is quite useful for many
> > >> >>>> different things all by itself.  Lots of applications need to create
> > >> >>>> temporary files that are unlinked at application failure (without a
> > >> >>>> race if app crashes after creating the file, but before unlinking).
> > >> >>>> It also avoids exposing temporary files into the namespace if other
> > >> >>>> applications are accessing the directory.
> > >> >>> Hmm. open_deleted_file() will still need to get a directory... so it
> > >> >>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > >> >>> be acceptable interface?
> > >> >>>                                    Pavel
> > >> >> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> > >> >>
> > >> >> Cheers,
> > >> >>   Trond
> > >> >
> > >> > I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ?
> > >>
> > >> Using open()+rename() has side effects:
> > >> - changes ctime/mtime on parent directory
> > >> - leaves temporary file in path during creation
> > >> - leaves temporary file in namespace during operations, and after crash
> > >
> > > So what is the actual problem that is being solved? Yes, the above may
> > > be disadvantages, but none of them have proven to be show-stoppers so
> > > far.
> > >
> > > So far, I've seen no justification for Andy's atomicity requirement
> > > other than "it would be nice if...". That's not enough IMO...
> > 
> > ISTM vpsendfile (or whatever it's called) plus a way to create deleted
> > files plus a way to relink deleted files gives atomic copies.  Perhaps
> > this is less efficient than would be ideal for OCFS2, though.
> 
> What real-life problem does the atomicity requirement solve? None of our
> customers have ever asked for it. They don't care...
> 
BTW: before you do answer, please note that the current NFSv4.2 solution
_does_ allow you to lock the file before you copy.

IOW: the same atomicity rules apply to offloaded copy as apply to
standard copy: there is no requirement anywhere to apply stronger
semantics. Surprisingly enough, that works for most people...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 21:45                   ` Pavel Machek
  2013-03-30 21:57                     ` Myklebust, Trond
@ 2013-03-31  5:38                     ` AEDilger Gmail
  2013-03-31  8:25                       ` Pavel Machek
  1 sibling, 1 reply; 56+ messages in thread
From: AEDilger Gmail @ 2013-03-31  5:38 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Myklebust, Trond, Paolo Bonzini, Ric Wheeler, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 2013-03-30, at 14:45, Pavel Machek <pavel@ucw.cz> wrote:
> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
>>> Hmm, really? AFAICT it would be simple to provide an
>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
>>> copy source file into it, then fsync(), then link it into filesystem.
>>> 
>>> That should have atomicity properties reflected.
>> 
>> Actually, the open_deleted_file() syscall is quite useful for many
>> different things all by itself.  Lots of applications need to create
>> temporary files that are unlinked at application failure (without a
>> race if app crashes after creating the file, but before unlinking).
>> It also avoids exposing temporary files into the namespace if other
>> applications are accessing the directory.
> 
> Hmm. open_deleted_file() will still need to get a directory... so it
> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> be acceptable interface?

Yes, that would be reasonable, and/or possibly openat(fd, NULL, AT_FDCWD|AT_UNLINKED)?

Cheers, Andreas

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 21:57                     ` Myklebust, Trond
  2013-03-30 23:21                       ` Ric Wheeler
@ 2013-03-31  7:36                       ` Pavel Machek
  2013-03-31 18:27                         ` Myklebust, Trond
  1 sibling, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2013-03-31  7:36 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

Hi!

> >>> Hmm, really? AFAICT it would be simple to provide an
> >>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> >>> copy source file into it, then fsync(), then link it into filesystem.
> >>> 
> >>> That should have atomicity properties reflected.
> >> 
> >> Actually, the open_deleted_file() syscall is quite useful for many
> >> different things all by itself.  Lots of applications need to create
> >> temporary files that are unlinked at application failure (without a
> >> race if app crashes after creating the file, but before unlinking).
> >> It also avoids exposing temporary files into the namespace if other
> >> applications are accessing the directory.
> > 
> > Hmm. open_deleted_file() will still need to get a directory... so it
> > will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > be acceptable interface?
> 
> ...and what's the big plan to make this work on anything other than ext4 and btrfs?

Deleted but open files are from original unix, so it should work on
anything unixy (minix, ext, ext2, ...).
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  5:38                     ` AEDilger Gmail
@ 2013-03-31  8:25                       ` Pavel Machek
  0 siblings, 0 replies; 56+ messages in thread
From: Pavel Machek @ 2013-03-31  8:25 UTC (permalink / raw)
  To: AEDilger Gmail
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Myklebust, Trond, Paolo Bonzini, Ric Wheeler, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

Hi!
On Sat 2013-03-30 22:38:35, AEDilger Gmail wrote:
> On 2013-03-30, at 14:45, Pavel Machek <pavel@ucw.cz> wrote:
> > On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
> >> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> >>> Hmm, really? AFAICT it would be simple to provide an
> >>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> >>> copy source file into it, then fsync(), then link it into filesystem.
> >>> 
> >>> That should have atomicity properties reflected.
> >> 
> >> Actually, the open_deleted_file() syscall is quite useful for many
> >> different things all by itself.  Lots of applications need to create
> >> temporary files that are unlinked at application failure (without a
> >> race if app crashes after creating the file, but before unlinking).
> >> It also avoids exposing temporary files into the namespace if other
> >> applications are accessing the directory.
> > 
> > Hmm. open_deleted_file() will still need to get a directory... so it
> > will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > be acceptable interface?
> 
> Yes, that would be reasonable, and/or possibly openat(fd, NULL, AT_FDCWD|AT_UNLINKED)?

openat() is better interface for this, I'd say.

BTW... I don't think this has to be done at the same time as splice()
[or how it ends up being called] changes...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 20:08                 ` Andreas Dilger
  2013-03-30 21:45                   ` Pavel Machek
@ 2013-03-31 11:48                   ` Pádraig Brady
  1 sibling, 0 replies; 56+ messages in thread
From: Pádraig Brady @ 2013-03-31 11:48 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Pavel Machek, Jörn Engel, Andy Lutomirski, Zach Brown,
	Myklebust, Trond, Paolo Bonzini, Ric Wheeler, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 03/30/2013 08:08 PM, Andreas Dilger wrote:
> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
>> Hmm, really? AFAICT it would be simple to provide an
>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
>> copy source file into it, then fsync(), then link it into filesystem.
>>
>> That should have atomicity properties reflected.
> 
> Actually, the open_deleted_file() syscall is quite useful for many
> different things all by itself.  Lots of applications need to create
> temporary files that are unlinked at application failure (without a
> race if app crashes after creating the file, but before unlinking).
> It also avoids exposing temporary files into the namespace if other
> applications are accessing the directory.
> 
> We've added a library routine that does this for Lustre in a hackish
> way (magical filename created in target directory) for being able to
> migrate files between data servers, HSM, defragmentation, rsync, etc.
> 
> Cheers, Andreas

This reminds me of the flink() discussion:
http://marc.info/?l=linux-kernel&m=104965452917349

Also kinda related is the exchangedata() OSX system call to
"atomically exchange data between two files"

thanks,
Pádraig.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  7:36                       ` Pavel Machek
@ 2013-03-31 18:27                         ` Myklebust, Trond
  2013-03-31 18:32                           ` openat(..., AT_UNLINKED) was " Pavel Machek
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-03-31 18:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sun, 2013-03-31 at 09:36 +0200, Pavel Machek wrote:
> Hi!
> 
> > >>> Hmm, really? AFAICT it would be simple to provide an
> > >>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> > >>> copy source file into it, then fsync(), then link it into filesystem.
> > >>> 
> > >>> That should have atomicity properties reflected.
> > >> 
> > >> Actually, the open_deleted_file() syscall is quite useful for many
> > >> different things all by itself.  Lots of applications need to create
> > >> temporary files that are unlinked at application failure (without a
> > >> race if app crashes after creating the file, but before unlinking).
> > >> It also avoids exposing temporary files into the namespace if other
> > >> applications are accessing the directory.
> > > 
> > > Hmm. open_deleted_file() will still need to get a directory... so it
> > > will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > > be acceptable interface?
> > 
> > ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> 
> Deleted but open files are from original unix, so it should work on
> anything unixy (minix, ext, ext2, ...).
> 								Pavel

minix, ext, ext2... are not under active development and haven't been
for more than a decade.

Take a look at how many actively used filesystems out there that have
some variant of sillyrename(), and explain what you want to do in those
cases.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 18:27                         ` Myklebust, Trond
@ 2013-03-31 18:32                           ` Pavel Machek
  2013-03-31 18:44                             ` Myklebust, Trond
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2013-03-31 18:32 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker


> > > > Hmm. open_deleted_file() will still need to get a directory... so it
> > > > will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > > > be acceptable interface?
> > > 
> > > ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> > 
> > Deleted but open files are from original unix, so it should work on
> > anything unixy (minix, ext, ext2, ...).
> 
> minix, ext, ext2... are not under active development and haven't been
> for more than a decade.
> 
> Take a look at how many actively used filesystems out there that have
> some variant of sillyrename(), and explain what you want to do in those
> cases.

Well. Yes, there are non-unix filesystems around. You have to deal
with silly files on them, and this will not be different.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 18:32                           ` openat(..., AT_UNLINKED) was " Pavel Machek
@ 2013-03-31 18:44                             ` Myklebust, Trond
  2013-03-31 22:50                               ` Pavel Machek
  0 siblings, 1 reply; 56+ messages in thread
From: Myklebust, Trond @ 2013-03-31 18:44 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote:
> > > > > Hmm. open_deleted_file() will still need to get a directory... so it
> > > > > will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > > > > be acceptable interface?
> > > > 
> > > > ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> > > 
> > > Deleted but open files are from original unix, so it should work on
> > > anything unixy (minix, ext, ext2, ...).
> > 
> > minix, ext, ext2... are not under active development and haven't been
> > for more than a decade.
> > 
> > Take a look at how many actively used filesystems out there that have
> > some variant of sillyrename(), and explain what you want to do in those
> > cases.
> 
> Well. Yes, there are non-unix filesystems around. You have to deal
> with silly files on them, and this will not be different.

So this would be a local POSIX filesystem only solution to a problem
that has yet to be formulated?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-30 19:45               ` Pavel Machek
@ 2013-03-31 21:23                 ` Eric Wong
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Wong @ 2013-03-31 21:23 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Myklebust, Trond, Zach Brown, Paolo Bonzini, Ric Wheeler,
	Linux FS Devel, linux-kernel, Chris L. Mason, Christoph Hellwig,
	Alexander Viro, Martin K. Petersen, Hannes Reinecke, Joel Becker

Pavel Machek <pavel@ucw.cz> wrote:
> Eric Wong wrote:
> > [1] my splice() annoyances:
> >     * need to create/manage a pipe
> >     * copy size limited by pipe size
> >     * doesn't reduce userspace syscalls (just data copy overhead)
> >     * easy to misuse and starve with blocking sockets + big buffers
> >     * not many users, so bugs creep in (v3.7.8 was the first usable
> >       version of the 3.7 series for TCP sockets)
> 
> Could library be created to make it less annoying to use, and harder
> to misuse?

Maybe, but getting people to use the library would be the hard, too.
And a library would not reduce syscalls in the common case.

We already have current->splice_pipe for sendfile, so maybe splice can
be taught to transparently use that when neither FD is a pipe.

I also think a SPLICE_F_DONTWAIT flag might be necessary.  It would be a
superset of SPLICE_F_NONBLOCK, but also act like MSG_DONTWAIT for the
non-pipe socket.

> splice man page does not mention pipe size limit...

It probably should.  I think I discovered it by using it many years ago
and burned it into my mind.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 18:44                             ` Myklebust, Trond
@ 2013-03-31 22:50                               ` Pavel Machek
  2013-03-31 23:14                                 ` Ric Wheeler
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2013-03-31 22:50 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Andreas Dilger, Jörn Engel, Andy Lutomirski, Zach Brown,
	Paolo Bonzini, Ric Wheeler, Linux FS Devel, linux-kernel,
	Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sun 2013-03-31 18:44:53, Myklebust, Trond wrote:
> On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote:
> > > > > > Hmm. open_deleted_file() will still need to get a directory... so it
> > > > > > will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > > > > > be acceptable interface?
> > > > > 
> > > > > ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> > > > 
> > > > Deleted but open files are from original unix, so it should work on
> > > > anything unixy (minix, ext, ext2, ...).
> > > 
> > > minix, ext, ext2... are not under active development and haven't been
> > > for more than a decade.
> > > 
> > > Take a look at how many actively used filesystems out there that have
> > > some variant of sillyrename(), and explain what you want to do in those
> > > cases.
> > 
> > Well. Yes, there are non-unix filesystems around. You have to deal
> > with silly files on them, and this will not be different.
> 
> So this would be a local POSIX filesystem only solution to a problem
> that has yet to be formulated?

Problem is "clasical create temp file then delete it" is racy. See the
archives. That is useful & common operation.

Problem is "atomicaly create file at target location with guaranteed
right content". That's also in the archives. Looks useful if someone
does rsync from your directory.

Non-POSIX filesystems have problems handling deleted files, but that
was always the case. That's one of the reasons they are seldomly used
for root filesystems.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 22:50                               ` Pavel Machek
@ 2013-03-31 23:14                                 ` Ric Wheeler
  2013-03-31 23:18                                   ` Pavel Machek
  0 siblings, 1 reply; 56+ messages in thread
From: Ric Wheeler @ 2013-03-31 23:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Myklebust, Trond, Andreas Dilger, Jörn Engel,
	Andy Lutomirski, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 03/31/2013 06:50 PM, Pavel Machek wrote:
> On Sun 2013-03-31 18:44:53, Myklebust, Trond wrote:
>> On Sun, 2013-03-31 at 20:32 +0200, Pavel Machek wrote:
>>>>>>> Hmm. open_deleted_file() will still need to get a directory... so it
>>>>>>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
>>>>>>> be acceptable interface?
>>>>>> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
>>>>> Deleted but open files are from original unix, so it should work on
>>>>> anything unixy (minix, ext, ext2, ...).
>>>> minix, ext, ext2... are not under active development and haven't been
>>>> for more than a decade.
>>>>
>>>> Take a look at how many actively used filesystems out there that have
>>>> some variant of sillyrename(), and explain what you want to do in those
>>>> cases.
>>> Well. Yes, there are non-unix filesystems around. You have to deal
>>> with silly files on them, and this will not be different.
>> So this would be a local POSIX filesystem only solution to a problem
>> that has yet to be formulated?
> Problem is "clasical create temp file then delete it" is racy. See the
> archives. That is useful & common operation.

Which race are you concerned with exactly?

User wants to test for a file with name "foo.txt"

* create "foo.txt~" (or whatever)
* write contents into "foo.txt~"
* rename "foo.txt~" to "foo.txt"

Until rename is done, the file does not exists and is not complete. You will 
potentially have a garbage file to clean up if the program (or system) crashes, 
but that is not racy in a classic sense, right?

This is more of a garbage clean up issue?

Regards,

Ric

>
> Problem is "atomicaly create file at target location with guaranteed
> right content". That's also in the archives. Looks useful if someone
> does rsync from your directory.
>
> Non-POSIX filesystems have problems handling deleted files, but that
> was always the case. That's one of the reasons they are seldomly used
> for root filesystems.
>
> 									Pavel


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 23:14                                 ` Ric Wheeler
@ 2013-03-31 23:18                                   ` Pavel Machek
  2013-03-31 23:28                                     ` Ric Wheeler
  0 siblings, 1 reply; 56+ messages in thread
From: Pavel Machek @ 2013-03-31 23:18 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Myklebust, Trond, Andreas Dilger, Jörn Engel,
	Andy Lutomirski, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

Hi!

> >>>>Take a look at how many actively used filesystems out there that have
> >>>>some variant of sillyrename(), and explain what you want to do in those
> >>>>cases.
> >>>Well. Yes, there are non-unix filesystems around. You have to deal
> >>>with silly files on them, and this will not be different.
> >>So this would be a local POSIX filesystem only solution to a problem
> >>that has yet to be formulated?
> >Problem is "clasical create temp file then delete it" is racy. See the
> >archives. That is useful & common operation.
> 
> Which race are you concerned with exactly?
> 
> User wants to test for a file with name "foo.txt"
> 
> * create "foo.txt~" (or whatever)
> * write contents into "foo.txt~"
> * rename "foo.txt~" to "foo.txt"
> 
> Until rename is done, the file does not exists and is not complete.
> You will potentially have a garbage file to clean up if the program
> (or system) crashes, but that is not racy in a classic sense, right?

Well. If people rsync from you, they will start fetching incomplete
foo.txt~. Plus the garbage issue.

> This is more of a garbage clean up issue?

Also. Plus sometimes you want temporary "file" that is
deleted. Terminals use it for history, etc...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 23:18                                   ` Pavel Machek
@ 2013-03-31 23:28                                     ` Ric Wheeler
  2013-03-31 23:41                                       ` Pavel Machek
  0 siblings, 1 reply; 56+ messages in thread
From: Ric Wheeler @ 2013-03-31 23:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Myklebust, Trond, Andreas Dilger, Jörn Engel,
	Andy Lutomirski, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On 03/31/2013 07:18 PM, Pavel Machek wrote:
> Hi!
>
>>>>>> Take a look at how many actively used filesystems out there that have
>>>>>> some variant of sillyrename(), and explain what you want to do in those
>>>>>> cases.
>>>>> Well. Yes, there are non-unix filesystems around. You have to deal
>>>>> with silly files on them, and this will not be different.
>>>> So this would be a local POSIX filesystem only solution to a problem
>>>> that has yet to be formulated?
>>> Problem is "clasical create temp file then delete it" is racy. See the
>>> archives. That is useful & common operation.
>> Which race are you concerned with exactly?
>>
>> User wants to test for a file with name "foo.txt"
>>
>> * create "foo.txt~" (or whatever)
>> * write contents into "foo.txt~"
>> * rename "foo.txt~" to "foo.txt"
>>
>> Until rename is done, the file does not exists and is not complete.
>> You will potentially have a garbage file to clean up if the program
>> (or system) crashes, but that is not racy in a classic sense, right?
> Well. If people rsync from you, they will start fetching incomplete
> foo.txt~. Plus the garbage issue.

That is not racy, just garbage (not trying to be pedantic, just trying to 
understand). I can see that the "~" file is annoying, but we have dealt with it 
for a *long* time :)

Until it has the right name (on either the source or target system for rsync), 
it is not the file you are looking for.
>
>> This is more of a garbage clean up issue?
> Also. Plus sometimes you want temporary "file" that is
> deleted. Terminals use it for history, etc...

There you would have a race, you can create a file and unlink it of course and 
still write to it, but you would have a potential empty file issue?

Ric



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: openat(..., AT_UNLINKED) was Re: New copyfile system call - discuss before LSF?
  2013-03-31 23:28                                     ` Ric Wheeler
@ 2013-03-31 23:41                                       ` Pavel Machek
  0 siblings, 0 replies; 56+ messages in thread
From: Pavel Machek @ 2013-03-31 23:41 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Myklebust, Trond, Andreas Dilger, Jörn Engel,
	Andy Lutomirski, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

Hi!

> >>User wants to test for a file with name "foo.txt"
> >>
> >>* create "foo.txt~" (or whatever)
> >>* write contents into "foo.txt~"
> >>* rename "foo.txt~" to "foo.txt"
> >>
> >>Until rename is done, the file does not exists and is not complete.
> >>You will potentially have a garbage file to clean up if the program
> >>(or system) crashes, but that is not racy in a classic sense, right?
> >Well. If people rsync from you, they will start fetching incomplete
> >foo.txt~. Plus the garbage issue.
> 
> That is not racy, just garbage (not trying to be pedantic, just
> trying to understand). I can see that the "~" file is annoying, but
> we have dealt with it for a *long* time :)

Ok, so lets keep it at "~" is annoying :-).

[But... I was wrong. openat(..., AT_UNLINKED) is not enough to solve
this: we do not have flink() and it is not easily possible to link
deleted file "back to life" from /proc/self/fd:

pavel@amd:/tmp$ > delme
pavel@amd:/tmp$ bash 3< delme &
[2] 32667
[2]+  Stopped                 bash 3< delme
pavel@amd:/tmp$ fg
bash 3< delme
pavel@amd:/tmp$ ls -al delme
-rw-r--r-- 1 pavel pavel 0 Apr  1 01:36 delme
pavel@amd:/tmp$ ls -al /proc/self/fd/3 
lr-x------ 1 pavel pavel 64 Apr  1 01:37 /proc/self/fd/3 -> /tmp/delme
pavel@amd:/tmp$ rm delme
pavel@amd:/tmp$ ls -al /proc/self/fd/3 
lr-x------ 1 pavel pavel 64 Apr  1 01:37 /proc/self/fd/3 -> /tmp/delme
(deleted)
pavel@amd:/tmp$ ln /proc/self/fd/3 delme2
ln: creating hard link `delme2' => `/proc/self/fd/3': Invalid
cross-device link
]

> >>This is more of a garbage clean up issue?
> >Also. Plus sometimes you want temporary "file" that is
> >deleted. Terminals use it for history, etc...
> 
> There you would have a race, you can create a file and unlink it of
> course and still write to it, but you would have a potential empty
> file issue?

Yes. openat(..., AT_UNLINKED) solves that -- you'll no longer get
those files. (Not sure they'd be always empty. How do you ensure rm
hits the disk? fsync() on parent directory? Sounds expensive.)
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: New copyfile system call - discuss before LSF?
  2013-03-31  4:36                               ` Myklebust, Trond
  2013-03-31  4:45                                 ` Myklebust, Trond
@ 2013-04-01 15:49                                 ` J. Bruce Fields
  1 sibling, 0 replies; 56+ messages in thread
From: J. Bruce Fields @ 2013-04-01 15:49 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Andy Lutomirski, Andreas Dilger, Ric Wheeler, Pavel Machek,
	Jörn Engel, Zach Brown, Paolo Bonzini, Linux FS Devel,
	linux-kernel, Chris L. Mason, Christoph Hellwig, Alexander Viro,
	Martin K. Petersen, Hannes Reinecke, Joel Becker

On Sun, Mar 31, 2013 at 04:36:59AM +0000, Myklebust, Trond wrote:
> On Sat, 2013-03-30 at 21:18 -0700, Andy Lutomirski wrote:
> > On Sat, Mar 30, 2013 at 8:52 PM, Myklebust, Trond
> > <Trond.Myklebust@netapp.com> wrote:
> > > On Sat, 2013-03-30 at 19:53 -0700, Andreas Dilger wrote:
> > >> On 2013-03-30, at 16:21, Ric Wheeler <rwheeler@redhat.com> wrote:
> > >>
> > >> > On 03/30/2013 05:57 PM, Myklebust, Trond wrote:
> > >> >> On Mar 30, 2013, at 5:45 PM, Pavel Machek <pavel@ucw.cz>
> > >> >>  wrote:
> > >> >>
> > >> >>> On Sat 2013-03-30 13:08:39, Andreas Dilger wrote:
> > >> >>>> On 2013-03-30, at 12:49 PM, Pavel Machek wrote:
> > >> >>>>> Hmm, really? AFAICT it would be simple to provide an
> > >> >>>>> open_deleted_file("directory") syscall. You'd open_deleted_file(),
> > >> >>>>> copy source file into it, then fsync(), then link it into filesystem.
> > >> >>>>>
> > >> >>>>> That should have atomicity properties reflected.
> > >> >>>> Actually, the open_deleted_file() syscall is quite useful for many
> > >> >>>> different things all by itself.  Lots of applications need to create
> > >> >>>> temporary files that are unlinked at application failure (without a
> > >> >>>> race if app crashes after creating the file, but before unlinking).
> > >> >>>> It also avoids exposing temporary files into the namespace if other
> > >> >>>> applications are accessing the directory.
> > >> >>> Hmm. open_deleted_file() will still need to get a directory... so it
> > >> >>> will still need a path. Perhaps open("/foo/bar/mnt", O_DELETED) would
> > >> >>> be acceptable interface?
> > >> >>>                                    Pavel
> > >> >> ...and what's the big plan to make this work on anything other than ext4 and btrfs?
> > >> >>
> > >> >> Cheers,
> > >> >>   Trond
> > >> >
> > >> > I know that change can be a good thing, but are we really solving a pressing problem given that application developers have dealt with open/rename as the way to get "atomic" file creation for several decades now ?
> > >>
> > >> Using open()+rename() has side effects:
> > >> - changes ctime/mtime on parent directory
> > >> - leaves temporary file in path during creation
> > >> - leaves temporary file in namespace during operations, and after crash
> > >
> > > So what is the actual problem that is being solved? Yes, the above may
> > > be disadvantages, but none of them have proven to be show-stoppers so
> > > far.
> > >
> > > So far, I've seen no justification for Andy's atomicity requirement
> > > other than "it would be nice if...". That's not enough IMO...
> > 
> > ISTM vpsendfile (or whatever it's called) plus a way to create deleted
> > files plus a way to relink deleted files gives atomic copies.  Perhaps
> > this is less efficient than would be ideal for OCFS2, though.
> 
> What real-life problem does the atomicity requirement solve?

I've occasionally wondered whether something like that would help an nfs
server implement atomic v4 open (which can acquire share locks and set
attributes): create an anonymous file, get the locks and set the
attributes, then link it in only once all that's succeeded.

I don't know if that actually works--among other problems, I'm not sure
how you'd implement O_CREAT and O_EXCL.  Probably it would make more
sense just to add a new open system call that does what we want.  (If we
decide we even care that much about perfect atomicity for v4 open
semantics that few clients actually use.)

--b.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2013-04-01 15:49 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-21 11:37 New copyfile system call - discuss before LSF? Ric Wheeler
2013-02-21 13:37 ` Hannes Reinecke
2013-02-21 13:51 ` Myklebust, Trond
2013-02-21 14:57   ` Ric Wheeler
2013-02-21 16:36     ` Andreas Dilger
2013-02-21 20:00     ` Paolo Bonzini
2013-02-21 20:50       ` Myklebust, Trond
2013-02-21 22:24         ` Zach Brown
2013-02-22  1:29           ` Myklebust, Trond
2013-02-23  0:32             ` Eric Wong
2013-03-30 19:45               ` Pavel Machek
2013-03-31 21:23                 ` Eric Wong
2013-02-22  9:47           ` Paolo Bonzini
2013-02-22  9:52             ` Ric Wheeler
2013-02-22 18:22               ` Zach Brown
2013-02-22 22:48                 ` Myklebust, Trond
2013-02-25 21:14           ` Andy Lutomirski
2013-02-25 21:49             ` Ric Wheeler
2013-02-25 21:59               ` Myklebust, Trond
2013-02-25 22:16                 ` Andy Lutomirski
2013-02-25 23:28                   ` Myklebust, Trond
2013-02-25 23:35                     ` Andy Lutomirski
2013-02-25 23:45                       ` Myklebust, Trond
2013-02-26  0:03                         ` Zach Brown
2013-03-11  9:31                           ` Joel Becker
2013-02-26 21:02             ` Jörn Engel
2013-02-26 22:35               ` Andy Lutomirski
2013-03-30 19:49               ` Pavel Machek
2013-03-30 20:08                 ` Andreas Dilger
2013-03-30 21:45                   ` Pavel Machek
2013-03-30 21:57                     ` Myklebust, Trond
2013-03-30 23:21                       ` Ric Wheeler
2013-03-31  2:53                         ` Andreas Dilger
2013-03-31  3:52                           ` Myklebust, Trond
2013-03-31  4:18                             ` Andy Lutomirski
2013-03-31  4:36                               ` Myklebust, Trond
2013-03-31  4:45                                 ` Myklebust, Trond
2013-04-01 15:49                                 ` J. Bruce Fields
2013-03-31  7:36                       ` Pavel Machek
2013-03-31 18:27                         ` Myklebust, Trond
2013-03-31 18:32                           ` openat(..., AT_UNLINKED) was " Pavel Machek
2013-03-31 18:44                             ` Myklebust, Trond
2013-03-31 22:50                               ` Pavel Machek
2013-03-31 23:14                                 ` Ric Wheeler
2013-03-31 23:18                                   ` Pavel Machek
2013-03-31 23:28                                     ` Ric Wheeler
2013-03-31 23:41                                       ` Pavel Machek
2013-03-31  5:38                     ` AEDilger Gmail
2013-03-31  8:25                       ` Pavel Machek
2013-03-31 11:48                   ` Pádraig Brady
2013-03-30 22:40                 ` Andy Lutomirski
2013-02-21 22:05       ` Ric Wheeler
2013-02-21 22:13         ` Myklebust, Trond
2013-02-22  8:47           ` Ric Wheeler
2013-02-21 18:29   ` Jeremy Allison
2013-02-22  0:29     ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).