* Reflink (cow) copy of busy files @ 2018-02-24 18:20 Gionatan Danti 2018-02-24 22:07 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-24 18:20 UTC (permalink / raw) To: linux-xfs; +Cc: g.danti Hi all, I have a question on how CoW/reflink works when used on busy files, as vm image files, databases, ecc. In short: can reflink-copy be used to create a crash-consistent snapshot of, say, a busy vm disk file? Or the db/vm/whatever should be quiesced before taking the copy (ie: similarly to how lvm call fsfreeze during the snapshot)? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-24 18:20 Reflink (cow) copy of busy files Gionatan Danti @ 2018-02-24 22:07 ` Dave Chinner 2018-02-24 22:57 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2018-02-24 22:07 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Sat, Feb 24, 2018 at 07:20:48PM +0100, Gionatan Danti wrote: > Hi all, > I have a question on how CoW/reflink works when used on busy files, > as vm image files, databases, ecc. Define "busy file", please. > In short: can reflink-copy be used to create a crash-consistent > snapshot of, say, a busy vm disk file? If the file is being actively written, then the clone will not be consistent. > Or the db/vm/whatever should > be quiesced before taking the copy (ie: similarly to how lvm call > fsfreeze during the snapshot)? Yes, it's just like any other snapshot process - you have to quiesce everything that is writing to the file before cloning it. i.e. the data in the file needs to be in a stable, consistent, unchanging state if you want the clone to contain consistent data... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-24 22:07 ` Dave Chinner @ 2018-02-24 22:57 ` Gionatan Danti 2018-02-25 2:47 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-24 22:57 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, g.danti Il 24-02-2018 23:07 Dave Chinner ha scritto: > Define "busy file", please. Think about a running virtual machine. Maybe an XFS-based virtual image (ie: a CentOS 7 guest). > If the file is being actively written, then the clone will not be > consistent. > > Yes, it's just like any other snapshot process - you have to quiesce > everything that is writing to the file before cloning it. i.e. the > data in the file needs to be in a stable, consistent, unchanging > state if you want the clone to contain consistent data... About *what* level of consistency are we speaking? I understand that application-level consistency requires a quiesced filesystem and, possibly, an application-level agent. But is it a quiesced filesystem a requisite for a *crash-consistent* ie: pull the plug) snapshot? In other words: would a cp --reflink=always <vmdisk> <snapshot> of a runnig virtual machine produce an usable, crash-consistent snapshot, or it risks ending with binary garbage? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-24 22:57 ` Gionatan Danti @ 2018-02-25 2:47 ` Dave Chinner 2018-02-25 11:40 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2018-02-25 2:47 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Sat, Feb 24, 2018 at 11:57:32PM +0100, Gionatan Danti wrote: > Il 24-02-2018 23:07 Dave Chinner ha scritto: > >Define "busy file", please. > > Think about a running virtual machine. Maybe an XFS-based virtual > image (ie: a CentOS 7 guest). > > >If the file is being actively written, then the clone will not be > >consistent. > > > >Yes, it's just like any other snapshot process - you have to quiesce > >everything that is writing to the file before cloning it. i.e. the > >data in the file needs to be in a stable, consistent, unchanging > >state if you want the clone to contain consistent data... > > About *what* level of consistency are we speaking? I understand that > application-level consistency requires a quiesced filesystem and, > possibly, an application-level agent. But is it a quiesced > filesystem a requisite for a *crash-consistent* ie: pull the plug) > snapshot? Yes, you have to freeze the filesystem to get a crash-consistent snapshot of the filesystem. > In other words: would a cp --reflink=always <vmdisk> <snapshot> of a > runnig virtual machine produce an usable, crash-consistent snapshot, > or it risks ending with binary garbage? You will end up with garbage. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-25 2:47 ` Dave Chinner @ 2018-02-25 11:40 ` Gionatan Danti 2018-02-25 21:13 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-25 11:40 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, g.danti Il 25-02-2018 03:47 Dave Chinner ha scritto: > > Yes, you have to freeze the filesystem to get a crash-consistent > snapshot of the filesystem. > > > You will end up with garbage. Ok. Bonus question: am I right thinking this is due to the CoW copy not being atomic (ie: the various extents being in different state until the copy is finished)? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-25 11:40 ` Gionatan Danti @ 2018-02-25 21:13 ` Dave Chinner 2018-02-25 21:58 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2018-02-25 21:13 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Sun, Feb 25, 2018 at 12:40:47PM +0100, Gionatan Danti wrote: > Il 25-02-2018 03:47 Dave Chinner ha scritto: > > > >Yes, you have to freeze the filesystem to get a crash-consistent > >snapshot of the filesystem. > > > > > >You will end up with garbage. > > Ok. Bonus question: am I right thinking this is due to the CoW copy > not being atomic (ie: the various extents being in different state > until the copy is finished)? This isn't a copy on write issue. This is an issue of the state of the file and the I/O stack above it at the time the data extents are shared. There is I/O inflight, and so there's no guarantee that what is in the extents being shared is consistent. Freezing the filesystem stops IO in flight, so the extents can be shared while the filesystem knows it has consistent state on stable storage. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-25 21:13 ` Dave Chinner @ 2018-02-25 21:58 ` Gionatan Danti 2018-02-26 0:25 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-25 21:58 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, g.danti Il 25-02-2018 22:13 Dave Chinner ha scritto: > This isn't a copy on write issue. This is an issue of the state of > the file and the I/O stack above it at the time the data extents are > shared. There is I/O inflight, and so there's no guarantee that what > is in the extents being shared is consistent. Freezing the > filesystem stops IO in flight, so the extents can be shared while > the filesystem knows it has consistent state on stable storage. Uhm, it seems the very same definition/catches of "crash-consistent" snapshot... Suppose an XFS filesystem used for VM disk images hosting, with running VMs. I naively execute a cp --reflink=always copy, stop the original VM and start the copied one. For an atomic snapshot I would expect that dataloss is comparable to a "power pull" case: - async writes are lost. After all, they were on the pagecache and never hit the backing file; - unacknowledged sync writes are lost. Again, they never successfully hit the disk; - acknowledged sync writes (ie: the one which returned) are properly written to the backing file. If the above is correct, when starting the new (copied) VM, the guest filesystem will behave as power was lost: its journal will be replied and broght to a consistent state. Application can/will be affected based on what they were doing at the time of the reflinked copy, but important ones (ie: the ones correctly using fsync), as databases, will gracefully recover replying their logs. This should be similar to how LVM snapshot works when no filesystem is (directly) layered on top of the volume (ie: volume assigned to a VM). Still, you warned be that a CoW copy on a running VM will produce garbage; so I am surely misunderstanding something. I would greatly appreciate any clarification. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-25 21:58 ` Gionatan Danti @ 2018-02-26 0:25 ` Dave Chinner 2018-02-26 7:19 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2018-02-26 0:25 UTC (permalink / raw) To: Gionatan Danti; +Cc: linux-xfs On Sun, Feb 25, 2018 at 10:58:16PM +0100, Gionatan Danti wrote: > Il 25-02-2018 22:13 Dave Chinner ha scritto: > >This isn't a copy on write issue. This is an issue of the state of > >the file and the I/O stack above it at the time the data extents are > >shared. There is I/O inflight, and so there's no guarantee that what > >is in the extents being shared is consistent. Freezing the > >filesystem stops IO in flight, so the extents can be shared while > >the filesystem knows it has consistent state on stable storage. > > Uhm, it seems the very same definition/catches of "crash-consistent" > snapshot... > > Suppose an XFS filesystem used for VM disk images hosting, with > running VMs. I naively execute a cp --reflink=always copy, stop the > original VM and start the copied one. > > For an atomic snapshot I would expect that dataloss is comparable to > a "power pull" case: > - async writes are lost. After all, they were on the pagecache and > never hit the backing file; > - unacknowledged sync writes are lost. Again, they never > successfully hit the disk; > - acknowledged sync writes (ie: the one which returned) are properly > written to the backing file. Acknowledged sync writes are not guaranteed to be stable. They may still be sitting in volatile caches below the backing file, and so until there is a cache flush pushed down through all layers of the storage stack (e.g. fsync on the backing file) those acknowledged sync writes are not stable. That's one of the things quiescing the filesystem guarantees, but running reflink to clone the file does not. IOWs, "properly written" is easy to say but very hard to guarantee. We cannot make such assumptions about random user configs, nor we can base recommendations on such assumptions. If you choose not to quiesce the filesystems before snapshotting them, then it's your responsibility to guarantee your storage stack will work correctly. > If the above is correct, when starting the new (copied) VM, the > guest filesystem will behave as power was lost: its journal will be > replied and broght to a consistent state. Application can/will be > affected based on what they were doing at the time of the reflinked > copy, but important ones (ie: the ones correctly using fsync), as > databases, will gracefully recover replying their logs. > > This should be similar to how LVM snapshot works when no filesystem > is (directly) layered on top of the volume (ie: volume assigned to a > VM). You still have to quiesce the filesystem when it's on top of a LVM snapshot volume. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 0:25 ` Dave Chinner @ 2018-02-26 7:19 ` Gionatan Danti 2018-02-26 7:58 ` Amir Goldstein 0 siblings, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-26 7:19 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, g.danti Full disclaimer: maybe my point of view is influenced by thinking in the context of Qemu/KVM + software RAID (where much works was done to be sure about proper barrier passing) or BBU/NV hardware RAID. Il 26-02-2018 01:25 Dave Chinner ha scritto: > Acknowledged sync writes are not guaranteed to be stable. They may > still be sitting in volatile caches below the backing file, and so > until there is a cache flush pushed down through all layers of the > storage stack (e.g. fsync on the backing file) those acknowledged > sync writes are not stable. That's one of the things quiescing the > filesystem guarantees, but running reflink to clone the file does > not. Sure, but not-passed-down fsync/write barriers will thwarts even "normal" (ie: not CoW/snapshotted/reflinked) sync writes, and will inevitably cause problems (ie: a power loss become a big problem). How is it different for relinked copy? > IOWs, "properly written" is easy to say but very hard to guarantee. > We cannot make such assumptions about random user configs, nor we > can base recommendations on such assumptions. If you choose not to > quiesce the filesystems before snapshotting them, then it's your > responsibility to guarantee your storage stack will work correctly. Absolutely, and I *really* appreciate your advices. > You still have to quiesce the filesystem when it's on top of a LVM > snapshot volume. When the LVM volume is passed to a guest VM, the host can not quiesce the filesystem. Host/guest communication can be achieved by the mean on a guest agent and a private control channel, but this has its own problems. I thoroughly tested live, LVM-backed snapshotted VM and every time I run them, the guest filesystem replies its log without problem. I always double-check that the entire I/O stack (from guest down to the physical disks) honors write barriers, though. Back to the original question: if a reflinked copy is an *atomic* operation on all the data extents comprising a file, and in the context of properly passed barriers/fsync, I would think that an unquiesced snapshot will work for the (reduced) consistency model of a crash-consistent snapshot. If the reflink copy is not atomic (ie: the different extents are CoWed at different time, making it only a "faster copy" rather than a snapshot) this will *not* work and I will end with binary garbage (ie: writes can be reordered from snapshot's view). I think all can be reduced to a single question: putting aside quiescing problems, is a reflinked copy a true *atomic* snapshot or it is "only" a faster copy? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 7:19 ` Gionatan Danti @ 2018-02-26 7:58 ` Amir Goldstein 2018-02-26 8:26 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Amir Goldstein @ 2018-02-26 7:58 UTC (permalink / raw) To: Gionatan Danti; +Cc: Dave Chinner, linux-xfs On Mon, Feb 26, 2018 at 9:19 AM, Gionatan Danti <g.danti@assyoma.it> wrote: > Full disclaimer: maybe my point of view is influenced by thinking in the > context of Qemu/KVM + software RAID (where much works was done to be sure > about proper barrier passing) or BBU/NV hardware RAID. > > Il 26-02-2018 01:25 Dave Chinner ha scritto: >> >> Acknowledged sync writes are not guaranteed to be stable. They may >> still be sitting in volatile caches below the backing file, and so >> until there is a cache flush pushed down through all layers of the >> storage stack (e.g. fsync on the backing file) those acknowledged >> sync writes are not stable. That's one of the things quiescing the >> filesystem guarantees, but running reflink to clone the file does >> not. > > > Sure, but not-passed-down fsync/write barriers will thwarts even "normal" > (ie: not CoW/snapshotted/reflinked) sync writes, and will inevitably cause > problems (ie: a power loss become a big problem). How is it different for > relinked copy? > >> IOWs, "properly written" is easy to say but very hard to guarantee. >> We cannot make such assumptions about random user configs, nor we >> can base recommendations on such assumptions. If you choose not to >> quiesce the filesystems before snapshotting them, then it's your >> responsibility to guarantee your storage stack will work correctly. > > > Absolutely, and I *really* appreciate your advices. > >> You still have to quiesce the filesystem when it's on top of a LVM >> snapshot volume. > > > When the LVM volume is passed to a guest VM, the host can not quiesce the > filesystem. Host/guest communication can be achieved by the mean on a guest > agent and a private control channel, but this has its own problems. I > thoroughly tested live, LVM-backed snapshotted VM and every time I run them, > the guest filesystem replies its log without problem. I always double-check > that the entire I/O stack (from guest down to the physical disks) honors > write barriers, though. > > Back to the original question: if a reflinked copy is an *atomic* operation > on all the data extents comprising a file, and in the context of properly > passed barriers/fsync, I would think that an unquiesced snapshot will work > for the (reduced) consistency model of a crash-consistent snapshot. > > If the reflink copy is not atomic (ie: the different extents are CoWed at > different time, making it only a "faster copy" rather than a snapshot) this > will *not* work and I will end with binary garbage (ie: writes can be > reordered from snapshot's view). > > I think all can be reduced to a single question: putting aside quiescing > problems, is a reflinked copy a true *atomic* snapshot or it is "only" a > faster copy? > Gionatan, First of all, the answer to your question is "just" faster copy. reflinkning a file is much faster than copy, but it is not O(1). I believe cp --reflink can result in cloning part of the file if the system crashes mid operation, so in any case, the operation is not *atomic* in that sense. But your questions about quiescence the filesystem and your question about the *atomic* nature of the clone operation are two very different questions. What you seem to *think* xfs reflink does, it does not actually do. xfs reflink does NOT reflink the file in-memory data. xfs reflink "only" reflinks the file on-disk data. Right now, if you write a large file without fsync and clone it, you might as well get a clone of unallocated or partly fallocated file with zero or stale data. Going forward, I think there is an intention to "clone" the file in-memory data as well by sharing the READONLY cache pages between cloned files, but I don't think dirty pages are going be shared between clones anyway, so you are back to square one - need to get the data on-disk before cloning the file. Cheers, Amir. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 7:58 ` Amir Goldstein @ 2018-02-26 8:26 ` Gionatan Danti 2018-02-26 17:26 ` Darrick J. Wong 2018-02-26 20:29 ` Amir Goldstein 0 siblings, 2 replies; 24+ messages in thread From: Gionatan Danti @ 2018-02-26 8:26 UTC (permalink / raw) To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, g.danti Hi Amir, Il 26-02-2018 08:58 Amir Goldstein ha scritto: > > Gionatan, > > First of all, the answer to your question is "just" faster copy. > reflinkning a file is much faster than copy, but it is not O(1). > I believe cp --reflink can result in cloning part of the file if the > system > crashes mid operation, so in any case, the operation is not *atomic* > in that sense. > > But your questions about quiescence the filesystem and your question > about the *atomic* nature of the clone operation are two very different > questions. can this result on out-of-order writes from the cloned file's point of view? I mean: - take a 10-extents file; - a vm/db/whatever is writing to the file; - a cp --reflink is executed; - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in progress; - the vm/db writes to extent n.1 - this write will *not* be present on the cloned file; - application writes to extent n.6 which will be cloned shortly; - the cloned file ends with the later write to extent n.6 but not the previous on extent n.1; - bad things happen! If the above is true, than cp --reflink can't be used even for relaxed-consistency backup/clones. > What you seem to *think* xfs reflink does, it does not actually do. > xfs reflink does NOT reflink the file in-memory data. > xfs reflink "only" reflinks the file on-disk data. > Right now, if you write a large file without fsync and clone it, you > might as well get a clone of unallocated or partly fallocated file with > zero or stale data. Oh, I absolutely do not expect for reflink/clone to works on in-memory data. I *surely* expect for dirty, not commited data to be lost: this is the very reason I wrote about crash-consistent backup. In short: is cloning/reflink the same as "pulling the plug" for the cloned file? I mean: - a successfull clone (so, a non-interruped/crashed one) is akin to an atomic process for the cloned file; - async writes/dirty data are lost; - fsynced writes are preserved; - writes are not reordered/commited out of order. Maybe the entire discussion is skewed by the fact that, in some cases, I am willing to relax my consistency model to include a crash-consistent backup option. Fact is, in the virtualization world there are many backup utilities/applications which *use* this model, and I wondered if a cp --reflink would give similar results without the hassle. Maybe the entire crash-vs-application consistency is out of place in a filesystem mailing list, where you (rightfully!!!) strive for perfect/maximum data consistency (and I *really* appreciate that). Hoewever, given the recent reflinking works on XFS, I wonder if I can put this to "good use" when it is considered stable. > Going forward, I think there is an intention to "clone" the file > in-memory > data as well by sharing the READONLY cache pages between cloned files, > but I don't think dirty pages are going be shared between clones > anyway, > so you are back to square one - need to get the data on-disk before > cloning > the file. Great - I think this would do wonders for cache efficiency... > > Cheers, > Amir. Thanks. PS: sorry if I rephrase the question in different terms. English is not my primary language, please bear with me :p -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 8:26 ` Gionatan Danti @ 2018-02-26 17:26 ` Darrick J. Wong 2018-02-26 21:23 ` Gionatan Danti 2018-02-27 0:33 ` Dave Chinner 2018-02-26 20:29 ` Amir Goldstein 1 sibling, 2 replies; 24+ messages in thread From: Darrick J. Wong @ 2018-02-26 17:26 UTC (permalink / raw) To: Gionatan Danti; +Cc: Amir Goldstein, Dave Chinner, linux-xfs On Mon, Feb 26, 2018 at 09:26:14AM +0100, Gionatan Danti wrote: > Hi Amir, > > Il 26-02-2018 08:58 Amir Goldstein ha scritto: > > > >Gionatan, > > > >First of all, the answer to your question is "just" faster copy. > >reflinkning a file is much faster than copy, but it is not O(1). > >I believe cp --reflink can result in cloning part of the file if the > >system > >crashes mid operation, so in any case, the operation is not *atomic* > >in that sense. > > > >But your questions about quiescence the filesystem and your question > >about the *atomic* nature of the clone operation are two very different > >questions. > > can this result on out-of-order writes from the cloned file's point of view? > I mean: > - take a 10-extents file; > - a vm/db/whatever is writing to the file; > - a cp --reflink is executed; > - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in > progress; > - the vm/db writes to extent n.1 - this write will *not* be present on the > cloned file; > - application writes to extent n.6 which will be cloned shortly; > - the cloned file ends with the later write to extent n.6 but not the > previous on extent n.1; > - bad things happen! > > If the above is true, than cp --reflink can't be used even for > relaxed-consistency backup/clones. > > >What you seem to *think* xfs reflink does, it does not actually do. > >xfs reflink does NOT reflink the file in-memory data. > >xfs reflink "only" reflinks the file on-disk data. > >Right now, if you write a large file without fsync and clone it, you > >might as well get a clone of unallocated or partly fallocated file with > >zero or stale data. > > Oh, I absolutely do not expect for reflink/clone to works on in-memory data. > I *surely* expect for dirty, not commited data to be lost: this is the very > reason I wrote about crash-consistent backup. > > In short: is cloning/reflink the same as "pulling the plug" for the cloned > file? I mean: > - a successfull clone (so, a non-interruped/crashed one) is akin to an > atomic process for the cloned file; > - async writes/dirty data are lost; > - fsynced writes are preserved; > - writes are not reordered/commited out of order. > > Maybe the entire discussion is skewed by the fact that, in some cases, I am > willing to relax my consistency model to include a crash-consistent backup > option. Fact is, in the virtualization world there are many backup > utilities/applications which *use* this model, and I wondered if a cp > --reflink would give similar results without the hassle. > > Maybe the entire crash-vs-application consistency is out of place in a > filesystem mailing list, where you (rightfully!!!) strive for > perfect/maximum data consistency (and I *really* appreciate that). Hoewever, > given the recent reflinking works on XFS, I wonder if I can put this to > "good use" when it is considered stable. The way reflink is supposed to work wrt consistency is: 1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock) 2. wait for all directio to complete 3. fsync both files (write all the dirty pagecache to disk) 4. lock both inodes (ilock) 5. clone each extent atomically 6. unlock ilock 7. unlock iolock/mmaplock So at least in theory the cloned file will match whatever the host saw on disk and page cache at the time the reflink call was initiated. I say 'in theory' because there could be bugs. Whatever dirty state is in the guest VM stays in that VM, which means that if you only cp --reflink on the host, the clone you get will reflect the virtual disk state as if you'd kill -9'd the VM, cloned the VM disk, and restarted the VM. Upon restart the log recovers whatever metadata made it out of the VM. However, if you tell the guest to freeze the fs before cloning (as Dave suggested earlier) the guest will flush all its state to the upper level (the host) and the host will push all that out to disk before cloning. The snapshot you create should be cleaner because you're effectively prepaying the recovery costs by flushing everything before taking the snapshot. Also note that if the host goes down before returning from the syscall, the log will continue on with whichever extent was being cloned at the time in order to preserve metadata integrity, but the destination file will reflect a partial copy. --D > >Going forward, I think there is an intention to "clone" the file in-memory > >data as well by sharing the READONLY cache pages between cloned files, > >but I don't think dirty pages are going be shared between clones anyway, > >so you are back to square one - need to get the data on-disk before > >cloning > >the file. > > Great - I think this would do wonders for cache efficiency... > > > > >Cheers, > >Amir. > > Thanks. > > PS: sorry if I rephrase the question in different terms. English is not my > primary language, please bear with me :p > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 17:26 ` Darrick J. Wong @ 2018-02-26 21:23 ` Gionatan Danti 2018-02-26 21:31 ` Darrick J. Wong 2018-02-27 0:33 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-26 21:23 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Amir Goldstein, Dave Chinner, linux-xfs, g.danti Il 26-02-2018 18:26 Darrick J. Wong ha scritto: > The way reflink is supposed to work wrt consistency is: > > 1. lock out all new io/fallocate activity on both inodes > (iolock/mmaplock) > 2. wait for all directio to complete > 3. fsync both files (write all the dirty pagecache to disk) > 4. lock both inodes (ilock) > 5. clone each extent atomically > 6. unlock ilock > 7. unlock iolock/mmaplock > > So at least in theory the cloned file will match whatever the host saw > on disk and page cache at the time the reflink call was initiated. > I say 'in theory' because there could be bugs. Great! CoW will be a great addition for XFS when it will be considered stable. > Whatever dirty state is in the guest VM stays in that VM, which means > that if you only cp --reflink on the host, the clone you get will > reflect the virtual disk state as if you'd kill -9'd the VM, cloned the > VM disk, and restarted the VM. Upon restart the log recovers whatever > metadata made it out of the VM. Sure, it is what I means for "crash-consistent". > However, if you tell the guest to freeze the fs before cloning (as Dave > suggested earlier) the guest will flush all its state to the upper > level > (the host) and the host will push all that out to disk before cloning. > The snapshot you create should be cleaner because you're effectively > prepaying the recovery costs by flushing everything before taking the > snapshot. True, and this is "application-level consistency" (which requires a guest agent and possibly even an application-specific agent) > Also note that if the host goes down before returning from the syscall, > the log will continue on with whichever extent was being cloned at the > time in order to preserve metadata integrity, but the destination file > will reflect a partial copy. Thanks for pointing that, and for your extremely clear explanation! -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 21:23 ` Gionatan Danti @ 2018-02-26 21:31 ` Darrick J. Wong 2018-02-26 21:39 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Darrick J. Wong @ 2018-02-26 21:31 UTC (permalink / raw) To: Gionatan Danti; +Cc: Amir Goldstein, Dave Chinner, linux-xfs On Mon, Feb 26, 2018 at 10:23:45PM +0100, Gionatan Danti wrote: > Il 26-02-2018 18:26 Darrick J. Wong ha scritto: > >The way reflink is supposed to work wrt consistency is: > > > >1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock) > >2. wait for all directio to complete > >3. fsync both files (write all the dirty pagecache to disk) > >4. lock both inodes (ilock) > >5. clone each extent atomically > >6. unlock ilock > >7. unlock iolock/mmaplock > > > >So at least in theory the cloned file will match whatever the host saw > >on disk and page cache at the time the reflink call was initiated. > >I say 'in theory' because there could be bugs. > > Great! CoW will be a great addition for XFS when it will be considered > stable. > > >Whatever dirty state is in the guest VM stays in that VM, which means > >that if you only cp --reflink on the host, the clone you get will > >reflect the virtual disk state as if you'd kill -9'd the VM, cloned the > >VM disk, and restarted the VM. Upon restart the log recovers whatever > >metadata made it out of the VM. > > Sure, it is what I means for "crash-consistent". > > >However, if you tell the guest to freeze the fs before cloning (as Dave > >suggested earlier) the guest will flush all its state to the upper level > >(the host) and the host will push all that out to disk before cloning. > >The snapshot you create should be cleaner because you're effectively > >prepaying the recovery costs by flushing everything before taking the > >snapshot. > > True, and this is "application-level consistency" (which requires a guest > agent and possibly even an application-specific agent) I believe qemu-ga takes care of guest fs freeze inside the guest, and you can invoke it from the host via 'virsh domfsfreeze' or the --quiesce argument to snapshot-create... but you ought to confirm that for yourself. --D > >Also note that if the host goes down before returning from the syscall, > >the log will continue on with whichever extent was being cloned at the > >time in order to preserve metadata integrity, but the destination file > >will reflect a partial copy. > > Thanks for pointing that, and for your extremely clear explanation! > > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 21:31 ` Darrick J. Wong @ 2018-02-26 21:39 ` Gionatan Danti 0 siblings, 0 replies; 24+ messages in thread From: Gionatan Danti @ 2018-02-26 21:39 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Amir Goldstein, Dave Chinner, linux-xfs, g.danti Il 26-02-2018 22:31 Darrick J. Wong ha scritto: > I believe qemu-ga takes care of guest fs freeze inside the guest, > and you can invoke it from the host via 'virsh domfsfreeze' or the > --quiesce argument to snapshot-create... but you ought to confirm that > for yourself. Sure, and I can confirm it works properly in Linux VMs (Windows VMs are another matter, if things have not changed). However it also poses a security risk: a malicious agent can will block your *entire* backup process (even for other virtual machines). So, in some environments, an agentless backup solution is preferred. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 17:26 ` Darrick J. Wong 2018-02-26 21:23 ` Gionatan Danti @ 2018-02-27 0:33 ` Dave Chinner 2018-02-27 0:58 ` Darrick J. Wong 2018-02-27 8:06 ` Gionatan Danti 1 sibling, 2 replies; 24+ messages in thread From: Dave Chinner @ 2018-02-27 0:33 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Gionatan Danti, Amir Goldstein, linux-xfs On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote: > On Mon, Feb 26, 2018 at 09:26:14AM +0100, Gionatan Danti wrote: > > Hi Amir, > > > > Il 26-02-2018 08:58 Amir Goldstein ha scritto: > > > > > >Gionatan, > > > > > >First of all, the answer to your question is "just" faster copy. > > >reflinkning a file is much faster than copy, but it is not O(1). > > >I believe cp --reflink can result in cloning part of the file if the > > >system > > >crashes mid operation, so in any case, the operation is not *atomic* > > >in that sense. > > > > > >But your questions about quiescence the filesystem and your question > > >about the *atomic* nature of the clone operation are two very different > > >questions. > > > > can this result on out-of-order writes from the cloned file's point of view? > > I mean: > > - take a 10-extents file; > > - a vm/db/whatever is writing to the file; > > - a cp --reflink is executed; > > - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in > > progress; > > - the vm/db writes to extent n.1 - this write will *not* be present on the > > cloned file; > > - application writes to extent n.6 which will be cloned shortly; > > - the cloned file ends with the later write to extent n.6 but not the > > previous on extent n.1; > > - bad things happen! > > > > If the above is true, than cp --reflink can't be used even for > > relaxed-consistency backup/clones. > > > > >What you seem to *think* xfs reflink does, it does not actually do. > > >xfs reflink does NOT reflink the file in-memory data. > > >xfs reflink "only" reflinks the file on-disk data. > > >Right now, if you write a large file without fsync and clone it, you > > >might as well get a clone of unallocated or partly fallocated file with > > >zero or stale data. > > > > Oh, I absolutely do not expect for reflink/clone to works on in-memory data. > > I *surely* expect for dirty, not commited data to be lost: this is the very > > reason I wrote about crash-consistent backup. > > > > In short: is cloning/reflink the same as "pulling the plug" for the cloned > > file? I mean: > > - a successfull clone (so, a non-interruped/crashed one) is akin to an > > atomic process for the cloned file; > > - async writes/dirty data are lost; > > - fsynced writes are preserved; > > - writes are not reordered/commited out of order. > > > > Maybe the entire discussion is skewed by the fact that, in some cases, I am > > willing to relax my consistency model to include a crash-consistent backup > > option. Fact is, in the virtualization world there are many backup > > utilities/applications which *use* this model, and I wondered if a cp > > --reflink would give similar results without the hassle. > > > > Maybe the entire crash-vs-application consistency is out of place in a > > filesystem mailing list, where you (rightfully!!!) strive for > > perfect/maximum data consistency (and I *really* appreciate that). Hoewever, > > given the recent reflinking works on XFS, I wonder if I can put this to > > "good use" when it is considered stable. > > The way reflink is supposed to work wrt consistency is: > > 1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock) > 2. wait for all directio to complete > 3. fsync both files (write all the dirty pagecache to disk) My point is that vfs_clone_file_range is not running fsync(2)i operations. It's a fdatawrite_and_wait() call, which submits dirty data to disk and waits for it, but does *not flush volatile storage caches*. IOWs, it's not a data integrity operation. Hence while the reflink now has "data on disk" and can clone the extents, Neither the data nor the extents being cloned are stable and won't be until an fsync operation is performed on either the reflink source or destination file.... > 4. lock both inodes (ilock) > 5. clone each extent atomically > 6. unlock ilock > 7. unlock iolock/mmaplock > > So at least in theory the cloned file will match whatever the host saw > on disk and page cache at the time the reflink call was initiated. > I say 'in theory' because there could be bugs. Still no cache flushes. Hence even after the clone has run, you can still lose the data (and extents!) from the host file.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-27 0:33 ` Dave Chinner @ 2018-02-27 0:58 ` Darrick J. Wong 2018-02-27 8:06 ` Gionatan Danti 1 sibling, 0 replies; 24+ messages in thread From: Darrick J. Wong @ 2018-02-27 0:58 UTC (permalink / raw) To: Dave Chinner; +Cc: Gionatan Danti, Amir Goldstein, linux-xfs On Tue, Feb 27, 2018 at 11:33:48AM +1100, Dave Chinner wrote: > On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote: > > On Mon, Feb 26, 2018 at 09:26:14AM +0100, Gionatan Danti wrote: > > > Hi Amir, > > > > > > Il 26-02-2018 08:58 Amir Goldstein ha scritto: > > > > > > > >Gionatan, > > > > > > > >First of all, the answer to your question is "just" faster copy. > > > >reflinkning a file is much faster than copy, but it is not O(1). > > > >I believe cp --reflink can result in cloning part of the file if the > > > >system > > > >crashes mid operation, so in any case, the operation is not *atomic* > > > >in that sense. > > > > > > > >But your questions about quiescence the filesystem and your question > > > >about the *atomic* nature of the clone operation are two very different > > > >questions. > > > > > > can this result on out-of-order writes from the cloned file's point of view? > > > I mean: > > > - take a 10-extents file; > > > - a vm/db/whatever is writing to the file; > > > - a cp --reflink is executed; > > > - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in > > > progress; > > > - the vm/db writes to extent n.1 - this write will *not* be present on the > > > cloned file; > > > - application writes to extent n.6 which will be cloned shortly; > > > - the cloned file ends with the later write to extent n.6 but not the > > > previous on extent n.1; > > > - bad things happen! > > > > > > If the above is true, than cp --reflink can't be used even for > > > relaxed-consistency backup/clones. > > > > > > >What you seem to *think* xfs reflink does, it does not actually do. > > > >xfs reflink does NOT reflink the file in-memory data. > > > >xfs reflink "only" reflinks the file on-disk data. > > > >Right now, if you write a large file without fsync and clone it, you > > > >might as well get a clone of unallocated or partly fallocated file with > > > >zero or stale data. > > > > > > Oh, I absolutely do not expect for reflink/clone to works on in-memory data. > > > I *surely* expect for dirty, not commited data to be lost: this is the very > > > reason I wrote about crash-consistent backup. > > > > > > In short: is cloning/reflink the same as "pulling the plug" for the cloned > > > file? I mean: > > > - a successfull clone (so, a non-interruped/crashed one) is akin to an > > > atomic process for the cloned file; > > > - async writes/dirty data are lost; > > > - fsynced writes are preserved; > > > - writes are not reordered/commited out of order. > > > > > > Maybe the entire discussion is skewed by the fact that, in some cases, I am > > > willing to relax my consistency model to include a crash-consistent backup > > > option. Fact is, in the virtualization world there are many backup > > > utilities/applications which *use* this model, and I wondered if a cp > > > --reflink would give similar results without the hassle. > > > > > > Maybe the entire crash-vs-application consistency is out of place in a > > > filesystem mailing list, where you (rightfully!!!) strive for > > > perfect/maximum data consistency (and I *really* appreciate that). Hoewever, > > > given the recent reflinking works on XFS, I wonder if I can put this to > > > "good use" when it is considered stable. > > > > The way reflink is supposed to work wrt consistency is: > > > > 1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock) > > 2. wait for all directio to complete > > 3. fsync both files (write all the dirty pagecache to disk) > > My point is that vfs_clone_file_range is not running fsync(2)i > operations. > > It's a fdatawrite_and_wait() call, which submits dirty data to disk > and waits for it, but does *not flush volatile storage caches*. > IOWs, it's not a data integrity operation. > > Hence while the reflink now has "data on disk" and can clone the > extents, Neither the data nor the extents being cloned are stable > and won't be until an fsync operation is performed on either the > reflink source or destination file.... > > > 4. lock both inodes (ilock) > > 5. clone each extent atomically > > 6. unlock ilock > > 7. unlock iolock/mmaplock > > > > So at least in theory the cloned file will match whatever the host saw > > on disk and page cache at the time the reflink call was initiated. > > I say 'in theory' because there could be bugs. > > Still no cache flushes. Hence even after the clone has run, > you can still lose the data (and extents!) from the host file.... TBH I was assuming that the host doesn't go down in these scenarios, so we were only concerned about getting the guest to flush everything it had. But Dave is right, if you need the host to maintain data integrity too, then you need to fsync both the src and dest fds too. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-27 0:33 ` Dave Chinner 2018-02-27 0:58 ` Darrick J. Wong @ 2018-02-27 8:06 ` Gionatan Danti 2018-02-27 22:04 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-27 8:06 UTC (permalink / raw) To: Dave Chinner, Darrick J. Wong; +Cc: Amir Goldstein, linux-xfs, Gionatan Danti On 27/02/2018 01:33, Dave Chinner wrote: > On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote: > > My point is that vfs_clone_file_range is not running fsync(2)i > operations. > > It's a fdatawrite_and_wait() call, which submits dirty data to disk > and waits for it, but does *not flush volatile storage caches*. > IOWs, it's not a data integrity operation. > > Hence while the reflink now has "data on disk" and can clone the > extents, Neither the data nor the extents being cloned are stable > and won't be until an fsync operation is performed on either the > reflink source or destination file.... > > Still no cache flushes. Hence even after the clone has run, > you can still lose the data (and extents!) from the host file.... Am I right saying that you are speaking about an *host* crash during or just after the clone? Even in such a case, only the newly created file clone should be lost/corrupted, while the original file will *not* be affected, right? Or will an interrupted clone operation (ie: due to a power failure) leave *both* files in an unconsistent state? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-27 8:06 ` Gionatan Danti @ 2018-02-27 22:04 ` Dave Chinner 2018-02-28 7:08 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2018-02-27 22:04 UTC (permalink / raw) To: Gionatan Danti; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs On Tue, Feb 27, 2018 at 09:06:25AM +0100, Gionatan Danti wrote: > On 27/02/2018 01:33, Dave Chinner wrote: > >On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote: > > > >My point is that vfs_clone_file_range is not running fsync(2)i > >operations. > > > >It's a fdatawrite_and_wait() call, which submits dirty data to disk > >and waits for it, but does *not flush volatile storage caches*. > >IOWs, it's not a data integrity operation. > > > >Hence while the reflink now has "data on disk" and can clone the > >extents, Neither the data nor the extents being cloned are stable > >and won't be until an fsync operation is performed on either the > >reflink source or destination file.... > > > >Still no cache flushes. Hence even after the clone has run, > >you can still lose the data (and extents!) from the host file.... > > Am I right saying that you are speaking about an *host* crash during > or just after the clone? > > Even in such a case, only the newly created file clone should be > lost/corrupted, while the original file will *not* be affected, > right? Or will an interrupted clone operation (ie: due to a power > failure) leave *both* files in an unconsistent state? A host crash can lose data from the original file when it is configured in writeback mode (as you've said you are using). If the clone is there, both source and clone should be fully intact. If it's not, then you will have lost data from the original image file. But, really, why risk losing data or filesystem corruption by trying to take shortcuts? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-27 22:04 ` Dave Chinner @ 2018-02-28 7:08 ` Gionatan Danti 2018-02-28 17:07 ` Darrick J. Wong 0 siblings, 1 reply; 24+ messages in thread From: Gionatan Danti @ 2018-02-28 7:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, g.danti Il 27-02-2018 23:04 Dave Chinner ha scritto: > A host crash can lose data from the original file when it is > configured in writeback mode (as you've said you are using). If the > clone is there, both source and clone should be fully intact. If > it's not, then you will have lost data from the original image file. I have difficult grasping how a system crash during a cp --refcopy could corrupt the source file. As per Darrick explanation, new writes on the original file should be blocked/queued during the copy. Even if this is not the case, fsync writes should complete only when data successfully landed on the disk platter. Losing some second on async writes should not be a problem in many environments (this is the very reasoning behind providing Qemu/KVM with a working writeback option). Clearly a crash during the copy *will* produce an invalid destination file, but this can not be avoided (after all, the system crashed!). > But, really, why risk losing data or filesystem corruption by trying > to take shortcuts? Losing data and filesystem corruption are two *very* different things. On many VMs, I can afford losing some seconds of async writes; obviously, fsync writes (which can lead do filesystem corruption) must *not* be lost on *any* condition. The point of the discussion is that if a cp --reflink is suitable for hot backup, it would be an extremely fast and convenient method to take "cheap" snapshot of key files. But if an interrupted copy can lead to total loss of the *original* file/filesystem, than this is clearly the wrong idea. I am missing something? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-28 7:08 ` Gionatan Danti @ 2018-02-28 17:07 ` Darrick J. Wong 2018-02-28 18:27 ` Gionatan Danti 0 siblings, 1 reply; 24+ messages in thread From: Darrick J. Wong @ 2018-02-28 17:07 UTC (permalink / raw) To: Gionatan Danti; +Cc: Dave Chinner, Amir Goldstein, linux-xfs On Wed, Feb 28, 2018 at 08:08:47AM +0100, Gionatan Danti wrote: > Il 27-02-2018 23:04 Dave Chinner ha scritto: > >A host crash can lose data from the original file when it is > >configured in writeback mode (as you've said you are using). If the > >clone is there, both source and clone should be fully intact. If > >it's not, then you will have lost data from the original image file. > > I have difficult grasping how a system crash during a cp --refcopy could > corrupt the source file. > As per Darrick explanation, new writes on the original file should be > blocked/queued during the copy. Even if this is not the case, fsync writes > should complete only when data successfully landed on the disk platter. reflink performs (more or less) a fdatasync of the source and dest file before it starts so that any dirty pages backed by delayed allocation reservation will be allocated and written to disk, but it doesn't do the "force all dirty metadata out to log" action that distinguishes fdatasync from fsync. That is a deliberate design decision because: 1) fsync is fairly heavyweight, 2) customers might have disposable environments where it is preferable to lose srcfile and destfile over paying performance penalties all the time, and 3) if you need srcfile to be completely stable on disk, you needed to call fsync anyway, and nothing prevents you from doing so before calling copy_file_range/clone_file_range if that is part of your operational requirements. In other words, if at a certain point you can't afford to lose the source file due to a host crash, you have to call fsync, as has been the case for ages. reflink does not itself call fsync, nor does it increase the chances of losing any file contents that weren't fsync'd before the host went down. --D > Losing some second on async writes should not be a problem in many > environments (this is the very reasoning behind providing Qemu/KVM with a > working writeback option). > > Clearly a crash during the copy *will* produce an invalid destination file, > but this can not be avoided (after all, the system crashed!). > > >But, really, why risk losing data or filesystem corruption by trying > >to take shortcuts? > > Losing data and filesystem corruption are two *very* different things. On > many VMs, I can afford losing some seconds of async writes; obviously, fsync > writes (which can lead do filesystem corruption) must *not* be lost on *any* > condition. > > The point of the discussion is that if a cp --reflink is suitable for hot > backup, it would be an extremely fast and convenient method to take "cheap" > snapshot of key files. But if an interrupted copy can lead to total loss of > the *original* file/filesystem, than this is clearly the wrong idea. > > I am missing something? > Thanks. > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti@assyoma.it - info@assyoma.it > GPG public key ID: FF5F32A8 > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-28 17:07 ` Darrick J. Wong @ 2018-02-28 18:27 ` Gionatan Danti 0 siblings, 0 replies; 24+ messages in thread From: Gionatan Danti @ 2018-02-28 18:27 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, Amir Goldstein, linux-xfs, g.danti Il 28-02-2018 18:07 Darrick J. Wong ha scritto: > reflink performs (more or less) a fdatasync of the source and dest file > before it starts so that any dirty pages backed by delayed allocation > reservation will be allocated and written to disk, but it doesn't do > the > "force all dirty metadata out to log" action that distinguishes > fdatasync from fsync. That is a deliberate design decision because: > > 1) fsync is fairly heavyweight, > 2) customers might have disposable environments where it is preferable > to lose srcfile and destfile over paying performance penalties > all the time, and > 3) if you need srcfile to be completely stable on disk, you needed to > call fsync anyway, and nothing prevents you from doing so before > calling copy_file_range/clone_file_range if that is part of your > operational requirements. > > In other words, if at a certain point you can't afford to lose the > source file due to a host crash, you have to call fsync, as has been > the > case for ages. reflink does not itself call fsync, nor does it > increase > the chances of losing any file contents that weren't fsync'd before the > host went down. Ok, this is exactly what I expect. To add some context: Qemu/KVM added safe barrier/fsync passing years ago, so when a guest issues a fsync+barrier operation (ie: after key operations, as a journal update or a COMMIT) they are immediately passed to the host, which issues real fsync+barrier on the backing file. In other words, host's writeback cache is used as the volatile disk's DRAM cache (which needs to be flushed at specific interval). See: https://www.static.linuxfound.org/jp_uploads/JLS2009/jls09_hellwig.pdf Back to the original argument: are guest/user initiated fsyncs+barriers honored even *during a cp --reflink copy*? If so, I can't see any shortcoming in using reflinking to hot copy a busy file. Sure, I risk losing async writes (which are in writeback host cache *or* in the unflushed volatile disk's DRAM cache), but this is nothing more (or less) than a normal, interrupted copy. I am right saying that? Maybe encapsulating the reflink copy in between two fsync calls is a good idea? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 8:26 ` Gionatan Danti 2018-02-26 17:26 ` Darrick J. Wong @ 2018-02-26 20:29 ` Amir Goldstein 2018-02-26 21:28 ` Gionatan Danti 1 sibling, 1 reply; 24+ messages in thread From: Amir Goldstein @ 2018-02-26 20:29 UTC (permalink / raw) To: Gionatan Danti; +Cc: Dave Chinner, linux-xfs On Mon, Feb 26, 2018 at 10:26 AM, Gionatan Danti <g.danti@assyoma.it> wrote: > Hi Amir, > > Il 26-02-2018 08:58 Amir Goldstein ha scritto: >> >> >> Gionatan, >> >> First of all, the answer to your question is "just" faster copy. >> reflinkning a file is much faster than copy, but it is not O(1). >> I believe cp --reflink can result in cloning part of the file if the >> system >> crashes mid operation, so in any case, the operation is not *atomic* >> in that sense. >> >> But your questions about quiescence the filesystem and your question >> about the *atomic* nature of the clone operation are two very different >> questions. > > > can this result on out-of-order writes from the cloned file's point of view? > I mean: > - take a 10-extents file; > - a vm/db/whatever is writing to the file; > - a cp --reflink is executed; > - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in > progress; > - the vm/db writes to extent n.1 - this write will *not* be present on the > cloned file; > - application writes to extent n.6 which will be cloned shortly; As Darrick explained, new writes to both files should be blocked throughout the clone operation. > - the cloned file ends with the later write to extent n.6 but not the > previous on extent n.1; > - bad things happen! > > If the above is true, than cp --reflink can't be used even for > relaxed-consistency backup/clones. > >> What you seem to *think* xfs reflink does, it does not actually do. >> xfs reflink does NOT reflink the file in-memory data. >> xfs reflink "only" reflinks the file on-disk data. >> Right now, if you write a large file without fsync and clone it, you >> might as well get a clone of unallocated or partly fallocated file with >> zero or stale data. > > > Oh, I absolutely do not expect for reflink/clone to works on in-memory data. > I *surely* expect for dirty, not commited data to be lost: this is the very > reason I wrote about crash-consistent backup. > I was also expecting dirty data to be lost, but according to Darrick that is not the case for page cache dirty data, although in all likelyhook, your VM is using direct IO to write to image file. > In short: is cloning/reflink the same as "pulling the plug" for the cloned > file? I mean: > - a successfull clone (so, a non-interruped/crashed one) is akin to an > atomic process for the cloned file; > - async writes/dirty data are lost; > - fsynced writes are preserved; > - writes are not reordered/commited out of order. > Got it now. According to Darrick's description, the answer seems to be that your assumptions about clone are correct. > Maybe the entire discussion is skewed by the fact that, in some cases, I am > willing to relax my consistency model to include a crash-consistent backup > option. Fact is, in the virtualization world there are many backup > utilities/applications which *use* this model, and I wondered if a cp > --reflink would give similar results without the hassle. > > Maybe the entire crash-vs-application consistency is out of place in a > filesystem mailing list, where you (rightfully!!!) strive for > perfect/maximum data consistency (and I *really* appreciate that). Hoewever, > given the recent reflinking works on XFS, I wonder if I can put this to > "good use" when it is considered stable. > I think your use case is interesting and this is definitely the right place to ask these questions. Cheers, Amir. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reflink (cow) copy of busy files 2018-02-26 20:29 ` Amir Goldstein @ 2018-02-26 21:28 ` Gionatan Danti 0 siblings, 0 replies; 24+ messages in thread From: Gionatan Danti @ 2018-02-26 21:28 UTC (permalink / raw) To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, g.danti Il 26-02-2018 21:29 Amir Goldstein ha scritto: > I was also expecting dirty data to be lost, but according to Darrick > that is > not the case for page cache dirty data, although in all likelyhook, > your VM > is using direct IO to write to image file. Yeah, it was better than expected. For some VM I use Qemu/KVM writeback policy rather than "none"/O_DIRECT, so the fact that host-side dirty data will not be lost when reflinking is a pleasant surprise. > I think your use case is interesting and this is definitely the right > place > to ask these questions. Great ;) -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2018-02-28 18:28 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-02-24 18:20 Reflink (cow) copy of busy files Gionatan Danti 2018-02-24 22:07 ` Dave Chinner 2018-02-24 22:57 ` Gionatan Danti 2018-02-25 2:47 ` Dave Chinner 2018-02-25 11:40 ` Gionatan Danti 2018-02-25 21:13 ` Dave Chinner 2018-02-25 21:58 ` Gionatan Danti 2018-02-26 0:25 ` Dave Chinner 2018-02-26 7:19 ` Gionatan Danti 2018-02-26 7:58 ` Amir Goldstein 2018-02-26 8:26 ` Gionatan Danti 2018-02-26 17:26 ` Darrick J. Wong 2018-02-26 21:23 ` Gionatan Danti 2018-02-26 21:31 ` Darrick J. Wong 2018-02-26 21:39 ` Gionatan Danti 2018-02-27 0:33 ` Dave Chinner 2018-02-27 0:58 ` Darrick J. Wong 2018-02-27 8:06 ` Gionatan Danti 2018-02-27 22:04 ` Dave Chinner 2018-02-28 7:08 ` Gionatan Danti 2018-02-28 17:07 ` Darrick J. Wong 2018-02-28 18:27 ` Gionatan Danti 2018-02-26 20:29 ` Amir Goldstein 2018-02-26 21:28 ` Gionatan Danti
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.