All of lore.kernel.org
 help / color / mirror / Atom feed
* Reflink (cow) copy of busy files
@ 2018-02-24 18:20 Gionatan Danti
  2018-02-24 22:07 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-24 18:20 UTC (permalink / raw)
  To: linux-xfs; +Cc: g.danti

Hi all,
I have a question on how CoW/reflink works when used on busy files, as 
vm image files, databases, ecc.

In short: can reflink-copy be used to create a crash-consistent snapshot 
of, say, a busy vm disk file? Or the db/vm/whatever should be quiesced 
before taking the copy (ie: similarly to how lvm call fsfreeze during 
the snapshot)?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-24 18:20 Reflink (cow) copy of busy files Gionatan Danti
@ 2018-02-24 22:07 ` Dave Chinner
  2018-02-24 22:57   ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2018-02-24 22:07 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Sat, Feb 24, 2018 at 07:20:48PM +0100, Gionatan Danti wrote:
> Hi all,
> I have a question on how CoW/reflink works when used on busy files,
> as vm image files, databases, ecc.

Define "busy file", please.

> In short: can reflink-copy be used to create a crash-consistent
> snapshot of, say, a busy vm disk file?

If the file is being actively written, then the clone will not be
consistent.

> Or the db/vm/whatever should
> be quiesced before taking the copy (ie: similarly to how lvm call
> fsfreeze during the snapshot)?

Yes, it's just like any other snapshot process - you have to quiesce
everything that is writing to the file before cloning it. i.e. the
data in the file needs to be in a stable, consistent, unchanging
state if you want the clone to contain consistent data...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-24 22:07 ` Dave Chinner
@ 2018-02-24 22:57   ` Gionatan Danti
  2018-02-25  2:47     ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-24 22:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, g.danti

Il 24-02-2018 23:07 Dave Chinner ha scritto:
> Define "busy file", please.

Think about a running virtual machine. Maybe an XFS-based virtual image 
(ie: a CentOS 7 guest).

> If the file is being actively written, then the clone will not be
> consistent.
> 
> Yes, it's just like any other snapshot process - you have to quiesce
> everything that is writing to the file before cloning it. i.e. the
> data in the file needs to be in a stable, consistent, unchanging
> state if you want the clone to contain consistent data...

About *what* level of consistency are we speaking? I understand that 
application-level consistency requires a quiesced filesystem and, 
possibly, an application-level agent. But is it a quiesced filesystem a 
requisite for a *crash-consistent* ie: pull the plug) snapshot?

In other words: would a cp --reflink=always <vmdisk> <snapshot> of a 
runnig virtual machine produce an usable, crash-consistent snapshot, or 
it risks ending with binary garbage?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-24 22:57   ` Gionatan Danti
@ 2018-02-25  2:47     ` Dave Chinner
  2018-02-25 11:40       ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2018-02-25  2:47 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Sat, Feb 24, 2018 at 11:57:32PM +0100, Gionatan Danti wrote:
> Il 24-02-2018 23:07 Dave Chinner ha scritto:
> >Define "busy file", please.
> 
> Think about a running virtual machine. Maybe an XFS-based virtual
> image (ie: a CentOS 7 guest).
> 
> >If the file is being actively written, then the clone will not be
> >consistent.
> >
> >Yes, it's just like any other snapshot process - you have to quiesce
> >everything that is writing to the file before cloning it. i.e. the
> >data in the file needs to be in a stable, consistent, unchanging
> >state if you want the clone to contain consistent data...
> 
> About *what* level of consistency are we speaking? I understand that
> application-level consistency requires a quiesced filesystem and,
> possibly, an application-level agent. But is it a quiesced
> filesystem a requisite for a *crash-consistent* ie: pull the plug)
> snapshot?

Yes, you have to freeze the filesystem to get a crash-consistent
snapshot of the filesystem.

> In other words: would a cp --reflink=always <vmdisk> <snapshot> of a
> runnig virtual machine produce an usable, crash-consistent snapshot,
> or it risks ending with binary garbage?

You will end up with garbage.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-25  2:47     ` Dave Chinner
@ 2018-02-25 11:40       ` Gionatan Danti
  2018-02-25 21:13         ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-25 11:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, g.danti

Il 25-02-2018 03:47 Dave Chinner ha scritto:
> 
> Yes, you have to freeze the filesystem to get a crash-consistent
> snapshot of the filesystem.
> 
> 
> You will end up with garbage.

Ok. Bonus question: am I right thinking this is due to the CoW copy not 
being atomic (ie: the various extents being in different state until the 
copy is finished)?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-25 11:40       ` Gionatan Danti
@ 2018-02-25 21:13         ` Dave Chinner
  2018-02-25 21:58           ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2018-02-25 21:13 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Sun, Feb 25, 2018 at 12:40:47PM +0100, Gionatan Danti wrote:
> Il 25-02-2018 03:47 Dave Chinner ha scritto:
> >
> >Yes, you have to freeze the filesystem to get a crash-consistent
> >snapshot of the filesystem.
> >
> >
> >You will end up with garbage.
> 
> Ok. Bonus question: am I right thinking this is due to the CoW copy
> not being atomic (ie: the various extents being in different state
> until the copy is finished)?

This isn't a copy on write issue. This is an issue of the state of
the file and the I/O stack above it at the time the data extents are
shared. There is I/O inflight, and so there's no guarantee that what
is in the extents being shared is consistent. Freezing the
filesystem stops IO in flight, so the extents can be shared while
the filesystem knows it has consistent state on stable storage.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-25 21:13         ` Dave Chinner
@ 2018-02-25 21:58           ` Gionatan Danti
  2018-02-26  0:25             ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-25 21:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, g.danti

Il 25-02-2018 22:13 Dave Chinner ha scritto:
> This isn't a copy on write issue. This is an issue of the state of
> the file and the I/O stack above it at the time the data extents are
> shared. There is I/O inflight, and so there's no guarantee that what
> is in the extents being shared is consistent. Freezing the
> filesystem stops IO in flight, so the extents can be shared while
> the filesystem knows it has consistent state on stable storage.

Uhm, it seems the very same definition/catches of "crash-consistent" 
snapshot...

Suppose an XFS filesystem used for VM disk images hosting, with running 
VMs. I naively execute a cp --reflink=always copy, stop the original VM 
and start the copied one.

For an atomic snapshot I would expect that dataloss is comparable to a 
"power pull" case:
- async writes are lost. After all, they were on the pagecache and never 
hit the backing file;
- unacknowledged sync writes are lost. Again, they never successfully 
hit the disk;
- acknowledged sync writes (ie: the one which returned) are properly 
written to the backing file.

If the above is correct, when starting the new (copied) VM, the guest 
filesystem will behave as power was lost: its journal will be replied 
and broght to a consistent state. Application can/will be affected based 
on what they were doing at the time of the reflinked copy, but important 
ones (ie: the ones correctly using fsync), as databases, will gracefully 
recover replying their logs.

This should be similar to how LVM snapshot works when no filesystem is 
(directly) layered on top of the volume (ie: volume assigned to a VM).

Still, you warned be that a CoW copy on a running VM will produce 
garbage; so I am surely misunderstanding something.
I would greatly appreciate any clarification.

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-25 21:58           ` Gionatan Danti
@ 2018-02-26  0:25             ` Dave Chinner
  2018-02-26  7:19               ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2018-02-26  0:25 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: linux-xfs

On Sun, Feb 25, 2018 at 10:58:16PM +0100, Gionatan Danti wrote:
> Il 25-02-2018 22:13 Dave Chinner ha scritto:
> >This isn't a copy on write issue. This is an issue of the state of
> >the file and the I/O stack above it at the time the data extents are
> >shared. There is I/O inflight, and so there's no guarantee that what
> >is in the extents being shared is consistent. Freezing the
> >filesystem stops IO in flight, so the extents can be shared while
> >the filesystem knows it has consistent state on stable storage.
> 
> Uhm, it seems the very same definition/catches of "crash-consistent"
> snapshot...
>
> Suppose an XFS filesystem used for VM disk images hosting, with
> running VMs. I naively execute a cp --reflink=always copy, stop the
> original VM and start the copied one.
>
> For an atomic snapshot I would expect that dataloss is comparable to
> a "power pull" case:
> - async writes are lost. After all, they were on the pagecache and
> never hit the backing file;
> - unacknowledged sync writes are lost. Again, they never
> successfully hit the disk;
> - acknowledged sync writes (ie: the one which returned) are properly
> written to the backing file.

Acknowledged sync writes are not guaranteed to be stable. They may
still be sitting in volatile caches below the backing file, and so
until there is a cache flush pushed down through all layers of the
storage stack (e.g. fsync on the backing file) those acknowledged
sync writes are not stable. That's one of the things quiescing the
filesystem guarantees, but running reflink to clone the file does
not.

IOWs, "properly written" is easy to say but very hard to guarantee.
We cannot make such assumptions about random user configs, nor we
can base recommendations on such assumptions.  If you choose not to
quiesce the filesystems before snapshotting them, then it's your
responsibility to guarantee your storage stack will work correctly.

> If the above is correct, when starting the new (copied) VM, the
> guest filesystem will behave as power was lost: its journal will be
> replied and broght to a consistent state.  Application can/will be
> affected based on what they were doing at the time of the reflinked
> copy, but important ones (ie: the ones correctly using fsync), as
> databases, will gracefully recover replying their logs.
> 
> This should be similar to how LVM snapshot works when no filesystem
> is (directly) layered on top of the volume (ie: volume assigned to a
> VM).

You still have to quiesce the filesystem when it's on top of a LVM
snapshot volume.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26  0:25             ` Dave Chinner
@ 2018-02-26  7:19               ` Gionatan Danti
  2018-02-26  7:58                 ` Amir Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-26  7:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, g.danti

Full disclaimer: maybe my point of view is influenced by thinking in the 
context of Qemu/KVM + software RAID (where much works was done to be 
sure about proper barrier passing) or BBU/NV hardware RAID.

Il 26-02-2018 01:25 Dave Chinner ha scritto:
> Acknowledged sync writes are not guaranteed to be stable. They may
> still be sitting in volatile caches below the backing file, and so
> until there is a cache flush pushed down through all layers of the
> storage stack (e.g. fsync on the backing file) those acknowledged
> sync writes are not stable. That's one of the things quiescing the
> filesystem guarantees, but running reflink to clone the file does
> not.

Sure, but not-passed-down fsync/write barriers will thwarts even 
"normal" (ie: not CoW/snapshotted/reflinked) sync writes, and will 
inevitably cause problems (ie: a power loss become a big problem). How 
is it different for relinked copy?

> IOWs, "properly written" is easy to say but very hard to guarantee.
> We cannot make such assumptions about random user configs, nor we
> can base recommendations on such assumptions.  If you choose not to
> quiesce the filesystems before snapshotting them, then it's your
> responsibility to guarantee your storage stack will work correctly.

Absolutely, and I *really* appreciate your advices.

> You still have to quiesce the filesystem when it's on top of a LVM
> snapshot volume.

When the LVM volume is passed to a guest VM, the host can not quiesce 
the filesystem. Host/guest communication can be achieved by the mean on 
a guest agent and a private control channel, but this has its own 
problems. I thoroughly tested live, LVM-backed snapshotted VM and every 
time I run them, the guest filesystem replies its log without problem. I 
always double-check that the entire I/O stack (from guest down to the 
physical disks) honors write barriers, though.

Back to the original question: if a reflinked copy is an *atomic* 
operation on all the data extents comprising a file, and in the context 
of properly passed barriers/fsync, I would think that an unquiesced 
snapshot will work for the (reduced) consistency model of a 
crash-consistent snapshot.

If the reflink copy is not atomic (ie: the different extents are CoWed 
at different time, making it only a "faster copy" rather than a 
snapshot) this will *not* work and I will end with binary garbage (ie: 
writes can be reordered from snapshot's view).

I think all can be reduced to a single question: putting aside quiescing 
problems, is a reflinked copy a true *atomic* snapshot or it is "only" a 
faster copy?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26  7:19               ` Gionatan Danti
@ 2018-02-26  7:58                 ` Amir Goldstein
  2018-02-26  8:26                   ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2018-02-26  7:58 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Dave Chinner, linux-xfs

On Mon, Feb 26, 2018 at 9:19 AM, Gionatan Danti <g.danti@assyoma.it> wrote:
> Full disclaimer: maybe my point of view is influenced by thinking in the
> context of Qemu/KVM + software RAID (where much works was done to be sure
> about proper barrier passing) or BBU/NV hardware RAID.
>
> Il 26-02-2018 01:25 Dave Chinner ha scritto:
>>
>> Acknowledged sync writes are not guaranteed to be stable. They may
>> still be sitting in volatile caches below the backing file, and so
>> until there is a cache flush pushed down through all layers of the
>> storage stack (e.g. fsync on the backing file) those acknowledged
>> sync writes are not stable. That's one of the things quiescing the
>> filesystem guarantees, but running reflink to clone the file does
>> not.
>
>
> Sure, but not-passed-down fsync/write barriers will thwarts even "normal"
> (ie: not CoW/snapshotted/reflinked) sync writes, and will inevitably cause
> problems (ie: a power loss become a big problem). How is it different for
> relinked copy?
>
>> IOWs, "properly written" is easy to say but very hard to guarantee.
>> We cannot make such assumptions about random user configs, nor we
>> can base recommendations on such assumptions.  If you choose not to
>> quiesce the filesystems before snapshotting them, then it's your
>> responsibility to guarantee your storage stack will work correctly.
>
>
> Absolutely, and I *really* appreciate your advices.
>
>> You still have to quiesce the filesystem when it's on top of a LVM
>> snapshot volume.
>
>
> When the LVM volume is passed to a guest VM, the host can not quiesce the
> filesystem. Host/guest communication can be achieved by the mean on a guest
> agent and a private control channel, but this has its own problems. I
> thoroughly tested live, LVM-backed snapshotted VM and every time I run them,
> the guest filesystem replies its log without problem. I always double-check
> that the entire I/O stack (from guest down to the physical disks) honors
> write barriers, though.
>
> Back to the original question: if a reflinked copy is an *atomic* operation
> on all the data extents comprising a file, and in the context of properly
> passed barriers/fsync, I would think that an unquiesced snapshot will work
> for the (reduced) consistency model of a crash-consistent snapshot.
>
> If the reflink copy is not atomic (ie: the different extents are CoWed at
> different time, making it only a "faster copy" rather than a snapshot) this
> will *not* work and I will end with binary garbage (ie: writes can be
> reordered from snapshot's view).
>
> I think all can be reduced to a single question: putting aside quiescing
> problems, is a reflinked copy a true *atomic* snapshot or it is "only" a
> faster copy?
>

Gionatan,

First of all, the answer to your question is "just" faster copy.
reflinkning a file is much faster than copy, but it is not O(1).
I believe cp --reflink can result in cloning part of the file if the system
crashes mid operation, so in any case, the operation is not *atomic*
in that sense.

But your questions about quiescence the filesystem and your question
about the *atomic* nature of the clone operation are two very different
questions.

What you seem to *think* xfs reflink does, it does not actually do.
xfs reflink does NOT reflink the file in-memory data.
xfs reflink "only" reflinks the file on-disk data.
Right now, if you write a large file without fsync and clone it, you
might as well get a clone of unallocated or partly fallocated file with
zero or stale data.

Going forward, I think there is an intention to "clone" the file in-memory
data as well by sharing the READONLY cache pages between cloned files,
but I don't think dirty pages are going be shared between clones anyway,
so you are back to square one - need to get the data on-disk before cloning
the file.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26  7:58                 ` Amir Goldstein
@ 2018-02-26  8:26                   ` Gionatan Danti
  2018-02-26 17:26                     ` Darrick J. Wong
  2018-02-26 20:29                     ` Amir Goldstein
  0 siblings, 2 replies; 24+ messages in thread
From: Gionatan Danti @ 2018-02-26  8:26 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, g.danti

Hi Amir,

Il 26-02-2018 08:58 Amir Goldstein ha scritto:
> 
> Gionatan,
> 
> First of all, the answer to your question is "just" faster copy.
> reflinkning a file is much faster than copy, but it is not O(1).
> I believe cp --reflink can result in cloning part of the file if the 
> system
> crashes mid operation, so in any case, the operation is not *atomic*
> in that sense.
> 
> But your questions about quiescence the filesystem and your question
> about the *atomic* nature of the clone operation are two very different
> questions.

can this result on out-of-order writes from the cloned file's point of 
view? I mean:
- take a 10-extents file;
- a vm/db/whatever is writing to the file;
- a cp --reflink is executed;
- extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in 
progress;
- the vm/db writes to extent n.1 - this write will *not* be present on 
the cloned file;
- application writes to extent n.6 which will be cloned shortly;
- the cloned file ends with the later write to extent n.6 but not the 
previous on extent n.1;
- bad things happen!

If the above is true, than cp --reflink can't be used even for 
relaxed-consistency backup/clones.

> What you seem to *think* xfs reflink does, it does not actually do.
> xfs reflink does NOT reflink the file in-memory data.
> xfs reflink "only" reflinks the file on-disk data.
> Right now, if you write a large file without fsync and clone it, you
> might as well get a clone of unallocated or partly fallocated file with
> zero or stale data.

Oh, I absolutely do not expect for reflink/clone to works on in-memory 
data. I *surely* expect for dirty, not commited data to be lost: this is 
the very reason I wrote about crash-consistent backup.

In short: is cloning/reflink the same as "pulling the plug" for the 
cloned file? I mean:
- a successfull clone (so, a non-interruped/crashed one) is akin to an 
atomic process for the cloned file;
- async writes/dirty data are lost;
- fsynced writes are preserved;
- writes are not reordered/commited out of order.

Maybe the entire discussion is skewed by the fact that, in some cases, I 
am willing to relax my consistency model to include a crash-consistent 
backup option. Fact is, in the virtualization world there are many 
backup utilities/applications which *use* this model, and I wondered if 
a cp --reflink would give similar results without the hassle.

Maybe the entire crash-vs-application consistency is out of place in a 
filesystem mailing list, where you (rightfully!!!) strive for 
perfect/maximum data consistency (and I *really* appreciate that). 
Hoewever, given the recent reflinking works on XFS, I wonder if I can 
put this to "good use" when it is considered stable.

> Going forward, I think there is an intention to "clone" the file 
> in-memory
> data as well by sharing the READONLY cache pages between cloned files,
> but I don't think dirty pages are going be shared between clones 
> anyway,
> so you are back to square one - need to get the data on-disk before 
> cloning
> the file.

Great - I think this would do wonders for cache efficiency...

> 
> Cheers,
> Amir.

Thanks.

PS: sorry if I rephrase the question in different terms. English is not 
my primary language, please bear with me :p

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26  8:26                   ` Gionatan Danti
@ 2018-02-26 17:26                     ` Darrick J. Wong
  2018-02-26 21:23                       ` Gionatan Danti
  2018-02-27  0:33                       ` Dave Chinner
  2018-02-26 20:29                     ` Amir Goldstein
  1 sibling, 2 replies; 24+ messages in thread
From: Darrick J. Wong @ 2018-02-26 17:26 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Amir Goldstein, Dave Chinner, linux-xfs

On Mon, Feb 26, 2018 at 09:26:14AM +0100, Gionatan Danti wrote:
> Hi Amir,
> 
> Il 26-02-2018 08:58 Amir Goldstein ha scritto:
> >
> >Gionatan,
> >
> >First of all, the answer to your question is "just" faster copy.
> >reflinkning a file is much faster than copy, but it is not O(1).
> >I believe cp --reflink can result in cloning part of the file if the
> >system
> >crashes mid operation, so in any case, the operation is not *atomic*
> >in that sense.
> >
> >But your questions about quiescence the filesystem and your question
> >about the *atomic* nature of the clone operation are two very different
> >questions.
> 
> can this result on out-of-order writes from the cloned file's point of view?
> I mean:
> - take a 10-extents file;
> - a vm/db/whatever is writing to the file;
> - a cp --reflink is executed;
> - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in
> progress;
> - the vm/db writes to extent n.1 - this write will *not* be present on the
> cloned file;
> - application writes to extent n.6 which will be cloned shortly;
> - the cloned file ends with the later write to extent n.6 but not the
> previous on extent n.1;
> - bad things happen!
> 
> If the above is true, than cp --reflink can't be used even for
> relaxed-consistency backup/clones.
> 
> >What you seem to *think* xfs reflink does, it does not actually do.
> >xfs reflink does NOT reflink the file in-memory data.
> >xfs reflink "only" reflinks the file on-disk data.
> >Right now, if you write a large file without fsync and clone it, you
> >might as well get a clone of unallocated or partly fallocated file with
> >zero or stale data.
> 
> Oh, I absolutely do not expect for reflink/clone to works on in-memory data.
> I *surely* expect for dirty, not commited data to be lost: this is the very
> reason I wrote about crash-consistent backup.
> 
> In short: is cloning/reflink the same as "pulling the plug" for the cloned
> file? I mean:
> - a successfull clone (so, a non-interruped/crashed one) is akin to an
> atomic process for the cloned file;
> - async writes/dirty data are lost;
> - fsynced writes are preserved;
> - writes are not reordered/commited out of order.
> 
> Maybe the entire discussion is skewed by the fact that, in some cases, I am
> willing to relax my consistency model to include a crash-consistent backup
> option. Fact is, in the virtualization world there are many backup
> utilities/applications which *use* this model, and I wondered if a cp
> --reflink would give similar results without the hassle.
> 
> Maybe the entire crash-vs-application consistency is out of place in a
> filesystem mailing list, where you (rightfully!!!) strive for
> perfect/maximum data consistency (and I *really* appreciate that). Hoewever,
> given the recent reflinking works on XFS, I wonder if I can put this to
> "good use" when it is considered stable.

The way reflink is supposed to work wrt consistency is:

1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock)
2. wait for all directio to complete
3. fsync both files (write all the dirty pagecache to disk)
4. lock both inodes (ilock)
5. clone each extent atomically
6. unlock ilock
7. unlock iolock/mmaplock

So at least in theory the cloned file will match whatever the host saw
on disk and page cache at the time the reflink call was initiated.
I say 'in theory' because there could be bugs.

Whatever dirty state is in the guest VM stays in that VM, which means
that if you only cp --reflink on the host, the clone you get will
reflect the virtual disk state as if you'd kill -9'd the VM, cloned the
VM disk, and restarted the VM.  Upon restart the log recovers whatever
metadata made it out of the VM.

However, if you tell the guest to freeze the fs before cloning (as Dave
suggested earlier) the guest will flush all its state to the upper level
(the host) and the host will push all that out to disk before cloning.
The snapshot you create should be cleaner because you're effectively
prepaying the recovery costs by flushing everything before taking the
snapshot.

Also note that if the host goes down before returning from the syscall,
the log will continue on with whichever extent was being cloned at the
time in order to preserve metadata integrity, but the destination file
will reflect a partial copy.

--D

> >Going forward, I think there is an intention to "clone" the file in-memory
> >data as well by sharing the READONLY cache pages between cloned files,
> >but I don't think dirty pages are going be shared between clones anyway,
> >so you are back to square one - need to get the data on-disk before
> >cloning
> >the file.
> 
> Great - I think this would do wonders for cache efficiency...
> 
> >
> >Cheers,
> >Amir.
> 
> Thanks.
> 
> PS: sorry if I rephrase the question in different terms. English is not my
> primary language, please bear with me :p
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26  8:26                   ` Gionatan Danti
  2018-02-26 17:26                     ` Darrick J. Wong
@ 2018-02-26 20:29                     ` Amir Goldstein
  2018-02-26 21:28                       ` Gionatan Danti
  1 sibling, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2018-02-26 20:29 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Dave Chinner, linux-xfs

On Mon, Feb 26, 2018 at 10:26 AM, Gionatan Danti <g.danti@assyoma.it> wrote:
> Hi Amir,
>
> Il 26-02-2018 08:58 Amir Goldstein ha scritto:
>>
>>
>> Gionatan,
>>
>> First of all, the answer to your question is "just" faster copy.
>> reflinkning a file is much faster than copy, but it is not O(1).
>> I believe cp --reflink can result in cloning part of the file if the
>> system
>> crashes mid operation, so in any case, the operation is not *atomic*
>> in that sense.
>>
>> But your questions about quiescence the filesystem and your question
>> about the *atomic* nature of the clone operation are two very different
>> questions.
>
>
> can this result on out-of-order writes from the cloned file's point of view?
> I mean:
> - take a 10-extents file;
> - a vm/db/whatever is writing to the file;
> - a cp --reflink is executed;
> - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in
> progress;
> - the vm/db writes to extent n.1 - this write will *not* be present on the
> cloned file;
> - application writes to extent n.6 which will be cloned shortly;

As Darrick explained, new writes to both files should be blocked throughout
the clone operation.

> - the cloned file ends with the later write to extent n.6 but not the
> previous on extent n.1;
> - bad things happen!
>
> If the above is true, than cp --reflink can't be used even for
> relaxed-consistency backup/clones.
>
>> What you seem to *think* xfs reflink does, it does not actually do.
>> xfs reflink does NOT reflink the file in-memory data.
>> xfs reflink "only" reflinks the file on-disk data.
>> Right now, if you write a large file without fsync and clone it, you
>> might as well get a clone of unallocated or partly fallocated file with
>> zero or stale data.
>
>
> Oh, I absolutely do not expect for reflink/clone to works on in-memory data.
> I *surely* expect for dirty, not commited data to be lost: this is the very
> reason I wrote about crash-consistent backup.
>

I was also expecting dirty data to be lost, but according to Darrick that is
not the case for page cache dirty data, although in all likelyhook, your VM
is using direct IO to write to image file.

> In short: is cloning/reflink the same as "pulling the plug" for the cloned
> file? I mean:
> - a successfull clone (so, a non-interruped/crashed one) is akin to an
> atomic process for the cloned file;
> - async writes/dirty data are lost;
> - fsynced writes are preserved;
> - writes are not reordered/commited out of order.
>

Got it now. According to Darrick's description, the answer seems to be
that your assumptions about clone are correct.

> Maybe the entire discussion is skewed by the fact that, in some cases, I am
> willing to relax my consistency model to include a crash-consistent backup
> option. Fact is, in the virtualization world there are many backup
> utilities/applications which *use* this model, and I wondered if a cp
> --reflink would give similar results without the hassle.
>
> Maybe the entire crash-vs-application consistency is out of place in a
> filesystem mailing list, where you (rightfully!!!) strive for
> perfect/maximum data consistency (and I *really* appreciate that). Hoewever,
> given the recent reflinking works on XFS, I wonder if I can put this to
> "good use" when it is considered stable.
>

I think your use case is interesting and this is definitely the right place
to ask these questions.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26 17:26                     ` Darrick J. Wong
@ 2018-02-26 21:23                       ` Gionatan Danti
  2018-02-26 21:31                         ` Darrick J. Wong
  2018-02-27  0:33                       ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-26 21:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, Dave Chinner, linux-xfs, g.danti

Il 26-02-2018 18:26 Darrick J. Wong ha scritto:
> The way reflink is supposed to work wrt consistency is:
> 
> 1. lock out all new io/fallocate activity on both inodes 
> (iolock/mmaplock)
> 2. wait for all directio to complete
> 3. fsync both files (write all the dirty pagecache to disk)
> 4. lock both inodes (ilock)
> 5. clone each extent atomically
> 6. unlock ilock
> 7. unlock iolock/mmaplock
> 
> So at least in theory the cloned file will match whatever the host saw
> on disk and page cache at the time the reflink call was initiated.
> I say 'in theory' because there could be bugs.

Great! CoW will be a great addition for XFS when it will be considered 
stable.

> Whatever dirty state is in the guest VM stays in that VM, which means
> that if you only cp --reflink on the host, the clone you get will
> reflect the virtual disk state as if you'd kill -9'd the VM, cloned the
> VM disk, and restarted the VM.  Upon restart the log recovers whatever
> metadata made it out of the VM.

Sure, it is what I means for "crash-consistent".

> However, if you tell the guest to freeze the fs before cloning (as Dave
> suggested earlier) the guest will flush all its state to the upper 
> level
> (the host) and the host will push all that out to disk before cloning.
> The snapshot you create should be cleaner because you're effectively
> prepaying the recovery costs by flushing everything before taking the
> snapshot.

True, and this is "application-level consistency" (which requires a 
guest agent and possibly even an application-specific agent)

> Also note that if the host goes down before returning from the syscall,
> the log will continue on with whichever extent was being cloned at the
> time in order to preserve metadata integrity, but the destination file
> will reflect a partial copy.

Thanks for pointing that, and for your extremely clear explanation!


-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26 20:29                     ` Amir Goldstein
@ 2018-02-26 21:28                       ` Gionatan Danti
  0 siblings, 0 replies; 24+ messages in thread
From: Gionatan Danti @ 2018-02-26 21:28 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, g.danti

Il 26-02-2018 21:29 Amir Goldstein ha scritto:
> I was also expecting dirty data to be lost, but according to Darrick 
> that is
> not the case for page cache dirty data, although in all likelyhook, 
> your VM
> is using direct IO to write to image file.

Yeah, it was better than expected. For some VM I use Qemu/KVM writeback 
policy rather than "none"/O_DIRECT, so the fact that host-side dirty 
data will not be lost when reflinking is a pleasant surprise.

> I think your use case is interesting and this is definitely the right 
> place
> to ask these questions.

Great ;)


-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26 21:23                       ` Gionatan Danti
@ 2018-02-26 21:31                         ` Darrick J. Wong
  2018-02-26 21:39                           ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2018-02-26 21:31 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Amir Goldstein, Dave Chinner, linux-xfs

On Mon, Feb 26, 2018 at 10:23:45PM +0100, Gionatan Danti wrote:
> Il 26-02-2018 18:26 Darrick J. Wong ha scritto:
> >The way reflink is supposed to work wrt consistency is:
> >
> >1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock)
> >2. wait for all directio to complete
> >3. fsync both files (write all the dirty pagecache to disk)
> >4. lock both inodes (ilock)
> >5. clone each extent atomically
> >6. unlock ilock
> >7. unlock iolock/mmaplock
> >
> >So at least in theory the cloned file will match whatever the host saw
> >on disk and page cache at the time the reflink call was initiated.
> >I say 'in theory' because there could be bugs.
> 
> Great! CoW will be a great addition for XFS when it will be considered
> stable.
> 
> >Whatever dirty state is in the guest VM stays in that VM, which means
> >that if you only cp --reflink on the host, the clone you get will
> >reflect the virtual disk state as if you'd kill -9'd the VM, cloned the
> >VM disk, and restarted the VM.  Upon restart the log recovers whatever
> >metadata made it out of the VM.
> 
> Sure, it is what I means for "crash-consistent".
> 
> >However, if you tell the guest to freeze the fs before cloning (as Dave
> >suggested earlier) the guest will flush all its state to the upper level
> >(the host) and the host will push all that out to disk before cloning.
> >The snapshot you create should be cleaner because you're effectively
> >prepaying the recovery costs by flushing everything before taking the
> >snapshot.
> 
> True, and this is "application-level consistency" (which requires a guest
> agent and possibly even an application-specific agent)

I believe qemu-ga takes care of guest fs freeze inside the guest,
and you can invoke it from the host via 'virsh domfsfreeze' or the
--quiesce argument to snapshot-create... but you ought to confirm that
for yourself.

--D

> >Also note that if the host goes down before returning from the syscall,
> >the log will continue on with whichever extent was being cloned at the
> >time in order to preserve metadata integrity, but the destination file
> >will reflect a partial copy.
> 
> Thanks for pointing that, and for your extremely clear explanation!
> 
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26 21:31                         ` Darrick J. Wong
@ 2018-02-26 21:39                           ` Gionatan Danti
  0 siblings, 0 replies; 24+ messages in thread
From: Gionatan Danti @ 2018-02-26 21:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, Dave Chinner, linux-xfs, g.danti

Il 26-02-2018 22:31 Darrick J. Wong ha scritto:
> I believe qemu-ga takes care of guest fs freeze inside the guest,
> and you can invoke it from the host via 'virsh domfsfreeze' or the
> --quiesce argument to snapshot-create... but you ought to confirm that
> for yourself.

Sure, and I can confirm it works properly in Linux VMs (Windows VMs are 
another matter, if things have not changed).
However it also poses a security risk: a malicious agent can will block 
your *entire* backup process (even for other virtual machines).

So, in some environments, an agentless backup solution is preferred.
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-26 17:26                     ` Darrick J. Wong
  2018-02-26 21:23                       ` Gionatan Danti
@ 2018-02-27  0:33                       ` Dave Chinner
  2018-02-27  0:58                         ` Darrick J. Wong
  2018-02-27  8:06                         ` Gionatan Danti
  1 sibling, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2018-02-27  0:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Gionatan Danti, Amir Goldstein, linux-xfs

On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 26, 2018 at 09:26:14AM +0100, Gionatan Danti wrote:
> > Hi Amir,
> > 
> > Il 26-02-2018 08:58 Amir Goldstein ha scritto:
> > >
> > >Gionatan,
> > >
> > >First of all, the answer to your question is "just" faster copy.
> > >reflinkning a file is much faster than copy, but it is not O(1).
> > >I believe cp --reflink can result in cloning part of the file if the
> > >system
> > >crashes mid operation, so in any case, the operation is not *atomic*
> > >in that sense.
> > >
> > >But your questions about quiescence the filesystem and your question
> > >about the *atomic* nature of the clone operation are two very different
> > >questions.
> > 
> > can this result on out-of-order writes from the cloned file's point of view?
> > I mean:
> > - take a 10-extents file;
> > - a vm/db/whatever is writing to the file;
> > - a cp --reflink is executed;
> > - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in
> > progress;
> > - the vm/db writes to extent n.1 - this write will *not* be present on the
> > cloned file;
> > - application writes to extent n.6 which will be cloned shortly;
> > - the cloned file ends with the later write to extent n.6 but not the
> > previous on extent n.1;
> > - bad things happen!
> > 
> > If the above is true, than cp --reflink can't be used even for
> > relaxed-consistency backup/clones.
> > 
> > >What you seem to *think* xfs reflink does, it does not actually do.
> > >xfs reflink does NOT reflink the file in-memory data.
> > >xfs reflink "only" reflinks the file on-disk data.
> > >Right now, if you write a large file without fsync and clone it, you
> > >might as well get a clone of unallocated or partly fallocated file with
> > >zero or stale data.
> > 
> > Oh, I absolutely do not expect for reflink/clone to works on in-memory data.
> > I *surely* expect for dirty, not commited data to be lost: this is the very
> > reason I wrote about crash-consistent backup.
> > 
> > In short: is cloning/reflink the same as "pulling the plug" for the cloned
> > file? I mean:
> > - a successfull clone (so, a non-interruped/crashed one) is akin to an
> > atomic process for the cloned file;
> > - async writes/dirty data are lost;
> > - fsynced writes are preserved;
> > - writes are not reordered/commited out of order.
> > 
> > Maybe the entire discussion is skewed by the fact that, in some cases, I am
> > willing to relax my consistency model to include a crash-consistent backup
> > option. Fact is, in the virtualization world there are many backup
> > utilities/applications which *use* this model, and I wondered if a cp
> > --reflink would give similar results without the hassle.
> > 
> > Maybe the entire crash-vs-application consistency is out of place in a
> > filesystem mailing list, where you (rightfully!!!) strive for
> > perfect/maximum data consistency (and I *really* appreciate that). Hoewever,
> > given the recent reflinking works on XFS, I wonder if I can put this to
> > "good use" when it is considered stable.
> 
> The way reflink is supposed to work wrt consistency is:
> 
> 1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock)
> 2. wait for all directio to complete
> 3. fsync both files (write all the dirty pagecache to disk)

My point is that vfs_clone_file_range is not running fsync(2)i
operations.

It's a fdatawrite_and_wait() call, which submits dirty data to disk
and waits for it, but does *not flush volatile storage caches*.
IOWs, it's not a data integrity operation.

Hence while the reflink now has "data on disk" and can clone the
extents, Neither the data nor the extents being cloned are stable
and won't be until an fsync operation is performed on either the
reflink source or destination file....

> 4. lock both inodes (ilock)
> 5. clone each extent atomically
> 6. unlock ilock
> 7. unlock iolock/mmaplock
> 
> So at least in theory the cloned file will match whatever the host saw
> on disk and page cache at the time the reflink call was initiated.
> I say 'in theory' because there could be bugs.

Still no cache flushes. Hence even after the clone has run,
you can still lose the data (and extents!) from the host file....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-27  0:33                       ` Dave Chinner
@ 2018-02-27  0:58                         ` Darrick J. Wong
  2018-02-27  8:06                         ` Gionatan Danti
  1 sibling, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2018-02-27  0:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Gionatan Danti, Amir Goldstein, linux-xfs

On Tue, Feb 27, 2018 at 11:33:48AM +1100, Dave Chinner wrote:
> On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 26, 2018 at 09:26:14AM +0100, Gionatan Danti wrote:
> > > Hi Amir,
> > > 
> > > Il 26-02-2018 08:58 Amir Goldstein ha scritto:
> > > >
> > > >Gionatan,
> > > >
> > > >First of all, the answer to your question is "just" faster copy.
> > > >reflinkning a file is much faster than copy, but it is not O(1).
> > > >I believe cp --reflink can result in cloning part of the file if the
> > > >system
> > > >crashes mid operation, so in any case, the operation is not *atomic*
> > > >in that sense.
> > > >
> > > >But your questions about quiescence the filesystem and your question
> > > >about the *atomic* nature of the clone operation are two very different
> > > >questions.
> > > 
> > > can this result on out-of-order writes from the cloned file's point of view?
> > > I mean:
> > > - take a 10-extents file;
> > > - a vm/db/whatever is writing to the file;
> > > - a cp --reflink is executed;
> > > - extents are cloned one-by-one, with extents 1-4 alredy cloned, 5 is in
> > > progress;
> > > - the vm/db writes to extent n.1 - this write will *not* be present on the
> > > cloned file;
> > > - application writes to extent n.6 which will be cloned shortly;
> > > - the cloned file ends with the later write to extent n.6 but not the
> > > previous on extent n.1;
> > > - bad things happen!
> > > 
> > > If the above is true, than cp --reflink can't be used even for
> > > relaxed-consistency backup/clones.
> > > 
> > > >What you seem to *think* xfs reflink does, it does not actually do.
> > > >xfs reflink does NOT reflink the file in-memory data.
> > > >xfs reflink "only" reflinks the file on-disk data.
> > > >Right now, if you write a large file without fsync and clone it, you
> > > >might as well get a clone of unallocated or partly fallocated file with
> > > >zero or stale data.
> > > 
> > > Oh, I absolutely do not expect for reflink/clone to works on in-memory data.
> > > I *surely* expect for dirty, not commited data to be lost: this is the very
> > > reason I wrote about crash-consistent backup.
> > > 
> > > In short: is cloning/reflink the same as "pulling the plug" for the cloned
> > > file? I mean:
> > > - a successfull clone (so, a non-interruped/crashed one) is akin to an
> > > atomic process for the cloned file;
> > > - async writes/dirty data are lost;
> > > - fsynced writes are preserved;
> > > - writes are not reordered/commited out of order.
> > > 
> > > Maybe the entire discussion is skewed by the fact that, in some cases, I am
> > > willing to relax my consistency model to include a crash-consistent backup
> > > option. Fact is, in the virtualization world there are many backup
> > > utilities/applications which *use* this model, and I wondered if a cp
> > > --reflink would give similar results without the hassle.
> > > 
> > > Maybe the entire crash-vs-application consistency is out of place in a
> > > filesystem mailing list, where you (rightfully!!!) strive for
> > > perfect/maximum data consistency (and I *really* appreciate that). Hoewever,
> > > given the recent reflinking works on XFS, I wonder if I can put this to
> > > "good use" when it is considered stable.
> > 
> > The way reflink is supposed to work wrt consistency is:
> > 
> > 1. lock out all new io/fallocate activity on both inodes (iolock/mmaplock)
> > 2. wait for all directio to complete
> > 3. fsync both files (write all the dirty pagecache to disk)
> 
> My point is that vfs_clone_file_range is not running fsync(2)i
> operations.
> 
> It's a fdatawrite_and_wait() call, which submits dirty data to disk
> and waits for it, but does *not flush volatile storage caches*.
> IOWs, it's not a data integrity operation.
> 
> Hence while the reflink now has "data on disk" and can clone the
> extents, Neither the data nor the extents being cloned are stable
> and won't be until an fsync operation is performed on either the
> reflink source or destination file....
> 
> > 4. lock both inodes (ilock)
> > 5. clone each extent atomically
> > 6. unlock ilock
> > 7. unlock iolock/mmaplock
> > 
> > So at least in theory the cloned file will match whatever the host saw
> > on disk and page cache at the time the reflink call was initiated.
> > I say 'in theory' because there could be bugs.
> 
> Still no cache flushes. Hence even after the clone has run,
> you can still lose the data (and extents!) from the host file....

TBH I was assuming that the host doesn't go down in these scenarios, so
we were only concerned about getting the guest to flush everything it
had.  But Dave is right, if you need the host to maintain data integrity
too, then you need to fsync both the src and dest fds too.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-27  0:33                       ` Dave Chinner
  2018-02-27  0:58                         ` Darrick J. Wong
@ 2018-02-27  8:06                         ` Gionatan Danti
  2018-02-27 22:04                           ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-27  8:06 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong; +Cc: Amir Goldstein, linux-xfs, Gionatan Danti

On 27/02/2018 01:33, Dave Chinner wrote:
> On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote:
> 
> My point is that vfs_clone_file_range is not running fsync(2)i
> operations.
> 
> It's a fdatawrite_and_wait() call, which submits dirty data to disk
> and waits for it, but does *not flush volatile storage caches*.
> IOWs, it's not a data integrity operation.
> 
> Hence while the reflink now has "data on disk" and can clone the
> extents, Neither the data nor the extents being cloned are stable
> and won't be until an fsync operation is performed on either the
> reflink source or destination file....
> 
> Still no cache flushes. Hence even after the clone has run,
> you can still lose the data (and extents!) from the host file....

Am I right saying that you are speaking about an *host* crash during or 
just after the clone?

Even in such a case, only the newly created file clone should be 
lost/corrupted, while the original file will *not* be affected, right? 
Or will an interrupted clone operation (ie: due to a power failure) 
leave *both* files in an unconsistent state?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-27  8:06                         ` Gionatan Danti
@ 2018-02-27 22:04                           ` Dave Chinner
  2018-02-28  7:08                             ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2018-02-27 22:04 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs

On Tue, Feb 27, 2018 at 09:06:25AM +0100, Gionatan Danti wrote:
> On 27/02/2018 01:33, Dave Chinner wrote:
> >On Mon, Feb 26, 2018 at 09:26:01AM -0800, Darrick J. Wong wrote:
> >
> >My point is that vfs_clone_file_range is not running fsync(2)i
> >operations.
> >
> >It's a fdatawrite_and_wait() call, which submits dirty data to disk
> >and waits for it, but does *not flush volatile storage caches*.
> >IOWs, it's not a data integrity operation.
> >
> >Hence while the reflink now has "data on disk" and can clone the
> >extents, Neither the data nor the extents being cloned are stable
> >and won't be until an fsync operation is performed on either the
> >reflink source or destination file....
> >
> >Still no cache flushes. Hence even after the clone has run,
> >you can still lose the data (and extents!) from the host file....
> 
> Am I right saying that you are speaking about an *host* crash during
> or just after the clone?
> 
> Even in such a case, only the newly created file clone should be
> lost/corrupted, while the original file will *not* be affected,
> right? Or will an interrupted clone operation (ie: due to a power
> failure) leave *both* files in an unconsistent state?

A host crash can lose data from the original file when it is
configured in writeback mode (as you've said you are using). If the
clone is there, both source and clone should be fully intact. If
it's not, then you will have lost data from the original image file.

But, really, why risk losing data or filesystem corruption by trying
to take shortcuts?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-27 22:04                           ` Dave Chinner
@ 2018-02-28  7:08                             ` Gionatan Danti
  2018-02-28 17:07                               ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Gionatan Danti @ 2018-02-28  7:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, g.danti

Il 27-02-2018 23:04 Dave Chinner ha scritto:
> A host crash can lose data from the original file when it is
> configured in writeback mode (as you've said you are using). If the
> clone is there, both source and clone should be fully intact. If
> it's not, then you will have lost data from the original image file.

I have difficult grasping how a system crash during a cp --refcopy could 
corrupt the source file.
As per Darrick explanation, new writes on the original file should be 
blocked/queued during the copy. Even if this is not the case, fsync 
writes should complete only when data successfully landed on the disk 
platter. Losing some second on async writes should not be a problem in 
many environments (this is the very reasoning behind providing Qemu/KVM 
with a working writeback option).

Clearly a crash during the copy *will* produce an invalid destination 
file, but this can not be avoided (after all, the system crashed!).

> But, really, why risk losing data or filesystem corruption by trying
> to take shortcuts?

Losing data and filesystem corruption are two *very* different things. 
On many VMs, I can afford losing some seconds of async writes; 
obviously, fsync writes (which can lead do filesystem corruption) must 
*not* be lost on *any* condition.

The point of the discussion is that if a cp --reflink is suitable for 
hot backup, it would be an extremely fast and convenient method to take 
"cheap" snapshot of key files. But if an interrupted copy can lead to 
total loss of the *original* file/filesystem, than this is clearly the 
wrong idea.

I am missing something?
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-28  7:08                             ` Gionatan Danti
@ 2018-02-28 17:07                               ` Darrick J. Wong
  2018-02-28 18:27                                 ` Gionatan Danti
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2018-02-28 17:07 UTC (permalink / raw)
  To: Gionatan Danti; +Cc: Dave Chinner, Amir Goldstein, linux-xfs

On Wed, Feb 28, 2018 at 08:08:47AM +0100, Gionatan Danti wrote:
> Il 27-02-2018 23:04 Dave Chinner ha scritto:
> >A host crash can lose data from the original file when it is
> >configured in writeback mode (as you've said you are using). If the
> >clone is there, both source and clone should be fully intact. If
> >it's not, then you will have lost data from the original image file.
> 
> I have difficult grasping how a system crash during a cp --refcopy could
> corrupt the source file.
> As per Darrick explanation, new writes on the original file should be
> blocked/queued during the copy. Even if this is not the case, fsync writes
> should complete only when data successfully landed on the disk platter.

reflink performs (more or less) a fdatasync of the source and dest file
before it starts so that any dirty pages backed by delayed allocation
reservation will be allocated and written to disk, but it doesn't do the
"force all dirty metadata out to log" action that distinguishes
fdatasync from fsync.  That is a deliberate design decision because:

1) fsync is fairly heavyweight,
2) customers might have disposable environments where it is preferable
   to lose srcfile and destfile over paying performance penalties
   all the time, and
3) if you need srcfile to be completely stable on disk, you needed to
   call fsync anyway, and nothing prevents you from doing so before
   calling copy_file_range/clone_file_range if that is part of your
   operational requirements.

In other words, if at a certain point you can't afford to lose the
source file due to a host crash, you have to call fsync, as has been the
case for ages.  reflink does not itself call fsync, nor does it increase
the chances of losing any file contents that weren't fsync'd before the
host went down.

--D

> Losing some second on async writes should not be a problem in many
> environments (this is the very reasoning behind providing Qemu/KVM with a
> working writeback option).
> 
> Clearly a crash during the copy *will* produce an invalid destination file,
> but this can not be avoided (after all, the system crashed!).
> 
> >But, really, why risk losing data or filesystem corruption by trying
> >to take shortcuts?
> 
> Losing data and filesystem corruption are two *very* different things. On
> many VMs, I can afford losing some seconds of async writes; obviously, fsync
> writes (which can lead do filesystem corruption) must *not* be lost on *any*
> condition.
> 
> The point of the discussion is that if a cp --reflink is suitable for hot
> backup, it would be an extremely fast and convenient method to take "cheap"
> snapshot of key files. But if an interrupted copy can lead to total loss of
> the *original* file/filesystem, than this is clearly the wrong idea.
> 
> I am missing something?
> Thanks.
> 
> -- 
> Danti Gionatan
> Supporto Tecnico
> Assyoma S.r.l. - www.assyoma.it
> email: g.danti@assyoma.it - info@assyoma.it
> GPG public key ID: FF5F32A8
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reflink (cow) copy of busy files
  2018-02-28 17:07                               ` Darrick J. Wong
@ 2018-02-28 18:27                                 ` Gionatan Danti
  0 siblings, 0 replies; 24+ messages in thread
From: Gionatan Danti @ 2018-02-28 18:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Amir Goldstein, linux-xfs, g.danti

Il 28-02-2018 18:07 Darrick J. Wong ha scritto:
> reflink performs (more or less) a fdatasync of the source and dest file
> before it starts so that any dirty pages backed by delayed allocation
> reservation will be allocated and written to disk, but it doesn't do 
> the
> "force all dirty metadata out to log" action that distinguishes
> fdatasync from fsync.  That is a deliberate design decision because:
> 
> 1) fsync is fairly heavyweight,
> 2) customers might have disposable environments where it is preferable
>    to lose srcfile and destfile over paying performance penalties
>    all the time, and
> 3) if you need srcfile to be completely stable on disk, you needed to
>    call fsync anyway, and nothing prevents you from doing so before
>    calling copy_file_range/clone_file_range if that is part of your
>    operational requirements.
> 
> In other words, if at a certain point you can't afford to lose the
> source file due to a host crash, you have to call fsync, as has been 
> the
> case for ages.  reflink does not itself call fsync, nor does it 
> increase
> the chances of losing any file contents that weren't fsync'd before the
> host went down.

Ok, this is exactly what I expect.

To add some context: Qemu/KVM added safe barrier/fsync passing years 
ago, so when a guest issues a fsync+barrier operation (ie: after key 
operations, as a journal update or a COMMIT) they are immediately passed 
to the host, which issues real fsync+barrier on the backing file. In 
other words, host's writeback cache is used as the volatile disk's DRAM 
cache (which needs to be flushed at specific interval). See: 
https://www.static.linuxfound.org/jp_uploads/JLS2009/jls09_hellwig.pdf

Back to the original argument: are guest/user initiated fsyncs+barriers 
honored even *during a cp --reflink copy*? If so, I can't see any 
shortcoming in using reflinking to hot copy a busy file. Sure, I risk 
losing async writes (which are in writeback host cache *or* in the 
unflushed volatile disk's DRAM cache), but this is nothing more (or 
less) than a normal, interrupted copy. I am right saying that?

Maybe encapsulating the reflink copy in between two fsync calls is a 
good idea?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-02-28 18:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-24 18:20 Reflink (cow) copy of busy files Gionatan Danti
2018-02-24 22:07 ` Dave Chinner
2018-02-24 22:57   ` Gionatan Danti
2018-02-25  2:47     ` Dave Chinner
2018-02-25 11:40       ` Gionatan Danti
2018-02-25 21:13         ` Dave Chinner
2018-02-25 21:58           ` Gionatan Danti
2018-02-26  0:25             ` Dave Chinner
2018-02-26  7:19               ` Gionatan Danti
2018-02-26  7:58                 ` Amir Goldstein
2018-02-26  8:26                   ` Gionatan Danti
2018-02-26 17:26                     ` Darrick J. Wong
2018-02-26 21:23                       ` Gionatan Danti
2018-02-26 21:31                         ` Darrick J. Wong
2018-02-26 21:39                           ` Gionatan Danti
2018-02-27  0:33                       ` Dave Chinner
2018-02-27  0:58                         ` Darrick J. Wong
2018-02-27  8:06                         ` Gionatan Danti
2018-02-27 22:04                           ` Dave Chinner
2018-02-28  7:08                             ` Gionatan Danti
2018-02-28 17:07                               ` Darrick J. Wong
2018-02-28 18:27                                 ` Gionatan Danti
2018-02-26 20:29                     ` Amir Goldstein
2018-02-26 21:28                       ` Gionatan Danti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.