linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Lazy file reflink
@ 2019-01-25 14:27 Amir Goldstein
  2019-01-28 12:50 ` Jan Kara
  2019-01-31 20:02 ` Chris Murphy
  0 siblings, 2 replies; 13+ messages in thread
From: Amir Goldstein @ 2019-01-25 14:27 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig, Jan Kara

Hi,

I would like to discuss the concept of lazy file reflink.
The use case is backup of a very large read-mostly file.
Backup application would like to read consistent content from the
file, "atomic read" sort of speak.

With filesystem that supports reflink, that can be done by:
- Create O_TMPFILE
- Reflink origin to temp file
- Backup from temp file

However, since the origin file is very likely not to be modified,
the reflink step, that may incur lots of metadata updates, is a waste.
Instead, if filesystem could be notified that atomic content was
requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY),
filesystem could defer reflink to an O_TMPFILE until origin file is
open for write or actually modified.

What I just described above is actually already implemented with
Overlayfs snapshots [1], but for many applications overlayfs snapshots
it is not a practical solution.

I have based my assumption that reflink of a large file may incur
lots of metadata updates on my limited knowledge of xfs reflink
implementation, but perhaps it is not the case for other filesystems?
(btrfs?) and perhaps the current metadata overhead on reflink of a large
file is an implementation detail that could be optimized in the future?

The point of the matter is that there is no API to make an explicit
request for a "volatile reflink" that does not need to survive power
failure and that limits the ability of filesytems to optimize this case.

I realize the "atomic read" requirement is somewhat adjacent to
the "atomic write" [2] requirement, if not only by name, but I am
not sure how much they really share in common?

A somewhat different approach for the problem is for the application
to use fanotify to register for pre-modify callback and implement the
lazy reflink by itself. This could work but will require to extend the
semantics of fanotify and application currently needs to have
CAP_SYS_ADMIN, because it can block access to file indefinitely.

Would love to get some feedback about the concept from filesystem
developers.

Thanks,
Amir.

[1] https://lwn.net/Articles/719772/
[2] https://lwn.net/Articles/715918/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-25 14:27 [LSF/MM TOPIC] Lazy file reflink Amir Goldstein
@ 2019-01-28 12:50 ` Jan Kara
  2019-01-28 21:26   ` Dave Chinner
  2019-01-31 20:02 ` Chris Murphy
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-01-28 12:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig, Jan Kara

Hi,

On Fri 25-01-19 16:27:52, Amir Goldstein wrote:
> I would like to discuss the concept of lazy file reflink.
> The use case is backup of a very large read-mostly file.
> Backup application would like to read consistent content from the
> file, "atomic read" sort of speak.
> 
> With filesystem that supports reflink, that can be done by:
> - Create O_TMPFILE
> - Reflink origin to temp file
> - Backup from temp file
> 
> However, since the origin file is very likely not to be modified,
> the reflink step, that may incur lots of metadata updates, is a waste.
> Instead, if filesystem could be notified that atomic content was
> requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY),
> filesystem could defer reflink to an O_TMPFILE until origin file is
> open for write or actually modified.
> 
> What I just described above is actually already implemented with
> Overlayfs snapshots [1], but for many applications overlayfs snapshots
> it is not a practical solution.
> 
> I have based my assumption that reflink of a large file may incur
> lots of metadata updates on my limited knowledge of xfs reflink
> implementation, but perhaps it is not the case for other filesystems?
> (btrfs?) and perhaps the current metadata overhead on reflink of a large
> file is an implementation detail that could be optimized in the future?
> 
> The point of the matter is that there is no API to make an explicit
> request for a "volatile reflink" that does not need to survive power
> failure and that limits the ability of filesytems to optimize this case.

Well, to me this seems like a relatively rare usecase (and performance
gain) for the complexity. Also the speed of reflink is fs dependent - e.g.
for btrfs it is rather cheap AFAIK.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-28 12:50 ` Jan Kara
@ 2019-01-28 21:26   ` Dave Chinner
  2019-01-28 22:56     ` Amir Goldstein
  2019-01-31 21:13     ` Matthew Wilcox
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2019-01-28 21:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, lsf-pc, linux-fsdevel, linux-xfs,
	Darrick J. Wong, Christoph Hellwig

On Mon, Jan 28, 2019 at 01:50:44PM +0100, Jan Kara wrote:
> Hi,
> 
> On Fri 25-01-19 16:27:52, Amir Goldstein wrote:
> > I would like to discuss the concept of lazy file reflink.
> > The use case is backup of a very large read-mostly file.
> > Backup application would like to read consistent content from the
> > file, "atomic read" sort of speak.
> > 
> > With filesystem that supports reflink, that can be done by:
> > - Create O_TMPFILE
> > - Reflink origin to temp file
> > - Backup from temp file
> >
> > However, since the origin file is very likely not to be modified,
> > the reflink step, that may incur lots of metadata updates, is a waste.
> > Instead, if filesystem could be notified that atomic content was
> > requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY),
> > filesystem could defer reflink to an O_TMPFILE until origin file is
> > open for write or actually modified.

That makes me want to run screaming for the hills.

> > What I just described above is actually already implemented with
> > Overlayfs snapshots [1], but for many applications overlayfs snapshots
> > it is not a practical solution.
> > 
> > I have based my assumption that reflink of a large file may incur
> > lots of metadata updates on my limited knowledge of xfs reflink
> > implementation, but perhaps it is not the case for other filesystems?

Comparitively speaking: compared to copying a large file, reflink is
cheap on any filesystem that implements it. Sure, reflinking on XFS
is CPU limited, IIRC, to ~10-20,000 extents per second per reflink
op per AG, but it's still faster than copying 10-20,000 extents
per second per copy op on all but the very fastest, unloaded nvme
SSDs...

> > (btrfs?) and perhaps the current metadata overhead on reflink of a large
> > file is an implementation detail that could be optimized in the future?
> > 
> > The point of the matter is that there is no API to make an explicit
> > request for a "volatile reflink" that does not need to survive power
> > failure and that limits the ability of filesytems to optimize this case.
> 
> Well, to me this seems like a relatively rare usecase (and performance
> gain) for the complexity. Also the speed of reflink is fs dependent - e.g.
> for btrfs it is rather cheap AFAIK.

I suspect for "very large read-mostly file" it's still an expensive
operation on btrfs.

Really, though, for this use case it's make more sense to have "per
file freeze" semantics. i.e. if you want a consistent backup image
on snapshot capable storage, the process is usually "freeze
filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
remove snapshot". We can already transparently block incoming
writes/modifications on files via the freeze mechanism, so why not
just extend that to per-file granularity so writes to the "very
large read-mostly file" block while it's being backed up....

Indeed, this would probably only require a simple extension to
FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
by XFS it was a "freeze level". Set this to 0xffffffff and then it
freezes just the fd passed in, not the whole filesystem.
Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-28 21:26   ` Dave Chinner
@ 2019-01-28 22:56     ` Amir Goldstein
  2019-01-29  0:18       ` Dave Chinner
  2019-01-31 21:13     ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2019-01-28 22:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig

> > > What I just described above is actually already implemented with
> > > Overlayfs snapshots [1], but for many applications overlayfs snapshots
> > > it is not a practical solution.
> > >
> > > I have based my assumption that reflink of a large file may incur
> > > lots of metadata updates on my limited knowledge of xfs reflink
> > > implementation, but perhaps it is not the case for other filesystems?
>
> Comparitively speaking: compared to copying a large file, reflink is
> cheap on any filesystem that implements it. Sure, reflinking on XFS
> is CPU limited, IIRC, to ~10-20,000 extents per second per reflink
> op per AG, but it's still faster than copying 10-20,000 extents
> per second per copy op on all but the very fastest, unloaded nvme
> SSDs...
>

I think the concern is the added metadata load on the rest of the
users. Backup app doesn't care about the time it consumes to clone
before backup. But this concern is not based on actual numbers.

> Really, though, for this use case it's make more sense to have "per
> file freeze" semantics. i.e. if you want a consistent backup image
> on snapshot capable storage, the process is usually "freeze
> filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
> remove snapshot". We can already transparently block incoming
> writes/modifications on files via the freeze mechanism, so why not
> just extend that to per-file granularity so writes to the "very
> large read-mostly file" block while it's being backed up....
>
> Indeed, this would probably only require a simple extension to
> FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
> by XFS it was a "freeze level". Set this to 0xffffffff and then it
> freezes just the fd passed in, not the whole filesystem.
> Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...
>

I think it's a good idea to add file freeze semantics to the toolbox
of useful things that could be accomplished with reflink.
Especially with your plans for subvolumes as files
How is that coming along by the way?.

Anyway, freeze semantics alone won't work for our backup application
that needs to be non intrusive. Even if writes to large file are few,
backup may take time, so blocking those few write for that long is
not acceptable. Blocking the writes for the setup time of a reflink
is exactly what I was proposing and in your analogy, the block
device is frozen only for a short period of time for setting up the
snapshot and not for the duration of the backup.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-28 22:56     ` Amir Goldstein
@ 2019-01-29  0:18       ` Dave Chinner
  2019-01-29  7:18         ` Amir Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2019-01-29  0:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig

On Tue, Jan 29, 2019 at 12:56:17AM +0200, Amir Goldstein wrote:
> > > > What I just described above is actually already implemented with
> > > > Overlayfs snapshots [1], but for many applications overlayfs snapshots
> > > > it is not a practical solution.
> > > >
> > > > I have based my assumption that reflink of a large file may incur
> > > > lots of metadata updates on my limited knowledge of xfs reflink
> > > > implementation, but perhaps it is not the case for other filesystems?
> >
> > Comparitively speaking: compared to copying a large file, reflink is
> > cheap on any filesystem that implements it. Sure, reflinking on XFS
> > is CPU limited, IIRC, to ~10-20,000 extents per second per reflink
> > op per AG, but it's still faster than copying 10-20,000 extents
> > per second per copy op on all but the very fastest, unloaded nvme
> > SSDs...
> >
> 
> I think the concern is the added metadata load on the rest of the
> users. Backup app doesn't care about the time it consumes to clone
> before backup. But this concern is not based on actual numbers.

So what is it based on?

> > Really, though, for this use case it's make more sense to have "per
> > file freeze" semantics. i.e. if you want a consistent backup image
> > on snapshot capable storage, the process is usually "freeze
> > filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
> > remove snapshot". We can already transparently block incoming
> > writes/modifications on files via the freeze mechanism, so why not
> > just extend that to per-file granularity so writes to the "very
> > large read-mostly file" block while it's being backed up....
> >
> > Indeed, this would probably only require a simple extension to
> > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
> > by XFS it was a "freeze level". Set this to 0xffffffff and then it
> > freezes just the fd passed in, not the whole filesystem.
> > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...
> >
> 
> I think it's a good idea to add file freeze semantics to the toolbox
> of useful things that could be accomplished with reflink.

reflink is already atomic w.r.t. other writes - in what way does a
"file freeze" have any impact on a reflink operation? that is, apart
from preventing it from being done, because reflink can modify the
source inode on XFS, too....

> Especially with your plans for subvolumes as files
> How is that coming along by the way?.

If I didn't have to spend so much time fire-fighting broken stuff,
I might make more progress.

> Anyway, freeze semantics alone won't work for our backup application
> that needs to be non intrusive. Even if writes to large file are few,
> backup may take time, so blocking those few write for that long is
> not acceptable.

So, reflink is too expensive because there are only occasional
writes, but blocking that occasional write is too expensive, too,
even though it is rare?

> Blocking the writes for the setup time of a reflink
> is exactly what I was proposing and in your analogy,

No, I proposed a way to provide a -point in time snapshot- of a
file that doesn't require reflink or any other special filesystem
support.

> the block
> device is frozen only for a short period of time for setting up the
> snapshot and not for the duration of the backup.

Right, it's frozen for as long as it takes to set up a -point in
time snapshot- that the backup can be taken from. You don't need
that to reflink a file. You need it if you want to do something
other than a reflink....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-29  0:18       ` Dave Chinner
@ 2019-01-29  7:18         ` Amir Goldstein
  2019-01-29 23:01           ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2019-01-29  7:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig

> > I think it's a good idea to add file freeze semantics to the toolbox
> > of useful things that could be accomplished with reflink.
>
> reflink is already atomic w.r.t. other writes - in what way does a
> "file freeze" have any impact on a reflink operation? that is, apart
> from preventing it from being done, because reflink can modify the
> source inode on XFS, too....
>

- create O_TMPFILE
- freeze source file
- read and calculate hash from source file
- likely unfreeze and skip reflink+backup

For the unlikely case, application could copy_file_range
before unfreeze and that means that reflink of source should
be allowed while file is frozen, that is, while file *data* is frozen.

That mean that file freeze API needs to be able to express if both
metadata and data freeze are required.

> > Anyway, freeze semantics alone won't work for our backup application
> > that needs to be non intrusive. Even if writes to large file are few,
> > backup may take time, so blocking those few write for that long is
> > not acceptable.
>
> So, reflink is too expensive because there are only occasional
> writes, but blocking that occasional write is too expensive, too,
> even though it is rare?
>

All right. I admit to have presented a weak example, but I am
not submitting a patch to be merged. I am proposing a discussion
on what I think is a gap in existing API. The feedback of "what is
the measurable benefits?" is well expected, but I brought this up
anyway, without concrete measurable figures to hear what others
have to say. And frankly, I quite like the file freeze suggestion, so
I am glad that I did.

Besides, even if existing filesystems implement reflink fast "enough",
this is not at all an mandated by the API.

> > Blocking the writes for the setup time of a reflink
> > is exactly what I was proposing and in your analogy,
>
> No, I proposed a way to provide a -point in time snapshot- of a
> file that doesn't require reflink or any other special filesystem
> support.
>
> > the block
> > device is frozen only for a short period of time for setting up the
> > snapshot and not for the duration of the backup.
>
> Right, it's frozen for as long as it takes to set up a -point in
> time snapshot- that the backup can be taken from. You don't need
> that to reflink a file. You need it if you want to do something
> other than a reflink....
>

Correct. As I wrote above, that could be used for conditional
copy or conditional reflink on a filesystem where reflink  has a
measurable cost.

Bottom line: I completely agree with you that "file freeze" is sufficient
for the case I presented, as long as reflink is allowed while file is frozen.
IOW, break the existing compound API freeze+reflink+unfreeze to
individual operations to give more control over to user.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-29  7:18         ` Amir Goldstein
@ 2019-01-29 23:01           ` Dave Chinner
  2019-01-30 13:30             ` Amir Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2019-01-29 23:01 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig

On Tue, Jan 29, 2019 at 09:18:57AM +0200, Amir Goldstein wrote:
> > > I think it's a good idea to add file freeze semantics to the toolbox
> > > of useful things that could be accomplished with reflink.
> >
> > reflink is already atomic w.r.t. other writes - in what way does a
> > "file freeze" have any impact on a reflink operation? that is, apart
> > from preventing it from being done, because reflink can modify the
> > source inode on XFS, too....
> >
> 
> - create O_TMPFILE
> - freeze source file
> - read and calculate hash from source file
> - likely unfreeze and skip reflink+backup

IF you can read the file to determine if you need to back up the
file while it's frozen, then why do you need to reflink it
before you read the file to back it up? It's the /exact same
read operation/ on the file.

I'm not sure what the hell you actually want now, because you've
contradicting yourself again by saying you want read the entire file
while it's frozen and blocking writes, after you just said that's
not acceptible and why reflink is required....

> > > Anyway, freeze semantics alone won't work for our backup application
> > > that needs to be non intrusive. Even if writes to large file are few,
> > > backup may take time, so blocking those few write for that long is
> > > not acceptable.
> >
> > So, reflink is too expensive because there are only occasional
> > writes, but blocking that occasional write is too expensive, too,
> > even though it is rare?
> >
> 
> All right. I admit to have presented a weak example, but I am
> not submitting a patch to be merged. I am proposing a discussion
> on what I think is a gap in existing API. The feedback of "what is
> the measurable benefits?" is well expected, but I brought this up
> anyway, without concrete measurable figures to hear what others
> have to say. And frankly, I quite like the file freeze suggestion, so
> I am glad that I did.
> 
> Besides, even if existing filesystems implement reflink fast "enough",
> this is not at all an mandated by the API.

All the API mandates is that it is atomic w.r.t. other IO and
that's all that realy matters. Everything else is optional,
including performance.

Correct functionality first, performance second.

> Bottom line: I completely agree with you that "file freeze" is sufficient
> for the case I presented, as long as reflink is allowed while file is frozen.
> IOW, break the existing compound API freeze+reflink+unfreeze to
> individual operations to give more control over to user.

I don't think that's a good idea. If we allow "metadata" to be
unfrozen, but only freeze data, does that mean we allow modifying
owner, perms, attributes, etc, but then don't allow truncate. What
about preallocation over holes? That doesn't change data, and it's
only a metadata modification. What about background dedupe? That
sort of thing is a can of worms that I don't want to go anywhere
near. Either the file is frozen (i.e. effectively immutable but
blocks modifications rather than EPERMs) or it's a normal, writeable
file - madness lies within any other boundary...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-29 23:01           ` Dave Chinner
@ 2019-01-30 13:30             ` Amir Goldstein
  2019-01-31 20:25               ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2019-01-30 13:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig

On Wed, Jan 30, 2019 at 1:01 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jan 29, 2019 at 09:18:57AM +0200, Amir Goldstein wrote:
> > > > I think it's a good idea to add file freeze semantics to the toolbox
> > > > of useful things that could be accomplished with reflink.
> > >
> > > reflink is already atomic w.r.t. other writes - in what way does a
> > > "file freeze" have any impact on a reflink operation? that is, apart
> > > from preventing it from being done, because reflink can modify the
> > > source inode on XFS, too....
> > >
> >
> > - create O_TMPFILE
> > - freeze source file
> > - read and calculate hash from source file
> > - likely unfreeze and skip reflink+backup
>
> IF you can read the file to determine if you need to back up the
> file while it's frozen, then why do you need to reflink it
> before you read the file to back it up? It's the /exact same
> read operation/ on the file.
>
> I'm not sure what the hell you actually want now, because you've
> contradicting yourself again by saying you want read the entire file
> while it's frozen and blocking writes, after you just said that's
> not acceptible and why reflink is required....
>

Mmm. I've tried to stick to a simplified description and picked one specific
use case of large file that was not a good example.
A colleague has corrected me that we are more concerned over the cost of
reflinking many millions of files just to find out that they were not change
and even just to back them up.
Ended up being confusing. If I contradicted myself is because my description
was often missing "sometimes".

Let's start over:
For small files, that can fit in application buffers, freeze is probably enough.
For very large files, your assertion that reflink is cheap compared to reading
the file is correct, so there is probably no justification for lazy reflink.

In the middle, there are files that we can analyse fast enough to determine
which parts of the file have been modified and send the modified parts.
Send can happen quite later.
If we analysed the frozen file, we wouldn't want the file to change
before sending
out the modified parts, hence we would want a reflink that is consistent with
the file that we analysed.

For that case, I think we need freeze+reflink, where reflink is done under
write protection. If I am wrong from show me how.

> > Bottom line: I completely agree with you that "file freeze" is sufficient
> > for the case I presented, as long as reflink is allowed while file is frozen.
> > IOW, break the existing compound API freeze+reflink+unfreeze to
> > individual operations to give more control over to user.
>
> I don't think that's a good idea. If we allow "metadata" to be
> unfrozen, but only freeze data, does that mean we allow modifying
> owner, perms, attributes, etc, but then don't allow truncate. What
> about preallocation over holes? That doesn't change data, and it's
> only a metadata modification. What about background dedupe? That
> sort of thing is a can of worms that I don't want to go anywhere
> near. Either the file is frozen (i.e. effectively immutable but
> blocks modifications rather than EPERMs) or it's a normal, writeable
> file - madness lies within any other boundary...
>

OK. the data/metadata dichotomy may be wrong to use here,
because it is not well defined.
We all know that truncate and fallocate need to be serialized with
data modifications and filesystems already do that, so however you
call it, we all know what data changes means.
I agree that data sometimes requires consistency with metadata.

Fact is that the backup application is interested in the file content
and metadata, but NOT the file disk layout.

generic_remap_file_range_prep() does not require that source
file is not immutable. Does XFS? I don't know if "immutable"
has ever been defined w.r.t file layout on disk. has it?
I recon btrfs re-balancing would not stop at migrating "immutable"
file blocks would it?

Still madness? or sparks of sanity?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-25 14:27 [LSF/MM TOPIC] Lazy file reflink Amir Goldstein
  2019-01-28 12:50 ` Jan Kara
@ 2019-01-31 20:02 ` Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2019-01-31 20:02 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong,
	Christoph Hellwig, Jan Kara

On Fri, Jan 25, 2019 at 7:28 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Hi,
>
> I would like to discuss the concept of lazy file reflink.
> The use case is backup of a very large read-mostly file.
> Backup application would like to read consistent content from the
> file, "atomic read" sort of speak.

If it's even a few thousand such files let alone millions, whether XFS
or Btrfs,  you're talking about a lot of metadata writes (hence I
sorta understand the request for lazy+volatile reflink). But this
quickly is a metric f ton of data, it's in effect a duplicate of each
file's metadata which includes a list of its extents. So in simple
cases it can be unwritten, but not in every case can you be sure the
operation fits in memory.

example from my sysroot:
36.87GiB data extents
1.12GiB filesystem metadata

If I reflink copy that whole file system, it translates into 1.12GiB
metadata read and then 1.12GiB written. If it's a Btrfs snapshot of
the containing subvolumes, it's maybe 128KiB written per snapshot. The
reflink copy is only cheap compared to full data copy. It's not that
cheap compared to snapshots. It sounds to me like a lazy reflink copy
is no longer lazy if it has to write out to disk because it can't all
fit in memory, or ends up evicting something in memory that otherwise
slows things down.

A Btrfs snapshot is cheaper than an LVM thinp snapshot which comes
with a need to then mount that snapshot's filesystem in order to do
the backup. But if the file system is big enough that there are long
mount times, chances are you're talking about a lot of data to backup
also, which means a alot of metadata to read and then write out unless
you're lucky enough to have gobs of RAM.

So *shrug* I'm not seeing a consistent optimization with lazy reflink.
It'll be faster if we're not talking about a lot of data in the first
place.

> I have based my assumption that reflink of a large file may incur
> lots of metadata updates on my limited knowledge of xfs reflink
> implementation, but perhaps it is not the case for other filesystems?
> (btrfs?) and perhaps the current metadata overhead on reflink of a large
> file is an implementation detail that could be optimized in the future?

The optimum use case is maybe a few hundred big files. Tens of
thousands to millions - I think you start creating a lot of
competition for memory, and the ensuing consequences. Something has to
be evicted. Either the lazy reflink is a lower priority and it
functionally becomes a partial or full reflink by writing out to block
devices; or it's a higher priority and kicks something else out. No
free lunch.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-30 13:30             ` Amir Goldstein
@ 2019-01-31 20:25               ` Chris Murphy
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2019-01-31 20:25 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Wed, Jan 30, 2019 at 6:30 AM Amir Goldstein <amir73il@gmail.com> wrote:

> generic_remap_file_range_prep() does not require that source
> file is not immutable. Does XFS? I don't know if "immutable"
> has ever been defined w.r.t file layout on disk. has it?
> I recon btrfs re-balancing would not stop at migrating "immutable"
> file blocks would it?

chattr +i does not pin the file to a particular physical location on a
device or even on a particular device, at least on Btrfs. A balance
operation, including device replace, or device add+remove, or changing
the raid profile, isn't inhibited - such operations happen at the
block group level, somewhat similar to LVM pvmove.

A directory containing an immutable file cannot be deleted until
immutable flag is unset; but a subvolume containing an immutable file
is deleted without complaint when using 'btrfs sub del'. I'm pretty
sure that's expected. Further, looks like now 'rmdir' and 'rm -rf'
will remove a subvolume; but in the case of it containing an immutable
file, 'rm -rf' fails just as if it were a directory even though 'btrfs
sub del' succeeds.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-28 21:26   ` Dave Chinner
  2019-01-28 22:56     ` Amir Goldstein
@ 2019-01-31 21:13     ` Matthew Wilcox
  2019-02-01 13:49       ` Amir Goldstein
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2019-01-31 21:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Amir Goldstein, lsf-pc, linux-fsdevel, linux-xfs,
	Darrick J. Wong, Christoph Hellwig

On Tue, Jan 29, 2019 at 08:26:43AM +1100, Dave Chinner wrote:
> Really, though, for this use case it's make more sense to have "per
> file freeze" semantics. i.e. if you want a consistent backup image
> on snapshot capable storage, the process is usually "freeze
> filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
> remove snapshot". We can already transparently block incoming
> writes/modifications on files via the freeze mechanism, so why not
> just extend that to per-file granularity so writes to the "very
> large read-mostly file" block while it's being backed up....
> 
> Indeed, this would probably only require a simple extension to
> FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
> by XFS it was a "freeze level". Set this to 0xffffffff and then it
> freezes just the fd passed in, not the whole filesystem.
> Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...

This sounds like you want a lease (aka oplock), which we already have
implemented.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-01-31 21:13     ` Matthew Wilcox
@ 2019-02-01 13:49       ` Amir Goldstein
  2019-04-27 21:46         ` Amir Goldstein
  0 siblings, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2019-02-01 13:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, lsf-pc, linux-fsdevel, linux-xfs,
	Darrick J. Wong, Christoph Hellwig

On Thu, Jan 31, 2019 at 11:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 29, 2019 at 08:26:43AM +1100, Dave Chinner wrote:
> > Really, though, for this use case it's make more sense to have "per
> > file freeze" semantics. i.e. if you want a consistent backup image
> > on snapshot capable storage, the process is usually "freeze
> > filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
> > remove snapshot". We can already transparently block incoming
> > writes/modifications on files via the freeze mechanism, so why not
> > just extend that to per-file granularity so writes to the "very
> > large read-mostly file" block while it's being backed up....
> >
> > Indeed, this would probably only require a simple extension to
> > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
> > by XFS it was a "freeze level". Set this to 0xffffffff and then it
> > freezes just the fd passed in, not the whole filesystem.
> > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...
>
> This sounds like you want a lease (aka oplock), which we already have
> implemented.

Yes, its possibly true.
I think that it could make sense to skip the reflink optimization for files that
are open for write in our workloads. I'll need to check with my peers.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM TOPIC] Lazy file reflink
  2019-02-01 13:49       ` Amir Goldstein
@ 2019-04-27 21:46         ` Amir Goldstein
  0 siblings, 0 replies; 13+ messages in thread
From: Amir Goldstein @ 2019-04-27 21:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jan Kara, lsf-pc, linux-fsdevel, linux-xfs,
	Darrick J. Wong, Christoph Hellwig, Miklos Szeredi

On Fri, Feb 1, 2019 at 9:49 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Jan 31, 2019 at 11:13 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 29, 2019 at 08:26:43AM +1100, Dave Chinner wrote:
> > > Really, though, for this use case it's make more sense to have "per
> > > file freeze" semantics. i.e. if you want a consistent backup image
> > > on snapshot capable storage, the process is usually "freeze
> > > filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
> > > remove snapshot". We can already transparently block incoming
> > > writes/modifications on files via the freeze mechanism, so why not
> > > just extend that to per-file granularity so writes to the "very
> > > large read-mostly file" block while it's being backed up....
> > >
> > > Indeed, this would probably only require a simple extension to
> > > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
> > > by XFS it was a "freeze level". Set this to 0xffffffff and then it
> > > freezes just the fd passed in, not the whole filesystem.
> > > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...
> >
> > This sounds like you want a lease (aka oplock), which we already have
> > implemented.
>
> Yes, its possibly true.
> I think that it could make sense to skip the reflink optimization for files that
> are open for write in our workloads. I'll need to check with my peers.
>

Getting back to this.
Since the topic got a slot in the LSF agenda, here are my talking points.

First of all, I would like to rewrite the subject. "lazy clone" was a specific
use case I had and the discussion mostly revolved around the viability
of this use case, but I have other use cases.

The core topic perhaps would be better described as "file pre-modification
callback".
We already have several of those: fsnotify, leases/oplocks, but they are
inadequate for some use cases, namely when the file is already open for
write and have writable maps.

One use case I have is taking a VFS level snapshot when there are open
files with writable maps.
Another similar use case is filesystem change journal, which I presented
last year: https://lwn.net/Articles/755277/

Another use case presented by Miklos is cache coherency between
guest and host in virtio-fs.

I envision something like fsnotify pre modification one shot permission
event that is emitted only once when inode data is dirtied after flushing
file's dirty data.
Depending on the use case, it may need to be combined with a file
freeze/thaw API or simply emit the event immediately after flushing
dirty data if inode is dirty.

For the cache coherency use case, that would mean that client
(i.e. guest) is valid for as long as host inode remains non-dirty.
Not sure if this is sufficient to meet virtio-fs requirements, but
I think this is pretty much similar to the way that networking
filesystems client-server cache coherency works, but with finer
granularity (break oplock/lease on dirtying instead of on open).

I would like to discuss possible ways to implement this API
and hear other people's concerns and other possible use cases.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-04-27 21:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-25 14:27 [LSF/MM TOPIC] Lazy file reflink Amir Goldstein
2019-01-28 12:50 ` Jan Kara
2019-01-28 21:26   ` Dave Chinner
2019-01-28 22:56     ` Amir Goldstein
2019-01-29  0:18       ` Dave Chinner
2019-01-29  7:18         ` Amir Goldstein
2019-01-29 23:01           ` Dave Chinner
2019-01-30 13:30             ` Amir Goldstein
2019-01-31 20:25               ` Chris Murphy
2019-01-31 21:13     ` Matthew Wilcox
2019-02-01 13:49       ` Amir Goldstein
2019-04-27 21:46         ` Amir Goldstein
2019-01-31 20:02 ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).