* [LSF/MM TOPIC] Lazy file reflink @ 2019-01-25 14:27 Amir Goldstein 2019-01-28 12:50 ` Jan Kara 2019-01-31 20:02 ` Chris Murphy 0 siblings, 2 replies; 13+ messages in thread From: Amir Goldstein @ 2019-01-25 14:27 UTC (permalink / raw) To: lsf-pc Cc: linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig, Jan Kara Hi, I would like to discuss the concept of lazy file reflink. The use case is backup of a very large read-mostly file. Backup application would like to read consistent content from the file, "atomic read" sort of speak. With filesystem that supports reflink, that can be done by: - Create O_TMPFILE - Reflink origin to temp file - Backup from temp file However, since the origin file is very likely not to be modified, the reflink step, that may incur lots of metadata updates, is a waste. Instead, if filesystem could be notified that atomic content was requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY), filesystem could defer reflink to an O_TMPFILE until origin file is open for write or actually modified. What I just described above is actually already implemented with Overlayfs snapshots [1], but for many applications overlayfs snapshots it is not a practical solution. I have based my assumption that reflink of a large file may incur lots of metadata updates on my limited knowledge of xfs reflink implementation, but perhaps it is not the case for other filesystems? (btrfs?) and perhaps the current metadata overhead on reflink of a large file is an implementation detail that could be optimized in the future? The point of the matter is that there is no API to make an explicit request for a "volatile reflink" that does not need to survive power failure and that limits the ability of filesytems to optimize this case. I realize the "atomic read" requirement is somewhat adjacent to the "atomic write" [2] requirement, if not only by name, but I am not sure how much they really share in common? A somewhat different approach for the problem is for the application to use fanotify to register for pre-modify callback and implement the lazy reflink by itself. This could work but will require to extend the semantics of fanotify and application currently needs to have CAP_SYS_ADMIN, because it can block access to file indefinitely. Would love to get some feedback about the concept from filesystem developers. Thanks, Amir. [1] https://lwn.net/Articles/719772/ [2] https://lwn.net/Articles/715918/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-25 14:27 [LSF/MM TOPIC] Lazy file reflink Amir Goldstein @ 2019-01-28 12:50 ` Jan Kara 2019-01-28 21:26 ` Dave Chinner 2019-01-31 20:02 ` Chris Murphy 1 sibling, 1 reply; 13+ messages in thread From: Jan Kara @ 2019-01-28 12:50 UTC (permalink / raw) To: Amir Goldstein Cc: lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig, Jan Kara Hi, On Fri 25-01-19 16:27:52, Amir Goldstein wrote: > I would like to discuss the concept of lazy file reflink. > The use case is backup of a very large read-mostly file. > Backup application would like to read consistent content from the > file, "atomic read" sort of speak. > > With filesystem that supports reflink, that can be done by: > - Create O_TMPFILE > - Reflink origin to temp file > - Backup from temp file > > However, since the origin file is very likely not to be modified, > the reflink step, that may incur lots of metadata updates, is a waste. > Instead, if filesystem could be notified that atomic content was > requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY), > filesystem could defer reflink to an O_TMPFILE until origin file is > open for write or actually modified. > > What I just described above is actually already implemented with > Overlayfs snapshots [1], but for many applications overlayfs snapshots > it is not a practical solution. > > I have based my assumption that reflink of a large file may incur > lots of metadata updates on my limited knowledge of xfs reflink > implementation, but perhaps it is not the case for other filesystems? > (btrfs?) and perhaps the current metadata overhead on reflink of a large > file is an implementation detail that could be optimized in the future? > > The point of the matter is that there is no API to make an explicit > request for a "volatile reflink" that does not need to survive power > failure and that limits the ability of filesytems to optimize this case. Well, to me this seems like a relatively rare usecase (and performance gain) for the complexity. Also the speed of reflink is fs dependent - e.g. for btrfs it is rather cheap AFAIK. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-28 12:50 ` Jan Kara @ 2019-01-28 21:26 ` Dave Chinner 2019-01-28 22:56 ` Amir Goldstein 2019-01-31 21:13 ` Matthew Wilcox 0 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2019-01-28 21:26 UTC (permalink / raw) To: Jan Kara Cc: Amir Goldstein, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig On Mon, Jan 28, 2019 at 01:50:44PM +0100, Jan Kara wrote: > Hi, > > On Fri 25-01-19 16:27:52, Amir Goldstein wrote: > > I would like to discuss the concept of lazy file reflink. > > The use case is backup of a very large read-mostly file. > > Backup application would like to read consistent content from the > > file, "atomic read" sort of speak. > > > > With filesystem that supports reflink, that can be done by: > > - Create O_TMPFILE > > - Reflink origin to temp file > > - Backup from temp file > > > > However, since the origin file is very likely not to be modified, > > the reflink step, that may incur lots of metadata updates, is a waste. > > Instead, if filesystem could be notified that atomic content was > > requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY), > > filesystem could defer reflink to an O_TMPFILE until origin file is > > open for write or actually modified. That makes me want to run screaming for the hills. > > What I just described above is actually already implemented with > > Overlayfs snapshots [1], but for many applications overlayfs snapshots > > it is not a practical solution. > > > > I have based my assumption that reflink of a large file may incur > > lots of metadata updates on my limited knowledge of xfs reflink > > implementation, but perhaps it is not the case for other filesystems? Comparitively speaking: compared to copying a large file, reflink is cheap on any filesystem that implements it. Sure, reflinking on XFS is CPU limited, IIRC, to ~10-20,000 extents per second per reflink op per AG, but it's still faster than copying 10-20,000 extents per second per copy op on all but the very fastest, unloaded nvme SSDs... > > (btrfs?) and perhaps the current metadata overhead on reflink of a large > > file is an implementation detail that could be optimized in the future? > > > > The point of the matter is that there is no API to make an explicit > > request for a "volatile reflink" that does not need to survive power > > failure and that limits the ability of filesytems to optimize this case. > > Well, to me this seems like a relatively rare usecase (and performance > gain) for the complexity. Also the speed of reflink is fs dependent - e.g. > for btrfs it is rather cheap AFAIK. I suspect for "very large read-mostly file" it's still an expensive operation on btrfs. Really, though, for this use case it's make more sense to have "per file freeze" semantics. i.e. if you want a consistent backup image on snapshot capable storage, the process is usually "freeze filesystem, snapshot fs, unfreeze fs, do backup from snapshot, remove snapshot". We can already transparently block incoming writes/modifications on files via the freeze mechanism, so why not just extend that to per-file granularity so writes to the "very large read-mostly file" block while it's being backed up.... Indeed, this would probably only require a simple extension to FIFREEZE/FITHAW - the parameter is currently ignored, but as defined by XFS it was a "freeze level". Set this to 0xffffffff and then it freezes just the fd passed in, not the whole filesystem. Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-28 21:26 ` Dave Chinner @ 2019-01-28 22:56 ` Amir Goldstein 2019-01-29 0:18 ` Dave Chinner 2019-01-31 21:13 ` Matthew Wilcox 1 sibling, 1 reply; 13+ messages in thread From: Amir Goldstein @ 2019-01-28 22:56 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig > > > What I just described above is actually already implemented with > > > Overlayfs snapshots [1], but for many applications overlayfs snapshots > > > it is not a practical solution. > > > > > > I have based my assumption that reflink of a large file may incur > > > lots of metadata updates on my limited knowledge of xfs reflink > > > implementation, but perhaps it is not the case for other filesystems? > > Comparitively speaking: compared to copying a large file, reflink is > cheap on any filesystem that implements it. Sure, reflinking on XFS > is CPU limited, IIRC, to ~10-20,000 extents per second per reflink > op per AG, but it's still faster than copying 10-20,000 extents > per second per copy op on all but the very fastest, unloaded nvme > SSDs... > I think the concern is the added metadata load on the rest of the users. Backup app doesn't care about the time it consumes to clone before backup. But this concern is not based on actual numbers. > Really, though, for this use case it's make more sense to have "per > file freeze" semantics. i.e. if you want a consistent backup image > on snapshot capable storage, the process is usually "freeze > filesystem, snapshot fs, unfreeze fs, do backup from snapshot, > remove snapshot". We can already transparently block incoming > writes/modifications on files via the freeze mechanism, so why not > just extend that to per-file granularity so writes to the "very > large read-mostly file" block while it's being backed up.... > > Indeed, this would probably only require a simple extension to > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined > by XFS it was a "freeze level". Set this to 0xffffffff and then it > freezes just the fd passed in, not the whole filesystem. > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define... > I think it's a good idea to add file freeze semantics to the toolbox of useful things that could be accomplished with reflink. Especially with your plans for subvolumes as files How is that coming along by the way?. Anyway, freeze semantics alone won't work for our backup application that needs to be non intrusive. Even if writes to large file are few, backup may take time, so blocking those few write for that long is not acceptable. Blocking the writes for the setup time of a reflink is exactly what I was proposing and in your analogy, the block device is frozen only for a short period of time for setting up the snapshot and not for the duration of the backup. Thanks, Amir. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-28 22:56 ` Amir Goldstein @ 2019-01-29 0:18 ` Dave Chinner 2019-01-29 7:18 ` Amir Goldstein 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2019-01-29 0:18 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig On Tue, Jan 29, 2019 at 12:56:17AM +0200, Amir Goldstein wrote: > > > > What I just described above is actually already implemented with > > > > Overlayfs snapshots [1], but for many applications overlayfs snapshots > > > > it is not a practical solution. > > > > > > > > I have based my assumption that reflink of a large file may incur > > > > lots of metadata updates on my limited knowledge of xfs reflink > > > > implementation, but perhaps it is not the case for other filesystems? > > > > Comparitively speaking: compared to copying a large file, reflink is > > cheap on any filesystem that implements it. Sure, reflinking on XFS > > is CPU limited, IIRC, to ~10-20,000 extents per second per reflink > > op per AG, but it's still faster than copying 10-20,000 extents > > per second per copy op on all but the very fastest, unloaded nvme > > SSDs... > > > > I think the concern is the added metadata load on the rest of the > users. Backup app doesn't care about the time it consumes to clone > before backup. But this concern is not based on actual numbers. So what is it based on? > > Really, though, for this use case it's make more sense to have "per > > file freeze" semantics. i.e. if you want a consistent backup image > > on snapshot capable storage, the process is usually "freeze > > filesystem, snapshot fs, unfreeze fs, do backup from snapshot, > > remove snapshot". We can already transparently block incoming > > writes/modifications on files via the freeze mechanism, so why not > > just extend that to per-file granularity so writes to the "very > > large read-mostly file" block while it's being backed up.... > > > > Indeed, this would probably only require a simple extension to > > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined > > by XFS it was a "freeze level". Set this to 0xffffffff and then it > > freezes just the fd passed in, not the whole filesystem. > > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define... > > > > I think it's a good idea to add file freeze semantics to the toolbox > of useful things that could be accomplished with reflink. reflink is already atomic w.r.t. other writes - in what way does a "file freeze" have any impact on a reflink operation? that is, apart from preventing it from being done, because reflink can modify the source inode on XFS, too.... > Especially with your plans for subvolumes as files > How is that coming along by the way?. If I didn't have to spend so much time fire-fighting broken stuff, I might make more progress. > Anyway, freeze semantics alone won't work for our backup application > that needs to be non intrusive. Even if writes to large file are few, > backup may take time, so blocking those few write for that long is > not acceptable. So, reflink is too expensive because there are only occasional writes, but blocking that occasional write is too expensive, too, even though it is rare? > Blocking the writes for the setup time of a reflink > is exactly what I was proposing and in your analogy, No, I proposed a way to provide a -point in time snapshot- of a file that doesn't require reflink or any other special filesystem support. > the block > device is frozen only for a short period of time for setting up the > snapshot and not for the duration of the backup. Right, it's frozen for as long as it takes to set up a -point in time snapshot- that the backup can be taken from. You don't need that to reflink a file. You need it if you want to do something other than a reflink.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-29 0:18 ` Dave Chinner @ 2019-01-29 7:18 ` Amir Goldstein 2019-01-29 23:01 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Amir Goldstein @ 2019-01-29 7:18 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig > > I think it's a good idea to add file freeze semantics to the toolbox > > of useful things that could be accomplished with reflink. > > reflink is already atomic w.r.t. other writes - in what way does a > "file freeze" have any impact on a reflink operation? that is, apart > from preventing it from being done, because reflink can modify the > source inode on XFS, too.... > - create O_TMPFILE - freeze source file - read and calculate hash from source file - likely unfreeze and skip reflink+backup For the unlikely case, application could copy_file_range before unfreeze and that means that reflink of source should be allowed while file is frozen, that is, while file *data* is frozen. That mean that file freeze API needs to be able to express if both metadata and data freeze are required. > > Anyway, freeze semantics alone won't work for our backup application > > that needs to be non intrusive. Even if writes to large file are few, > > backup may take time, so blocking those few write for that long is > > not acceptable. > > So, reflink is too expensive because there are only occasional > writes, but blocking that occasional write is too expensive, too, > even though it is rare? > All right. I admit to have presented a weak example, but I am not submitting a patch to be merged. I am proposing a discussion on what I think is a gap in existing API. The feedback of "what is the measurable benefits?" is well expected, but I brought this up anyway, without concrete measurable figures to hear what others have to say. And frankly, I quite like the file freeze suggestion, so I am glad that I did. Besides, even if existing filesystems implement reflink fast "enough", this is not at all an mandated by the API. > > Blocking the writes for the setup time of a reflink > > is exactly what I was proposing and in your analogy, > > No, I proposed a way to provide a -point in time snapshot- of a > file that doesn't require reflink or any other special filesystem > support. > > > the block > > device is frozen only for a short period of time for setting up the > > snapshot and not for the duration of the backup. > > Right, it's frozen for as long as it takes to set up a -point in > time snapshot- that the backup can be taken from. You don't need > that to reflink a file. You need it if you want to do something > other than a reflink.... > Correct. As I wrote above, that could be used for conditional copy or conditional reflink on a filesystem where reflink has a measurable cost. Bottom line: I completely agree with you that "file freeze" is sufficient for the case I presented, as long as reflink is allowed while file is frozen. IOW, break the existing compound API freeze+reflink+unfreeze to individual operations to give more control over to user. Thanks, Amir. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-29 7:18 ` Amir Goldstein @ 2019-01-29 23:01 ` Dave Chinner 2019-01-30 13:30 ` Amir Goldstein 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2019-01-29 23:01 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig On Tue, Jan 29, 2019 at 09:18:57AM +0200, Amir Goldstein wrote: > > > I think it's a good idea to add file freeze semantics to the toolbox > > > of useful things that could be accomplished with reflink. > > > > reflink is already atomic w.r.t. other writes - in what way does a > > "file freeze" have any impact on a reflink operation? that is, apart > > from preventing it from being done, because reflink can modify the > > source inode on XFS, too.... > > > > - create O_TMPFILE > - freeze source file > - read and calculate hash from source file > - likely unfreeze and skip reflink+backup IF you can read the file to determine if you need to back up the file while it's frozen, then why do you need to reflink it before you read the file to back it up? It's the /exact same read operation/ on the file. I'm not sure what the hell you actually want now, because you've contradicting yourself again by saying you want read the entire file while it's frozen and blocking writes, after you just said that's not acceptible and why reflink is required.... > > > Anyway, freeze semantics alone won't work for our backup application > > > that needs to be non intrusive. Even if writes to large file are few, > > > backup may take time, so blocking those few write for that long is > > > not acceptable. > > > > So, reflink is too expensive because there are only occasional > > writes, but blocking that occasional write is too expensive, too, > > even though it is rare? > > > > All right. I admit to have presented a weak example, but I am > not submitting a patch to be merged. I am proposing a discussion > on what I think is a gap in existing API. The feedback of "what is > the measurable benefits?" is well expected, but I brought this up > anyway, without concrete measurable figures to hear what others > have to say. And frankly, I quite like the file freeze suggestion, so > I am glad that I did. > > Besides, even if existing filesystems implement reflink fast "enough", > this is not at all an mandated by the API. All the API mandates is that it is atomic w.r.t. other IO and that's all that realy matters. Everything else is optional, including performance. Correct functionality first, performance second. > Bottom line: I completely agree with you that "file freeze" is sufficient > for the case I presented, as long as reflink is allowed while file is frozen. > IOW, break the existing compound API freeze+reflink+unfreeze to > individual operations to give more control over to user. I don't think that's a good idea. If we allow "metadata" to be unfrozen, but only freeze data, does that mean we allow modifying owner, perms, attributes, etc, but then don't allow truncate. What about preallocation over holes? That doesn't change data, and it's only a metadata modification. What about background dedupe? That sort of thing is a can of worms that I don't want to go anywhere near. Either the file is frozen (i.e. effectively immutable but blocks modifications rather than EPERMs) or it's a normal, writeable file - madness lies within any other boundary... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-29 23:01 ` Dave Chinner @ 2019-01-30 13:30 ` Amir Goldstein 2019-01-31 20:25 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Amir Goldstein @ 2019-01-30 13:30 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig On Wed, Jan 30, 2019 at 1:01 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 29, 2019 at 09:18:57AM +0200, Amir Goldstein wrote: > > > > I think it's a good idea to add file freeze semantics to the toolbox > > > > of useful things that could be accomplished with reflink. > > > > > > reflink is already atomic w.r.t. other writes - in what way does a > > > "file freeze" have any impact on a reflink operation? that is, apart > > > from preventing it from being done, because reflink can modify the > > > source inode on XFS, too.... > > > > > > > - create O_TMPFILE > > - freeze source file > > - read and calculate hash from source file > > - likely unfreeze and skip reflink+backup > > IF you can read the file to determine if you need to back up the > file while it's frozen, then why do you need to reflink it > before you read the file to back it up? It's the /exact same > read operation/ on the file. > > I'm not sure what the hell you actually want now, because you've > contradicting yourself again by saying you want read the entire file > while it's frozen and blocking writes, after you just said that's > not acceptible and why reflink is required.... > Mmm. I've tried to stick to a simplified description and picked one specific use case of large file that was not a good example. A colleague has corrected me that we are more concerned over the cost of reflinking many millions of files just to find out that they were not change and even just to back them up. Ended up being confusing. If I contradicted myself is because my description was often missing "sometimes". Let's start over: For small files, that can fit in application buffers, freeze is probably enough. For very large files, your assertion that reflink is cheap compared to reading the file is correct, so there is probably no justification for lazy reflink. In the middle, there are files that we can analyse fast enough to determine which parts of the file have been modified and send the modified parts. Send can happen quite later. If we analysed the frozen file, we wouldn't want the file to change before sending out the modified parts, hence we would want a reflink that is consistent with the file that we analysed. For that case, I think we need freeze+reflink, where reflink is done under write protection. If I am wrong from show me how. > > Bottom line: I completely agree with you that "file freeze" is sufficient > > for the case I presented, as long as reflink is allowed while file is frozen. > > IOW, break the existing compound API freeze+reflink+unfreeze to > > individual operations to give more control over to user. > > I don't think that's a good idea. If we allow "metadata" to be > unfrozen, but only freeze data, does that mean we allow modifying > owner, perms, attributes, etc, but then don't allow truncate. What > about preallocation over holes? That doesn't change data, and it's > only a metadata modification. What about background dedupe? That > sort of thing is a can of worms that I don't want to go anywhere > near. Either the file is frozen (i.e. effectively immutable but > blocks modifications rather than EPERMs) or it's a normal, writeable > file - madness lies within any other boundary... > OK. the data/metadata dichotomy may be wrong to use here, because it is not well defined. We all know that truncate and fallocate need to be serialized with data modifications and filesystems already do that, so however you call it, we all know what data changes means. I agree that data sometimes requires consistency with metadata. Fact is that the backup application is interested in the file content and metadata, but NOT the file disk layout. generic_remap_file_range_prep() does not require that source file is not immutable. Does XFS? I don't know if "immutable" has ever been defined w.r.t file layout on disk. has it? I recon btrfs re-balancing would not stop at migrating "immutable" file blocks would it? Still madness? or sparks of sanity? Thanks, Amir. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-30 13:30 ` Amir Goldstein @ 2019-01-31 20:25 ` Chris Murphy 0 siblings, 0 replies; 13+ messages in thread From: Chris Murphy @ 2019-01-31 20:25 UTC (permalink / raw) To: Amir Goldstein; +Cc: lsf-pc, linux-fsdevel, linux-xfs On Wed, Jan 30, 2019 at 6:30 AM Amir Goldstein <amir73il@gmail.com> wrote: > generic_remap_file_range_prep() does not require that source > file is not immutable. Does XFS? I don't know if "immutable" > has ever been defined w.r.t file layout on disk. has it? > I recon btrfs re-balancing would not stop at migrating "immutable" > file blocks would it? chattr +i does not pin the file to a particular physical location on a device or even on a particular device, at least on Btrfs. A balance operation, including device replace, or device add+remove, or changing the raid profile, isn't inhibited - such operations happen at the block group level, somewhat similar to LVM pvmove. A directory containing an immutable file cannot be deleted until immutable flag is unset; but a subvolume containing an immutable file is deleted without complaint when using 'btrfs sub del'. I'm pretty sure that's expected. Further, looks like now 'rmdir' and 'rm -rf' will remove a subvolume; but in the case of it containing an immutable file, 'rm -rf' fails just as if it were a directory even though 'btrfs sub del' succeeds. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-28 21:26 ` Dave Chinner 2019-01-28 22:56 ` Amir Goldstein @ 2019-01-31 21:13 ` Matthew Wilcox 2019-02-01 13:49 ` Amir Goldstein 1 sibling, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2019-01-31 21:13 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Amir Goldstein, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig On Tue, Jan 29, 2019 at 08:26:43AM +1100, Dave Chinner wrote: > Really, though, for this use case it's make more sense to have "per > file freeze" semantics. i.e. if you want a consistent backup image > on snapshot capable storage, the process is usually "freeze > filesystem, snapshot fs, unfreeze fs, do backup from snapshot, > remove snapshot". We can already transparently block incoming > writes/modifications on files via the freeze mechanism, so why not > just extend that to per-file granularity so writes to the "very > large read-mostly file" block while it's being backed up.... > > Indeed, this would probably only require a simple extension to > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined > by XFS it was a "freeze level". Set this to 0xffffffff and then it > freezes just the fd passed in, not the whole filesystem. > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define... This sounds like you want a lease (aka oplock), which we already have implemented. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-31 21:13 ` Matthew Wilcox @ 2019-02-01 13:49 ` Amir Goldstein 2019-04-27 21:46 ` Amir Goldstein 0 siblings, 1 reply; 13+ messages in thread From: Amir Goldstein @ 2019-02-01 13:49 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig On Thu, Jan 31, 2019 at 11:13 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jan 29, 2019 at 08:26:43AM +1100, Dave Chinner wrote: > > Really, though, for this use case it's make more sense to have "per > > file freeze" semantics. i.e. if you want a consistent backup image > > on snapshot capable storage, the process is usually "freeze > > filesystem, snapshot fs, unfreeze fs, do backup from snapshot, > > remove snapshot". We can already transparently block incoming > > writes/modifications on files via the freeze mechanism, so why not > > just extend that to per-file granularity so writes to the "very > > large read-mostly file" block while it's being backed up.... > > > > Indeed, this would probably only require a simple extension to > > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined > > by XFS it was a "freeze level". Set this to 0xffffffff and then it > > freezes just the fd passed in, not the whole filesystem. > > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define... > > This sounds like you want a lease (aka oplock), which we already have > implemented. Yes, its possibly true. I think that it could make sense to skip the reflink optimization for files that are open for write in our workloads. I'll need to check with my peers. Thanks, Amir. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-02-01 13:49 ` Amir Goldstein @ 2019-04-27 21:46 ` Amir Goldstein 0 siblings, 0 replies; 13+ messages in thread From: Amir Goldstein @ 2019-04-27 21:46 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, Jan Kara, lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig, Miklos Szeredi On Fri, Feb 1, 2019 at 9:49 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Thu, Jan 31, 2019 at 11:13 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Jan 29, 2019 at 08:26:43AM +1100, Dave Chinner wrote: > > > Really, though, for this use case it's make more sense to have "per > > > file freeze" semantics. i.e. if you want a consistent backup image > > > on snapshot capable storage, the process is usually "freeze > > > filesystem, snapshot fs, unfreeze fs, do backup from snapshot, > > > remove snapshot". We can already transparently block incoming > > > writes/modifications on files via the freeze mechanism, so why not > > > just extend that to per-file granularity so writes to the "very > > > large read-mostly file" block while it's being backed up.... > > > > > > Indeed, this would probably only require a simple extension to > > > FIFREEZE/FITHAW - the parameter is currently ignored, but as defined > > > by XFS it was a "freeze level". Set this to 0xffffffff and then it > > > freezes just the fd passed in, not the whole filesystem. > > > Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define... > > > > This sounds like you want a lease (aka oplock), which we already have > > implemented. > > Yes, its possibly true. > I think that it could make sense to skip the reflink optimization for files that > are open for write in our workloads. I'll need to check with my peers. > Getting back to this. Since the topic got a slot in the LSF agenda, here are my talking points. First of all, I would like to rewrite the subject. "lazy clone" was a specific use case I had and the discussion mostly revolved around the viability of this use case, but I have other use cases. The core topic perhaps would be better described as "file pre-modification callback". We already have several of those: fsnotify, leases/oplocks, but they are inadequate for some use cases, namely when the file is already open for write and have writable maps. One use case I have is taking a VFS level snapshot when there are open files with writable maps. Another similar use case is filesystem change journal, which I presented last year: https://lwn.net/Articles/755277/ Another use case presented by Miklos is cache coherency between guest and host in virtio-fs. I envision something like fsnotify pre modification one shot permission event that is emitted only once when inode data is dirtied after flushing file's dirty data. Depending on the use case, it may need to be combined with a file freeze/thaw API or simply emit the event immediately after flushing dirty data if inode is dirty. For the cache coherency use case, that would mean that client (i.e. guest) is valid for as long as host inode remains non-dirty. Not sure if this is sufficient to meet virtio-fs requirements, but I think this is pretty much similar to the way that networking filesystems client-server cache coherency works, but with finer granularity (break oplock/lease on dirtying instead of on open). I would like to discuss possible ways to implement this API and hear other people's concerns and other possible use cases. Thanks, Amir. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSF/MM TOPIC] Lazy file reflink 2019-01-25 14:27 [LSF/MM TOPIC] Lazy file reflink Amir Goldstein 2019-01-28 12:50 ` Jan Kara @ 2019-01-31 20:02 ` Chris Murphy 1 sibling, 0 replies; 13+ messages in thread From: Chris Murphy @ 2019-01-31 20:02 UTC (permalink / raw) To: Amir Goldstein Cc: lsf-pc, linux-fsdevel, linux-xfs, Darrick J. Wong, Christoph Hellwig, Jan Kara On Fri, Jan 25, 2019 at 7:28 AM Amir Goldstein <amir73il@gmail.com> wrote: > > Hi, > > I would like to discuss the concept of lazy file reflink. > The use case is backup of a very large read-mostly file. > Backup application would like to read consistent content from the > file, "atomic read" sort of speak. If it's even a few thousand such files let alone millions, whether XFS or Btrfs, you're talking about a lot of metadata writes (hence I sorta understand the request for lazy+volatile reflink). But this quickly is a metric f ton of data, it's in effect a duplicate of each file's metadata which includes a list of its extents. So in simple cases it can be unwritten, but not in every case can you be sure the operation fits in memory. example from my sysroot: 36.87GiB data extents 1.12GiB filesystem metadata If I reflink copy that whole file system, it translates into 1.12GiB metadata read and then 1.12GiB written. If it's a Btrfs snapshot of the containing subvolumes, it's maybe 128KiB written per snapshot. The reflink copy is only cheap compared to full data copy. It's not that cheap compared to snapshots. It sounds to me like a lazy reflink copy is no longer lazy if it has to write out to disk because it can't all fit in memory, or ends up evicting something in memory that otherwise slows things down. A Btrfs snapshot is cheaper than an LVM thinp snapshot which comes with a need to then mount that snapshot's filesystem in order to do the backup. But if the file system is big enough that there are long mount times, chances are you're talking about a lot of data to backup also, which means a alot of metadata to read and then write out unless you're lucky enough to have gobs of RAM. So *shrug* I'm not seeing a consistent optimization with lazy reflink. It'll be faster if we're not talking about a lot of data in the first place. > I have based my assumption that reflink of a large file may incur > lots of metadata updates on my limited knowledge of xfs reflink > implementation, but perhaps it is not the case for other filesystems? > (btrfs?) and perhaps the current metadata overhead on reflink of a large > file is an implementation detail that could be optimized in the future? The optimum use case is maybe a few hundred big files. Tens of thousands to millions - I think you start creating a lot of competition for memory, and the ensuing consequences. Something has to be evicted. Either the lazy reflink is a lower priority and it functionally becomes a partial or full reflink by writing out to block devices; or it's a higher priority and kicks something else out. No free lunch. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-04-27 21:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-25 14:27 [LSF/MM TOPIC] Lazy file reflink Amir Goldstein 2019-01-28 12:50 ` Jan Kara 2019-01-28 21:26 ` Dave Chinner 2019-01-28 22:56 ` Amir Goldstein 2019-01-29 0:18 ` Dave Chinner 2019-01-29 7:18 ` Amir Goldstein 2019-01-29 23:01 ` Dave Chinner 2019-01-30 13:30 ` Amir Goldstein 2019-01-31 20:25 ` Chris Murphy 2019-01-31 21:13 ` Matthew Wilcox 2019-02-01 13:49 ` Amir Goldstein 2019-04-27 21:46 ` Amir Goldstein 2019-01-31 20:02 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).