Re: [LSF/MM TOPIC] Lazy file reflink

From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>,
	lsf-pc@lists.linux-foundation.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [LSF/MM TOPIC] Lazy file reflink
Date: Tue, 29 Jan 2019 08:26:43 +1100	[thread overview]
Message-ID: <20190128212642.GQ4205@dastard> (raw)
In-Reply-To: <20190128125044.GC27972@quack2.suse.cz>

On Mon, Jan 28, 2019 at 01:50:44PM +0100, Jan Kara wrote:
> Hi,
> 
> On Fri 25-01-19 16:27:52, Amir Goldstein wrote:
> > I would like to discuss the concept of lazy file reflink.
> > The use case is backup of a very large read-mostly file.
> > Backup application would like to read consistent content from the
> > file, "atomic read" sort of speak.
> > 
> > With filesystem that supports reflink, that can be done by:
> > - Create O_TMPFILE
> > - Reflink origin to temp file
> > - Backup from temp file
> >
> > However, since the origin file is very likely not to be modified,
> > the reflink step, that may incur lots of metadata updates, is a waste.
> > Instead, if filesystem could be notified that atomic content was
> > requested (O_ATOMIC|O_RDONLY or O_CLONE|O_RDONLY),
> > filesystem could defer reflink to an O_TMPFILE until origin file is
> > open for write or actually modified.

That makes me want to run screaming for the hills.

> > What I just described above is actually already implemented with
> > Overlayfs snapshots [1], but for many applications overlayfs snapshots
> > it is not a practical solution.
> > 
> > I have based my assumption that reflink of a large file may incur
> > lots of metadata updates on my limited knowledge of xfs reflink
> > implementation, but perhaps it is not the case for other filesystems?

Comparitively speaking: compared to copying a large file, reflink is
cheap on any filesystem that implements it. Sure, reflinking on XFS
is CPU limited, IIRC, to ~10-20,000 extents per second per reflink
op per AG, but it's still faster than copying 10-20,000 extents
per second per copy op on all but the very fastest, unloaded nvme
SSDs...

> > (btrfs?) and perhaps the current metadata overhead on reflink of a large
> > file is an implementation detail that could be optimized in the future?
> > 
> > The point of the matter is that there is no API to make an explicit
> > request for a "volatile reflink" that does not need to survive power
> > failure and that limits the ability of filesytems to optimize this case.
> 
> Well, to me this seems like a relatively rare usecase (and performance
> gain) for the complexity. Also the speed of reflink is fs dependent - e.g.
> for btrfs it is rather cheap AFAIK.

I suspect for "very large read-mostly file" it's still an expensive
operation on btrfs.

Really, though, for this use case it's make more sense to have "per
file freeze" semantics. i.e. if you want a consistent backup image
on snapshot capable storage, the process is usually "freeze
filesystem, snapshot fs, unfreeze fs, do backup from snapshot,
remove snapshot". We can already transparently block incoming
writes/modifications on files via the freeze mechanism, so why not
just extend that to per-file granularity so writes to the "very
large read-mostly file" block while it's being backed up....

Indeed, this would probably only require a simple extension to
FIFREEZE/FITHAW - the parameter is currently ignored, but as defined
by XFS it was a "freeze level". Set this to 0xffffffff and then it
freezes just the fd passed in, not the whole filesystem.
Alternatively, FI_FREEZE_FILE/FI_THAW_FILE is simple to define...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com