Re: Feature requests: online backup - defrag - change RAID level

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: webmaster@zedlx.com
Cc: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Wed, 11 Sep 2019 17:42:11 -0400	[thread overview]
Message-ID: <20190911214211.GC22121@hungrycats.org> (raw)
In-Reply-To: <20190911160101.Horde.mYR8sgLb1dgpIs3fD4D5Cfy@server53.web-hosting.com>

On Wed, Sep 11, 2019 at 04:01:01PM -0400, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> > On 2019-09-11 13:20, webmaster@zedlx.com wrote:
> > > 
> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > 
> > > > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > > > > 
> > > > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > > > 
> > > 
> > > > > 
> > > > > === I CHALLENGE you and anyone else on this mailing list: ===
> > > > > 
> > > > >  - Show me an exaple where splitting an extent requires
> > > > > unsharing, and this split is needed to defrag.
> > > > > 
> > > > > Make it clear, write it yourself, I don't want any machine-made outputs.
> > > > > 
> > > > Start with the above comment about all writes unsharing the
> > > > region being written to.
> > > > 
> > > > Now, extrapolating from there:
> > > > 
> > > > Assume you have two files, A and B, each consisting of 64
> > > > filesystem blocks in single shared extent.  Now assume somebody
> > > > writes a few bytes to the middle of file B, right around the
> > > > boundary between blocks 31 and 32, and that you get similar
> > > > writes to file A straddling blocks 14-15 and 47-48.
> > > > 
> > > > After all of that, file A will be 5 extents:
> > > > 
> > > > * A reflink to blocks 0-13 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 14-15
> > > > * A reflink to blocks 16-46 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 47-48
> > > > * A reflink to blocks 49-63 of the original extent.
> > > > 
> > > > And file B will be 3 extents:
> > > > 
> > > > * A reflink to blocks 0-30 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 31-32.
> > > > * A reflink to blocks 32-63 of the original extent.
> > > > 
> > > > Note that there are a total of four contiguous sequences of
> > > > blocks that are common between both files:
> > > > 
> > > > * 0-13
> > > > * 16-30
> > > > * 32-46
> > > > * 49-63
> > > > 
> > > > There is no way to completely defragment either file without
> > > > splitting the original extent (which is still there, just not
> > > > fully referenced by either file) unless you rewrite the whole
> > > > file to a new single extent (which would, of course, completely
> > > > unshare the whole file).  In fact, if you want to ensure that
> > > > those shared regions stay reflinked, there's no way to
> > > > defragment either file without _increasing_ the number of
> > > > extents in that file (either file would need 7 extents to
> > > > properly share only those 4 regions), and even then only one of
> > > > the files could be fully defragmented.
> > > > 
> > > > Such a situation generally won't happen if you're just dealing
> > > > with read-only snapshots, but is not unusual when dealing with
> > > > regular files that are reflinked (which is not an uncommon
> > > > situation on some systems, as a lot of people have `cp` aliased
> > > > to reflink things whenever possible).
> > > 
> > > Well, thank you very much for writing this example. Your example is
> > > certainly not minimal, as it seems to me that one write to the file
> > > A and one write to file B would be sufficient to prove your point,
> > > so there we have one extra write in the example, but that's OK.
> > > 
> > > Your example proves that I was wrong. I admit: it is impossible to
> > > perfectly defrag one subvolume (in the way I imagined it should be
> > > done).
> > > Why? Because, as in your example, there can be files within a SINGLE
> > > subvolume which share their extents with each other. I didn't
> > > consider such a case.
> > > 
> > > On the other hand, I judge this issue to be mostly irrelevant. Why?
> > > Because most of the file sharing will be between subvolumes, not
> > > within a subvolume.
> 
> > Not necessarily. Even ignoring the case of data deduplication (which
> > needs to be considered if you care at all about enterprise usage, and is
> > part of the whole point of using a CoW filesystem), there are existing
> > applications that actively use reflinks, either directly or indirectly
> > (via things like the `copy_file_range` system call), and the number of
> > such applications is growing.
> 
> The same argument goes here: If data-deduplication was performed, then the
> user has specifically requested it.
> Therefore, since it was user's will, the defrag has to honor it, and so the
> defrag must not unshare deduplicated extents because the user wants them
> shared. This might prevent a perfect defrag, but that is exactly what the
> user has requested, either directly or indirectly, by some policy he has
> choosen.
> 
> If an application actively creates reflinked-copies, then we can assume it
> does so according to user's will, therefore it is also a command by user and
> defrag should honor it by not unsharing and by being imperfect.
> 
> Now, you might point out that, in case of data-deduplication, we now have a
> case where most sharing might be within-subvolume, invalidating my assertion
> that most sharing will be between-subvolumes. But this is an invalid (more
> precisely, irelevant) argument. Why? Because the defrag operation has to
> focus on doing what it can do, while honoring user's will. All
> within-subvolume sharing is user-requested, therefore it cannot be part of
> the argument to unshare.
> 
> You can't both perfectly defrag and honor deduplication. Therefore, the
> defrag has to do the best possible thing while still honoring user's will.
> <<<!!! So, the fact that the deduplication was performed is actually the
> reason FOR not unsharing, not against it, as you made it look in that
> paragraph. !!!>>>

IMHO the current kernel 'defrag' API shouldn't be used any more.  We need
a tool that handles dedupe and defrag at the same time, for precisely
this reason:  currently the two operations have no knowledge of each
other and duplicate or reverse each others work.  You don't need to defrag
an extent if you can find a duplicate, and you don't want to use fragmented
extents as dedupe sources.

> If the system unshares automatically after deduplication, then the user will
> need to run deduplication again. Ridiculous!
> 
> > > When a user creates a reflink to a file in the same subvolume, he is
> > > willingly denying himself the assurance of a perfect defrag.
> > > Because, as your example proves, if there are a few writes to BOTH
> > > files, it gets impossible to defrag perfectly. So, if the user
> > > creates such reflinks, it's his own whish and his own fault.
> 
> > The same argument can be made about snapshots.  It's an invalid argument
> > in both cases though because it's not always the user who's creating the
> > reflinks or snapshots.
> 
> Um, I don't agree.
> 
> 1) Actually, it is always the user who is creating reflinks, and snapshots,
> too. Ultimately, it's always the user who does absolutely everything,
> because a computer is supposed to be under his full control. But, in the
> case of reflink-copies, this is even more true
> because reflinks are not an essential feature for normal OS operation, at
> least as far as today's OSes go. Every OS has to copy files around. Every OS
> requires the copy operation. No current OS requires the reflinked-copy
> operation in order to function.

If we don't do reflinks all day, every day, our disks fill up in a matter
of hours...

> 2) A user can make any number of snapshots and subvolumes, but he can at any
> time select one subvolume as a focus of the defrag operation, and that
> subvolume can be perfectly defragmented without any unsharing (except that
> the internal-reflinked files won't be perfectly defragmented).
> Therefore, the snapshoting operation can never jeopardize a perfect defrag.
> The user can make many snapshots without any fears (I'd say a total of 100
> snapshots at any point in time is a good and reasonable limit).
> 
> > > Such situations will occur only in some specific circumstances:
> > > a) when the user is reflinking manually
> > > b) when a file is copied from one subvolume into a different file in
> > > a different subvolume.
> > > 
> > > The situation a) is unusual in normal use of the filesystem. Even
> > > when it occurs, it is the explicit command given by the user, so he
> > > should be willing to accept all the consequences, even the bad ones
> > > like imperfect defrag.
> > > 
> > > The situation b) is possible, but as far as I know copies are
> > > currently not done that way in btrfs. There should probably be the
> > > option to reflink-copy files fron another subvolume, that would be
> > > good.
> > > 
> > > But anyway, it doesn't matter. Because most of the sharing will be
> > > between subvolumes, not within subvolume. So, if there is some
> > > in-subvolume sharing, the defrag wont be 100% perfect, that a minor
> > > point. Unimportant.
> 
> > You're focusing too much on your own use case here.
> 
> It's so easy to say that. But you really don't know. You might be wrong. I
> might be the objective one, and you might be giving me some
> groupthink-induced, badly thought out conclusions from years ago, which was
> never rechecked because that's so hard to do. And then everybody just
> repeats it and it becomes the truth. As Goebels said, if you repeat anything
> enough times, it becomes the truth.
> 
> > Not everybody uses snapshots, and there are many people who are using
> > reflinks very actively within subvolumes, either for deduplication or
> > because it saves time and space when dealing with multiple copies of
> > mostly identical tress of files.
> 
> Yes, I guess there are many such users. Doesn't matter. What you are
> proposing is that the defrag should break all their reflinks and
> deduplicated data they painstakingly created. Come on!
> 
> Or, maybe the defrag should unshare to gain performance? Yes, but only WHEN
> USER REQUESTS IT. So the defrag can unshare,
> but only by request. Since this means that user is reversing his previous
> command to not unshare, this has to be explicitly requested by the user, not
> part of the default defrag operation.
> 
> 
> > As mentioned in the previous email, we actually did have a (mostly)
> > working reflink-aware defrag a few years back.  It got removed because
> > it had serious performance issues.  Note that we're not talking a few
> > seconds of extra time to defrag a full tree here, we're talking
> > double-digit _minutes_ of extra time to defrag a moderate sized (low
> > triple digit GB) subvolume with dozens of snapshots, _if you were lucky_
> > (if you weren't, you would be looking at potentially multiple _hours_ of
> > runtime for the defrag).  The performance scaled inversely proportionate
> > to the number of reflinks involved and the total amount of data in the
> > subvolume being defragmented, and was pretty bad even in the case of
> > only a couple of snapshots.
> > 
> > Ultimately, there are a couple of issues at play here:
> 
> I'll reply to this in another post. This one is getting a bit too long.
> 
>