From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: webmaster@zedlx.com
Cc: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Wed, 11 Sep 2019 17:42:11 -0400 [thread overview]
Message-ID: <20190911214211.GC22121@hungrycats.org> (raw)
In-Reply-To: <20190911160101.Horde.mYR8sgLb1dgpIs3fD4D5Cfy@server53.web-hosting.com>
On Wed, Sep 11, 2019 at 04:01:01PM -0400, webmaster@zedlx.com wrote:
>
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
> > On 2019-09-11 13:20, webmaster@zedlx.com wrote:
> > >
> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > >
> > > > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > > > >
> > > > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > > >
> > >
> > > > >
> > > > > === I CHALLENGE you and anyone else on this mailing list: ===
> > > > >
> > > > > - Show me an exaple where splitting an extent requires
> > > > > unsharing, and this split is needed to defrag.
> > > > >
> > > > > Make it clear, write it yourself, I don't want any machine-made outputs.
> > > > >
> > > > Start with the above comment about all writes unsharing the
> > > > region being written to.
> > > >
> > > > Now, extrapolating from there:
> > > >
> > > > Assume you have two files, A and B, each consisting of 64
> > > > filesystem blocks in single shared extent. Now assume somebody
> > > > writes a few bytes to the middle of file B, right around the
> > > > boundary between blocks 31 and 32, and that you get similar
> > > > writes to file A straddling blocks 14-15 and 47-48.
> > > >
> > > > After all of that, file A will be 5 extents:
> > > >
> > > > * A reflink to blocks 0-13 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 14-15
> > > > * A reflink to blocks 16-46 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 47-48
> > > > * A reflink to blocks 49-63 of the original extent.
> > > >
> > > > And file B will be 3 extents:
> > > >
> > > > * A reflink to blocks 0-30 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 31-32.
> > > > * A reflink to blocks 32-63 of the original extent.
> > > >
> > > > Note that there are a total of four contiguous sequences of
> > > > blocks that are common between both files:
> > > >
> > > > * 0-13
> > > > * 16-30
> > > > * 32-46
> > > > * 49-63
> > > >
> > > > There is no way to completely defragment either file without
> > > > splitting the original extent (which is still there, just not
> > > > fully referenced by either file) unless you rewrite the whole
> > > > file to a new single extent (which would, of course, completely
> > > > unshare the whole file). In fact, if you want to ensure that
> > > > those shared regions stay reflinked, there's no way to
> > > > defragment either file without _increasing_ the number of
> > > > extents in that file (either file would need 7 extents to
> > > > properly share only those 4 regions), and even then only one of
> > > > the files could be fully defragmented.
> > > >
> > > > Such a situation generally won't happen if you're just dealing
> > > > with read-only snapshots, but is not unusual when dealing with
> > > > regular files that are reflinked (which is not an uncommon
> > > > situation on some systems, as a lot of people have `cp` aliased
> > > > to reflink things whenever possible).
> > >
> > > Well, thank you very much for writing this example. Your example is
> > > certainly not minimal, as it seems to me that one write to the file
> > > A and one write to file B would be sufficient to prove your point,
> > > so there we have one extra write in the example, but that's OK.
> > >
> > > Your example proves that I was wrong. I admit: it is impossible to
> > > perfectly defrag one subvolume (in the way I imagined it should be
> > > done).
> > > Why? Because, as in your example, there can be files within a SINGLE
> > > subvolume which share their extents with each other. I didn't
> > > consider such a case.
> > >
> > > On the other hand, I judge this issue to be mostly irrelevant. Why?
> > > Because most of the file sharing will be between subvolumes, not
> > > within a subvolume.
>
> > Not necessarily. Even ignoring the case of data deduplication (which
> > needs to be considered if you care at all about enterprise usage, and is
> > part of the whole point of using a CoW filesystem), there are existing
> > applications that actively use reflinks, either directly or indirectly
> > (via things like the `copy_file_range` system call), and the number of
> > such applications is growing.
>
> The same argument goes here: If data-deduplication was performed, then the
> user has specifically requested it.
> Therefore, since it was user's will, the defrag has to honor it, and so the
> defrag must not unshare deduplicated extents because the user wants them
> shared. This might prevent a perfect defrag, but that is exactly what the
> user has requested, either directly or indirectly, by some policy he has
> choosen.
>
> If an application actively creates reflinked-copies, then we can assume it
> does so according to user's will, therefore it is also a command by user and
> defrag should honor it by not unsharing and by being imperfect.
>
> Now, you might point out that, in case of data-deduplication, we now have a
> case where most sharing might be within-subvolume, invalidating my assertion
> that most sharing will be between-subvolumes. But this is an invalid (more
> precisely, irelevant) argument. Why? Because the defrag operation has to
> focus on doing what it can do, while honoring user's will. All
> within-subvolume sharing is user-requested, therefore it cannot be part of
> the argument to unshare.
>
> You can't both perfectly defrag and honor deduplication. Therefore, the
> defrag has to do the best possible thing while still honoring user's will.
> <<<!!! So, the fact that the deduplication was performed is actually the
> reason FOR not unsharing, not against it, as you made it look in that
> paragraph. !!!>>>
IMHO the current kernel 'defrag' API shouldn't be used any more. We need
a tool that handles dedupe and defrag at the same time, for precisely
this reason: currently the two operations have no knowledge of each
other and duplicate or reverse each others work. You don't need to defrag
an extent if you can find a duplicate, and you don't want to use fragmented
extents as dedupe sources.
> If the system unshares automatically after deduplication, then the user will
> need to run deduplication again. Ridiculous!
>
> > > When a user creates a reflink to a file in the same subvolume, he is
> > > willingly denying himself the assurance of a perfect defrag.
> > > Because, as your example proves, if there are a few writes to BOTH
> > > files, it gets impossible to defrag perfectly. So, if the user
> > > creates such reflinks, it's his own whish and his own fault.
>
> > The same argument can be made about snapshots. It's an invalid argument
> > in both cases though because it's not always the user who's creating the
> > reflinks or snapshots.
>
> Um, I don't agree.
>
> 1) Actually, it is always the user who is creating reflinks, and snapshots,
> too. Ultimately, it's always the user who does absolutely everything,
> because a computer is supposed to be under his full control. But, in the
> case of reflink-copies, this is even more true
> because reflinks are not an essential feature for normal OS operation, at
> least as far as today's OSes go. Every OS has to copy files around. Every OS
> requires the copy operation. No current OS requires the reflinked-copy
> operation in order to function.
If we don't do reflinks all day, every day, our disks fill up in a matter
of hours...
> 2) A user can make any number of snapshots and subvolumes, but he can at any
> time select one subvolume as a focus of the defrag operation, and that
> subvolume can be perfectly defragmented without any unsharing (except that
> the internal-reflinked files won't be perfectly defragmented).
> Therefore, the snapshoting operation can never jeopardize a perfect defrag.
> The user can make many snapshots without any fears (I'd say a total of 100
> snapshots at any point in time is a good and reasonable limit).
>
> > > Such situations will occur only in some specific circumstances:
> > > a) when the user is reflinking manually
> > > b) when a file is copied from one subvolume into a different file in
> > > a different subvolume.
> > >
> > > The situation a) is unusual in normal use of the filesystem. Even
> > > when it occurs, it is the explicit command given by the user, so he
> > > should be willing to accept all the consequences, even the bad ones
> > > like imperfect defrag.
> > >
> > > The situation b) is possible, but as far as I know copies are
> > > currently not done that way in btrfs. There should probably be the
> > > option to reflink-copy files fron another subvolume, that would be
> > > good.
> > >
> > > But anyway, it doesn't matter. Because most of the sharing will be
> > > between subvolumes, not within subvolume. So, if there is some
> > > in-subvolume sharing, the defrag wont be 100% perfect, that a minor
> > > point. Unimportant.
>
> > You're focusing too much on your own use case here.
>
> It's so easy to say that. But you really don't know. You might be wrong. I
> might be the objective one, and you might be giving me some
> groupthink-induced, badly thought out conclusions from years ago, which was
> never rechecked because that's so hard to do. And then everybody just
> repeats it and it becomes the truth. As Goebels said, if you repeat anything
> enough times, it becomes the truth.
>
> > Not everybody uses snapshots, and there are many people who are using
> > reflinks very actively within subvolumes, either for deduplication or
> > because it saves time and space when dealing with multiple copies of
> > mostly identical tress of files.
>
> Yes, I guess there are many such users. Doesn't matter. What you are
> proposing is that the defrag should break all their reflinks and
> deduplicated data they painstakingly created. Come on!
>
> Or, maybe the defrag should unshare to gain performance? Yes, but only WHEN
> USER REQUESTS IT. So the defrag can unshare,
> but only by request. Since this means that user is reversing his previous
> command to not unshare, this has to be explicitly requested by the user, not
> part of the default defrag operation.
>
>
> > As mentioned in the previous email, we actually did have a (mostly)
> > working reflink-aware defrag a few years back. It got removed because
> > it had serious performance issues. Note that we're not talking a few
> > seconds of extra time to defrag a full tree here, we're talking
> > double-digit _minutes_ of extra time to defrag a moderate sized (low
> > triple digit GB) subvolume with dozens of snapshots, _if you were lucky_
> > (if you weren't, you would be looking at potentially multiple _hours_ of
> > runtime for the defrag). The performance scaled inversely proportionate
> > to the number of reflinks involved and the total amount of data in the
> > subvolume being defragmented, and was pretty bad even in the case of
> > only a couple of snapshots.
> >
> > Ultimately, there are a couple of issues at play here:
>
> I'll reply to this in another post. This one is getting a bit too long.
>
>
next prev parent reply other threads:[~2019-09-11 21:42 UTC|newest]
Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-09 2:55 Feature requests: online backup - defrag - change RAID level zedlryqc
2019-09-09 3:51 ` Qu Wenruo
2019-09-09 11:25 ` zedlryqc
2019-09-09 12:18 ` Qu Wenruo
2019-09-09 12:28 ` Qu Wenruo
2019-09-09 17:11 ` webmaster
2019-09-10 17:39 ` Andrei Borzenkov
2019-09-10 22:41 ` webmaster
2019-09-09 15:29 ` Graham Cobb
2019-09-09 17:24 ` Remi Gauvin
2019-09-09 19:26 ` webmaster
2019-09-10 19:22 ` Austin S. Hemmelgarn
2019-09-10 23:32 ` webmaster
2019-09-11 12:02 ` Austin S. Hemmelgarn
2019-09-11 16:26 ` Zygo Blaxell
2019-09-11 17:20 ` webmaster
2019-09-11 18:19 ` Austin S. Hemmelgarn
2019-09-11 20:01 ` webmaster
2019-09-11 21:42 ` Zygo Blaxell [this message]
2019-09-13 1:33 ` General Zed
2019-09-11 21:37 ` webmaster
2019-09-12 11:31 ` Austin S. Hemmelgarn
2019-09-12 19:18 ` webmaster
2019-09-12 19:44 ` Chris Murphy
2019-09-12 21:34 ` General Zed
2019-09-12 22:28 ` Chris Murphy
2019-09-12 22:57 ` General Zed
2019-09-12 23:54 ` Zygo Blaxell
2019-09-13 0:26 ` General Zed
2019-09-13 3:12 ` Zygo Blaxell
2019-09-13 5:05 ` General Zed
2019-09-14 0:56 ` Zygo Blaxell
2019-09-14 1:50 ` General Zed
2019-09-14 4:42 ` Zygo Blaxell
2019-09-14 4:53 ` Zygo Blaxell
2019-09-15 17:54 ` General Zed
2019-09-16 22:51 ` Zygo Blaxell
2019-09-17 1:03 ` General Zed
2019-09-17 1:34 ` General Zed
2019-09-17 1:44 ` Chris Murphy
2019-09-17 4:55 ` Zygo Blaxell
2019-09-17 4:19 ` Zygo Blaxell
2019-09-17 3:10 ` General Zed
2019-09-17 4:05 ` General Zed
2019-09-14 1:56 ` General Zed
2019-09-13 5:22 ` General Zed
2019-09-13 6:16 ` General Zed
2019-09-13 6:58 ` General Zed
2019-09-13 9:25 ` General Zed
2019-09-13 17:02 ` General Zed
2019-09-14 0:59 ` Zygo Blaxell
2019-09-14 1:28 ` General Zed
2019-09-14 4:28 ` Zygo Blaxell
2019-09-15 18:05 ` General Zed
2019-09-16 23:05 ` Zygo Blaxell
2019-09-13 7:51 ` General Zed
2019-09-13 11:04 ` Austin S. Hemmelgarn
2019-09-13 20:43 ` Zygo Blaxell
2019-09-14 0:20 ` General Zed
2019-09-14 18:29 ` Chris Murphy
2019-09-14 23:39 ` Zygo Blaxell
2019-09-13 11:09 ` Austin S. Hemmelgarn
2019-09-13 17:20 ` General Zed
2019-09-13 18:20 ` General Zed
2019-09-12 19:54 ` Austin S. Hemmelgarn
2019-09-12 22:21 ` General Zed
2019-09-13 11:53 ` Austin S. Hemmelgarn
2019-09-13 16:54 ` General Zed
2019-09-13 18:29 ` Austin S. Hemmelgarn
2019-09-13 19:40 ` General Zed
2019-09-14 15:10 ` Jukka Larja
2019-09-12 22:47 ` General Zed
2019-09-11 21:37 ` Zygo Blaxell
2019-09-11 23:21 ` webmaster
2019-09-12 0:10 ` Remi Gauvin
2019-09-12 3:05 ` webmaster
2019-09-12 3:30 ` Remi Gauvin
2019-09-12 3:33 ` Remi Gauvin
2019-09-12 5:19 ` Zygo Blaxell
2019-09-12 21:23 ` General Zed
2019-09-14 4:12 ` Zygo Blaxell
2019-09-16 11:42 ` General Zed
2019-09-17 0:49 ` Zygo Blaxell
2019-09-17 2:30 ` General Zed
2019-09-17 5:30 ` Zygo Blaxell
2019-09-17 10:07 ` General Zed
2019-09-17 23:40 ` Zygo Blaxell
2019-09-18 4:37 ` General Zed
2019-09-18 18:00 ` Zygo Blaxell
2019-09-10 23:58 ` webmaster
2019-09-09 23:24 ` Qu Wenruo
2019-09-09 23:25 ` webmaster
2019-09-09 16:38 ` webmaster
2019-09-09 23:44 ` Qu Wenruo
2019-09-10 0:00 ` Chris Murphy
2019-09-10 0:51 ` Qu Wenruo
2019-09-10 0:06 ` webmaster
2019-09-10 0:48 ` Qu Wenruo
2019-09-10 1:24 ` webmaster
2019-09-10 1:48 ` Qu Wenruo
2019-09-10 3:32 ` webmaster
2019-09-10 14:14 ` Nikolay Borisov
2019-09-10 22:35 ` webmaster
2019-09-11 6:40 ` Nikolay Borisov
2019-09-10 22:48 ` webmaster
2019-09-10 23:14 ` webmaster
2019-09-11 0:26 ` webmaster
2019-09-11 0:36 ` webmaster
2019-09-11 1:00 ` webmaster
2019-09-10 11:12 ` Austin S. Hemmelgarn
2019-09-09 3:12 webmaster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190911214211.GC22121@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=ahferroin7@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=webmaster@zedlx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).