Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: webmaster@zedlx.com
Cc: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Wed, 11 Sep 2019 17:42:11 -0400
Message-ID: <20190911214211.GC22121@hungrycats.org> (raw)
In-Reply-To: <20190911160101.Horde.mYR8sgLb1dgpIs3fD4D5Cfy@server53.web-hosting.com>

On Wed, Sep 11, 2019 at 04:01:01PM -0400, webmaster@zedlx.com wrote:
> 
> Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> > On 2019-09-11 13:20, webmaster@zedlx.com wrote:
> > > 
> > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > 
> > > > On 2019-09-10 19:32, webmaster@zedlx.com wrote:
> > > > > 
> > > > > Quoting "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> > > > > 
> > > 
> > > > > 
> > > > > === I CHALLENGE you and anyone else on this mailing list: ===
> > > > > 
> > > > >  - Show me an exaple where splitting an extent requires
> > > > > unsharing, and this split is needed to defrag.
> > > > > 
> > > > > Make it clear, write it yourself, I don't want any machine-made outputs.
> > > > > 
> > > > Start with the above comment about all writes unsharing the
> > > > region being written to.
> > > > 
> > > > Now, extrapolating from there:
> > > > 
> > > > Assume you have two files, A and B, each consisting of 64
> > > > filesystem blocks in single shared extent.  Now assume somebody
> > > > writes a few bytes to the middle of file B, right around the
> > > > boundary between blocks 31 and 32, and that you get similar
> > > > writes to file A straddling blocks 14-15 and 47-48.
> > > > 
> > > > After all of that, file A will be 5 extents:
> > > > 
> > > > * A reflink to blocks 0-13 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 14-15
> > > > * A reflink to blocks 16-46 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 47-48
> > > > * A reflink to blocks 49-63 of the original extent.
> > > > 
> > > > And file B will be 3 extents:
> > > > 
> > > > * A reflink to blocks 0-30 of the original extent.
> > > > * A single isolated extent consisting of the new blocks 31-32.
> > > > * A reflink to blocks 32-63 of the original extent.
> > > > 
> > > > Note that there are a total of four contiguous sequences of
> > > > blocks that are common between both files:
> > > > 
> > > > * 0-13
> > > > * 16-30
> > > > * 32-46
> > > > * 49-63
> > > > 
> > > > There is no way to completely defragment either file without
> > > > splitting the original extent (which is still there, just not
> > > > fully referenced by either file) unless you rewrite the whole
> > > > file to a new single extent (which would, of course, completely
> > > > unshare the whole file).  In fact, if you want to ensure that
> > > > those shared regions stay reflinked, there's no way to
> > > > defragment either file without _increasing_ the number of
> > > > extents in that file (either file would need 7 extents to
> > > > properly share only those 4 regions), and even then only one of
> > > > the files could be fully defragmented.
> > > > 
> > > > Such a situation generally won't happen if you're just dealing
> > > > with read-only snapshots, but is not unusual when dealing with
> > > > regular files that are reflinked (which is not an uncommon
> > > > situation on some systems, as a lot of people have `cp` aliased
> > > > to reflink things whenever possible).
> > > 
> > > Well, thank you very much for writing this example. Your example is
> > > certainly not minimal, as it seems to me that one write to the file
> > > A and one write to file B would be sufficient to prove your point,
> > > so there we have one extra write in the example, but that's OK.
> > > 
> > > Your example proves that I was wrong. I admit: it is impossible to
> > > perfectly defrag one subvolume (in the way I imagined it should be
> > > done).
> > > Why? Because, as in your example, there can be files within a SINGLE
> > > subvolume which share their extents with each other. I didn't
> > > consider such a case.
> > > 
> > > On the other hand, I judge this issue to be mostly irrelevant. Why?
> > > Because most of the file sharing will be between subvolumes, not
> > > within a subvolume.
> 
> > Not necessarily. Even ignoring the case of data deduplication (which
> > needs to be considered if you care at all about enterprise usage, and is
> > part of the whole point of using a CoW filesystem), there are existing
> > applications that actively use reflinks, either directly or indirectly
> > (via things like the `copy_file_range` system call), and the number of
> > such applications is growing.
> 
> The same argument goes here: If data-deduplication was performed, then the
> user has specifically requested it.
> Therefore, since it was user's will, the defrag has to honor it, and so the
> defrag must not unshare deduplicated extents because the user wants them
> shared. This might prevent a perfect defrag, but that is exactly what the
> user has requested, either directly or indirectly, by some policy he has
> choosen.
> 
> If an application actively creates reflinked-copies, then we can assume it
> does so according to user's will, therefore it is also a command by user and
> defrag should honor it by not unsharing and by being imperfect.
> 
> Now, you might point out that, in case of data-deduplication, we now have a
> case where most sharing might be within-subvolume, invalidating my assertion
> that most sharing will be between-subvolumes. But this is an invalid (more
> precisely, irelevant) argument. Why? Because the defrag operation has to
> focus on doing what it can do, while honoring user's will. All
> within-subvolume sharing is user-requested, therefore it cannot be part of
> the argument to unshare.
> 
> You can't both perfectly defrag and honor deduplication. Therefore, the
> defrag has to do the best possible thing while still honoring user's will.
> <<<!!! So, the fact that the deduplication was performed is actually the
> reason FOR not unsharing, not against it, as you made it look in that
> paragraph. !!!>>>

IMHO the current kernel 'defrag' API shouldn't be used any more.  We need
a tool that handles dedupe and defrag at the same time, for precisely
this reason:  currently the two operations have no knowledge of each
other and duplicate or reverse each others work.  You don't need to defrag
an extent if you can find a duplicate, and you don't want to use fragmented
extents as dedupe sources.

> If the system unshares automatically after deduplication, then the user will
> need to run deduplication again. Ridiculous!
> 
> > > When a user creates a reflink to a file in the same subvolume, he is
> > > willingly denying himself the assurance of a perfect defrag.
> > > Because, as your example proves, if there are a few writes to BOTH
> > > files, it gets impossible to defrag perfectly. So, if the user
> > > creates such reflinks, it's his own whish and his own fault.
> 
> > The same argument can be made about snapshots.  It's an invalid argument
> > in both cases though because it's not always the user who's creating the
> > reflinks or snapshots.
> 
> Um, I don't agree.
> 
> 1) Actually, it is always the user who is creating reflinks, and snapshots,
> too. Ultimately, it's always the user who does absolutely everything,
> because a computer is supposed to be under his full control. But, in the
> case of reflink-copies, this is even more true
> because reflinks are not an essential feature for normal OS operation, at
> least as far as today's OSes go. Every OS has to copy files around. Every OS
> requires the copy operation. No current OS requires the reflinked-copy
> operation in order to function.

If we don't do reflinks all day, every day, our disks fill up in a matter
of hours...

> 2) A user can make any number of snapshots and subvolumes, but he can at any
> time select one subvolume as a focus of the defrag operation, and that
> subvolume can be perfectly defragmented without any unsharing (except that
> the internal-reflinked files won't be perfectly defragmented).
> Therefore, the snapshoting operation can never jeopardize a perfect defrag.
> The user can make many snapshots without any fears (I'd say a total of 100
> snapshots at any point in time is a good and reasonable limit).
> 
> > > Such situations will occur only in some specific circumstances:
> > > a) when the user is reflinking manually
> > > b) when a file is copied from one subvolume into a different file in
> > > a different subvolume.
> > > 
> > > The situation a) is unusual in normal use of the filesystem. Even
> > > when it occurs, it is the explicit command given by the user, so he
> > > should be willing to accept all the consequences, even the bad ones
> > > like imperfect defrag.
> > > 
> > > The situation b) is possible, but as far as I know copies are
> > > currently not done that way in btrfs. There should probably be the
> > > option to reflink-copy files fron another subvolume, that would be
> > > good.
> > > 
> > > But anyway, it doesn't matter. Because most of the sharing will be
> > > between subvolumes, not within subvolume. So, if there is some
> > > in-subvolume sharing, the defrag wont be 100% perfect, that a minor
> > > point. Unimportant.
> 
> > You're focusing too much on your own use case here.
> 
> It's so easy to say that. But you really don't know. You might be wrong. I
> might be the objective one, and you might be giving me some
> groupthink-induced, badly thought out conclusions from years ago, which was
> never rechecked because that's so hard to do. And then everybody just
> repeats it and it becomes the truth. As Goebels said, if you repeat anything
> enough times, it becomes the truth.
> 
> > Not everybody uses snapshots, and there are many people who are using
> > reflinks very actively within subvolumes, either for deduplication or
> > because it saves time and space when dealing with multiple copies of
> > mostly identical tress of files.
> 
> Yes, I guess there are many such users. Doesn't matter. What you are
> proposing is that the defrag should break all their reflinks and
> deduplicated data they painstakingly created. Come on!
> 
> Or, maybe the defrag should unshare to gain performance? Yes, but only WHEN
> USER REQUESTS IT. So the defrag can unshare,
> but only by request. Since this means that user is reversing his previous
> command to not unshare, this has to be explicitly requested by the user, not
> part of the default defrag operation.
> 
> 
> > As mentioned in the previous email, we actually did have a (mostly)
> > working reflink-aware defrag a few years back.  It got removed because
> > it had serious performance issues.  Note that we're not talking a few
> > seconds of extra time to defrag a full tree here, we're talking
> > double-digit _minutes_ of extra time to defrag a moderate sized (low
> > triple digit GB) subvolume with dozens of snapshots, _if you were lucky_
> > (if you weren't, you would be looking at potentially multiple _hours_ of
> > runtime for the defrag).  The performance scaled inversely proportionate
> > to the number of reflinks involved and the total amount of data in the
> > subvolume being defragmented, and was pretty bad even in the case of
> > only a couple of snapshots.
> > 
> > Ultimately, there are a couple of issues at play here:
> 
> I'll reply to this in another post. This one is getting a bit too long.
> 
> 

  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell [this message]
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190911214211.GC22121@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=webmaster@zedlx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git