Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Mon, 16 Sep 2019 22:30:39 -0400
Message-ID: <20190916223039.Horde.HZvhrBkQsN12DF6IDemvio6@server53.web-hosting.com> (raw)
In-Reply-To: <20190917004916.GD24379@hungrycats.org>


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Mon, Sep 16, 2019 at 07:42:51AM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > Your defrag ideas are interesting, but you should spend a lot more
>> > time learning the btrfs fundamentals before continuing.  Right now
>> > you do not understand what btrfs is capable of doing easily, and what
>> > requires such significant rework in btrfs to implement that the result
>> > cannot be considered the same filesystem.  This is impairing the quality
>> > of your design proposals and reducing the value of your contribution
>> > significantly.
>>
>> Ok, that was a shot at me; and I admit, guilty as charged. I barely have a
>> clue about btrfs.
>> Now it's my turn to shoot. Apparently, the people which are implementing the
>> btrfs defrag, or at least the ones that responded to my post, seem to have
>> no clue about how on-demand defrag solutions typically work. I had to
>> explain the usual tricks involved in the defragmentation, and it was like
>> talking to complete rookies. None of you even considered a full-featured
>> defrag solution, all that you are doing are some partial solutions.
>
> Take a look at btrfs RAID5/6 some time, if you want to see rookie mistakes...
>
>> And, you all got lost in implementation details. How many times have I been
>> told here that some operation cannot be performed, and then it turned out
>> the opposite. You have all sunk into some strange state of mind where every
>> possible excuse is being made in order not to start working on a better,
>> hollistic defrag solution.
>>
>> And you even misunderstood me when I said "hollistic defrag", you thought I
>> was talking about a full defrag. No. A full defrag is a defrag performed on
>> all the data. A holistic defrag can be performed on only some data, but it
>> is hollistic in the sense that it uses whole information about a filesystem,
>> not just a partial view of it. A holistic defrag is better than a partial
>> defrag: it is faster and produces better results, and it can defrag a wider
>> spectrum of cases. Why? Because a holistic defrag takes everything into
>> account.
>
> What I'm looking for is a quantitative approach.  Sort the filesystem
> regions by how bad they are (in terms of measurable negative outcomes
> like poor read performance, pathological metadata updates, and future
> allocation performance), then apply mitigation in increasing order of
> cost-benefit ratio (or at least filter by cost-benefit ratio if you can't
> sort without reading the whole filesystem) until a minimum threshold
> is reached, then stop.  This lets the mitigation scale according to
> the available maintenance window, i.e. if you have 5% of a day for
> maintenance, you attack the worst 5% of the filesystem, then stop.

Any good defrag solution should be able to prioritize to work on most  
fragmented parts of the filesystem. That's a given.

> In that respect I think we might be coming toward the same point, but
> from different directions:  you seem to think the problem is easy to
> solve at scale,

Yes!

> and I think that's impossible so I start from designs
> that make forward progress with a fixed allocation of resources.

Well, that's not useless, but it's kind of meh. Waste of time. Solve  
the problem like a real man! Shoot with thermonuclear weapons only!

>> So I think you should all inform yourself a little better about various
>> defrag algorithms and solutions that exist. Apparently, you all lost the
>> sight of the big picture. You can't see the wood from the trees.
>
> I can see the woods, but any solution that starts with "enumerate all
> the trees" will be met with extreme skepticism, unless it can do that
> enumeration incrementally.

I think that I'm close to a solution that only needs to scan the  
free-space tree in the entirety at start. All other trees can be only  
partially scanned. I mean, at start. As the defrag progresses, it will  
go through all the trees (except in case of defragging only a part of  
the partition). If a partition is to be only partially defragged, then  
the trees do not need to be red in entirety. Only the free space tree  
needs to be red in entirety at start (and the virtual-physical address  
translation trees, which are small, I guess).

>> Well, no. Perhaps the word "defrag" can have a wider and narrower sense. So
>> in a narrower sense, "defrag" means what you just wrote. In that sense, the
>> word "defrag" means practically the same as "merge", so why not just use the
>> word "merge" to remove any ambiguities. The "merge" is the only operation
>> that decreases the number of fragments (besides "delete"). Perhaps you meant
>> move&merge. But, commonly, the word "defrag" is used in a wider sense, which
>> is not the one you described.
>
> This is fairly common on btrfs:  the btrfs words don't mean the same as
> other words, causing confusion.  How many copies are there in a btrfs
> 4-disk raid1 array?

2 copies of everything, except the superblock which has 2-6 copies.
>
>> > > > Dedupe on btrfs also requires the ability to split and merge extents;
>> > > > otherwise, we can't dedupe an extent that contains a combination of
>> > > > unique and duplicate data.  If we try to just move references around
>> > > > without splitting extents into all-duplicate and all-unique extents,
>> > > > the duplicate blocks become unreachable, but are not  
>> deallocated.  If we
>> > > > only split extents, fragmentation overhead gets bad.  Before creating
>> > > > thousands of references to an extent, it is worthwhile to  
>> merge it with
>> > > > as many of its neighbors as possible, ideally by picking the biggest
>> > > > existing garbage-free extents available so we don't have to do defrag.
>> > > > As we examine each extent in the filesystem, it may be best to send
>> > > > to defrag, dedupe, or garbage collection--sometimes more than one of
>> > > > those.
>> > >
>> > > This is sovled simply by always running defrag before dedupe.
>> >
>> > Defrag and dedupe in separate passes is nonsense on btrfs.
>>
>> Defrag can be run without dedupe.
>
> Yes, but if you're planning to run both on the same filesystem, they
> had better be aware of each other.

On-demand defrag doesn't need to be aware of on-demand dedupe. Or,  
only in the sense that dedupe should be shut down while defrag is  
running.

Perhaps you were referring to an on-the-fly dedupe. In that case, yes.

>> Now, how to organize dedupe? I didn't think about it yet. I'll leave it to
>> you, but it seems to me that defrag should be involved there. And, my defrag
>> solution would help there very, very much.
>
> I can't see defrag in isolation as anything but counterproductive to
> dedupe (and vice versa).

Share-preserving defrag can't be harmful to dedupe.

> A critical feature of the dedupe is to do extent splits along duplicate
> content boundaries, so that you're not limited to deduping only
> whole-extent matches.  This is especially necessary on btrfs because
> you can't split an extent in place--if you find a partial match,
> you have to find a new home for the unique data, which means you
> get a lot of little fragments that are inevitably distant from their
> logically adjacent neighbors which themselves were recently replaced
> with a physically distant identical extent.
>
> Sometimes both copies of the data suck (both have many fragments
> or uncollected garbage), and at that point you want to do some
> preprocessing--copy the data to make the extent you want, then use
> dedupe to replace both bad extents with your new good one.  That's an
> opportunistic extent merge and it needs some defrag logic to do proper
> cost estimation.
>
> If you have to copy 64MB of unique data to dedupe a 512K match, the extent
> split cost is far higher than if you have a 2MB extent with 512K match.
> So there should be sysadmin-tunable parameters that specify how much
> to spend on diminishing returns:  maybe you don't deduplicate anything
> that saves less than 1% of the required copy bytes, because you have
> lots of duplicates in the filesystem and you are willing to spend 1% of
> your disk space to not be running dedupe all day.  Similarly you don't
> defragment (or move for any reason) extents unless that move gives you
> significantly better read performance or consolidate diffuse allocations
> across metadata pages, because there are millions of extents to choose
> from and it's not necessary to optimize them all.
>
> On the other hand, if you find you _must_ move the 64MB of data for
> other reasons (e.g. to consolidate free space) then you do want to do
> the dedupe because it will make the extent move slightly faster (63.5MB
> of data + reflink instead of 64MB copy).  So you definitely want one
> program looking at both things.
>
> Maybe there's a way to plug opportunistic dedupe into a defrag algorithm
> the same way there's a way to plug opportunistic defrag into a dedupe
> algorithm.  I don't know, I'm coming at this from the dedupe side.
> If the solution looks anything like "run both separately" then I'm
> not interested.

I would suggest one of the two following simple solutions:
    a) the on-demand defrag should be run BEFORE AND AFTER the  
on-demand dedupe.
or b) the on-demand defrag should be run BEFORE the on-demand dedupe,  
and on-demand dedupe uses defrag functionality to defrag while dedupe  
is in progress.

So I guess you were thinking about the solution b) all the time when  
you said that dedupe and defrag need to be related.

>> > Extent splitting in-place is not possible on btrfs, so extent boundary
>> > changes necessarily involve data copies.  Reference counting is done
>> > by extent in btrfs, so it is only possible to free complete extents.
>>
>> Great, there is reference counting in btrfs. That helps. Good design.
>
> Well, I say "reference counting" because I'm simplifying for an audience
> that does not yet all know the low-level details.  The counter, such as
> it is, gives values "zero" or "more than zero."  You never know exactly
> how many references there are without doing the work to enumerate them.
> The "is extent unique" function in btrfs runs the enumeration loop until
> the second reference is found or the supply of references is exhausted,
> whichever comes first.  It's a tradeoff to make snapshots fast.

Well, that's a disappointment.

> When a reference is created to a new extent, it refers to the entire
> extent.  References can refer to parts of extents (the reference has an
> offset and length field), so when an extent is partially overwritten, the
> extent is not modified.  Only the reference is modified, to make it refer
> to a subset of the extent (references in other snapshots are not changed,
> and the extent data itself is immutable).  This makes POSIX fast, but it
> creates some headaches related to garbage collection, dedupe, defrag, etc.

Ok, got it. Thaks.




  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed [this message]
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190916223039.Horde.HZvhrBkQsN12DF6IDemvio6@server53.web-hosting.com \
    --to=general-zed@zedlx.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox