Re: Feature requests: online backup - defrag - change RAID level

From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Tue, 17 Sep 2019 06:07:24 -0400	[thread overview]
Message-ID: <20190917060724.Horde.2JvifSdoEgszEJI8_4CFSH8@server53.web-hosting.com> (raw)
In-Reply-To: <20190917053055.GG24379@hungrycats.org>

Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Mon, Sep 16, 2019 at 10:30:39PM -0400, General Zed wrote:
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > and I think that's impossible so I start from designs
>> > that make forward progress with a fixed allocation of resources.
>>
>> Well, that's not useless, but it's kind of meh. Waste of time. Solve the
>> problem like a real man! Shoot with thermonuclear weapons only!
>
> I have thermonuclear weapons:  the metadata trees in my filesystems.  ;)
>
>> > > So I think you should all inform yourself a little better about various
>> > > defrag algorithms and solutions that exist. Apparently, you all lost the
>> > > sight of the big picture. You can't see the wood from the trees.
>> >
>> > I can see the woods, but any solution that starts with "enumerate all
>> > the trees" will be met with extreme skepticism, unless it can do that
>> > enumeration incrementally.
>>
>> I think that I'm close to a solution that only needs to scan the free-space
>> tree in the entirety at start. All other trees can be only partially
>> scanned. I mean, at start. As the defrag progresses, it will go through all
>> the trees (except in case of defragging only a part of the partition). If a
>> partition is to be only partially defragged, then the trees do not need to
>> be red in entirety. Only the free space tree needs to be red in entirety at
>> start (and the virtual-physical address translation trees, which are small,
>> I guess).
>
> I doubt that on a 50TB filesystem you need to read the whole tree...are
> you going to globally optimize 50TB at once?  That will take a while.

I need to read the whole free-space tree to find a few regions with  
most free space. Those will be used as destinations for defragmented  
data.

If a mostly free region of sufficient size (a few GB) can be found  
faster, then there is no need to read the entire free-space tree. But,  
on a disk with less than 15% free space, it would be advisable to read  
the entire free space tree to find the less-crowded regions of the  
filesystem.

> Start with a 100GB sliding window, maybe.

There will be something similar to a sliding window (in virtual  
address space). The likely size of the window for "typical" desktops  
is just around 1 GB, no more. In complicated filesystems, it will be  
smaller. Really, you don't need a very big sliding window (little can  
be gained by enlarging it), for a 256 GB drive a small sliding window  
is quite fine. The size of the sliding window can be dynamically  
adjusted, depending on several factors, but mostly: available RAM and  
filesystem complexity (number of extents and reflinks in the sliding  
window).

So, the defrag will be tunable by supplying the amount of RAM to use.  
If you supply it with insufficient RAM, it will slow down  
considerably. 400 MB minimum RAM usage recommended for typical  
desktops. But, this should be tested on an actual implementation, I'm  
just guessing at this point. Could be better or worse.

This sliding window won't be a perfect one (it can have  
discontinuities, fragments), and also a small amount of data which is  
not in the sliding window but in logically adjacent areas will also be  
scanned.

So, I'm designing a defrag that is fast, can use little RAM, and can  
work in low free-space conditions. Can work on huge filesystems and  
can take on a good amount of pathological cases. Preserves all file  
data sharing.

I hope that at least someone will be satisfied.

>> > This is fairly common on btrfs:  the btrfs words don't mean the same as
>> > other words, causing confusion.  How many copies are there in a btrfs
>> > 4-disk raid1 array?
>>
>> 2 copies of everything, except the superblock which has 2-6 copies.
>
> Good, you can enter the clubhouse.  A lot of new btrfs users are surprised
> it's less than 4.
>
>> > > > > This is sovled simply by always running defrag before dedupe.
>> > > > Defrag and dedupe in separate passes is nonsense on btrfs.
>> > > Defrag can be run without dedupe.
>> > Yes, but if you're planning to run both on the same filesystem, they
>> > had better be aware of each other.
>>
>> On-demand defrag doesn't need to be aware of on-demand dedupe. Or, only in
>> the sense that dedupe should be shut down while defrag is running.
>>
>> Perhaps you were referring to an on-the-fly dedupe. In that case, yes.
>
> My dedupe runs continuously (well, polling with incremental scan).
> It doesn't shut down.

Ah... so I suggest that the defrag should temporarily shut down  
dedupe, at least in the initial versions of defrag. Once both defrag  
and dedupe are working standalone, the merging effort can begin.

>> > > Now, how to organize dedupe? I didn't think about it yet. I'll  
>> leave it to
>> > > you, but it seems to me that defrag should be involved there.  
>> And, my defrag
>> > > solution would help there very, very much.
>> >
>> > I can't see defrag in isolation as anything but counterproductive to
>> > dedupe (and vice versa).
>>
>> Share-preserving defrag can't be harmful to dedupe.
>
> Sure it can.  Dedupe needs to split extents by content, and btrfs only
> supports that by copying.  If defrag is making new extents bigger before
> dedupe gets to them, there is more work for dedupe when it needs to make
> extents smaller again.
>
>> I would suggest one of the two following simple solutions:
>>    a) the on-demand defrag should be run BEFORE AND AFTER the on-demand
>> dedupe.
>> or b) the on-demand defrag should be run BEFORE the on-demand dedupe, and
>> on-demand dedupe uses defrag functionality to defrag while dedupe is in
>> progress.
>>
>> So I guess you were thinking about the solution b) all the time when you
>> said that dedupe and defrag need to be related.
>
> Well, both would be running continuously in the same process, so
> they would negotiate with each other as required.  Dedupe runs first
> on new extents to create a plan for increasing extent sharing, then
> defrag creates a plan for sufficient logical/physical contiguity of
> those extents after dedupe has cut them into content-aligned pieces.
> Extents that are entirely duplicate simply disappear and do not form
> part of the defrag workload (at least until it is time to defragment
> free space...).  Both plans are combined and optimized, then the final
> data relocation command sequence is sent to the filesystem.

I think that this kind of close dedupe-defrag integration should  
mostly be left to dedupe developers. First, both defrag and dedupe  
should work perfectly on their own. Then, an interface to defrag  
should be made available to dedupe developers. In particular, I think  
that the batch-update functionality (it takes lots of extents and an  
empty free space region, then writes defragmented extents to the given  
region) is of particular interest to dedupe.

>> > > > Extent splitting in-place is not possible on btrfs, so extent boundary
>> > > > changes necessarily involve data copies.  Reference counting is done
>> > > > by extent in btrfs, so it is only possible to free complete extents.
>> > >
>> > > Great, there is reference counting in btrfs. That helps. Good design.
>> >
>> > Well, I say "reference counting" because I'm simplifying for an audience
>> > that does not yet all know the low-level details.  The counter, such as
>> > it is, gives values "zero" or "more than zero."  You never know exactly
>> > how many references there are without doing the work to enumerate them.
>> > The "is extent unique" function in btrfs runs the enumeration loop until
>> > the second reference is found or the supply of references is exhausted,
>> > whichever comes first.  It's a tradeoff to make snapshots fast.
>>
>> Well, that's a disappointment.
>>
>> > When a reference is created to a new extent, it refers to the entire
>> > extent.  References can refer to parts of extents (the reference has an
>> > offset and length field), so when an extent is partially overwritten, the
>> > extent is not modified.  Only the reference is modified, to make it refer
>> > to a subset of the extent (references in other snapshots are not changed,
>> > and the extent data itself is immutable).  This makes POSIX fast, but it
>> > creates some headaches related to garbage collection, dedupe, defrag, etc.
>>
>> Ok, got it. Thaks.
>>
>>
>>
>>