Re: Feature requests: online backup - defrag - change RAID level

From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Fri, 13 Sep 2019 21:56:42 -0400	[thread overview]
Message-ID: <20190913215642.Horde.MvjbFry-r1RYjoGFfEha7aE@server53.web-hosting.com> (raw)
In-Reply-To: <20190914005655.GH22121@hungrycats.org>

Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>> > >
>> > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>> > >
>> > > > Don't forget you have to write new checksum and free space tree pages.
>> > > > In the worst case, you'll need about 1GB of new metadata  
>> pages for each
>> > > > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > > > after).
>> > >
>> > > Yes, here we are debating some worst-case scenaraio which is actually
>> > > imposible in practice due to various reasons.
>> >
>> > No, it's quite possible.  A log file written slowly on an active
>> > filesystem above a few TB will do that accidentally.  Every now and then
>> > I hit that case.  It can take several hours to do a logrotate on spinning
>> > arrays because of all the metadata fetches and updates associated with
>> > worst-case file delete.  Long enough to watch the delete happen, and
>> > even follow along in the source code.
>> >
>> > I guess if I did a proactive defrag every few hours, it might take less
>> > time to do the logrotate, but that would mean spreading out all the
>> > seeky IO load during the day instead of getting it all done at night.
>> > Logrotate does the same job as defrag in this case (replacing a file in
>> > thousands of fragments spread across the disk with a few large fragments
>> > close together), except logrotate gets better compression.
>> >
>> > To be more accurate, the example I gave above is the worst case you
>> > can expect from normal user workloads.  If I throw in some reflinks
>> > and snapshots, I can make it arbitrarily worse, until the entire disk
>> > is consumed by the metadata update of a single extent defrag.
>> >
>>
>> I can't believe I am considering this case.
>>
>> So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
>> randomly over the entire disk. We have 512 GB free RAM and 2% free disk
>> space. The file needs to be defragmented.
>>
>> In order to do that, defrag needs to be able to copy-move multiple extents
>> in one batch, and update the metadata.
>>
>> The metadata has a total of at least 256 million entries, each of some size,
>> but each one should hold at least a pointer to the extent (8 bytes) and a
>> checksum (8 bytes): In reality, it could be that there is a lot of other
>> data there per entry.
>
> It's about 48KB per 4K extent, plus a few hundred bytes on average for each
> reference.
>
>> The metadata is organized as a b-tree. Therefore, nearby nodes should
>> contain data of consecutive file extents.
>
> It's 48KB per item.  As you remove the original data extents, you will
> be touching a 16KB page in three trees for each extent that is removed:
> Free space tree, csum tree, and extent tree.  This happens after the
> merged extent is created.  It is part of the cleanup operation that
> gets rid of the original 4K extents.
>
> Because the file was written very slowly on a big filesystem, the extents
> are scattered pessimally all over the virtual address space, not packed
> close together.  If there are a few hundred extent allocations between
> each log extent, then they will all occupy separate metadata pages.
> When it is time to remove them, each of these pages must be updated.
> This can be hit in a number of places in btrfs, including overwrite
> and delete.
>
> There's also 60ish bytes per extent in any subvol trees the file
> actually appears in, but you do get locality in that one (the key is
> inode and offset, so nothing can get between them and space them apart).
> That's 12GB and change (you'll probably completely empty most of the
> updated subvol metadata pages, so we can expect maybe 5 pages to remain
> including root and interior nodes).  I haven't been unlucky enough to
> get a "natural" 12GB, but I got over 1GB a few times recently.
>
> Reflinks can be used to multiply that 12GB arbitrarily--you only get
> locality if the reflinks are consecutive in (inode, offset) space,
> so if the reflinks are scattered across subvols or files, they won't
> share pages.
>
>> The trick, in this case, is to select one part of "ultralog" which is
>> localized in the metadata, and defragment it. Repeating this step will
>> ultimately defragment the entire file.
>>
>> So, the defrag selects some part of metadata which is entirely a descendant
>> of some b-tree node not far from the bottom of b-tree. It selects it such
>> that the required update to the metadata is less than, let's say, 64 MB, and
>> simultaneously the affected "ultralog" file fragments total less han 512 MB
>> (therefore, less than 128 thousand metadata leaf entries, each pointing to a
>> 4 KB fragment). Then it finds all the file extents pointed to by that part
>> of metadata. They are consecutive (as file fragments), because we have
>> selected such part of metadata. Now the defrag can safely copy-move those
>> fragments to a new area and update the metadata.
>>
>> In order to quickly select that small part of metadata, the defrag needs a
>> metatdata cache that can hold somewhat more than 128 thousand localized
>> metadata leaf entries. That fits into 128 MB RAM definitely.
>>
>> Of course, there are many other small issues there, but this outlines the
>> general procedure.
>>
>> Problem solved?
>
> Problem missed completely.  The forward reference updates were the only
> easy part.
>
> My solution is to detect this is happening in real time, and merge the
> extents while they're still too few to be a problem.  Now you might be
> thinking "but doesn't that mean you'll merge the same data blocks over
> and over, wasting iops?" but really it's a perfectly reasonable trade
> considering the interest rates those unspent iops can collect on btrfs.
> If the target minimum extent size is 192K, you turn this 12GB problem into
> a 250MB one, and the 1GB problem that actually occurs becomes trivial.
>
> Another solution would be to get the allocator to reserve some space
> near growing files reserved for use by those files, so that the small
> fragments don't explode across the address space.  Then we'd get locality
> in all four btrees.  Other filesystems have heuristics all over their
> allocators to do things like this--btrfs seems to have a very minimal
> allocator that could stand much improvement.

Ok, a fine solution. Basically, you improve the autodefrag to detect  
this specific situation.

Another way to solve this issue is to run the on-demand defrag  
sufficiently often. You order the defrag to  defragment only that one  
specific file, or you order it to find and defrag only 0.1% of most  
fragmented files (and the pathological file should fall within those  
0.1%).

But, this is a very specific and rare case that we are talking about here.

So, that's it.