From: General Zed <firstname.lastname@example.org> To: Zygo Blaxell <email@example.com> Cc: Chris Murphy <firstname.lastname@example.org>, Btrfs BTRFS <email@example.com> Subject: Re: Feature requests: online backup - defrag - change RAID level Date: Fri, 13 Sep 2019 21:56:42 -0400 Message-ID: <20190913215642.Horde.MvjbFry-r1RYjoGFfEha7aE@server53.web-hosting.com> (raw) In-Reply-To: <20190914005655.GH22121@hungrycats.org> Quoting Zygo Blaxell <firstname.lastname@example.org>: > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote: >> >> Quoting Zygo Blaxell <email@example.com>: >> >> > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote: >> > > >> > > Quoting Zygo Blaxell <firstname.lastname@example.org>: >> > > >> > > > Don't forget you have to write new checksum and free space tree pages. >> > > > In the worst case, you'll need about 1GB of new metadata >> pages for each >> > > > 128MB you defrag (though you get to delete 99.5% of them immediately >> > > > after). >> > > >> > > Yes, here we are debating some worst-case scenaraio which is actually >> > > imposible in practice due to various reasons. >> > >> > No, it's quite possible. A log file written slowly on an active >> > filesystem above a few TB will do that accidentally. Every now and then >> > I hit that case. It can take several hours to do a logrotate on spinning >> > arrays because of all the metadata fetches and updates associated with >> > worst-case file delete. Long enough to watch the delete happen, and >> > even follow along in the source code. >> > >> > I guess if I did a proactive defrag every few hours, it might take less >> > time to do the logrotate, but that would mean spreading out all the >> > seeky IO load during the day instead of getting it all done at night. >> > Logrotate does the same job as defrag in this case (replacing a file in >> > thousands of fragments spread across the disk with a few large fragments >> > close together), except logrotate gets better compression. >> > >> > To be more accurate, the example I gave above is the worst case you >> > can expect from normal user workloads. If I throw in some reflinks >> > and snapshots, I can make it arbitrarily worse, until the entire disk >> > is consumed by the metadata update of a single extent defrag. >> > >> >> I can't believe I am considering this case. >> >> So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents >> randomly over the entire disk. We have 512 GB free RAM and 2% free disk >> space. The file needs to be defragmented. >> >> In order to do that, defrag needs to be able to copy-move multiple extents >> in one batch, and update the metadata. >> >> The metadata has a total of at least 256 million entries, each of some size, >> but each one should hold at least a pointer to the extent (8 bytes) and a >> checksum (8 bytes): In reality, it could be that there is a lot of other >> data there per entry. > > It's about 48KB per 4K extent, plus a few hundred bytes on average for each > reference. > >> The metadata is organized as a b-tree. Therefore, nearby nodes should >> contain data of consecutive file extents. > > It's 48KB per item. As you remove the original data extents, you will > be touching a 16KB page in three trees for each extent that is removed: > Free space tree, csum tree, and extent tree. This happens after the > merged extent is created. It is part of the cleanup operation that > gets rid of the original 4K extents. > > Because the file was written very slowly on a big filesystem, the extents > are scattered pessimally all over the virtual address space, not packed > close together. If there are a few hundred extent allocations between > each log extent, then they will all occupy separate metadata pages. > When it is time to remove them, each of these pages must be updated. > This can be hit in a number of places in btrfs, including overwrite > and delete. > > There's also 60ish bytes per extent in any subvol trees the file > actually appears in, but you do get locality in that one (the key is > inode and offset, so nothing can get between them and space them apart). > That's 12GB and change (you'll probably completely empty most of the > updated subvol metadata pages, so we can expect maybe 5 pages to remain > including root and interior nodes). I haven't been unlucky enough to > get a "natural" 12GB, but I got over 1GB a few times recently. > > Reflinks can be used to multiply that 12GB arbitrarily--you only get > locality if the reflinks are consecutive in (inode, offset) space, > so if the reflinks are scattered across subvols or files, they won't > share pages. > >> The trick, in this case, is to select one part of "ultralog" which is >> localized in the metadata, and defragment it. Repeating this step will >> ultimately defragment the entire file. >> >> So, the defrag selects some part of metadata which is entirely a descendant >> of some b-tree node not far from the bottom of b-tree. It selects it such >> that the required update to the metadata is less than, let's say, 64 MB, and >> simultaneously the affected "ultralog" file fragments total less han 512 MB >> (therefore, less than 128 thousand metadata leaf entries, each pointing to a >> 4 KB fragment). Then it finds all the file extents pointed to by that part >> of metadata. They are consecutive (as file fragments), because we have >> selected such part of metadata. Now the defrag can safely copy-move those >> fragments to a new area and update the metadata. >> >> In order to quickly select that small part of metadata, the defrag needs a >> metatdata cache that can hold somewhat more than 128 thousand localized >> metadata leaf entries. That fits into 128 MB RAM definitely. >> >> Of course, there are many other small issues there, but this outlines the >> general procedure. >> >> Problem solved? > > Problem missed completely. The forward reference updates were the only > easy part. > > My solution is to detect this is happening in real time, and merge the > extents while they're still too few to be a problem. Now you might be > thinking "but doesn't that mean you'll merge the same data blocks over > and over, wasting iops?" but really it's a perfectly reasonable trade > considering the interest rates those unspent iops can collect on btrfs. > If the target minimum extent size is 192K, you turn this 12GB problem into > a 250MB one, and the 1GB problem that actually occurs becomes trivial. > > Another solution would be to get the allocator to reserve some space > near growing files reserved for use by those files, so that the small > fragments don't explode across the address space. Then we'd get locality > in all four btrees. Other filesystems have heuristics all over their > allocators to do things like this--btrfs seems to have a very minimal > allocator that could stand much improvement. Ok, a fine solution. Basically, you improve the autodefrag to detect this specific situation. Another way to solve this issue is to run the on-demand defrag sufficiently often. You order the defrag to defragment only that one specific file, or you order it to find and defrag only 0.1% of most fragmented files (and the pathological file should fall within those 0.1%). But, this is a very specific and rare case that we are talking about here. So, that's it.
next prev parent reply index Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-09 2:55 zedlryqc 2019-09-09 3:51 ` Qu Wenruo 2019-09-09 11:25 ` zedlryqc 2019-09-09 12:18 ` Qu Wenruo 2019-09-09 12:28 ` Qu Wenruo 2019-09-09 17:11 ` webmaster 2019-09-10 17:39 ` Andrei Borzenkov 2019-09-10 22:41 ` webmaster 2019-09-09 15:29 ` Graham Cobb 2019-09-09 17:24 ` Remi Gauvin 2019-09-09 19:26 ` webmaster 2019-09-10 19:22 ` Austin S. Hemmelgarn 2019-09-10 23:32 ` webmaster 2019-09-11 12:02 ` Austin S. Hemmelgarn 2019-09-11 16:26 ` Zygo Blaxell 2019-09-11 17:20 ` webmaster 2019-09-11 18:19 ` Austin S. Hemmelgarn 2019-09-11 20:01 ` webmaster 2019-09-11 21:42 ` Zygo Blaxell 2019-09-13 1:33 ` General Zed 2019-09-11 21:37 ` webmaster 2019-09-12 11:31 ` Austin S. Hemmelgarn 2019-09-12 19:18 ` webmaster 2019-09-12 19:44 ` Chris Murphy 2019-09-12 21:34 ` General Zed 2019-09-12 22:28 ` Chris Murphy 2019-09-12 22:57 ` General Zed 2019-09-12 23:54 ` Zygo Blaxell 2019-09-13 0:26 ` General Zed 2019-09-13 3:12 ` Zygo Blaxell 2019-09-13 5:05 ` General Zed 2019-09-14 0:56 ` Zygo Blaxell 2019-09-14 1:50 ` General Zed 2019-09-14 4:42 ` Zygo Blaxell 2019-09-14 4:53 ` Zygo Blaxell 2019-09-15 17:54 ` General Zed 2019-09-16 22:51 ` Zygo Blaxell 2019-09-17 1:03 ` General Zed 2019-09-17 1:34 ` General Zed 2019-09-17 1:44 ` Chris Murphy 2019-09-17 4:55 ` Zygo Blaxell 2019-09-17 4:19 ` Zygo Blaxell 2019-09-17 3:10 ` General Zed 2019-09-17 4:05 ` General Zed 2019-09-14 1:56 ` General Zed [this message] 2019-09-13 5:22 ` General Zed 2019-09-13 6:16 ` General Zed 2019-09-13 6:58 ` General Zed 2019-09-13 9:25 ` General Zed 2019-09-13 17:02 ` General Zed 2019-09-14 0:59 ` Zygo Blaxell 2019-09-14 1:28 ` General Zed 2019-09-14 4:28 ` Zygo Blaxell 2019-09-15 18:05 ` General Zed 2019-09-16 23:05 ` Zygo Blaxell 2019-09-13 7:51 ` General Zed 2019-09-13 11:04 ` Austin S. Hemmelgarn 2019-09-13 20:43 ` Zygo Blaxell 2019-09-14 0:20 ` General Zed 2019-09-14 18:29 ` Chris Murphy 2019-09-14 23:39 ` Zygo Blaxell 2019-09-13 11:09 ` Austin S. Hemmelgarn 2019-09-13 17:20 ` General Zed 2019-09-13 18:20 ` General Zed 2019-09-12 19:54 ` Austin S. Hemmelgarn 2019-09-12 22:21 ` General Zed 2019-09-13 11:53 ` Austin S. Hemmelgarn 2019-09-13 16:54 ` General Zed 2019-09-13 18:29 ` Austin S. Hemmelgarn 2019-09-13 19:40 ` General Zed 2019-09-14 15:10 ` Jukka Larja 2019-09-12 22:47 ` General Zed 2019-09-11 21:37 ` Zygo Blaxell 2019-09-11 23:21 ` webmaster 2019-09-12 0:10 ` Remi Gauvin 2019-09-12 3:05 ` webmaster 2019-09-12 3:30 ` Remi Gauvin 2019-09-12 3:33 ` Remi Gauvin 2019-09-12 5:19 ` Zygo Blaxell 2019-09-12 21:23 ` General Zed 2019-09-14 4:12 ` Zygo Blaxell 2019-09-16 11:42 ` General Zed 2019-09-17 0:49 ` Zygo Blaxell 2019-09-17 2:30 ` General Zed 2019-09-17 5:30 ` Zygo Blaxell 2019-09-17 10:07 ` General Zed 2019-09-17 23:40 ` Zygo Blaxell 2019-09-18 4:37 ` General Zed 2019-09-18 18:00 ` Zygo Blaxell 2019-09-10 23:58 ` webmaster 2019-09-09 23:24 ` Qu Wenruo 2019-09-09 23:25 ` webmaster 2019-09-09 16:38 ` webmaster 2019-09-09 23:44 ` Qu Wenruo 2019-09-10 0:00 ` Chris Murphy 2019-09-10 0:51 ` Qu Wenruo 2019-09-10 0:06 ` webmaster 2019-09-10 0:48 ` Qu Wenruo 2019-09-10 1:24 ` webmaster 2019-09-10 1:48 ` Qu Wenruo 2019-09-10 3:32 ` webmaster 2019-09-10 14:14 ` Nikolay Borisov 2019-09-10 22:35 ` webmaster 2019-09-11 6:40 ` Nikolay Borisov 2019-09-10 22:48 ` webmaster 2019-09-10 23:14 ` webmaster 2019-09-11 0:26 ` webmaster 2019-09-11 0:36 ` webmaster 2019-09-11 1:00 ` webmaster 2019-09-10 11:12 ` Austin S. Hemmelgarn 2019-09-09 3:12 webmaster
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190913215642.Horde.MvjbFry-r1RYjoGFfEha7aE@server53.web-hosting.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ email@example.com public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git