From: Zygo Blaxell <firstname.lastname@example.org> To: General Zed <email@example.com> Cc: Chris Murphy <firstname.lastname@example.org>, "Austin S. Hemmelgarn" <email@example.com>, Btrfs BTRFS <firstname.lastname@example.org> Subject: Re: Feature requests: online backup - defrag - change RAID level Date: Fri, 13 Sep 2019 20:56:55 -0400 Message-ID: <20190914005655.GH22121@hungrycats.org> (raw) In-Reply-To: <20190913010552.Horde.cUL303XsYbqREB5g0iiCDKd@server53.web-hosting.com> On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote: > > Quoting Zygo Blaxell <email@example.com>: > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote: > > > > > > Quoting Zygo Blaxell <firstname.lastname@example.org>: > > > > > > > Don't forget you have to write new checksum and free space tree pages. > > > > In the worst case, you'll need about 1GB of new metadata pages for each > > > > 128MB you defrag (though you get to delete 99.5% of them immediately > > > > after). > > > > > > Yes, here we are debating some worst-case scenaraio which is actually > > > imposible in practice due to various reasons. > > > > No, it's quite possible. A log file written slowly on an active > > filesystem above a few TB will do that accidentally. Every now and then > > I hit that case. It can take several hours to do a logrotate on spinning > > arrays because of all the metadata fetches and updates associated with > > worst-case file delete. Long enough to watch the delete happen, and > > even follow along in the source code. > > > > I guess if I did a proactive defrag every few hours, it might take less > > time to do the logrotate, but that would mean spreading out all the > > seeky IO load during the day instead of getting it all done at night. > > Logrotate does the same job as defrag in this case (replacing a file in > > thousands of fragments spread across the disk with a few large fragments > > close together), except logrotate gets better compression. > > > > To be more accurate, the example I gave above is the worst case you > > can expect from normal user workloads. If I throw in some reflinks > > and snapshots, I can make it arbitrarily worse, until the entire disk > > is consumed by the metadata update of a single extent defrag. > > > > I can't believe I am considering this case. > > So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents > randomly over the entire disk. We have 512 GB free RAM and 2% free disk > space. The file needs to be defragmented. > > In order to do that, defrag needs to be able to copy-move multiple extents > in one batch, and update the metadata. > > The metadata has a total of at least 256 million entries, each of some size, > but each one should hold at least a pointer to the extent (8 bytes) and a > checksum (8 bytes): In reality, it could be that there is a lot of other > data there per entry. It's about 48KB per 4K extent, plus a few hundred bytes on average for each reference. > The metadata is organized as a b-tree. Therefore, nearby nodes should > contain data of consecutive file extents. It's 48KB per item. As you remove the original data extents, you will be touching a 16KB page in three trees for each extent that is removed: Free space tree, csum tree, and extent tree. This happens after the merged extent is created. It is part of the cleanup operation that gets rid of the original 4K extents. Because the file was written very slowly on a big filesystem, the extents are scattered pessimally all over the virtual address space, not packed close together. If there are a few hundred extent allocations between each log extent, then they will all occupy separate metadata pages. When it is time to remove them, each of these pages must be updated. This can be hit in a number of places in btrfs, including overwrite and delete. There's also 60ish bytes per extent in any subvol trees the file actually appears in, but you do get locality in that one (the key is inode and offset, so nothing can get between them and space them apart). That's 12GB and change (you'll probably completely empty most of the updated subvol metadata pages, so we can expect maybe 5 pages to remain including root and interior nodes). I haven't been unlucky enough to get a "natural" 12GB, but I got over 1GB a few times recently. Reflinks can be used to multiply that 12GB arbitrarily--you only get locality if the reflinks are consecutive in (inode, offset) space, so if the reflinks are scattered across subvols or files, they won't share pages. > The trick, in this case, is to select one part of "ultralog" which is > localized in the metadata, and defragment it. Repeating this step will > ultimately defragment the entire file. > > So, the defrag selects some part of metadata which is entirely a descendant > of some b-tree node not far from the bottom of b-tree. It selects it such > that the required update to the metadata is less than, let's say, 64 MB, and > simultaneously the affected "ultralog" file fragments total less han 512 MB > (therefore, less than 128 thousand metadata leaf entries, each pointing to a > 4 KB fragment). Then it finds all the file extents pointed to by that part > of metadata. They are consecutive (as file fragments), because we have > selected such part of metadata. Now the defrag can safely copy-move those > fragments to a new area and update the metadata. > > In order to quickly select that small part of metadata, the defrag needs a > metatdata cache that can hold somewhat more than 128 thousand localized > metadata leaf entries. That fits into 128 MB RAM definitely. > > Of course, there are many other small issues there, but this outlines the > general procedure. > > Problem solved? Problem missed completely. The forward reference updates were the only easy part. My solution is to detect this is happening in real time, and merge the extents while they're still too few to be a problem. Now you might be thinking "but doesn't that mean you'll merge the same data blocks over and over, wasting iops?" but really it's a perfectly reasonable trade considering the interest rates those unspent iops can collect on btrfs. If the target minimum extent size is 192K, you turn this 12GB problem into a 250MB one, and the 1GB problem that actually occurs becomes trivial. Another solution would be to get the allocator to reserve some space near growing files reserved for use by those files, so that the small fragments don't explode across the address space. Then we'd get locality in all four btrees. Other filesystems have heuristics all over their allocators to do things like this--btrfs seems to have a very minimal allocator that could stand much improvement.
next prev parent reply index Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-09 2:55 zedlryqc 2019-09-09 3:51 ` Qu Wenruo 2019-09-09 11:25 ` zedlryqc 2019-09-09 12:18 ` Qu Wenruo 2019-09-09 12:28 ` Qu Wenruo 2019-09-09 17:11 ` webmaster 2019-09-10 17:39 ` Andrei Borzenkov 2019-09-10 22:41 ` webmaster 2019-09-09 15:29 ` Graham Cobb 2019-09-09 17:24 ` Remi Gauvin 2019-09-09 19:26 ` webmaster 2019-09-10 19:22 ` Austin S. Hemmelgarn 2019-09-10 23:32 ` webmaster 2019-09-11 12:02 ` Austin S. Hemmelgarn 2019-09-11 16:26 ` Zygo Blaxell 2019-09-11 17:20 ` webmaster 2019-09-11 18:19 ` Austin S. Hemmelgarn 2019-09-11 20:01 ` webmaster 2019-09-11 21:42 ` Zygo Blaxell 2019-09-13 1:33 ` General Zed 2019-09-11 21:37 ` webmaster 2019-09-12 11:31 ` Austin S. Hemmelgarn 2019-09-12 19:18 ` webmaster 2019-09-12 19:44 ` Chris Murphy 2019-09-12 21:34 ` General Zed 2019-09-12 22:28 ` Chris Murphy 2019-09-12 22:57 ` General Zed 2019-09-12 23:54 ` Zygo Blaxell 2019-09-13 0:26 ` General Zed 2019-09-13 3:12 ` Zygo Blaxell 2019-09-13 5:05 ` General Zed 2019-09-14 0:56 ` Zygo Blaxell [this message] 2019-09-14 1:50 ` General Zed 2019-09-14 4:42 ` Zygo Blaxell 2019-09-14 4:53 ` Zygo Blaxell 2019-09-15 17:54 ` General Zed 2019-09-16 22:51 ` Zygo Blaxell 2019-09-17 1:03 ` General Zed 2019-09-17 1:34 ` General Zed 2019-09-17 1:44 ` Chris Murphy 2019-09-17 4:55 ` Zygo Blaxell 2019-09-17 4:19 ` Zygo Blaxell 2019-09-17 3:10 ` General Zed 2019-09-17 4:05 ` General Zed 2019-09-14 1:56 ` General Zed 2019-09-13 5:22 ` General Zed 2019-09-13 6:16 ` General Zed 2019-09-13 6:58 ` General Zed 2019-09-13 9:25 ` General Zed 2019-09-13 17:02 ` General Zed 2019-09-14 0:59 ` Zygo Blaxell 2019-09-14 1:28 ` General Zed 2019-09-14 4:28 ` Zygo Blaxell 2019-09-15 18:05 ` General Zed 2019-09-16 23:05 ` Zygo Blaxell 2019-09-13 7:51 ` General Zed 2019-09-13 11:04 ` Austin S. Hemmelgarn 2019-09-13 20:43 ` Zygo Blaxell 2019-09-14 0:20 ` General Zed 2019-09-14 18:29 ` Chris Murphy 2019-09-14 23:39 ` Zygo Blaxell 2019-09-13 11:09 ` Austin S. Hemmelgarn 2019-09-13 17:20 ` General Zed 2019-09-13 18:20 ` General Zed 2019-09-12 19:54 ` Austin S. Hemmelgarn 2019-09-12 22:21 ` General Zed 2019-09-13 11:53 ` Austin S. Hemmelgarn 2019-09-13 16:54 ` General Zed 2019-09-13 18:29 ` Austin S. Hemmelgarn 2019-09-13 19:40 ` General Zed 2019-09-14 15:10 ` Jukka Larja 2019-09-12 22:47 ` General Zed 2019-09-11 21:37 ` Zygo Blaxell 2019-09-11 23:21 ` webmaster 2019-09-12 0:10 ` Remi Gauvin 2019-09-12 3:05 ` webmaster 2019-09-12 3:30 ` Remi Gauvin 2019-09-12 3:33 ` Remi Gauvin 2019-09-12 5:19 ` Zygo Blaxell 2019-09-12 21:23 ` General Zed 2019-09-14 4:12 ` Zygo Blaxell 2019-09-16 11:42 ` General Zed 2019-09-17 0:49 ` Zygo Blaxell 2019-09-17 2:30 ` General Zed 2019-09-17 5:30 ` Zygo Blaxell 2019-09-17 10:07 ` General Zed 2019-09-17 23:40 ` Zygo Blaxell 2019-09-18 4:37 ` General Zed 2019-09-18 18:00 ` Zygo Blaxell 2019-09-10 23:58 ` webmaster 2019-09-09 23:24 ` Qu Wenruo 2019-09-09 23:25 ` webmaster 2019-09-09 16:38 ` webmaster 2019-09-09 23:44 ` Qu Wenruo 2019-09-10 0:00 ` Chris Murphy 2019-09-10 0:51 ` Qu Wenruo 2019-09-10 0:06 ` webmaster 2019-09-10 0:48 ` Qu Wenruo 2019-09-10 1:24 ` webmaster 2019-09-10 1:48 ` Qu Wenruo 2019-09-10 3:32 ` webmaster 2019-09-10 14:14 ` Nikolay Borisov 2019-09-10 22:35 ` webmaster 2019-09-11 6:40 ` Nikolay Borisov 2019-09-10 22:48 ` webmaster 2019-09-10 23:14 ` webmaster 2019-09-11 0:26 ` webmaster 2019-09-11 0:36 ` webmaster 2019-09-11 1:00 ` webmaster 2019-09-10 11:12 ` Austin S. Hemmelgarn 2019-09-09 3:12 webmaster
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190914005655.GH22121@hungrycats.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ firstname.lastname@example.org public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git