From: General Zed <firstname.lastname@example.org> To: Zygo Blaxell <email@example.com> Cc: Chris Murphy <firstname.lastname@example.org>, Btrfs BTRFS <email@example.com> Subject: Re: Feature requests: online backup - defrag - change RAID level Date: Mon, 16 Sep 2019 21:03:17 -0400 Message-ID: <20190916210317.Horde.CLwHiAXP00_WIX7YMxFiew3@server53.web-hosting.com> (raw) In-Reply-To: <20190916225126.GB24379@hungrycats.org> Quoting Zygo Blaxell <firstname.lastname@example.org>: > On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote: >> >> Quoting Zygo Blaxell <email@example.com>: >> >> > On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote: >> > > >> > > Quoting Zygo Blaxell <firstname.lastname@example.org>: >> > > >> > > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote: >> > > > > >> > > > > Quoting Zygo Blaxell <email@example.com>: >> > > > > >> > > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote: >> > > > > > > >> > > > > > > Quoting Zygo Blaxell <firstname.lastname@example.org>: >> > > > > > > >> > > > > > > > Don't forget you have to write new checksum and free space >> > > tree pages. >> > > > > > > > In the worst case, you'll need about 1GB of new metadata pages >> > > > > for each >> > > > > > > > 128MB you defrag (though you get to delete 99.5% of them >> > > immediately >> > > > > > > > after). >> > > > > > > >> > > > > > > Yes, here we are debating some worst-case scenaraio which is >> > > actually >> > > > > > > imposible in practice due to various reasons. >> > > > > > >> > > > > > No, it's quite possible. A log file written slowly on an active >> > > > > > filesystem above a few TB will do that accidentally. Every >> > > now and then >> > > > > > I hit that case. It can take several hours to do a logrotate >> > > on spinning >> > > > > > arrays because of all the metadata fetches and updates >> associated with >> > > > > > worst-case file delete. Long enough to watch the delete >> happen, and >> > > > > > even follow along in the source code. >> > > > > > >> > > > > > I guess if I did a proactive defrag every few hours, it might >> > > take less >> > > > > > time to do the logrotate, but that would mean spreading >> out all the >> > > > > > seeky IO load during the day instead of getting it all >> done at night. >> > > > > > Logrotate does the same job as defrag in this case (replacing >> > > a file in >> > > > > > thousands of fragments spread across the disk with a few large >> > > fragments >> > > > > > close together), except logrotate gets better compression. >> > > > > > >> > > > > > To be more accurate, the example I gave above is the >> worst case you >> > > > > > can expect from normal user workloads. If I throw in >> some reflinks >> > > > > > and snapshots, I can make it arbitrarily worse, until the >> entire disk >> > > > > > is consumed by the metadata update of a single extent defrag. >> > > > > > >> > > > > >> > > > > I can't believe I am considering this case. >> > > > > >> > > > > So, we have a 1TB log file "ultralog" split into 256 million 4 >> > > KB extents >> > > > > randomly over the entire disk. We have 512 GB free RAM and >> 2% free disk >> > > > > space. The file needs to be defragmented. >> > > > > >> > > > > In order to do that, defrag needs to be able to copy-move >> > > multiple extents >> > > > > in one batch, and update the metadata. >> > > > > >> > > > > The metadata has a total of at least 256 million entries, each >> > > of some size, >> > > > > but each one should hold at least a pointer to the extent (8 >> > > bytes) and a >> > > > > checksum (8 bytes): In reality, it could be that there is a >> lot of other >> > > > > data there per entry. >> > > > >> > > > It's about 48KB per 4K extent, plus a few hundred bytes on average >> > > for each >> > > > reference. >> > > >> > > Sorry, could you be more clear there? An file fragment/extent that holds >> > > file data can be any >> > > size up to 128 MB. What metadata is there per every file >> fragment/extent? >> > > >> > > Because "48 KB per 4 K extent" ... cannot decode what you mean. >> > >> > An extent has 3 associated records in btrfs, not including its references. >> > The first two exist while the extent exists, the third appears after it >> > is removed. >> > >> > - extent tree: location, size of extent, pointers to backref trees. >> > Length is around 60 bytes plus the size of the backref pointer list. >> >> Wait.. and where are the reflinks? Backrefs are there for going up the tree, >> but where are reflinks for going down the tree? > > Reflinks are the forward references--there is no other kind of forward > reference in btrfs (contrast with other filesystems which use one data > structure for single references and another for multiple references). > > There are two distinct objects with similar names: extent data items, > and extent ref items. > > A file consists of an inode item followed by extent ref items (aka > reflinks) in a subvol tree keyed by (inode, offset) pairs. Subvol tree > pages can be shared with other subvol trees to make snapshots. Ok, so a reflink contains a virtual address. Did I get that right? All extent ref items are reflinks which contain a 4 KB aligned address because the extents have that same alignment. Did I get that right? Virtual addresses are 8-bytes in size? I hope that virtual addresses are not wasteful of address space (that is, many top bits in an 8 bit virtual address are all zero). > Extent data items are stored in a single tree (with other trees using > the same keys) that just lists which parts of the filesystem are occupied, > how long they are, and what data/metadata they contain. Each extent > item contains a list of references to one of four kinds of object that > refers to the extent item (aka backrefs). The free space tree is the > inverse of the extent data tree. Ok, so there is an "extent tree" keyed by virtual addresses. Items there contain extent data. But, how are nodes in this extent tree addressed (how do you travel from the parent to the child)? I guess by full virtual address, i.e. by a reflink, but this reflink can point within-extent, meaning its address is not 4 KB aligned. Or, an alternative explanation: each whole metadata extent is a single node. Each node is often half-full to allow for various tree operations to be performed. Due to there being many items per each node, there is additional CPU processing effort required when updating a node. > Each extent ref item is a reference to an extent data item, but it > also contains all the information required to access the data. For > normal read operations the extent data tree can be ignored (though > you still need to do a lookup in the csum tree to verify csums. So, for normal reads, the information in subvol tree is sufficient. >> So, you are saying that backrefs are already in the extent tree (or >> reachable from it). I didn't know that, that information makes my defrag >> much simpler to implement and describe. Someone in this thread has >> previously mislead me to believe that backref information is not easily >> available. > > The backref isn't a precise location--it just tells you which metadata > blocks are holding at least one reference to the extent. Some CPU > and linear searching has to be done to resolve that fully to an (inode, > offset) pair in the subvol tree(s). It's a tradeoff to make normal POSIX > go faster, because you don't need to update the extent tree again when > you do some operations on the forward ref side, even though they add or > remove references. e.g. creating a snapshot does not change the backrefs > list on individual extents--it creates two roots sharing a subset of the > subvol trees' branches. This reads like a mayor fu**** to me. I don't get it. If a backref doesn't point to an exact item, than CPU has to scan the entire 16 KB metadata extent to find the matching reflink. However, this would imply that all the items in a metadata extent are always valid (not stale from older versions of metadata). This then implies that, when an item of a metadata extent is updated, all the parents of all the items in the same extent have to be updated. Now, that would be such a waste, wouldn't it? Especially if the metadata extent is allowed to contain stale items. An alternative explanation: all the b-trees have 16 KB nodes, where each node matches a metadata extent. Therefore, the entire node has a single parent in a particular tree. This means all virtual addresses are always 4 K aligned, furthermore, all virtual addresses that point to metadata extents are 16 K aligned. 16 KB is a pretty big for a tree node. I wonder why was this size selected vs. 4 KB nodes? But, it doesn't matter. >> > - csum tree: location, 1 or more 4-byte csums packed in an array. >> > Length of item is number of extent data blocks * 4 bytes plus a >> > 168-bit header (ish...csums from adjacent extents may be packed >> > using a shared header) >> > >> > - free space tree: location, size of free space. This appears >> > when the extent is deleted. It may be merged with adjacent >> > records. Length is maybe 20 bytes? >> > >> > Each page contains a few hundred items, so if there are a few hundred >> > unrelated extents between extents in the log file, each log file extent >> > gets its own metadata page in each tree. >> >> As far as I can understand it, the extents in the extent tree are indexed >> (keyed) by inode&offset. Therefore, no matter how many unrelated extents >> there are between (physical locations of data) extents in the log file, the >> log file extent tree entries will (generally speaking) be localized, because >> multiple extent entries (extent items) are bunched tohgether in one 16 KB >> metadata extent node. > > No, extents in the extent tree are indexed by virtual address (roughly the > same as physical address over small scales, let's leave the device tree > out of it for now). The subvol trees are organized the way you are > thinking of. So, I guess that the virtual-to-physical address translation tables are always loaded in memory and that this translation is very fast? And the translation in the opposite direction, too. Anyway, thanks for explaining this all to me, makes it all much more clear.
next prev parent reply index Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-09 2:55 zedlryqc 2019-09-09 3:51 ` Qu Wenruo 2019-09-09 11:25 ` zedlryqc 2019-09-09 12:18 ` Qu Wenruo 2019-09-09 12:28 ` Qu Wenruo 2019-09-09 17:11 ` webmaster 2019-09-10 17:39 ` Andrei Borzenkov 2019-09-10 22:41 ` webmaster 2019-09-09 15:29 ` Graham Cobb 2019-09-09 17:24 ` Remi Gauvin 2019-09-09 19:26 ` webmaster 2019-09-10 19:22 ` Austin S. Hemmelgarn 2019-09-10 23:32 ` webmaster 2019-09-11 12:02 ` Austin S. Hemmelgarn 2019-09-11 16:26 ` Zygo Blaxell 2019-09-11 17:20 ` webmaster 2019-09-11 18:19 ` Austin S. Hemmelgarn 2019-09-11 20:01 ` webmaster 2019-09-11 21:42 ` Zygo Blaxell 2019-09-13 1:33 ` General Zed 2019-09-11 21:37 ` webmaster 2019-09-12 11:31 ` Austin S. Hemmelgarn 2019-09-12 19:18 ` webmaster 2019-09-12 19:44 ` Chris Murphy 2019-09-12 21:34 ` General Zed 2019-09-12 22:28 ` Chris Murphy 2019-09-12 22:57 ` General Zed 2019-09-12 23:54 ` Zygo Blaxell 2019-09-13 0:26 ` General Zed 2019-09-13 3:12 ` Zygo Blaxell 2019-09-13 5:05 ` General Zed 2019-09-14 0:56 ` Zygo Blaxell 2019-09-14 1:50 ` General Zed 2019-09-14 4:42 ` Zygo Blaxell 2019-09-14 4:53 ` Zygo Blaxell 2019-09-15 17:54 ` General Zed 2019-09-16 22:51 ` Zygo Blaxell 2019-09-17 1:03 ` General Zed [this message] 2019-09-17 1:34 ` General Zed 2019-09-17 1:44 ` Chris Murphy 2019-09-17 4:55 ` Zygo Blaxell 2019-09-17 4:19 ` Zygo Blaxell 2019-09-17 3:10 ` General Zed 2019-09-17 4:05 ` General Zed 2019-09-14 1:56 ` General Zed 2019-09-13 5:22 ` General Zed 2019-09-13 6:16 ` General Zed 2019-09-13 6:58 ` General Zed 2019-09-13 9:25 ` General Zed 2019-09-13 17:02 ` General Zed 2019-09-14 0:59 ` Zygo Blaxell 2019-09-14 1:28 ` General Zed 2019-09-14 4:28 ` Zygo Blaxell 2019-09-15 18:05 ` General Zed 2019-09-16 23:05 ` Zygo Blaxell 2019-09-13 7:51 ` General Zed 2019-09-13 11:04 ` Austin S. Hemmelgarn 2019-09-13 20:43 ` Zygo Blaxell 2019-09-14 0:20 ` General Zed 2019-09-14 18:29 ` Chris Murphy 2019-09-14 23:39 ` Zygo Blaxell 2019-09-13 11:09 ` Austin S. Hemmelgarn 2019-09-13 17:20 ` General Zed 2019-09-13 18:20 ` General Zed 2019-09-12 19:54 ` Austin S. Hemmelgarn 2019-09-12 22:21 ` General Zed 2019-09-13 11:53 ` Austin S. Hemmelgarn 2019-09-13 16:54 ` General Zed 2019-09-13 18:29 ` Austin S. Hemmelgarn 2019-09-13 19:40 ` General Zed 2019-09-14 15:10 ` Jukka Larja 2019-09-12 22:47 ` General Zed 2019-09-11 21:37 ` Zygo Blaxell 2019-09-11 23:21 ` webmaster 2019-09-12 0:10 ` Remi Gauvin 2019-09-12 3:05 ` webmaster 2019-09-12 3:30 ` Remi Gauvin 2019-09-12 3:33 ` Remi Gauvin 2019-09-12 5:19 ` Zygo Blaxell 2019-09-12 21:23 ` General Zed 2019-09-14 4:12 ` Zygo Blaxell 2019-09-16 11:42 ` General Zed 2019-09-17 0:49 ` Zygo Blaxell 2019-09-17 2:30 ` General Zed 2019-09-17 5:30 ` Zygo Blaxell 2019-09-17 10:07 ` General Zed 2019-09-17 23:40 ` Zygo Blaxell 2019-09-18 4:37 ` General Zed 2019-09-18 18:00 ` Zygo Blaxell 2019-09-10 23:58 ` webmaster 2019-09-09 23:24 ` Qu Wenruo 2019-09-09 23:25 ` webmaster 2019-09-09 16:38 ` webmaster 2019-09-09 23:44 ` Qu Wenruo 2019-09-10 0:00 ` Chris Murphy 2019-09-10 0:51 ` Qu Wenruo 2019-09-10 0:06 ` webmaster 2019-09-10 0:48 ` Qu Wenruo 2019-09-10 1:24 ` webmaster 2019-09-10 1:48 ` Qu Wenruo 2019-09-10 3:32 ` webmaster 2019-09-10 14:14 ` Nikolay Borisov 2019-09-10 22:35 ` webmaster 2019-09-11 6:40 ` Nikolay Borisov 2019-09-10 22:48 ` webmaster 2019-09-10 23:14 ` webmaster 2019-09-11 0:26 ` webmaster 2019-09-11 0:36 ` webmaster 2019-09-11 1:00 ` webmaster 2019-09-10 11:12 ` Austin S. Hemmelgarn 2019-09-09 3:12 webmaster
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190916210317.Horde.CLwHiAXP00_WIX7YMxFiew3@server53.web-hosting.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ email@example.com public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git