Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: General Zed <general-zed@zedlx.com>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Tue, 17 Sep 2019 00:19:11 -0400
Message-ID: <20190917041911.GE24379@hungrycats.org> (raw)
In-Reply-To: <20190916210317.Horde.CLwHiAXP00_WIX7YMxFiew3@server53.web-hosting.com>

On Mon, Sep 16, 2019 at 09:03:17PM -0400, General Zed wrote:
> 
> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > Reflinks are the forward references--there is no other kind of forward
> > reference in btrfs (contrast with other filesystems which use one data
> > structure for single references and another for multiple references).
> > 
> > There are two distinct objects with similar names:  extent data items,
> > and extent ref items.
> > 
> > A file consists of an inode item followed by extent ref items (aka
> > reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
> > pages can be shared with other subvol trees to make snapshots.
> 
> Ok, so a reflink contains a virtual address. Did I get that right?

Virtual address of the extent (i.e. the beginning of the original contiguous
write) plus offset and length within the extent.  This allows an extent ref
to reference part of an extent.

> All extent ref items are reflinks which contain a 4 KB aligned address
> because the extents have that same alignment. Did I get that right?
> 
> Virtual addresses are 8-bytes in size?

The addresses are 8 bytes.  There's an offset and length in the ref which
IIRC are 4 bytes each (extents have several limits on their size, all below
2GB).  For compressed extents there's also a decompressed length.

> I hope that virtual addresses are not wasteful of address space (that is,
> many top bits in an 8 bit virtual address are all zero).

Block groups that are deleted are never reissued the same virtual address,
though I know of no reason they couldn't be.

> > Extent data items are stored in a single tree (with other trees using
> > the same keys) that just lists which parts of the filesystem are occupied,
> > how long they are, and what data/metadata they contain.  Each extent
> > item contains a list of references to one of four kinds of object that
> > refers to the extent item (aka backrefs).  The free space tree is the
> > inverse of the extent data tree.
> 
> Ok, so there is an "extent tree" keyed by virtual addresses. Items there
> contain extent data.
> 
> But, how are nodes in this extent tree addressed (how do you travel from the
> parent to the child)? I guess by full virtual address, i.e. by a reflink,
> but this reflink can point within-extent, meaning its address is not 4 KB
> aligned.

Metadata items are also stored in extents.  There's a device tree to bootstrap.

End addresses of extent refs can be not 4K-aligned (i.e. the trailing
bytes at EOF).  All other addresses must be 4K-aligned.

> Or, an alternative explanation:
> each whole metadata extent is a single node. Each node is often half-full to
> allow for various tree operations to be performed. Due to there being many
> items per each node, there is additional CPU processing effort required when
> updating a node.
> 
> > Each extent ref item is a reference to an extent data item, but it
> > also contains all the information required to access the data.  For
> > normal read operations the extent data tree can be ignored (though
> > you still need to do a lookup in the csum tree to verify csums.
> 
> So, for normal reads, the information in subvol tree is sufficient.
> 
> > > So, you are saying that backrefs are already in the extent tree (or
> > > reachable from it). I didn't know that, that information makes my defrag
> > > much simpler to implement and describe. Someone in this thread has
> > > previously mislead me to believe that backref information is not easily
> > > available.
> > 
> > The backref isn't a precise location--it just tells you which metadata
> > blocks are holding at least one reference to the extent.  Some CPU
> > and linear searching has to be done to resolve that fully to an (inode,
> > offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
> > go faster, because you don't need to update the extent tree again when
> > you do some operations on the forward ref side, even though they add or
> > remove references.  e.g. creating a snapshot does not change the backrefs
> > list on individual extents--it creates two roots sharing a subset of the
> > subvol trees' branches.
> 
> This reads like a mayor fu**** to me.
> 
> I don't get it. If a backref doesn't point to an exact item, than CPU has to
> scan the entire 16 KB metadata extent to find the matching reflink. However,
> this would imply that all the items in a metadata extent are always valid
> (not stale from older versions of metadata). This then implies that, when an
> item of a metadata extent is updated, all the parents of all the items in
> the same extent have to be updated. Now, that would be such a waste,
> wouldn't it? Especially if the metadata extent is allowed to contain stale
> items.

All parents are updated in all trees that are updated.  It's the
wandering-trees filesystem.  The extent tree gets a _lot_ of updates,
because its items do change for every new subvol item that is created
or deleted.

But you're getting into the depths I'm not clear on.  There are 4 kinds of
parent node (or 4 branches in the case statement that follows backrefs).
Some of them (like snapshot leaf nodes) are very static, not changing
without creating new nodes all the way up to the top of a subvol tree.
Others refer fairly directly and specifically to a (subvol, inode, offset)
tuple.

> An alternative explanation: all the b-trees have 16 KB nodes, where each
> node matches a metadata extent. Therefore, the entire node has a single
> parent in a particular tree.
> 
> This means all virtual addresses are always 4 K aligned, furthermore, all
> virtual addresses that point to metadata extents are 16 K aligned.
> 
> 16 KB is a pretty big for a tree node. I wonder why was this size selected
> vs. 4 KB nodes? But, it doesn't matter.

Probably due to a performance benchmark someone ran around 2007-2009...?

It is useful for subvol trees where small files (2K or less after
compression) are stored inline.

> > > > 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
> > > > 	Length of item is number of extent data blocks * 4 bytes plus a
> > > > 	168-bit header (ish...csums from adjacent extents may be packed
> > > > 	using a shared header)
> > > >
> > > > 	- free space tree:  location, size of free space.  This appears
> > > > 	when the extent is deleted.  It may be merged with adjacent
> > > > 	records.  Length is maybe 20 bytes?
> > > >
> > > > Each page contains a few hundred items, so if there are a few hundred
> > > > unrelated extents between extents in the log file, each log file extent
> > > > gets its own metadata page in each tree.
> > > 
> > > As far as I can understand it, the extents in the extent tree are indexed
> > > (keyed) by inode&offset. Therefore, no matter how many unrelated extents
> > > there are between (physical locations of data) extents in the log file, the
> > > log file extent tree entries will (generally speaking) be localized, because
> > > multiple extent entries (extent items) are bunched tohgether in one 16 KB
> > > metadata extent node.
> > 
> > No, extents in the extent tree are indexed by virtual address (roughly the
> > same as physical address over small scales, let's leave the device tree
> > out of it for now).  The subvol trees are organized the way you are
> > thinking of.
> 
> So, I guess that the virtual-to-physical address translation tables are
> always loaded in memory and that this translation is very fast? And the
> translation in the opposite direction, too.

Block groups typically consist of 1GB chunks from each disk, so the
maximum number of block groups is typically total disk size / 1GB.
(It's possible to create smaller chunks if you do a lot of small
incrementing resizes but nobody does that).  So a 45TB filesystem might
have a 3.5MB translation btree.

Translation in the other direction is rare--everything in btrfs uses the virtual
addresses.  IO error reports do the reverse transformation to turn it into a
(device id, sector) pair...that's the only use case I can think of.  Well,
I guess the block group allocator must deal with block addresses too.

> Anyway, thanks for explaining this all to me, makes it all much more clear.
> 
> 
> 

  parent reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell [this message]
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190917041911.GE24379@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=general-zed@zedlx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox