Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: General Zed <general-zed@zedlx.com>
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Mon, 16 Sep 2019 19:44:31 -0600
Message-ID: <CAJCQCtRW8ObeQ5nL_Q9t-7rXDtOk5TQLcZnhH6bGRMA-puUVNw@mail.gmail.com> (raw)
In-Reply-To: <20190916210317.Horde.CLwHiAXP00_WIX7YMxFiew3@server53.web-hosting.com>

On Mon, Sep 16, 2019 at 7:03 PM General Zed <general-zed@zedlx.com> wrote:
>
> Ok, so a reflink contains a virtual address. Did I get that right?
>
> All extent ref items are reflinks which contain a 4 KB aligned address
> because the extents have that same alignment. Did I get that right?
>
> Virtual addresses are 8-bytes in size?
>
> I hope that virtual addresses are not wasteful of address space (that
> is, many top bits in an 8 bit virtual address are all zero).

All addresses in Btrfs are in linear address space. This is actually a
lot easier for everyone if you've familiarized yourself with some
things:

https://btrfs.wiki.kernel.org/index.php/On-disk_Format

You probably don't need to know the literal on disk format at a sector
level. There is a human readable form available with 'btrfs
inspect-internal dump-tree'. Create a Btrfs file system, and dump the
tree so you can see what it looks like totally empty. Mount it. Copy
over a file. Unmount it. Dump tree. You don't actually have to unmount
it, but it will keep Btrfs from regular commit intervals making things
move around. Make a directory. Dump tree. Add file to directory. Dump
tree. Move the file. Dump tree.

You'll see the relationship between the superblock, and all the trees.
You'll see nodes and leaves, and figure out the difference between
them.

Make a reflink. Dump tree. Make a snapshot. Dump tree.


> > Extent data items are stored in a single tree (with other trees using
> > the same keys) that just lists which parts of the filesystem are occupied,
> > how long they are, and what data/metadata they contain.  Each extent
> > item contains a list of references to one of four kinds of object that
> > refers to the extent item (aka backrefs).  The free space tree is the
> > inverse of the extent data tree.
>
> Ok, so there is an "extent tree" keyed by virtual addresses. Items
> there contain extent data.
>
> But, how are nodes in this extent tree addressed (how do you travel
> from the parent to the child)? I guess by full virtual address, i.e.
> by a reflink, but this reflink can point within-extent, meaning its
> address is not 4 KB aligned.
>
> Or, an alternative explanation:
> each whole metadata extent is a single node. Each node is often
> half-full to allow for various tree operations to be performed. Due to
> there being many items per each node, there is additional CPU
> processing effort required when updating a node.

Reflinks are like a file based snapshot, they're a file with its own
inode and other metadata you expect for a file, but points to the same
extents as another file. Off hand I'm not sure other than age if
there's any difference between the structure of the original file and
a reflink of that file. Find out. Make a reflink, dump tree. Delete
the original file. Dump tree.




>
> > Each extent ref item is a reference to an extent data item, but it
> > also contains all the information required to access the data.  For
> > normal read operations the extent data tree can be ignored (though
> > you still need to do a lookup in the csum tree to verify csums.
>
> So, for normal reads, the information in subvol tree is sufficient.

A subvolume is a file tree. A snapshot is a prepopulated subvolume.
It's interesting to snapshot a subvolume. Dump tree. Modify just one
thing in the snapshot. Dump tree.



>
> >> So, you are saying that backrefs are already in the extent tree (or
> >> reachable from it). I didn't know that, that information makes my defrag
> >> much simpler to implement and describe. Someone in this thread has
> >> previously mislead me to believe that backref information is not easily
> >> available.
> >
> > The backref isn't a precise location--it just tells you which metadata
> > blocks are holding at least one reference to the extent.  Some CPU
> > and linear searching has to be done to resolve that fully to an (inode,
> > offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
> > go faster, because you don't need to update the extent tree again when
> > you do some operations on the forward ref side, even though they add or
> > remove references.  e.g. creating a snapshot does not change the backrefs
> > list on individual extents--it creates two roots sharing a subset of the
> > subvol trees' branches.
>
> This reads like a mayor fu**** to me.
>
> I don't get it. If a backref doesn't point to an exact item, than CPU
> has to scan the entire 16 KB metadata extent to find the matching
> reflink. However, this would imply that all the items in a metadata
> extent are always valid (not stale from older versions of metadata).
> This then implies that, when an item of a metadata extent is updated,
> all the parents of all the items in the same extent have to be
> updated. Now, that would be such a waste, wouldn't it? Especially if
> the metadata extent is allowed to contain stale items.
>
> An alternative explanation: all the b-trees have 16 KB nodes, where
> each node matches a metadata extent. Therefore, the entire node has a
> single parent in a particular tree.
>
> This means all virtual addresses are always 4 K aligned, furthermore,
> all virtual addresses that point to metadata extents are 16 K aligned.
>
> 16 KB is a pretty big for a tree node. I wonder why was this size
> selected vs. 4 KB nodes? But, it doesn't matter.

4KiB used to be the default. It was benchmarked and found to be
faster. They can optionally be 32K or 64k on x86.

Btrfs filesystem sector size must match arch pagesize. And nodesize
can't be smaller than filesystem sector size. And leaf size must be
the same as nodesize.

> So, I guess that the virtual-to-physical address translation tables
> are always loaded in memory and that this translation is very fast?
> And the translation in the opposite direction, too.

That's the job of the chunk tree and device tree. And that's how
multiple device support magic happens where files and extent
information don't have to deal with where the data is, that's the job
of other trees.

Add device. Dump tree. Do a balance convert to change to raid1 for
data and metadata. Dump tree.

It's sometimes also useful to dump the super which is in a sense the
top most part of the tree.

-- 
Chris Murphy

  parent reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy [this message]
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtRW8ObeQ5nL_Q9t-7rXDtOk5TQLcZnhH6bGRMA-puUVNw@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=general-zed@zedlx.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox