Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Mon, 16 Sep 2019 21:34:51 -0400
Message-ID: <20190916213451.Horde.CVGC1tsP9Z8Q-7hBi7l1ymH@server53.web-hosting.com> (raw)
In-Reply-To: <20190916210317.Horde.CLwHiAXP00_WIX7YMxFiew3@server53.web-hosting.com>


Quoting General Zed <general-zed@zedlx.com>:

> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>
>> On Sun, Sep 15, 2019 at 01:54:07PM -0400, General Zed wrote:
>>>
>>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>
>>>> On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
>>>> >
>>>> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> >
>>>> > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
>>>> > > >
>>>> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> > > >
>>>> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>>> > > > > >
>>>> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>>> > > > > >
>>>> > > > > > > Don't forget you have to write new checksum and free space
>>>> > tree pages.
>>>> > > > > > > In the worst case, you'll need about 1GB of new metadata pages
>>>> > > > for each
>>>> > > > > > > 128MB you defrag (though you get to delete 99.5% of them
>>>> > immediately
>>>> > > > > > > after).
>>>> > > > > >
>>>> > > > > > Yes, here we are debating some worst-case scenaraio which is
>>>> > actually
>>>> > > > > > imposible in practice due to various reasons.
>>>> > > > >
>>>> > > > > No, it's quite possible.  A log file written slowly on an active
>>>> > > > > filesystem above a few TB will do that accidentally.  Every
>>>> > now and then
>>>> > > > > I hit that case.  It can take several hours to do a logrotate
>>>> > on spinning
>>>> > > > > arrays because of all the metadata fetches and updates  
>>>> associated with
>>>> > > > > worst-case file delete.  Long enough to watch the delete  
>>>> happen, and
>>>> > > > > even follow along in the source code.
>>>> > > > >
>>>> > > > > I guess if I did a proactive defrag every few hours, it might
>>>> > take less
>>>> > > > > time to do the logrotate, but that would mean spreading  
>>>> out all the
>>>> > > > > seeky IO load during the day instead of getting it all  
>>>> done at night.
>>>> > > > > Logrotate does the same job as defrag in this case (replacing
>>>> > a file in
>>>> > > > > thousands of fragments spread across the disk with a few large
>>>> > fragments
>>>> > > > > close together), except logrotate gets better compression.
>>>> > > > >
>>>> > > > > To be more accurate, the example I gave above is the  
>>>> worst case you
>>>> > > > > can expect from normal user workloads.  If I throw in  
>>>> some reflinks
>>>> > > > > and snapshots, I can make it arbitrarily worse, until the  
>>>> entire disk
>>>> > > > > is consumed by the metadata update of a single extent defrag.
>>>> > > > >
>>>> > > >
>>>> > > > I can't believe I am considering this case.
>>>> > > >
>>>> > > > So, we have a 1TB log file "ultralog" split into 256 million 4
>>>> > KB extents
>>>> > > > randomly over the entire disk. We have 512 GB free RAM and  
>>>> 2% free disk
>>>> > > > space. The file needs to be defragmented.
>>>> > > >
>>>> > > > In order to do that, defrag needs to be able to copy-move
>>>> > multiple extents
>>>> > > > in one batch, and update the metadata.
>>>> > > >
>>>> > > > The metadata has a total of at least 256 million entries, each
>>>> > of some size,
>>>> > > > but each one should hold at least a pointer to the extent (8
>>>> > bytes) and a
>>>> > > > checksum (8 bytes): In reality, it could be that there is a  
>>>> lot of other
>>>> > > > data there per entry.
>>>> > >
>>>> > > It's about 48KB per 4K extent, plus a few hundred bytes on average
>>>> > for each
>>>> > > reference.
>>>> >
>>>> > Sorry, could you be more clear there? An file fragment/extent that holds
>>>> > file data can be any
>>>> > size up to 128 MB. What metadata is there per every file  
>>>> fragment/extent?
>>>> >
>>>> > Because "48 KB per 4 K extent" ... cannot decode what you mean.
>>>>
>>>> An extent has 3 associated records in btrfs, not including its references.
>>>> The first two exist while the extent exists, the third appears after it
>>>> is removed.
>>>>
>>>> 	- extent tree:  location, size of extent, pointers to backref trees.
>>>> 	Length is around 60 bytes plus the size of the backref pointer list.
>>>
>>> Wait.. and where are the reflinks? Backrefs are there for going up  
>>> the tree,
>>> but where are reflinks for going down the tree?
>>
>> Reflinks are the forward references--there is no other kind of forward
>> reference in btrfs (contrast with other filesystems which use one data
>> structure for single references and another for multiple references).
>>
>> There are two distinct objects with similar names:  extent data items,
>> and extent ref items.
>>
>> A file consists of an inode item followed by extent ref items (aka
>> reflinks) in a subvol tree keyed by (inode, offset) pairs.  Subvol tree
>> pages can be shared with other subvol trees to make snapshots.
>
> Ok, so a reflink contains a virtual address. Did I get that right?
>
> All extent ref items are reflinks which contain a 4 KB aligned  
> address because the extents have that same alignment. Did I get that  
> right?
>
> Virtual addresses are 8-bytes in size?
>
> I hope that virtual addresses are not wasteful of address space  
> (that is, many top bits in an 8 bit virtual address are all zero).
>
>> Extent data items are stored in a single tree (with other trees using
>> the same keys) that just lists which parts of the filesystem are occupied,
>> how long they are, and what data/metadata they contain.  Each extent
>> item contains a list of references to one of four kinds of object that
>> refers to the extent item (aka backrefs).  The free space tree is the
>> inverse of the extent data tree.
>
> Ok, so there is an "extent tree" keyed by virtual addresses. Items  
> there contain extent data.
>
> But, how are nodes in this extent tree addressed (how do you travel  
> from the parent to the child)? I guess by full virtual address, i.e.  
> by a reflink, but this reflink can point within-extent, meaning its  
> address is not 4 KB aligned.
>
> Or, an alternative explanation:
> each whole metadata extent is a single node. Each node is often  
> half-full to allow for various tree operations to be performed. Due  
> to there being many items per each node, there is additional CPU  
> processing effort required when updating a node.
>
>> Each extent ref item is a reference to an extent data item, but it
>> also contains all the information required to access the data.  For
>> normal read operations the extent data tree can be ignored (though
>> you still need to do a lookup in the csum tree to verify csums.
>
> So, for normal reads, the information in subvol tree is sufficient.
>
>>> So, you are saying that backrefs are already in the extent tree (or
>>> reachable from it). I didn't know that, that information makes my defrag
>>> much simpler to implement and describe. Someone in this thread has
>>> previously mislead me to believe that backref information is not easily
>>> available.
>>
>> The backref isn't a precise location--it just tells you which metadata
>> blocks are holding at least one reference to the extent.  Some CPU
>> and linear searching has to be done to resolve that fully to an (inode,
>> offset) pair in the subvol tree(s).  It's a tradeoff to make normal POSIX
>> go faster, because you don't need to update the extent tree again when
>> you do some operations on the forward ref side, even though they add or
>> remove references.  e.g. creating a snapshot does not change the backrefs
>> list on individual extents--it creates two roots sharing a subset of the
>> subvol trees' branches.
>
> This reads like a mayor fu**** to me.
>
> I don't get it. If a backref doesn't point to an exact item, than  
> CPU has to scan the entire 16 KB metadata extent to find the  
> matching reflink. However, this would imply that all the items in a  
> metadata extent are always valid (not stale from older versions of  
> metadata). This then implies that, when an item of a metadata extent  
> is updated, all the parents of all the items in the same extent have  
> to be updated. Now, that would be such a waste, wouldn't it?  
> Especially if the metadata extent is allowed to contain stale items.
>
> An alternative explanation: all the b-trees have 16 KB nodes, where  
> each node matches a metadata extent. Therefore, the entire node has  
> a single parent in a particular tree.
>
> This means all virtual addresses are always 4 K aligned,  
> furthermore, all virtual addresses that point to metadata extents  
> are 16 K aligned.
>
> 16 KB is a pretty big for a tree node. I wonder why was this size  
> selected vs. 4 KB nodes? But, it doesn't matter.
>
>>>> 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
>>>> 	Length of item is number of extent data blocks * 4 bytes plus a
>>>> 	168-bit header (ish...csums from adjacent extents may be packed
>>>> 	using a shared header)
>>>>
>>>> 	- free space tree:  location, size of free space.  This appears
>>>> 	when the extent is deleted.  It may be merged with adjacent
>>>> 	records.  Length is maybe 20 bytes?
>>>>
>>>> Each page contains a few hundred items, so if there are a few hundred
>>>> unrelated extents between extents in the log file, each log file extent
>>>> gets its own metadata page in each tree.
>>>
>>> As far as I can understand it, the extents in the extent tree are indexed
>>> (keyed) by inode&offset. Therefore, no matter how many unrelated extents
>>> there are between (physical locations of data) extents in the log file, the
>>> log file extent tree entries will (generally speaking) be  
>>> localized, because
>>> multiple extent entries (extent items) are bunched tohgether in one 16 KB
>>> metadata extent node.
>>
>> No, extents in the extent tree are indexed by virtual address (roughly the
>> same as physical address over small scales, let's leave the device tree
>> out of it for now).  The subvol trees are organized the way you are
>> thinking of.
>
> So, I guess that the virtual-to-physical address translation tables  
> are always loaded in memory and that this translation is very fast?  
> And the translation in the opposite direction, too.
>
> Anyway, thanks for explaining this all to me, makes it all much more clear.

Taking into account the explanation that b-tree nodes are sized 16 KB,  
this is what my imagined defrag would do:

This defrag makes batched updates to metadata. A most common cases, a  
bathched update (with a single commit) will have the following  
properties:

- it moves many file data extents from one place to another. However,  
the update is mostly localized in space (for common fragmentation  
cases), meaning that only a low number of fragments per update will be  
significantly scattered. Therefore, updates to extent-tree are  
localized.

- the free space tree - mostly localized updates, for same reason as above

- the checksum tree - the same as above

- the subvol tree - a (potentially big) number of files is affected  
per update. In most cases of fragmentation, the dispersion of  
fragments will not be high. However, due to number of files, updating  
this tree looks like the most performance-troubling part of all. The  
problem is that while there is only a small dispersion per each file,  
when this gets multiplied by the number of files, it can get bad. Some  
dispersion in this tree is to be expected. The update size can  
effectively be limited by the size of changes to the subvol tree (in  
complex cases).

- the expected dispersion in the subvol tree can get better after  
multiple defrags, if the defrag is made to approximately order the  
files on disk according to their position in filesystem directory  
tree. So, the files of the same directory should be grouped together  
in disk sectors in order to speed up the future defrags.

So, as each b-tree node is 16 K, but I guess most nodes are only  
half-full, the average node size is 8 KB. If the defrag reserves 128  
MB for a metadata update computation, it could update 16000 (16  
thousand) metadata nodes per one commit. It would be interesting to  
try to predict what amount of file data, on average, is referred by  
such a 16000 nodes commit?

Ok, let's do some pure guessing. For a home user, with 256 GB root or  
home partition, I would guess the average will be well over 500 MB  
file data per commit. It depends on how long has it been since his  
last defrag. If it is a weekly or a daily defrag, then I guess well  
over 1 GB file data per commit.

And for a server, there are so many different kinds of servers, so  
I'll just skip the guessing.



  reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed [this message]
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190916213451.Horde.CVGC1tsP9Z8Q-7hBi7l1ymH@server53.web-hosting.com \
    --to=general-zed@zedlx.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git