linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: General Zed <general-zed@zedlx.com>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Sat, 14 Sep 2019 00:53:11 -0400	[thread overview]
Message-ID: <20190914045311.GM22121@hungrycats.org> (raw)
In-Reply-To: <20190914044219.GL22121@hungrycats.org>

On Sat, Sep 14, 2019 at 12:42:19AM -0400, Zygo Blaxell wrote:
> On Fri, Sep 13, 2019 at 09:50:38PM -0400, General Zed wrote:
> > 
> > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > 
> > > On Fri, Sep 13, 2019 at 01:05:52AM -0400, General Zed wrote:
> > > > 
> > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > 
> > > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > > >
> > > > > > Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
> > > > > >
> > > > > > > Don't forget you have to write new checksum and free space tree pages.
> > > > > > > In the worst case, you'll need about 1GB of new metadata pages
> > > > for each
> > > > > > > 128MB you defrag (though you get to delete 99.5% of them immediately
> > > > > > > after).
> > > > > >
> > > > > > Yes, here we are debating some worst-case scenaraio which is actually
> > > > > > imposible in practice due to various reasons.
> > > > >
> > > > > No, it's quite possible.  A log file written slowly on an active
> > > > > filesystem above a few TB will do that accidentally.  Every now and then
> > > > > I hit that case.  It can take several hours to do a logrotate on spinning
> > > > > arrays because of all the metadata fetches and updates associated with
> > > > > worst-case file delete.  Long enough to watch the delete happen, and
> > > > > even follow along in the source code.
> > > > >
> > > > > I guess if I did a proactive defrag every few hours, it might take less
> > > > > time to do the logrotate, but that would mean spreading out all the
> > > > > seeky IO load during the day instead of getting it all done at night.
> > > > > Logrotate does the same job as defrag in this case (replacing a file in
> > > > > thousands of fragments spread across the disk with a few large fragments
> > > > > close together), except logrotate gets better compression.
> > > > >
> > > > > To be more accurate, the example I gave above is the worst case you
> > > > > can expect from normal user workloads.  If I throw in some reflinks
> > > > > and snapshots, I can make it arbitrarily worse, until the entire disk
> > > > > is consumed by the metadata update of a single extent defrag.
> > > > >
> > > > 
> > > > I can't believe I am considering this case.
> > > > 
> > > > So, we have a 1TB log file "ultralog" split into 256 million 4 KB extents
> > > > randomly over the entire disk. We have 512 GB free RAM and 2% free disk
> > > > space. The file needs to be defragmented.
> > > > 
> > > > In order to do that, defrag needs to be able to copy-move multiple extents
> > > > in one batch, and update the metadata.
> > > > 
> > > > The metadata has a total of at least 256 million entries, each of some size,
> > > > but each one should hold at least a pointer to the extent (8 bytes) and a
> > > > checksum (8 bytes): In reality, it could be that there is a lot of other
> > > > data there per entry.
> > > 
> > > It's about 48KB per 4K extent, plus a few hundred bytes on average for each
> > > reference.
> > 
> > Sorry, could you be more clear there? An file fragment/extent that holds
> > file data can be any
> > size up to 128 MB. What metadata is there per every file fragment/extent?
> > 
> > Because "48 KB per 4 K extent" ... cannot decode what you mean.
> 
> An extent has 3 associated records in btrfs, not including its references.
> The first two exist while the extent exists, the third appears after it
> is removed.
> 
> 	- extent tree:  location, size of extent, pointers to backref trees.
> 	Length is around 60 bytes plus the size of the backref pointer list.
> 
> 	- csum tree:  location, 1 or more 4-byte csums packed in an array.
> 	Length of item is number of extent data blocks * 4 bytes plus a
> 	168-bit header (ish...csums from adjacent extents may be packed
> 	using a shared header)
> 
> 	- free space tree:  location, size of free space.  This appears
> 	when the extent is deleted.  It may be merged with adjacent
> 	records.  Length is maybe 20 bytes?
> 
> Each page contains a few hundred items, so if there are a few hundred
> unrelated extents between extents in the log file, each log file extent
> gets its own metadata page in each tree.
> 
> > Another question is: what is the average size of metadata extents?
> 
> Metadata extents are all 16K.
> 
> > > > The metadata is organized as a b-tree. Therefore, nearby nodes should
> > > > contain data of consecutive file extents.
> > > 
> > > It's 48KB per item.
> > 
> > What's the "item"?
> 
> Items are the objects stored in the trees.  So one extent item, one csum
> item, and one free space tree item, all tied to the 4K extent from the
> log file.
> 
> > > As you remove the original data extents, you will
> > > be touching a 16KB page in three trees for each extent that is removed:
> > > Free space tree, csum tree, and extent tree.  This happens after the
> > > merged extent is created.  It is part of the cleanup operation that
> > > gets rid of the original 4K extents.
> > 
> > Ok, but how big are free space tree and csum tree?
> 
> At least 12GB in the worst-case example.

The filesystem where I was hitting the 1GB-metadata issue has 79GB
of metadata.  An average of 500-1000 new extents are created on the
filesystem between log file page writes, maybe up to 5000 during peak
activity periods.

> > Also, when moving a file to defragment it, there should still be some
> > locality even in free space tree.
> 
> It is only guaranteed in the defrag result, because the defrag result is
> the only thing that is necessarily physically contiguous.
> 
> > And the csum tree, it should be ordered similar to free space tree, right?
> 
> They are all ordered by extent physical address (same physical blocks,
> same metadata item key).
> 
> > > Because the file was written very slowly on a big filesystem, the extents
> > > are scattered pessimally all over the virtual address space, not packed
> > > close together.  If there are a few hundred extent allocations between
> > > each log extent, then they will all occupy separate metadata pages.
> > 
> > Ok, now you are talking about your pathological case. Let's consider it.
> > 
> > Note that there is very little that can be in this case that you are
> > describing. In order to defrag such a file, either the defrag will take many
> > small steps and therefore it will be slow (because each step needs to
> > perform an update to the metadata), or the defrag can do it in one big step
> > and use a huge amount of RAM.
> > 
> > So, the best thing to be done in this situation is to allow the user to
> > specify the amount of RAM that defrag is allowed to use, so that the user
> > decides which of the two (slow defrag or lots of RAM) he wants.
> > 
> > There is no way around it. There is no better defrag than the one that has
> > ALL information at hand, that one will be the fastest and the best defrag.
> > 
> > > When it is time to remove them, each of these pages must be updated.
> > > This can be hit in a number of places in btrfs, including overwrite
> > > and delete.
> > > 
> > > There's also 60ish bytes per extent in any subvol trees the file
> > > actually appears in, but you do get locality in that one (the key is
> > > inode and offset, so nothing can get between them and space them apart).
> > > That's 12GB and change (you'll probably completely empty most of the
> > > updated subvol metadata pages, so we can expect maybe 5 pages to remain
> > > including root and interior nodes).  I haven't been unlucky enough to
> > > get a "natural" 12GB, but I got over 1GB a few times recently.
> > 
> > The thing that I figured out (and I have already written it down in another
> > post) is that the defrag can CHOOSE AT WILL how large update to metadata it
> > wants to perform (within the limit of available RAM). The defrag can select,
> > by itself, the most efficient way to proceed while still honoring the
> > user-supplied limit on RAM.
> 
> Yeah, it can update half the reflinks and pause for a commit, or similar.
> If there's a power failure then there will be a duplicate extent with some
> of the references to one copy and some to the other, but this is probably
> rare enough not to matter.
> 
> > > Reflinks can be used to multiply that 12GB arbitrarily--you only get
> > > locality if the reflinks are consecutive in (inode, offset) space,
> > > so if the reflinks are scattered across subvols or files, they won't
> > > share pages.
> > 
> > OK.
> > 
> > Yes, given a sufficiently pathological case, the defrag will take forever.
> > There is nothing unexpected there. I agree on that point. The defrag always
> > functions within certain prerequisites.
> > 
> > > > The trick, in this case, is to select one part of "ultralog" which is
> > > > localized in the metadata, and defragment it. Repeating this step will
> > > > ultimately defragment the entire file.
> > > > 
> > > > So, the defrag selects some part of metadata which is entirely a descendant
> > > > of some b-tree node not far from the bottom of b-tree. It selects it such
> > > > that the required update to the metadata is less than, let's say, 64 MB, and
> > > > simultaneously the affected "ultralog" file fragments total less han 512 MB
> > > > (therefore, less than 128 thousand metadata leaf entries, each pointing to a
> > > > 4 KB fragment). Then it finds all the file extents pointed to by that part
> > > > of metadata. They are consecutive (as file fragments), because we have
> > > > selected such part of metadata. Now the defrag can safely copy-move those
> > > > fragments to a new area and update the metadata.
> > > > 
> > > > In order to quickly select that small part of metadata, the defrag needs a
> > > > metatdata cache that can hold somewhat more than 128 thousand localized
> > > > metadata leaf entries. That fits into 128 MB RAM definitely.
> > > > 
> > > > Of course, there are many other small issues there, but this outlines the
> > > > general procedure.
> > > > 
> > > > Problem solved?
> > 
> > > Problem missed completely.  The forward reference updates were the only
> > > easy part.
> > 
> > Oh, I'll reply in another mail, this one is getting too tireing.
> > 
> > 

  reply	other threads:[~2019-09-14  4:57 UTC|newest]

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 Feature requests: online backup - defrag - change RAID level zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell [this message]
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190914045311.GM22121@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=general-zed@zedlx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).