Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: General Zed <general-zed@zedlx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Chris Murphy <lists@colorremedies.com>,
	"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Feature requests: online backup - defrag - change RAID level
Date: Fri, 13 Sep 2019 01:22:24 -0400
Message-ID: <20190913012224.Horde.29BsHKRl-F2PB5EG-U0wHMA@server53.web-hosting.com> (raw)
In-Reply-To: <20190913031242.GF22121@hungrycats.org>


Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:

> On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
>>
>> Quoting Zygo Blaxell <ce3g8jdj@umail.furryterror.org>:
>>
>> You mean: all metadata size is 156 GB on one of your systems. However, you
>> don't typically have to put ALL metadata in RAM.
>> You need just some parts needed for defrag operation. So, for defrag, what
>> you really need is just some large metadata cache present in RAM. I would
>> say that if such a metadata cache is using 128 MB (for 2 TB disk) to 2 GB
>> (for 156 GB disk), than the defrag will run sufficiently fast.
>
> You're missing something (metadata requirement for delete?) in those
> estimates.
>
> Total metadata size does not affect how much metadata cache you need
> to defragment one extent quickly.  That number is a product of factors
> including input and output and extent size ratio, the ratio of various
> metadata item sizes to the metadata page size, and the number of trees you
> have to update (number of reflinks + 3 for extent, csum, and free space
> trees).
>
> It follows from the above that if you're joining just 2 unshared extents
> together, the total metadata required is well under a MB.
>
> If you're defragging a 128MB journal file with 32768 4K extents, it can
> create several GB of new metadata and spill out of RAM cache (which is
> currently capped at 512MB for assorted reasons).  Add reflinks and you
> might need more cache, or take a performance hit.  Yes, a GB might be
> the total size of all your metadata, but if you run defrag on a 128MB
> log file you could rewrite all of your filesystem's metadata in a single
> transaction (almost...you probably won't need to update the device or
> uuid trees).
>
> If you want to pipeline multiple extents per commit to avoid seeking,
> you need to multiply the above numbers by the size of the pipeline.
>
> You can also reduce the metadata cache requirement by reducing the output
> extent size.  A 16MB target extent size requires only 64MB of cache for
> the logfile case.
>
>> > Don't forget you have to write new checksum and free space tree pages.
>> > In the worst case, you'll need about 1GB of new metadata pages for each
>> > 128MB you defrag (though you get to delete 99.5% of them immediately
>> > after).
>>
>> Yes, here we are debating some worst-case scenaraio which is actually
>> imposible in practice due to various reasons.
>
> No, it's quite possible.  A log file written slowly on an active
> filesystem above a few TB will do that accidentally.  Every now and then
> I hit that case.  It can take several hours to do a logrotate on spinning
> arrays because of all the metadata fetches and updates associated with
> worst-case file delete.  Long enough to watch the delete happen, and
> even follow along in the source code.
>
> I guess if I did a proactive defrag every few hours, it might take less
> time to do the logrotate, but that would mean spreading out all the
> seeky IO load during the day instead of getting it all done at night.
> Logrotate does the same job as defrag in this case (replacing a file in
> thousands of fragments spread across the disk with a few large fragments
> close together), except logrotate gets better compression.
>
> To be more accurate, the example I gave above is the worst case you
> can expect from normal user workloads.  If I throw in some reflinks
> and snapshots, I can make it arbitrarily worse, until the entire disk
> is consumed by the metadata update of a single extent defrag.
>

In fact, I overcomplicated it in my previous answer.

So, we have a 1TB log file "ultralog" split into 256 million 4 KB  
extents randomly over the entire disk. We have 512 GB free RAM and 2%  
free disk space. The file needs to be defragmented.

We select some (any) consecutive 512 MB of file segments. They are  
certainly localized in the metadata, because we are talking about an  
ordered b-tree. We write those 512 MB of file extents to another place  
on the partition, defragmented (don't update b-tree yet). Then defrag  
calculates the fuse (merge) operation on those written extents. Then  
it calculates which metadata updates are necessary. Since we have  
selected (at the start) consecutive 512 MB of file segments, the  
updates to metadata are certainly localazed. The defrag writes out, in  
new extents, the required changes to metadata, then updates the super  
to commit.

Easy.


  parent reply index

Thread overview: 111+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:55 zedlryqc
2019-09-09  3:51 ` Qu Wenruo
2019-09-09 11:25   ` zedlryqc
2019-09-09 12:18     ` Qu Wenruo
2019-09-09 12:28       ` Qu Wenruo
2019-09-09 17:11         ` webmaster
2019-09-10 17:39           ` Andrei Borzenkov
2019-09-10 22:41             ` webmaster
2019-09-09 15:29       ` Graham Cobb
2019-09-09 17:24         ` Remi Gauvin
2019-09-09 19:26         ` webmaster
2019-09-10 19:22           ` Austin S. Hemmelgarn
2019-09-10 23:32             ` webmaster
2019-09-11 12:02               ` Austin S. Hemmelgarn
2019-09-11 16:26                 ` Zygo Blaxell
2019-09-11 17:20                 ` webmaster
2019-09-11 18:19                   ` Austin S. Hemmelgarn
2019-09-11 20:01                     ` webmaster
2019-09-11 21:42                       ` Zygo Blaxell
2019-09-13  1:33                         ` General Zed
2019-09-11 21:37                     ` webmaster
2019-09-12 11:31                       ` Austin S. Hemmelgarn
2019-09-12 19:18                         ` webmaster
2019-09-12 19:44                           ` Chris Murphy
2019-09-12 21:34                             ` General Zed
2019-09-12 22:28                               ` Chris Murphy
2019-09-12 22:57                                 ` General Zed
2019-09-12 23:54                                   ` Zygo Blaxell
2019-09-13  0:26                                     ` General Zed
2019-09-13  3:12                                       ` Zygo Blaxell
2019-09-13  5:05                                         ` General Zed
2019-09-14  0:56                                           ` Zygo Blaxell
2019-09-14  1:50                                             ` General Zed
2019-09-14  4:42                                               ` Zygo Blaxell
2019-09-14  4:53                                                 ` Zygo Blaxell
2019-09-15 17:54                                                 ` General Zed
2019-09-16 22:51                                                   ` Zygo Blaxell
2019-09-17  1:03                                                     ` General Zed
2019-09-17  1:34                                                       ` General Zed
2019-09-17  1:44                                                       ` Chris Murphy
2019-09-17  4:55                                                         ` Zygo Blaxell
2019-09-17  4:19                                                       ` Zygo Blaxell
2019-09-17  3:10                                                     ` General Zed
2019-09-17  4:05                                                       ` General Zed
2019-09-14  1:56                                             ` General Zed
2019-09-13  5:22                                         ` General Zed [this message]
2019-09-13  6:16                                         ` General Zed
2019-09-13  6:58                                         ` General Zed
2019-09-13  9:25                                           ` General Zed
2019-09-13 17:02                                             ` General Zed
2019-09-14  0:59                                             ` Zygo Blaxell
2019-09-14  1:28                                               ` General Zed
2019-09-14  4:28                                                 ` Zygo Blaxell
2019-09-15 18:05                                                   ` General Zed
2019-09-16 23:05                                                     ` Zygo Blaxell
2019-09-13  7:51                                         ` General Zed
2019-09-13 11:04                                     ` Austin S. Hemmelgarn
2019-09-13 20:43                                       ` Zygo Blaxell
2019-09-14  0:20                                         ` General Zed
2019-09-14 18:29                                       ` Chris Murphy
2019-09-14 23:39                                         ` Zygo Blaxell
2019-09-13 11:09                                   ` Austin S. Hemmelgarn
2019-09-13 17:20                                     ` General Zed
2019-09-13 18:20                                       ` General Zed
2019-09-12 19:54                           ` Austin S. Hemmelgarn
2019-09-12 22:21                             ` General Zed
2019-09-13 11:53                               ` Austin S. Hemmelgarn
2019-09-13 16:54                                 ` General Zed
2019-09-13 18:29                                   ` Austin S. Hemmelgarn
2019-09-13 19:40                                     ` General Zed
2019-09-14 15:10                                       ` Jukka Larja
2019-09-12 22:47                             ` General Zed
2019-09-11 21:37                   ` Zygo Blaxell
2019-09-11 23:21                     ` webmaster
2019-09-12  0:10                       ` Remi Gauvin
2019-09-12  3:05                         ` webmaster
2019-09-12  3:30                           ` Remi Gauvin
2019-09-12  3:33                             ` Remi Gauvin
2019-09-12  5:19                       ` Zygo Blaxell
2019-09-12 21:23                         ` General Zed
2019-09-14  4:12                           ` Zygo Blaxell
2019-09-16 11:42                             ` General Zed
2019-09-17  0:49                               ` Zygo Blaxell
2019-09-17  2:30                                 ` General Zed
2019-09-17  5:30                                   ` Zygo Blaxell
2019-09-17 10:07                                     ` General Zed
2019-09-17 23:40                                       ` Zygo Blaxell
2019-09-18  4:37                                         ` General Zed
2019-09-18 18:00                                           ` Zygo Blaxell
2019-09-10 23:58             ` webmaster
2019-09-09 23:24         ` Qu Wenruo
2019-09-09 23:25         ` webmaster
2019-09-09 16:38       ` webmaster
2019-09-09 23:44         ` Qu Wenruo
2019-09-10  0:00           ` Chris Murphy
2019-09-10  0:51             ` Qu Wenruo
2019-09-10  0:06           ` webmaster
2019-09-10  0:48             ` Qu Wenruo
2019-09-10  1:24               ` webmaster
2019-09-10  1:48                 ` Qu Wenruo
2019-09-10  3:32                   ` webmaster
2019-09-10 14:14                     ` Nikolay Borisov
2019-09-10 22:35                       ` webmaster
2019-09-11  6:40                         ` Nikolay Borisov
2019-09-10 22:48                     ` webmaster
2019-09-10 23:14                   ` webmaster
2019-09-11  0:26               ` webmaster
2019-09-11  0:36                 ` webmaster
2019-09-11  1:00                 ` webmaster
2019-09-10 11:12     ` Austin S. Hemmelgarn
2019-09-09  3:12 webmaster

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190913012224.Horde.29BsHKRl-F2PB5EG-U0wHMA@server53.web-hosting.com \
    --to=general-zed@zedlx.com \
    --cc=ahferroin7@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox