Re: BTRFS Mount Delay Time Graph

From: Hans van Kranenburg <Hans.van.Kranenburg@mendix.com>
To: Lionel Bouton <lionel-subscription@bouton.name>,
	"Wilson, Ellis" <ellisw@panasas.com>,
	BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: BTRFS Mount Delay Time Graph
Date: Mon, 3 Dec 2018 22:22:37 +0000	[thread overview]
Message-ID: <57bdbd3d-e605-7dad-4b9a-d78a2f657dce@mendix.com> (raw)
In-Reply-To: <4746e8ba-b20c-01e2-379e-b76f0d2ab5a7@bouton.name>

[-- Attachment #1: Type: text/plain, Size: 7058 bytes --]

Hi,

On 12/3/18 8:56 PM, Lionel Bouton wrote:
> 
> Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.

Probably yes. The thing that is happening is that all block group items
are read from the extent tree. And, instead of being nicely grouped
together, they are scattered all over the place, at their virtual
address, in between all normal extent items.

So, mount time depends on cold random read iops your storage can do, and
the size of the extent tree and amount of block groups. And, your extent
tree has more items in it if you have more extents. So, yes, writing a
lot of 4kiB files should have a similar effect I think as a lot of
128MiB files that are still stored in 1 extent per file.

>  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
> 
> We are hosting some large BTRFS filesystems on Ceph (RBD used by
> QEMU/KVM). I believe the delay is heavily linked to the number of files
> (I didn't check if snapshots matter and I suspect it does but not as
> much as the number of "original" files at least if you don't heavily
> modify existing files but mostly create new ones as we do).
> As an example, we have a filesystem with 20TB used space with 4
> subvolumes hosting multi millions files/directories (probably 10-20
> millions total I didn't check the exact number recently as simply
> counting files is a very long process) and 40 snapshots for each volume.
> Mount takes about 15 minutes.
> We have virtual machines that we don't reboot as often as we would like
> because of these slow mount times.
> 
> If you want to study this, you could :
> - graph the delay for various individual file sizes (instead of 25x10GB,
> create 2 500 x 100MB and 250 000 x 1MB files between each run and
> compare to the original result)
> - graph the delay vs the number of snapshots (probably starting with a
> large number of files in the initial subvolume to start with a non
> trivial mount delay)
> You may want to study the impact of the differences between snapshots by
> comparing snapshoting without modifications and snapshots made at
> various stages of your suvolume growth.
> 
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests by half on
> one server in production though although more tests are needed to
> isolate the cause).
> I didn't expect much for the mount times, it seems to me that mount is
> mostly constrained by the BTRFS on disk structures needed at mount time
> and how the filesystem reads them (for example it doesn't benefit at all
> from large IO queue depths which probably means that each read depends
> on previous ones which prevents io-schedulers from optimizing anything).

Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes
where the next one will be (just after the end of the vaddr+length of
it), and then jumps over all normal extent items and searches again near
where the next block group item has to be. So, yes, that means that they
depend on each other.

Two possible ways to improve this:

1. Instead, walk the chunk tree (which has all related items packed
together) instead to find out at which locations in the extent tree the
block group items are located and then start getting items in parallel.
If you have storage with a lot of rotating rust that can deliver much
more random reads if you ask for more of them at the same time, then
this can already cause a massive speedup.

2. Move the block group items somewhere else, where they can nicely be
grouped together, so that the amount of metadata pages that has to be
looked up is minimal. Quoting from the link below, "slightly tricky
[...] but there are no fundamental obstacles".

https://www.spinics.net/lists/linux-btrfs/msg71766.html

I think the main obstacle here is finding a developer with enough
experience and time to do it. :)

For fun, you can also just read the block group metadata after dropping
caches each time, which should give similar relative timing results as
mounting the filesystem again. (Well, if disk IO wait is the main
slowdown of course.)

Attached are two example programs, using python-btrfs.

* bg_after_another.py does the same thing as the kernel code I just linked.
* bg_via_chunks.py looks them up based on chunk tree info.

The time that it takes after option 2 above would be implemented should
be very similar to just reading the chunk tree. (remove the block group
lookup from bg_via_chunks and run that).

Now what's still missing is changing the bg_via_chunks one to start
kicking off the block group searches in parallel, and then you can
predict how long it would take if 1 would be implemented.

\:D/

-- 
Hans van Kranenburg

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bg_after_another.py --]
[-- Type: text/x-python; name="bg_after_another.py", Size: 798 bytes --]

#!/usr/bin/python3

import btrfs
import sys

if len(sys.argv) < 2:
    print("Usage: {} <mountpoint>".format(sys.argv[0]))
    sys.exit(1)

tree = btrfs.ctree.EXTENT_TREE_OBJECTID
min_key = btrfs.ctree.Key(0, 0, 0)
bufsize = btrfs.utils.SZ_4K

def first_block_group_after(fs, key):
    for header, data in btrfs.ioctl.search_v2(fs.fd, tree, min_key, buf_size=bufsize):
        if header.type == btrfs.ctree.BLOCK_GROUP_ITEM_KEY:
            return header

fs = btrfs.FileSystem(sys.argv[1])
while True:
    header = first_block_group_after(fs, min_key)
    if header is None:
        break
    min_key = btrfs.ctree.Key(header.objectid + header.offset,
                              btrfs.ctree.BLOCK_GROUP_ITEM_KEY, 0)
    print('.', end='', flush=True)

print()

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: bg_via_chunks.py --]
[-- Type: text/x-python; name="bg_via_chunks.py", Size: 306 bytes --]

#!/usr/bin/python3

import btrfs
import sys

if len(sys.argv) < 2:
    print("Usage: {} <mountpoint>".format(sys.argv[0]))
    sys.exit(1)

fs = btrfs.FileSystem(sys.argv[1])
for chunk in fs.chunks():
    fs.block_group(chunk.vaddr, chunk.length)
    print('.', end='', flush=True)

print()