Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Felix Niederwanger <felix@feldspaten.org>,
	Johannes Thumshirn <jthumshirn@suse.de>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
Date: Wed, 9 Oct 2019 19:00:18 +0800	[thread overview]
Message-ID: <43b50e0b-03dd-3b93-adaa-dc467d9cbad1@gmx.com> (raw)
In-Reply-To: <cba919dd-1b77-5757-cf37-760ad6718fe3@feldspaten.org>

[-- Attachment #1.1: Type: text/plain, Size: 8664 bytes --]

On 2019/10/9 下午4:08, Felix Niederwanger wrote:
> Hi Qu,
> 
> I'm afraid the system is now in a very different configuration and
> already in production, which makes it impossible to run specific tests
> on it.
> 
> At the time the problem occurred, we were using about 57/82 TB filled
> with approx 20-30 million individual files with varying file sized. I
> created a histogram of the most common unsed filesizes for a small subset:
> 
> # find . -type f -print0 | xargs -0 ls -l | awk
> '{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n",
> 2^i, size[i])}' | sort -n
> 
>          0 7992
>          1 3072
>          2   4
>          8 1536
>        128   2
>        512  22
>       1024 4600
>       2048 3341
>       4096 5671
>       8192 940
>      16384 6535
>      32768 700
>      65536  17
>     131072 3843
>     262144 3362
>     524288 2143
>    1048576 1169
>    2097152 856
>    4194304 168
>    8388608 5579
>   16777216 4052
>   33554432 604
>   67108864 890
> 
> 26240 out of 57098 files (45%) are <=4k in size. This should be
> representative for most of the files on the affected volume.

That explains the problem.

There are 3 main factors contributing to the mount time:
- Number of block groups
  This directly affects how many tree search we need to do.

- Number of extents
  This affects how random the block group iteration will be.

- Disk IOPS performance
  Obviously, since bg iteration is mostly random IO, it's IOPS affecting
  the overall mount time.

In your case, your fs seems to have all the boxes checked.

Anyway, it still stands inside the assumption I have, although it's a
pity that we can't get a real world benchmark, it still contributes to
the motivation of bg-tree feature.

Thanks,
Qu
> 
> The affected system is now in production (xfs on the same RAID6) and as
> I left university, it's unfortunately impossible to run any more tests
> on that particular system.
> 
> Greetings,
> Felix
> 
> 
> On 10/9/19 9:43 AM, Qu Wenruo wrote:
>>
>> On 2019/10/9 下午3:07, Felix Niederwanger wrote:
>>> Hey Johannes,
>>>
>>> glad to hear back from you :-)
>>>
>>> As discussed I try to elaborate the setup where we experienced the
>>> issue, that btrfs mount takes more than 5 minutes. The initial bug
>>> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
>>>
>>> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>>>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
>>> Installed System:            OPENSUSE LEAP 15.1
>>> Disks:                       / is on a separate DOM
>>>                              /dev/sda1 is the affected btrfs volume
>>>
>>> Disk layout
>>>
>>> sda      8:0    0 98.2T  0 disk 
>>> └─sda1   8:1    0 81.9T  0 part /ESO-RAID
>> How much space is used?
>>
>> With enough space used (especially your tens of TB used), it's pretty
>> easy to have too many block groups items to overload the extent tree.
>>
>> There is a better tool to explore the problem easier:
>> https://github.com/adam900710/btrfs-progs/tree/account_bgs
>>
>> You can compile the btrfs-corrupt-block tool, and then:
>> # ./btrfs-corrupt-block -X /dev/sda1
>>
>> It's recommended to call it with fs unmounted.
>>
>> Then it should output something like:
>> extent_tree: total=1080 leaves=1025
>>
>> Then please post that line to surprise us.
>>
>> It shows how many unique tree blocks are needed to be read from disk,
>> just for iterating the block group items.
>>
>> You could consider it as how many random IO needs to be done in nodesize
>> (normally 16K).
>>
>>> sdb      8:16   0   59G  0 disk 
>>> ├─sdb1   8:17   0    1G  0 part /boot
>>> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
>>> └─sdb3   8:19   0 52.1G  0 part /
>>>
>>> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
>>> installed NFS server.
>>>
>>> I copied data from the old NAS (separate server, xfs volume) to the new
>>> btrfs volume using rsync.
>> If you are willing to/have enough spare space to test, you could try
>> that my latest bg-tree feature, to see if it would solve the problem.
>>
>> My not-so-optimized guess that feature would reduce mount time to around
>> 1min.
>> My average guess is, around 30s.
>>
>> Thanks,
>> Qu
>>
>>> Then I performed a system update with zypper, rebooted and run into the
>>> problems described in
>>> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
>>> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
>>> mount worked fine (but took up to 6 minutes) and the filesystem was
>>> completely unresponsive. See the bug report for more details about what
>>> became unresponsive.
>>>
>>> A movie of the failing boot process is still on my webserver:
>>> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
>>>
>>>
>>> I hope this contributes to reproduce the issue. Feel free to contact me
>>> if you need further details,
>>>
>>> Greetings,
>>> Felix :-)
>>>
>>>
>>> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>>>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>>>> [[Benchmark]]
>>>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>>>> test result.
>>>>>>>
>>>>>>> Physical device:	NVMe SSD
>>>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>>>> Nodesize:		4K  (to bump up tree height)
>>>>>>> Extent data size:	4M
>>>>>>> Fs size used:		1T
>>>>>>>
>>>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>>>> (as the VM uses loopback block device backed by sparse file)
>>>>>> Do you have a some additional details about the test setup? I tried to
>>>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>>>
>>>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>>>> script to touch all blockgroups:
>>>>> Here is my test scripts:
>>>>> ---
>>>>> #!/bin/bash
>>>>>
>>>>> dev="/dev/vdb"
>>>>> mnt="/mnt/btrfs"
>>>>>
>>>>> nr_subv=16
>>>>> nr_extents=16384
>>>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>>>
>>>>> _fail()
>>>>> {
>>>>>         echo "!!! FAILED: $@ !!!"
>>>>>         exit 1
>>>>> }
>>>>>
>>>>> fill_one_subv()
>>>>> {
>>>>>         path=$1
>>>>>         if [ -z $path ]; then
>>>>>                 _fail "wrong parameter for fill_one_subv"
>>>>>         fi
>>>>>         btrfs subv create $path || _fail "create subv"
>>>>>
>>>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>>>> $path/file || _fail "fallocate"
>>>>>         done
>>>>> }
>>>>>
>>>>> declare -a pids
>>>>> umount $mnt &> /dev/null
>>>>> umount $dev &> /dev/null
>>>>>
>>>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>>>> mkfs.btrfs -f -n 4k $dev
>>>>> mount $dev $mnt -o nospace_cache
>>>>>
>>>>> for i in $(seq 1 $nr_subv); do
>>>>>         fill_one_subv $mnt/subv_${i} &
>>>>>         pids[$i]=$!
>>>>> done
>>>>>
>>>>> for i in $(seq 1 $nr_subv); do
>>>>>         wait ${pids[$i]}
>>>>> done
>>>>> sync
>>>>> umount $dev
>>>>>
>>>>> ---
>>>>>
>>>>>> #!/bin/sh
>>>>>>
>>>>>> FILE=/mnt/test
>>>>>>
>>>>>> add_dirty_bg() {
>>>>>>         off="$1"
>>>>>>         len="$2"
>>>>>>         touch $FILE
>>>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>>>         rm $FILE
>>>>>> }
>>>>>>
>>>>>> mkfs.btrfs /dev/vda
>>>>>> mount /dev/vda /mnt
>>>>>>
>>>>>> for ((i = 1; i < 100000; i++)); do
>>>>>>         add_dirty_bg $i"G" "1G"
>>>>>> done
>>>>> This wont really build a good enough extent tree layout.
>>>>>
>>>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>>>
>>>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>>>> together.
>>>>>
>>>>> To build a case to really show the problem, you'll need a lot of
>>>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>>>
>>>>> My test scripts did that, but may still not represent the real world, as
>>>>> real world can cause even smaller extents due to snapshots.
>>>>>
>>>> Ah thanks for the explanation. I'll give your testscript a try.
>>>>
>>>>
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]