Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time

From: Felix Niederwanger <felix@feldspaten.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Johannes Thumshirn <jthumshirn@suse.de>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 0/3] btrfs: Introduce new incompat feature BG_TREE to hugely reduce mount time
Date: Wed, 9 Oct 2019 10:08:24 +0200	[thread overview]
Message-ID: <cba919dd-1b77-5757-cf37-760ad6718fe3@feldspaten.org> (raw)
In-Reply-To: <6f17fcd6-c576-adad-fb20-defcfde4efb5@gmx.com>

[-- Attachment #1.1: Type: text/plain, Size: 7666 bytes --]

Hi Qu,

I'm afraid the system is now in a very different configuration and
already in production, which makes it impossible to run specific tests
on it.

At the time the problem occurred, we were using about 57/82 TB filled
with approx 20-30 million individual files with varying file sized. I
created a histogram of the most common unsed filesizes for a small subset:

# find . -type f -print0 | xargs -0 ls -l | awk
'{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n",
2^i, size[i])}' | sort -n

         0 7992
         1 3072
         2   4
         8 1536
       128   2
       512  22
      1024 4600
      2048 3341
      4096 5671
      8192 940
     16384 6535
     32768 700
     65536  17
    131072 3843
    262144 3362
    524288 2143
   1048576 1169
   2097152 856
   4194304 168
   8388608 5579
  16777216 4052
  33554432 604
  67108864 890

26240 out of 57098 files (45%) are <=4k in size. This should be
representative for most of the files on the affected volume.

The affected system is now in production (xfs on the same RAID6) and as
I left university, it's unfortunately impossible to run any more tests
on that particular system.

Greetings,
Felix

On 10/9/19 9:43 AM, Qu Wenruo wrote:
>
> On 2019/10/9 下午3:07, Felix Niederwanger wrote:
>> Hey Johannes,
>>
>> glad to hear back from you :-)
>>
>> As discussed I try to elaborate the setup where we experienced the
>> issue, that btrfs mount takes more than 5 minutes. The initial bug
>> report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865
>>
>> Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller
>>                              Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD
>> Installed System:            OPENSUSE LEAP 15.1
>> Disks:                       / is on a separate DOM
>>                              /dev/sda1 is the affected btrfs volume
>>
>> Disk layout
>>
>> sda      8:0    0 98.2T  0 disk 
>> └─sda1   8:1    0 81.9T  0 part /ESO-RAID
> How much space is used?
>
> With enough space used (especially your tens of TB used), it's pretty
> easy to have too many block groups items to overload the extent tree.
>
> There is a better tool to explore the problem easier:
> https://github.com/adam900710/btrfs-progs/tree/account_bgs
>
> You can compile the btrfs-corrupt-block tool, and then:
> # ./btrfs-corrupt-block -X /dev/sda1
>
> It's recommended to call it with fs unmounted.
>
> Then it should output something like:
> extent_tree: total=1080 leaves=1025
>
> Then please post that line to surprise us.
>
> It shows how many unique tree blocks are needed to be read from disk,
> just for iterating the block group items.
>
> You could consider it as how many random IO needs to be done in nodesize
> (normally 16K).
>
>> sdb      8:16   0   59G  0 disk 
>> ├─sdb1   8:17   0    1G  0 part /boot
>> ├─sdb2   8:18   0  5.9G  0 part [SWAP]
>> └─sdb3   8:19   0 52.1G  0 part /
>>
>> System configuration : Opensuse LEAP 15.1 with "Server" configuration,
>> installed NFS server.
>>
>> I copied data from the old NAS (separate server, xfs volume) to the new
>> btrfs volume using rsync.
> If you are willing to/have enough spare space to test, you could try
> that my latest bg-tree feature, to see if it would solve the problem.
>
> My not-so-optimized guess that feature would reduce mount time to around
> 1min.
> My average guess is, around 30s.
>
> Thanks,
> Qu
>
>> Then I performed a system update with zypper, rebooted and run into the
>> problems described in
>> https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot
>> failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual
>> mount worked fine (but took up to 6 minutes) and the filesystem was
>> completely unresponsive. See the bug report for more details about what
>> became unresponsive.
>>
>> A movie of the failing boot process is still on my webserver:
>> ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4
>>
>>
>> I hope this contributes to reproduce the issue. Feel free to contact me
>> if you need further details,
>>
>> Greetings,
>> Felix :-)
>>
>>
>> On 10/8/19 11:47 AM, Johannes Thumshirn wrote:
>>> On 08/10/2019 11:26, Qu Wenruo wrote:
>>>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote:
>>>>>> [[Benchmark]]
>>>>>> Since I have upgraded my rig to all NVME storage, there is no HDD
>>>>>> test result.
>>>>>>
>>>>>> Physical device:	NVMe SSD
>>>>>> VM device:		VirtIO block device, backup by sparse file
>>>>>> Nodesize:		4K  (to bump up tree height)
>>>>>> Extent data size:	4M
>>>>>> Fs size used:		1T
>>>>>>
>>>>>> All file extents on disk is in 4M size, preallocated to reduce space usage
>>>>>> (as the VM uses loopback block device backed by sparse file)
>>>>> Do you have a some additional details about the test setup? I tried to
>>>>> do the same (testing) for a bug Felix (added to Cc) reported to my at
>>>>> the ALPSS Conference and I couldn't reproduce the issue.
>>>>>
>>>>> My testing was a 100TB sparse file passed into a VM and running this
>>>>> script to touch all blockgroups:
>>>> Here is my test scripts:
>>>> ---
>>>> #!/bin/bash
>>>>
>>>> dev="/dev/vdb"
>>>> mnt="/mnt/btrfs"
>>>>
>>>> nr_subv=16
>>>> nr_extents=16384
>>>> extent_size=$((4 * 1024 * 1024)) # 4M
>>>>
>>>> _fail()
>>>> {
>>>>         echo "!!! FAILED: $@ !!!"
>>>>         exit 1
>>>> }
>>>>
>>>> fill_one_subv()
>>>> {
>>>>         path=$1
>>>>         if [ -z $path ]; then
>>>>                 _fail "wrong parameter for fill_one_subv"
>>>>         fi
>>>>         btrfs subv create $path || _fail "create subv"
>>>>
>>>>         for i in $(seq 0 $((nr_extents - 1))); do
>>>>                 fallocate -o $((i * $extent_size)) -l $extent_size
>>>> $path/file || _fail "fallocate"
>>>>         done
>>>> }
>>>>
>>>> declare -a pids
>>>> umount $mnt &> /dev/null
>>>> umount $dev &> /dev/null
>>>>
>>>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree
>>>> mkfs.btrfs -f -n 4k $dev
>>>> mount $dev $mnt -o nospace_cache
>>>>
>>>> for i in $(seq 1 $nr_subv); do
>>>>         fill_one_subv $mnt/subv_${i} &
>>>>         pids[$i]=$!
>>>> done
>>>>
>>>> for i in $(seq 1 $nr_subv); do
>>>>         wait ${pids[$i]}
>>>> done
>>>> sync
>>>> umount $dev
>>>>
>>>> ---
>>>>
>>>>> #!/bin/sh
>>>>>
>>>>> FILE=/mnt/test
>>>>>
>>>>> add_dirty_bg() {
>>>>>         off="$1"
>>>>>         len="$2"
>>>>>         touch $FILE
>>>>>         xfs_io -c "falloc $off $len" $FILE
>>>>>         rm $FILE
>>>>> }
>>>>>
>>>>> mkfs.btrfs /dev/vda
>>>>> mount /dev/vda /mnt
>>>>>
>>>>> for ((i = 1; i < 100000; i++)); do
>>>>>         add_dirty_bg $i"G" "1G"
>>>>> done
>>>> This wont really build a good enough extent tree layout.
>>>>
>>>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs.
>>>>
>>>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all
>>>> together.
>>>>
>>>> To build a case to really show the problem, you'll need a lot of
>>>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS.
>>>>
>>>> My test scripts did that, but may still not represent the real world, as
>>>> real world can cause even smaller extents due to snapshots.
>>>>
>>> Ah thanks for the explanation. I'll give your testscript a try.
>>>
>>>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]