On 2019/10/9 下午3:07, Felix Niederwanger wrote: > Hey Johannes, > > glad to hear back from you :-) > > As discussed I try to elaborate the setup where we experienced the > issue, that btrfs mount takes more than 5 minutes. The initial bug > report is at https://bugzilla.opensuse.org/show_bug.cgi?id=1143865 > > Physical device:             Hardware RAID controller ARECA-1883 PCIe 3.0 to SAS/SATA 12Gb RAID Controller > Hardware RAID6+HotSpare, 8 TB Seagate IronWolf NAS HDD > Installed System: OPENSUSE LEAP 15.1 > Disks: / is on a separate DOM > /dev/sda1 is the affected btrfs volume > > Disk layout > > sda 8:0 0 98.2T 0 disk > └─sda1 8:1 0 81.9T 0 part /ESO-RAID How much space is used? With enough space used (especially your tens of TB used), it's pretty easy to have too many block groups items to overload the extent tree. There is a better tool to explore the problem easier: https://github.com/adam900710/btrfs-progs/tree/account_bgs You can compile the btrfs-corrupt-block tool, and then: # ./btrfs-corrupt-block -X /dev/sda1 It's recommended to call it with fs unmounted. Then it should output something like: extent_tree: total=1080 leaves=1025 Then please post that line to surprise us. It shows how many unique tree blocks are needed to be read from disk, just for iterating the block group items. You could consider it as how many random IO needs to be done in nodesize (normally 16K). > sdb 8:16 0 59G 0 disk > ├─sdb1 8:17 0 1G 0 part /boot > ├─sdb2 8:18 0 5.9G 0 part [SWAP] > └─sdb3 8:19 0 52.1G 0 part / > > System configuration : Opensuse LEAP 15.1 with "Server" configuration, > installed NFS server. > > I copied data from the old NAS (separate server, xfs volume) to the new > btrfs volume using rsync. If you are willing to/have enough spare space to test, you could try that my latest bg-tree feature, to see if it would solve the problem. My not-so-optimized guess that feature would reduce mount time to around 1min. My average guess is, around 30s. Thanks, Qu > Then I performed a system update with zypper, rebooted and run into the > problems described in > https://bugzilla.opensuse.org/show_bug.cgi?id=1143865. In short: Boot > failed, because mounting /ESO-RAID run into a 5 minutes timeout. Manual > mount worked fine (but took up to 6 minutes) and the filesystem was > completely unresponsive. See the bug report for more details about what > became unresponsive. > > A movie of the failing boot process is still on my webserver: > ftp://feldspaten.org/dump/20190803_btrfs_balance_issue/btrfs_openctree_failed.mp4 > > > I hope this contributes to reproduce the issue. Feel free to contact me > if you need further details, > > Greetings, > Felix :-) > > > On 10/8/19 11:47 AM, Johannes Thumshirn wrote: >> On 08/10/2019 11:26, Qu Wenruo wrote: >>> On 2019/10/8 下午5:14, Johannes Thumshirn wrote: >>>>> [[Benchmark]] >>>>> Since I have upgraded my rig to all NVME storage, there is no HDD >>>>> test result. >>>>> >>>>> Physical device: NVMe SSD >>>>> VM device: VirtIO block device, backup by sparse file >>>>> Nodesize: 4K (to bump up tree height) >>>>> Extent data size: 4M >>>>> Fs size used: 1T >>>>> >>>>> All file extents on disk is in 4M size, preallocated to reduce space usage >>>>> (as the VM uses loopback block device backed by sparse file) >>>> Do you have a some additional details about the test setup? I tried to >>>> do the same (testing) for a bug Felix (added to Cc) reported to my at >>>> the ALPSS Conference and I couldn't reproduce the issue. >>>> >>>> My testing was a 100TB sparse file passed into a VM and running this >>>> script to touch all blockgroups: >>> Here is my test scripts: >>> --- >>> #!/bin/bash >>> >>> dev="/dev/vdb" >>> mnt="/mnt/btrfs" >>> >>> nr_subv=16 >>> nr_extents=16384 >>> extent_size=$((4 * 1024 * 1024)) # 4M >>> >>> _fail() >>> { >>> echo "!!! FAILED: $@ !!!" >>> exit 1 >>> } >>> >>> fill_one_subv() >>> { >>> path=$1 >>> if [ -z $path ]; then >>> _fail "wrong parameter for fill_one_subv" >>> fi >>> btrfs subv create $path || _fail "create subv" >>> >>> for i in $(seq 0 $((nr_extents - 1))); do >>> fallocate -o $((i * $extent_size)) -l $extent_size >>> $path/file || _fail "fallocate" >>> done >>> } >>> >>> declare -a pids >>> umount $mnt &> /dev/null >>> umount $dev &> /dev/null >>> >>> #~/btrfs-progs/mkfs.btrfs -f -n 4k $dev -O bg-tree >>> mkfs.btrfs -f -n 4k $dev >>> mount $dev $mnt -o nospace_cache >>> >>> for i in $(seq 1 $nr_subv); do >>> fill_one_subv $mnt/subv_${i} & >>> pids[$i]=$! >>> done >>> >>> for i in $(seq 1 $nr_subv); do >>> wait ${pids[$i]} >>> done >>> sync >>> umount $dev >>> >>> --- >>> >>>> #!/bin/sh >>>> >>>> FILE=/mnt/test >>>> >>>> add_dirty_bg() { >>>> off="$1" >>>> len="$2" >>>> touch $FILE >>>> xfs_io -c "falloc $off $len" $FILE >>>> rm $FILE >>>> } >>>> >>>> mkfs.btrfs /dev/vda >>>> mount /dev/vda /mnt >>>> >>>> for ((i = 1; i < 100000; i++)); do >>>> add_dirty_bg $i"G" "1G" >>>> done >>> This wont really build a good enough extent tree layout. >>> >>> 1G fallocate will only cause 8 128M file extents, thus 8 EXTENT_ITEMs. >>> >>> Thus a leaf (16K by default) can still contain a lot of BLOCK_GROUPS all >>> together. >>> >>> To build a case to really show the problem, you'll need a lot of >>> EXTENT_ITEM/METADATA_ITEMS to fill the gaps between BLOCK_GROUPS. >>> >>> My test scripts did that, but may still not represent the real world, as >>> real world can cause even smaller extents due to snapshots. >>> >> Ah thanks for the explanation. I'll give your testscript a try. >> >>