Re: Problems balancing BTRFS

From: devel@roosoft.ltd.uk
To: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Problems balancing BTRFS
Date: Sat, 23 Nov 2019 12:53:09 +0000	[thread overview]
Message-ID: <ed74e33c-5136-a45d-2100-d7741b56ed54@casa-di-locascio.net> (raw)
In-Reply-To: <1582e606-ecd4-a908-c139-05aca4551c2e@gmx.com>

On 23/11/2019 00:09, Qu Wenruo wrote:
>
> On 2019/11/22 下午11:32, devel@roosoft.ltd.uk wrote:
>> On 22/11/2019 14:07, devel@roosoft.ltd.uk wrote:
>>> On 22/11/2019 13:56, Qu Wenruo wrote:
>>>> On 2019/11/22 下午9:20, devel@roosoft.ltd.uk wrote:
>>>>> On 22/11/2019 13:10, Qu Wenruo wrote:
>>>>>> On 2019/11/22 下午8:37, devel@roosoft.ltd.uk wrote:
>>>>>>> So been discussing this on IRC but looks like more sage advice is needed.
>>>>>> You're not the only one hitting the bug. (Not sure if that makes you
>>>>>> feel a little better)
>>>>> Hehe.. well always help to know you are not slowly going crazy by oneself.
>>>>>
>>>>>>> The csum error is from data reloc tree, which is a tree to record the
>>>>>>> new (relocated) data.
>>>>>>> So the good news is, your old data is not corrupted, and since we hit
>>>>>>> EIO before switching tree blocks, the corrupted data is just deleted.
>>>>>>>
>>>>>>> And I have also seen the bug just using single device, with DUP meta and
>>>>>>> SINGLE data, so I believe there is something wrong with the data reloc tree.
>>>>>>> The problem here is, I can't find a way to reproduce it, so it will take
>>>>>>> us a longer time to debug.
>>>>>>>
>>>>>>>
>>>>>>> Despite that, have you seen any other problem? Especially ENOSPC (needs
>>>>>>> enospc_debug mount option).
>>>>>>> The only time I hit it, I was debugging ENOSPC bug of relocation.
>>>>>>>
>>>>> As far as I can tell the rest of the filesystem works normally. Like I
>>>>> show scrubs clean etc.. I have not actively added much new data since
>>>>> the whole point is to balance the fs so a scrub does not take 18 hours.
>>>> Sorry my point here is, would you like to try balance again with
>>>> "enospc_debug" mount option?
>>>>
>>>> As for balance, we can hit ENOSPC without showing it as long as we have
>>>> a more serious problem, like the EIO you hit.
>>> Oh I see .. Sure I can start the balance again.
>>>
>>>
>>>>> So really I am not sure what to do. It only seems to appear during a
>>>>> balance, which as far as I know is a much needed regular maintenance
>>>>> tool to keep a fs healthy, which is why it is part of the
>>>>> btrfsmaintenance tools 
>>>> You don't need to be that nervous just for not being able to balance.
>>>>
>>>> Nowadays, balance is no longer that much necessary.
>>>> In the old days, balance is the only way to delete empty block groups,
>>>> but now empty block groups will be removed automatically, so balance is
>>>> only here to address unbalanced disk usage or convert.
>>>>
>>>> For your case, although it's not comfortable to have imbalanced disk
>>>> usages, but that won't hurt too much.
>>> Well going from 1Tb to 6Tb devices means there is a lot of weighting
>>> going the wrong way. Initially there was only ~ 200Gb on each of the new
>>> disks and so that was just unacceptable it was getting better until I
>>> hit this balance issue. But I am wary of putting too much new data
>>> unless it is symptomatic of something else.
>>>
>>>
>>>
>>>> So for now, you can just disable balance and call it a day.
>>>> As long as you're still writing into that fs, the fs should become more
>>>> and more balanced.
>>>>
>>>>> Are there some other tests to try and isolate what the problem appears
>>>>> to be?
>>>> Forgot to mention, is that always reproducible? And always one the same
>>>> block group?
>>>>
>>>> Thanks,
>>>> Qu
>>> So far yes. The balance will always fall at the same ino and offset
>>> making it impossible to continue.
>>>
>>>
>>> Let me run it with debug on and get back to you.
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>
>>
>>
>> OK so I mounted the fs with enospc_debug
>>
>>
>>> /dev/sdb on /mnt/media type btrfs
>> (rw,relatime,space_cache,enospc_debug,subvolid=1001,subvol=/media)
>>
>>
>> Re- ran a balance and it did a little more. but then errored out again..
>>
>>
>> However I don't see any more info in dmesg..
> OK, not that ENOSPC bug I'm chasing.
>
>> [Fri Nov 22 15:13:40 2019] BTRFS info (device sdb): relocating block
>> group 8963019112448 flags data|raid5
>> [Fri Nov 22 15:14:22 2019] BTRFS info (device sdb): found 61 extents
>> [Fri Nov 22 15:14:41 2019] BTRFS info (device sdb): found 61 extents
>> [Fri Nov 22 15:14:59 2019] BTRFS info (device sdb): relocating block
>> group 8801957838848 flags data|raid5
>> [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
>> -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 1
>> [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
>> -9 ino 305 off 131764224 csum 0xd009e874 expected csum 0x00000000 mirror 1
>> [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
>> -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 2
>> [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
>> -9 ino 305 off 131764224 csum 0xd009e874 expected csum 0x00000000 mirror 2
>> [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
>> -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 1
>> [Fri Nov 22 15:15:05 2019] BTRFS warning (device sdb): csum failed root
>> -9 ino 305 off 131760128 csum 0x07436c62 expected csum 0x0001cbde mirror 2
>> [Fri Nov 22 15:15:13 2019] BTRFS info (device sdb): balance: ended with
>> status: -5
>>
>>
>> What should I do now to get more information on the issue ?
> Not exactly.
>
> But I have an idea to see if it's really a certain block group causing
> the problem.
>
> 1. Get the block group/chunk list.
>    You can go the traditional way, by using "btrfs ins dump-tree" or
>    more advanced tool to get block group/chunk list.
>
>    If you go the manual way, it's something like:
>    # btrfs ins dump-tree -t chunk <device>
>    item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 290455552) itemoff 15785
> itemsize 80
>                 length 1073741824 owner 2 stripe_len 65536 type DATA
>                 io_align 65536 io_width 65536 sector_size 4096
>                 num_stripes 1 sub_stripes 1
>                         stripe 0 devid 1 offset 290455552
>                         dev_uuid b929fabe-c291-4fd8-a01e-c67259d202ed
>
>
>    In above case, 290455552 is the chunk's logical bytenr, and
>    1073741824 is the length. Record them all.
>
> 2. Use vrange filter.
>    Btrfs balance can balance certain block groups only, you can use
>    vrange=290455552..1364197375 to relocate the block group above.
>
>    So you can try to relocate block groups one by one manually.
>    I recommend to relocate block group 8801957838848 first, as it looks
>    like to be the offending one.
>
>    If you can manually relocate that block group manually, then it must
>    be something wrong with multiple block groups relocation sequence.
>
> Thanks,
> Qu
>>
>> Thank.
>>
>>
>>

OK just a follow up. As you can see that the original metadata was RAID1
and sitting on 2 drives. This is how it works currently though changes
are in the works so I see, however I was not happy with that so I
decided to balance mconvert=raidi10 it instead and use the other 2
drives as well. Well that worked no issues at all. So I decided to try
and another normal data balance.. I moved from 5 .. 95 in 5 increments
and until it hit 95 did it actually do anything then it moved just 2
blocks which took about 2 mins and that was it. No more balancing needed.

So not sure exactly what the issue was but I suspect it was replacing
the drive didn't not also replace it's place in the meta pool which left
some devices with no meta on them at all and all sorts of weirdness
ensued. So given that scrub passed check passed and all devices are now
being used for system meta and data I have started writing more data to
it now.. and as expected it is starting to balance the 4 devices on its
own now.

So keep this oddity for reference but as far as I can see swapping
metadata from Raid 1 to raid 10 solved my issues.

Thanks for all the pointers guys.. Appreciate not feeling alone on this.

Cheers

Don Alex