Re: Balance loops: what we know so far

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Balance loops: what we know so far
Date: Wed, 13 May 2020 14:36:21 +0800	[thread overview]
Message-ID: <437bf0bc-b308-5b41-67cc-5b84ba6a88d2@gmx.com> (raw)
In-Reply-To: <20200513050204.GX10769@hungrycats.org>

[-- Attachment #1.1: Type: text/plain, Size: 6388 bytes --]

On 2020/5/13 下午1:02, Zygo Blaxell wrote:
> On Wed, May 13, 2020 at 10:28:37AM +0800, Qu Wenruo wrote:
>>
>>
[...]
>> I'm a little surprised about the it's using logical ino ioctl, not just
>> TREE_SEARCH.
> 
> Tree search can't read shared backrefs because they refer directly to
> disk blocks, not to object/type/offset tuples.  It would be nice to have
> an ioctl that can read a metadata block (or even a data block) by bytenr.

Sorry, I mean we can use tree search to search extent tree, and only
search the backref for specified bytenr to inspect it.

In such hanging case, what I really want is to inspect if the tree block
4374646833152 belongs to data reloc tree.

> 
> Or even better, just a fd that can be obtained by some ioctl to access
> the btrfs virtual address space with pread().

If we want the content of 4374646833152, we can easily go btrfs ins
dump-tree -b 4374646833152.

As balance works on commit root, at the time of looping, that tree block
should still be on disk, and dump-tree can get it.

Another way to workaround this is, to provide full extent tree dump.
(I know this is a bad idea, but it would still be faster than interact
with mail).

> 
>> I guess if we could get a plain tree search based one (it only search
>> commit root, which is exactly balance based on), it would be easier to
>> do the digging.
> 
> That would be nice.  I have an application for it.  ;)
> 
>>> 	OSError: [Errno 22] Invalid argument
>>>
>>> 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
>>> 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//current/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>>
>>> Hmmm, I wonder if there's a problem with deleted snapshots?
>>
>> Yes, also what I'm guessing.
>>
>> The cleanup of data reloc tree doesn't look correct to me.
>>
>> Thanks for the new clues,
>> Qu
> 
> Here's a fun one:
> 
> 1.  Delete all the files on a filesystem where balance loops
> have occurred.

I tried with newly created fs, and failed to reproduce.

> 
> 2.  Verify there are no data blocks (one data block group
> with used = 0):
> 
> # show_block_groups.py /testfs/
> block group vaddr 435969589248 length 1073741824 flags METADATA|RAID1 used 180224 used_pct 0
> block group vaddr 4382686969856 length 33554432 flags SYSTEM|RAID1 used 16384 used_pct 0
> block group vaddr 4383794266112 length 1073741824 flags DATA used 0 used_pct 0
> 
> 3.  Create a new file with a single reference in the only (root) subvol:
> # head -c 1024m > file
> # sync
> # show_block_groups.py .
> block group vaddr 435969589248 length 1073741824 flags METADATA|RAID1 used 1245184 used_pct 0
> block group vaddr 4382686969856 length 33554432 flags SYSTEM|RAID1 used 16384 used_pct 0
> block group vaddr 4384868007936 length 1073741824 flags DATA used 961708032 used_pct 90
> block group vaddr 4385941749760 length 1073741824 flags DATA used 112033792 used_pct 10
> 
> 4.  Run balance, and it immediately loops on a single extent:
> # btrfs balance start -d .
> [Wed May 13 00:41:58 2020] BTRFS info (device dm-0): balance: start -d
> [Wed May 13 00:41:58 2020] BTRFS info (device dm-0): relocating block group 4385941749760 flags data
> [Wed May 13 00:42:00 2020] BTRFS info (device dm-0): found 1 extents, loops 1, stage: move data extents
> [Wed May 13 00:42:00 2020] BTRFS info (device dm-0): found 1 extents, loops 2, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 3, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 4, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 5, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 6, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 7, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 8, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 9, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 10, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 11, stage: update data pointers
> [etc...]
> 
> I tried it a 3 more times time and there was no loop.  The 5th try looped again.

10 loops, no reproduce.
I guess there are some other factors involved, like newly created fs
won't trigger it?

BTW, for the reproducible case (or the looping case), would you like to
dump the data_reloc root?

My current guess is some orphan cleanup doesn't get kicked, but it
shouldn't affect metadata with my patch :(

Thanks,
Qu
> 
> There might be a correlation with cancels.  After a fresh boot, I can
> often balance a few dozen block groups before there's a loop, but if I
> cancel a balance, the next balance almost always loops.
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]