All of lore.kernel.org
 help / color / mirror / Atom feed
* Balance loops: what we know so far
@ 2020-04-11 21:14 Zygo Blaxell
  2020-04-27  7:07 ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-04-11 21:14 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5418 bytes --]

Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
in balance and device shrink/remove:

	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers

This is a block group while it's looping, as seen by python-btrfs:

	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
	extent vaddr 1934913175552 length 134217728 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585767972864 count 1
	extent vaddr 1935047393280 length 5591040 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769349120 count 1
	extent vaddr 1935052984320 length 134217728 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769349120 count 1
	extent vaddr 1935187202048 length 122064896 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769349120 count 1
	extent vaddr 1935309266944 length 20414464 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769365504 count 1
	extent vaddr 1935329681408 length 60555264 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769365504 count 1
	extent vaddr 1935390236672 length 9605120 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769381888 count 1
	extent vaddr 1935399841792 length 4538368 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769381888 count 1
	extent vaddr 1935404380160 length 24829952 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769381888 count 1
	extent vaddr 1935429210112 length 7999488 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769398272 count 1
	extent vaddr 1935437209600 length 6426624 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769627648 count 1
	extent vaddr 1935443636224 length 28676096 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769644032 count 1
	extent vaddr 1935472312320 length 8101888 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769644032 count 1
	extent vaddr 1935480414208 length 20455424 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769644032 count 1
	extent vaddr 1935500869632 length 10215424 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769660416 count 1
	extent vaddr 1935511085056 length 10792960 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769676800 count 1
	extent vaddr 1935521878016 length 6066176 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769709568 count 1
	extent vaddr 1935527944192 length 80896000 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769725952 count 1
	extent vaddr 1935608840192 length 134217728 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769742336 count 1
	extent vaddr 1935743057920 length 106102784 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769742336 count 1
	extent vaddr 1935849160704 length 3125248 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769742336 count 1
	extent vaddr 1935852285952 length 57344 refs 1 gen 113299 flags DATA
	    inline shared data backref parent 1585769742336 count 1

All of the extent data backrefs are removed by the balance, but the
loop keeps trying to get rid of the shared data backrefs.  It has
no effect on them, but keeps trying anyway.

This is "semi-infinite" because it is possible for the balance to
terminate if something removes those 29 extents (e.g. by looking up the
extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
fi defrag' will reduce the number of inline shared data backref objects.
When it's reduced all the way to zero, balance starts up again, usually
promptly getting stuck on the very next block group.  If the _only_
thing running on the filesystem is balance, it will not stop looping.

Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
tree deletion after merge_reloc_roots" as the first commit where the
balance loops can be reproduced.

I tested with commit 59b2c371052c "btrfs: check commit root generation
in should_ignore_root" as well as the rest of misc-next, but the balance
loops are still easier to reproduce than to avoid.

Once it starts happening on a filesystem, it seems to happen very
frequently.  It is not possible to reshape a RAID array of more than a
few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
completed in a resize or balance after a fresh boot, then balance gets
stuck in loops after that.  With the fast balance cancel patches it's
possibly to recover from the loop, but futile, since the next balance
will almost always also loop, even if it is passed a different block
group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
reshaping work.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-11 21:14 Balance loops: what we know so far Zygo Blaxell
@ 2020-04-27  7:07 ` Qu Wenruo
  2020-04-28  4:55   ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-04-27  7:07 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6370 bytes --]



On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> in balance and device shrink/remove:
> 
> 	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> 	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> 	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> 
> This is a block group while it's looping, as seen by python-btrfs:
> 
> 	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> 	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> 	extent vaddr 1934913175552 length 134217728 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585767972864 count 1
> 	extent vaddr 1935047393280 length 5591040 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769349120 count 1
> 	extent vaddr 1935052984320 length 134217728 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769349120 count 1
> 	extent vaddr 1935187202048 length 122064896 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769349120 count 1
> 	extent vaddr 1935309266944 length 20414464 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769365504 count 1
> 	extent vaddr 1935329681408 length 60555264 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769365504 count 1
> 	extent vaddr 1935390236672 length 9605120 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769381888 count 1
> 	extent vaddr 1935399841792 length 4538368 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769381888 count 1
> 	extent vaddr 1935404380160 length 24829952 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769381888 count 1
> 	extent vaddr 1935429210112 length 7999488 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769398272 count 1
> 	extent vaddr 1935437209600 length 6426624 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769627648 count 1
> 	extent vaddr 1935443636224 length 28676096 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769644032 count 1
> 	extent vaddr 1935472312320 length 8101888 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769644032 count 1
> 	extent vaddr 1935480414208 length 20455424 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769644032 count 1
> 	extent vaddr 1935500869632 length 10215424 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769660416 count 1
> 	extent vaddr 1935511085056 length 10792960 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769676800 count 1
> 	extent vaddr 1935521878016 length 6066176 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769709568 count 1
> 	extent vaddr 1935527944192 length 80896000 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769725952 count 1
> 	extent vaddr 1935608840192 length 134217728 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769742336 count 1
> 	extent vaddr 1935743057920 length 106102784 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769742336 count 1
> 	extent vaddr 1935849160704 length 3125248 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769742336 count 1
> 	extent vaddr 1935852285952 length 57344 refs 1 gen 113299 flags DATA
> 	    inline shared data backref parent 1585769742336 count 1
> 
> All of the extent data backrefs are removed by the balance, but the
> loop keeps trying to get rid of the shared data backrefs.  It has
> no effect on them, but keeps trying anyway.

I guess this shows a pretty good clue.

I was always thinking about the reloc tree, but in your case, it's data
reloc tree owning them.

In that case, data reloc tree is only cleaned up at the end of
btrfs_relocate_block_group().

Thus it is never cleaned up until we exit the balance loop.

I'm not sure why this is happening only after I extended the lifespan of
reloc tree (not data reloc tree).

But anyway, would you like to give a try of the following patch?
https://patchwork.kernel.org/patch/11511241/

It should make us exit the the balance so long as we have no extra
extent to relocate.

Thanks,
Qu

> 
> This is "semi-infinite" because it is possible for the balance to
> terminate if something removes those 29 extents (e.g. by looking up the
> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> fi defrag' will reduce the number of inline shared data backref objects.
> When it's reduced all the way to zero, balance starts up again, usually
> promptly getting stuck on the very next block group.  If the _only_
> thing running on the filesystem is balance, it will not stop looping.
> 
> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> tree deletion after merge_reloc_roots" as the first commit where the
> balance loops can be reproduced.
> 
> I tested with commit 59b2c371052c "btrfs: check commit root generation
> in should_ignore_root" as well as the rest of misc-next, but the balance
> loops are still easier to reproduce than to avoid.
> 
> Once it starts happening on a filesystem, it seems to happen very
> frequently.  It is not possible to reshape a RAID array of more than a
> few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
> completed in a resize or balance after a fresh boot, then balance gets
> stuck in loops after that.  With the fast balance cancel patches it's
> possibly to recover from the loop, but futile, since the next balance
> will almost always also loop, even if it is passed a different block
> group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
> reshaping work.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-27  7:07 ` Qu Wenruo
@ 2020-04-28  4:55   ` Zygo Blaxell
  2020-04-28  9:54     ` Qu Wenruo
  2020-05-11  8:31     ` Qu Wenruo
  0 siblings, 2 replies; 32+ messages in thread
From: Zygo Blaxell @ 2020-04-28  4:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 10829 bytes --]

On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> > Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> > in balance and device shrink/remove:
> > 
> > 	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> > 	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> > 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> > 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> > 	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> > 
> > This is a block group while it's looping, as seen by python-btrfs:
> > 
> > 	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> > 	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> > 	extent vaddr 1934913175552 length 134217728 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585767972864 count 1
> > 	extent vaddr 1935047393280 length 5591040 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769349120 count 1
> > 	extent vaddr 1935052984320 length 134217728 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769349120 count 1
> > 	extent vaddr 1935187202048 length 122064896 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769349120 count 1
> > 	extent vaddr 1935309266944 length 20414464 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769365504 count 1
> > 	extent vaddr 1935329681408 length 60555264 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769365504 count 1
> > 	extent vaddr 1935390236672 length 9605120 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769381888 count 1
> > 	extent vaddr 1935399841792 length 4538368 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769381888 count 1
> > 	extent vaddr 1935404380160 length 24829952 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769381888 count 1
> > 	extent vaddr 1935429210112 length 7999488 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769398272 count 1
> > 	extent vaddr 1935437209600 length 6426624 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769627648 count 1
> > 	extent vaddr 1935443636224 length 28676096 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769644032 count 1
> > 	extent vaddr 1935472312320 length 8101888 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769644032 count 1
> > 	extent vaddr 1935480414208 length 20455424 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769644032 count 1
> > 	extent vaddr 1935500869632 length 10215424 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769660416 count 1
> > 	extent vaddr 1935511085056 length 10792960 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769676800 count 1
> > 	extent vaddr 1935521878016 length 6066176 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769709568 count 1
> > 	extent vaddr 1935527944192 length 80896000 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769725952 count 1
> > 	extent vaddr 1935608840192 length 134217728 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769742336 count 1
> > 	extent vaddr 1935743057920 length 106102784 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769742336 count 1
> > 	extent vaddr 1935849160704 length 3125248 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769742336 count 1
> > 	extent vaddr 1935852285952 length 57344 refs 1 gen 113299 flags DATA
> > 	    inline shared data backref parent 1585769742336 count 1
> > 
> > All of the extent data backrefs are removed by the balance, but the
> > loop keeps trying to get rid of the shared data backrefs.  It has
> > no effect on them, but keeps trying anyway.
> 
> I guess this shows a pretty good clue.
> 
> I was always thinking about the reloc tree, but in your case, it's data
> reloc tree owning them.

In that case, yes.  Metadata balances loop too, in the "move data extents"
stage while data balances loop in the "update data pointers" stage.

> In that case, data reloc tree is only cleaned up at the end of
> btrfs_relocate_block_group().
> 
> Thus it is never cleaned up until we exit the balance loop.
> 
> I'm not sure why this is happening only after I extended the lifespan of
> reloc tree (not data reloc tree).

I have been poking around with printk to trace what it's doing in the
looping and non-looping cases.  It seems to be very similar up to
calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
btrfs_free_block_rsv.  In the looping cases, everything up to those
functions seems the same on every loop except the first one.

In the non-looping cases, those functions do something different than
the looping cases:  the extents disappear in the next loop, and the
balance finishes.

I haven't figured out _what_ is different yet.  I need more cycles to
look at it.

Your extend-the-lifespan-of-reloc-tree patch moves one of the
functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
place in the call sequence.  It was in merge_reloc_roots before the
transaction commit, now it's in relocate_block_group after transaction
commit.  My guess is that the problem lies somewhere in how the behavior
of these functions has been changed by calling them in a different
sequence.

> But anyway, would you like to give a try of the following patch?
> https://patchwork.kernel.org/patch/11511241/

I'm not sure how this patch could work.  We are hitting the found_extents
counter every time through the loop.  It's returning thousands of extents
each time.

> It should make us exit the the balance so long as we have no extra
> extent to relocate.

The problem is not that we have no extents to relocate.  The problem is
that we don't successfully get rid of the extents we do find, so we keep
finding them over and over again.

In testing, the patch has no effect:

	[Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
	[Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers

The above is the tail end of 3320 loops on a single block group.

I switched to a metadata block group and it's on the 9th loop:

	# btrfs balance start -mconvert=raid1 /media/testfs/
	[Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
	[Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
	[Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents


> Thanks,
> Qu
> 
> > 
> > This is "semi-infinite" because it is possible for the balance to
> > terminate if something removes those 29 extents (e.g. by looking up the
> > extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> > fi defrag' will reduce the number of inline shared data backref objects.
> > When it's reduced all the way to zero, balance starts up again, usually
> > promptly getting stuck on the very next block group.  If the _only_
> > thing running on the filesystem is balance, it will not stop looping.
> > 
> > Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> > tree deletion after merge_reloc_roots" as the first commit where the
> > balance loops can be reproduced.
> > 
> > I tested with commit 59b2c371052c "btrfs: check commit root generation
> > in should_ignore_root" as well as the rest of misc-next, but the balance
> > loops are still easier to reproduce than to avoid.
> > 
> > Once it starts happening on a filesystem, it seems to happen very
> > frequently.  It is not possible to reshape a RAID array of more than a
> > few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
> > completed in a resize or balance after a fresh boot, then balance gets
> > stuck in loops after that.  With the fast balance cancel patches it's
> > possibly to recover from the loop, but futile, since the next balance
> > will almost always also loop, even if it is passed a different block
> > group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
> > reshaping work.
> > 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-28  4:55   ` Zygo Blaxell
@ 2020-04-28  9:54     ` Qu Wenruo
  2020-04-28 14:51       ` Zygo Blaxell
  2020-05-04 18:54       ` Andrea Gelmini
  2020-05-11  8:31     ` Qu Wenruo
  1 sibling, 2 replies; 32+ messages in thread
From: Qu Wenruo @ 2020-04-28  9:54 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8380 bytes --]



On 2020/4/28 下午12:55, Zygo Blaxell wrote:
> On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
>>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
>>> in balance and device shrink/remove:
>>>
>>> 	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>> 	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>> 	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>>
>>> This is a block group while it's looping, as seen by python-btrfs:
>>>
>>> 	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
>>> 	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
[...]
>>>
>>> All of the extent data backrefs are removed by the balance, but the
>>> loop keeps trying to get rid of the shared data backrefs.  It has
>>> no effect on them, but keeps trying anyway.
>>
>> I guess this shows a pretty good clue.
>>
>> I was always thinking about the reloc tree, but in your case, it's data
>> reloc tree owning them.
> 
> In that case, yes.  Metadata balances loop too, in the "move data extents"
> stage while data balances loop in the "update data pointers" stage.

Would you please take an image dump of the fs when runaway balance happened?

Both metadata and data block group loop would greatly help.

> 
>> In that case, data reloc tree is only cleaned up at the end of
>> btrfs_relocate_block_group().
>>
>> Thus it is never cleaned up until we exit the balance loop.
>>
>> I'm not sure why this is happening only after I extended the lifespan of
>> reloc tree (not data reloc tree).
> 
> I have been poking around with printk to trace what it's doing in the
> looping and non-looping cases.  It seems to be very similar up to
> calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
> btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
> btrfs_free_block_rsv.  In the looping cases, everything up to those
> functions seems the same on every loop except the first one.
> 
> In the non-looping cases, those functions do something different than
> the looping cases:  the extents disappear in the next loop, and the
> balance finishes.
> 
> I haven't figured out _what_ is different yet.  I need more cycles to
> look at it.
> 
> Your extend-the-lifespan-of-reloc-tree patch moves one of the
> functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
> place in the call sequence.  It was in merge_reloc_roots before the
> transaction commit, now it's in relocate_block_group after transaction
> commit.  My guess is that the problem lies somewhere in how the behavior
> of these functions has been changed by calling them in a different
> sequence.
> 
>> But anyway, would you like to give a try of the following patch?
>> https://patchwork.kernel.org/patch/11511241/
> 
> I'm not sure how this patch could work.  We are hitting the found_extents
> counter every time through the loop.  It's returning thousands of extents
> each time.
> 
>> It should make us exit the the balance so long as we have no extra
>> extent to relocate.
> 
> The problem is not that we have no extents to relocate.  The problem is
> that we don't successfully get rid of the extents we do find, so we keep
> finding them over and over again.

That's very strange.

As you can see, for relocate_block_group(), it will cleanup reloc trees.

This means either we have reloc trees in use and not cleaned up, or some
tracing mechanism is not work properly.

Anyway, if image dump with the dead looping block group specified, it
would provide good hint to this long problem.

Thanks,
Qu

> 
> In testing, the patch has no effect:
> 
> 	[Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 	[Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> 
> The above is the tail end of 3320 loops on a single block group.
> 
> I switched to a metadata block group and it's on the 9th loop:
> 
> 	# btrfs balance start -mconvert=raid1 /media/testfs/
> 	[Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
> 	[Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 	[Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> 
> 
>> Thanks,
>> Qu
>>
>>>
>>> This is "semi-infinite" because it is possible for the balance to
>>> terminate if something removes those 29 extents (e.g. by looking up the
>>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
>>> fi defrag' will reduce the number of inline shared data backref objects.
>>> When it's reduced all the way to zero, balance starts up again, usually
>>> promptly getting stuck on the very next block group.  If the _only_
>>> thing running on the filesystem is balance, it will not stop looping.
>>>
>>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
>>> tree deletion after merge_reloc_roots" as the first commit where the
>>> balance loops can be reproduced.
>>>
>>> I tested with commit 59b2c371052c "btrfs: check commit root generation
>>> in should_ignore_root" as well as the rest of misc-next, but the balance
>>> loops are still easier to reproduce than to avoid.
>>>
>>> Once it starts happening on a filesystem, it seems to happen very
>>> frequently.  It is not possible to reshape a RAID array of more than a
>>> few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
>>> completed in a resize or balance after a fresh boot, then balance gets
>>> stuck in loops after that.  With the fast balance cancel patches it's
>>> possibly to recover from the loop, but futile, since the next balance
>>> will almost always also loop, even if it is passed a different block
>>> group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
>>> reshaping work.
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-28  9:54     ` Qu Wenruo
@ 2020-04-28 14:51       ` Zygo Blaxell
  2020-04-29  5:34         ` Qu Wenruo
  2020-05-04 18:54       ` Andrea Gelmini
  1 sibling, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-04-28 14:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/4/28 下午12:55, Zygo Blaxell wrote:
> > On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> >>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> >>> in balance and device shrink/remove:
> >>>
> >>> 	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>
> >>> This is a block group while it's looping, as seen by python-btrfs:
> >>>
> >>> 	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> >>> 	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> [...]
> >>>
> >>> All of the extent data backrefs are removed by the balance, but the
> >>> loop keeps trying to get rid of the shared data backrefs.  It has
> >>> no effect on them, but keeps trying anyway.
> >>
> >> I guess this shows a pretty good clue.
> >>
> >> I was always thinking about the reloc tree, but in your case, it's data
> >> reloc tree owning them.
> > 
> > In that case, yes.  Metadata balances loop too, in the "move data extents"
> > stage while data balances loop in the "update data pointers" stage.
> 
> Would you please take an image dump of the fs when runaway balance happened?
> 
> Both metadata and data block group loop would greatly help.

There's two problems with this:

	1) my smallest test filesystems have 29GB of metadata,

	2) the problem is not reproducible with an image.

I've tried using VM snapshots to put a filesystem into a reproducible
looping state.  A block group that loops on one boot doesn't repeatably
loop on another boot from the same initial state; however, once a
booted system starts looping, it continues to loop even if the balance
is cancelled, and I restart balance on the same block group or other
random block groups.

I have production filesystems with tens of thousands of block groups
and almost all of them loop (as I said before, I cannot complete any
RAID reshapes with 5.1+ kernels).  They can't _all_ be bad.

Cancelling a balance (usually) doesn't recover.  Rebooting does.
The change that triggered this changes the order of operations in
the kernel.  That smells like a runtime thing to me.

> >> In that case, data reloc tree is only cleaned up at the end of
> >> btrfs_relocate_block_group().
> >>
> >> Thus it is never cleaned up until we exit the balance loop.
> >>
> >> I'm not sure why this is happening only after I extended the lifespan of
> >> reloc tree (not data reloc tree).
> > 
> > I have been poking around with printk to trace what it's doing in the
> > looping and non-looping cases.  It seems to be very similar up to
> > calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
> > btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
> > btrfs_free_block_rsv.  In the looping cases, everything up to those
> > functions seems the same on every loop except the first one.
> > 
> > In the non-looping cases, those functions do something different than
> > the looping cases:  the extents disappear in the next loop, and the
> > balance finishes.
> > 
> > I haven't figured out _what_ is different yet.  I need more cycles to
> > look at it.
> > 
> > Your extend-the-lifespan-of-reloc-tree patch moves one of the
> > functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
> > place in the call sequence.  It was in merge_reloc_roots before the
> > transaction commit, now it's in relocate_block_group after transaction
> > commit.  My guess is that the problem lies somewhere in how the behavior
> > of these functions has been changed by calling them in a different
> > sequence.
> > 
> >> But anyway, would you like to give a try of the following patch?
> >> https://patchwork.kernel.org/patch/11511241/
> > 
> > I'm not sure how this patch could work.  We are hitting the found_extents
> > counter every time through the loop.  It's returning thousands of extents
> > each time.
> > 
> >> It should make us exit the the balance so long as we have no extra
> >> extent to relocate.
> > 
> > The problem is not that we have no extents to relocate.  The problem is
> > that we don't successfully get rid of the extents we do find, so we keep
> > finding them over and over again.
> 
> That's very strange.
> 
> As you can see, for relocate_block_group(), it will cleanup reloc trees.
> 
> This means either we have reloc trees in use and not cleaned up, or some
> tracing mechanism is not work properly.

Can you point out where in the kernel that happens?  If we throw some
printks at it we might see something.

> Anyway, if image dump with the dead looping block group specified, it
> would provide good hint to this long problem.
> 
> Thanks,
> Qu
> 
> > 
> > In testing, the patch has no effect:
> > 
> > 	[Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 
> > The above is the tail end of 3320 loops on a single block group.
> > 
> > I switched to a metadata block group and it's on the 9th loop:
> > 
> > 	# btrfs balance start -mconvert=raid1 /media/testfs/
> > 	[Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
> > 	[Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 
> > 
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> This is "semi-infinite" because it is possible for the balance to
> >>> terminate if something removes those 29 extents (e.g. by looking up the
> >>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> >>> fi defrag' will reduce the number of inline shared data backref objects.
> >>> When it's reduced all the way to zero, balance starts up again, usually
> >>> promptly getting stuck on the very next block group.  If the _only_
> >>> thing running on the filesystem is balance, it will not stop looping.
> >>>
> >>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> >>> tree deletion after merge_reloc_roots" as the first commit where the
> >>> balance loops can be reproduced.
> >>>
> >>> I tested with commit 59b2c371052c "btrfs: check commit root generation
> >>> in should_ignore_root" as well as the rest of misc-next, but the balance
> >>> loops are still easier to reproduce than to avoid.
> >>>
> >>> Once it starts happening on a filesystem, it seems to happen very
> >>> frequently.  It is not possible to reshape a RAID array of more than a
> >>> few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
> >>> completed in a resize or balance after a fresh boot, then balance gets
> >>> stuck in loops after that.  With the fast balance cancel patches it's
> >>> possibly to recover from the loop, but futile, since the next balance
> >>> will almost always also loop, even if it is passed a different block
> >>> group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
> >>> reshaping work.
> >>>
> >>
> > 
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-28 14:51       ` Zygo Blaxell
@ 2020-04-29  5:34         ` Qu Wenruo
  2020-04-29 12:23           ` Sebastian Döring
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-04-29  5:34 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 10895 bytes --]



On 2020/4/28 下午10:51, Zygo Blaxell wrote:
> On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/4/28 下午12:55, Zygo Blaxell wrote:
>>> On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
>>>>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
>>>>> in balance and device shrink/remove:
>>>>>
>>>>> 	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>>>> 	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>>>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>>>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>>>> 	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
>>>>>
>>>>> This is a block group while it's looping, as seen by python-btrfs:
>>>>>
>>>>> 	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
>>>>> 	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
>> [...]
>>>>>
>>>>> All of the extent data backrefs are removed by the balance, but the
>>>>> loop keeps trying to get rid of the shared data backrefs.  It has
>>>>> no effect on them, but keeps trying anyway.
>>>>
>>>> I guess this shows a pretty good clue.
>>>>
>>>> I was always thinking about the reloc tree, but in your case, it's data
>>>> reloc tree owning them.
>>>
>>> In that case, yes.  Metadata balances loop too, in the "move data extents"
>>> stage while data balances loop in the "update data pointers" stage.
>>
>> Would you please take an image dump of the fs when runaway balance happened?
>>
>> Both metadata and data block group loop would greatly help.
> 
> There's two problems with this:
> 
> 	1) my smallest test filesystems have 29GB of metadata,
> 
> 	2) the problem is not reproducible with an image.

OK, then when such loop happened, what we need is:
- Which block group is looping
  Obviously

- The extent tree dump of that block group
  You have done a pretty good job on that.

- Follow the backref to reach the root
  In balance case, the offending backref should always using parent
  bytenr, so you need to step by step to reach the root, which should
  have backref points back to itself.

- Root tree dump of the looping fs.
  To find out which reloc tree it belongs to.
  And show the ref number of that reloc tree.

> 
> I've tried using VM snapshots to put a filesystem into a reproducible
> looping state.  A block group that loops on one boot doesn't repeatably
> loop on another boot from the same initial state; however, once a
> booted system starts looping, it continues to loop even if the balance
> is cancelled, and I restart balance on the same block group or other
> random block groups.
> 
> I have production filesystems with tens of thousands of block groups
> and almost all of them loop (as I said before, I cannot complete any
> RAID reshapes with 5.1+ kernels).  They can't _all_ be bad.
> 
> Cancelling a balance (usually) doesn't recover.  Rebooting does.
> The change that triggered this changes the order of operations in
> the kernel.  That smells like a runtime thing to me.
> 
>>>> In that case, data reloc tree is only cleaned up at the end of
>>>> btrfs_relocate_block_group().
>>>>
>>>> Thus it is never cleaned up until we exit the balance loop.
>>>>
>>>> I'm not sure why this is happening only after I extended the lifespan of
>>>> reloc tree (not data reloc tree).
>>>
>>> I have been poking around with printk to trace what it's doing in the
>>> looping and non-looping cases.  It seems to be very similar up to
>>> calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
>>> btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
>>> btrfs_free_block_rsv.  In the looping cases, everything up to those
>>> functions seems the same on every loop except the first one.
>>>
>>> In the non-looping cases, those functions do something different than
>>> the looping cases:  the extents disappear in the next loop, and the
>>> balance finishes.
>>>
>>> I haven't figured out _what_ is different yet.  I need more cycles to
>>> look at it.
>>>
>>> Your extend-the-lifespan-of-reloc-tree patch moves one of the
>>> functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
>>> place in the call sequence.  It was in merge_reloc_roots before the
>>> transaction commit, now it's in relocate_block_group after transaction
>>> commit.  My guess is that the problem lies somewhere in how the behavior
>>> of these functions has been changed by calling them in a different
>>> sequence.
>>>
>>>> But anyway, would you like to give a try of the following patch?
>>>> https://patchwork.kernel.org/patch/11511241/
>>>
>>> I'm not sure how this patch could work.  We are hitting the found_extents
>>> counter every time through the loop.  It's returning thousands of extents
>>> each time.
>>>
>>>> It should make us exit the the balance so long as we have no extra
>>>> extent to relocate.
>>>
>>> The problem is not that we have no extents to relocate.  The problem is
>>> that we don't successfully get rid of the extents we do find, so we keep
>>> finding them over and over again.
>>
>> That's very strange.
>>
>> As you can see, for relocate_block_group(), it will cleanup reloc trees.
>>
>> This means either we have reloc trees in use and not cleaned up, or some
>> tracing mechanism is not work properly.
> 
> Can you point out where in the kernel that happens?  If we throw some
> printks at it we might see something.

The merge happens at merge_reloc_roots().
The cleanup happens at clean_dirty_subvols().

In theory, all reloc trees will either marked as orphan (due to new
changes in original subvolume) or merged in merge_reloc_roots(), and
queued for cleanup.
Then in clean_dirty_subvolumes() reloc trees all get cleaned up.

If there are any reloc tree remaining after clean_dirty_subvolumes()
then it could be the cause of such loop.

Thanks,
Qu
> 
>> Anyway, if image dump with the dead looping block group specified, it
>> would provide good hint to this long problem.
>>
>> Thanks,
>> Qu
>>
>>>
>>> In testing, the patch has no effect:
>>>
>>> 	[Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>> 	[Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
>>>
>>> The above is the tail end of 3320 loops on a single block group.
>>>
>>> I switched to a metadata block group and it's on the 9th loop:
>>>
>>> 	# btrfs balance start -mconvert=raid1 /media/testfs/
>>> 	[Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
>>> 	[Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>> 	[Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
>>>
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> This is "semi-infinite" because it is possible for the balance to
>>>>> terminate if something removes those 29 extents (e.g. by looking up the
>>>>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
>>>>> fi defrag' will reduce the number of inline shared data backref objects.
>>>>> When it's reduced all the way to zero, balance starts up again, usually
>>>>> promptly getting stuck on the very next block group.  If the _only_
>>>>> thing running on the filesystem is balance, it will not stop looping.
>>>>>
>>>>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
>>>>> tree deletion after merge_reloc_roots" as the first commit where the
>>>>> balance loops can be reproduced.
>>>>>
>>>>> I tested with commit 59b2c371052c "btrfs: check commit root generation
>>>>> in should_ignore_root" as well as the rest of misc-next, but the balance
>>>>> loops are still easier to reproduce than to avoid.
>>>>>
>>>>> Once it starts happening on a filesystem, it seems to happen very
>>>>> frequently.  It is not possible to reshape a RAID array of more than a
>>>>> few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
>>>>> completed in a resize or balance after a fresh boot, then balance gets
>>>>> stuck in loops after that.  With the fast balance cancel patches it's
>>>>> possibly to recover from the loop, but futile, since the next balance
>>>>> will almost always also loop, even if it is passed a different block
>>>>> group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
>>>>> reshaping work.
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-29  5:34         ` Qu Wenruo
@ 2020-04-29 12:23           ` Sebastian Döring
  0 siblings, 0 replies; 32+ messages in thread
From: Sebastian Döring @ 2020-04-29 12:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

I think I'm also running into this kind of balancing loop. I'm seeing
thousands of these repeating messages in syslog:


Apr 29 11:56:49 rockpro64 kernel: [12044.645247] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:49 rockpro64 kernel: [12045.134649] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:50 rockpro64 kernel: [12045.648773] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:50 rockpro64 kernel: [12046.172767] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:51 rockpro64 kernel: [12046.666079] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:51 rockpro64 kernel: [12047.199652] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents


Can I help you to debug this in any way? I'm not too familiar with the
internal btrfs tools.

Best regards,
Sebastian


Am Mi., 29. Apr. 2020 um 07:36 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
>
> On 2020/4/28 下午10:51, Zygo Blaxell wrote:
> > On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/4/28 下午12:55, Zygo Blaxell wrote:
> >>> On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
> >>>>
> >>>>
> >>>> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> >>>>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> >>>>> in balance and device shrink/remove:
> >>>>>
> >>>>>   [Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>>   [Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>>   [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>>   [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>>   [Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>>
> >>>>> This is a block group while it's looping, as seen by python-btrfs:
> >>>>>
> >>>>>   # share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> >>>>>   block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> >> [...]
> >>>>>
> >>>>> All of the extent data backrefs are removed by the balance, but the
> >>>>> loop keeps trying to get rid of the shared data backrefs.  It has
> >>>>> no effect on them, but keeps trying anyway.
> >>>>
> >>>> I guess this shows a pretty good clue.
> >>>>
> >>>> I was always thinking about the reloc tree, but in your case, it's data
> >>>> reloc tree owning them.
> >>>
> >>> In that case, yes.  Metadata balances loop too, in the "move data extents"
> >>> stage while data balances loop in the "update data pointers" stage.
> >>
> >> Would you please take an image dump of the fs when runaway balance happened?
> >>
> >> Both metadata and data block group loop would greatly help.
> >
> > There's two problems with this:
> >
> >       1) my smallest test filesystems have 29GB of metadata,
> >
> >       2) the problem is not reproducible with an image.
>
> OK, then when such loop happened, what we need is:
> - Which block group is looping
>   Obviously
>
> - The extent tree dump of that block group
>   You have done a pretty good job on that.
>
> - Follow the backref to reach the root
>   In balance case, the offending backref should always using parent
>   bytenr, so you need to step by step to reach the root, which should
>   have backref points back to itself.
>
> - Root tree dump of the looping fs.
>   To find out which reloc tree it belongs to.
>   And show the ref number of that reloc tree.
>
> >
> > I've tried using VM snapshots to put a filesystem into a reproducible
> > looping state.  A block group that loops on one boot doesn't repeatably
> > loop on another boot from the same initial state; however, once a
> > booted system starts looping, it continues to loop even if the balance
> > is cancelled, and I restart balance on the same block group or other
> > random block groups.
> >
> > I have production filesystems with tens of thousands of block groups
> > and almost all of them loop (as I said before, I cannot complete any
> > RAID reshapes with 5.1+ kernels).  They can't _all_ be bad.
> >
> > Cancelling a balance (usually) doesn't recover.  Rebooting does.
> > The change that triggered this changes the order of operations in
> > the kernel.  That smells like a runtime thing to me.
> >
> >>>> In that case, data reloc tree is only cleaned up at the end of
> >>>> btrfs_relocate_block_group().
> >>>>
> >>>> Thus it is never cleaned up until we exit the balance loop.
> >>>>
> >>>> I'm not sure why this is happening only after I extended the lifespan of
> >>>> reloc tree (not data reloc tree).
> >>>
> >>> I have been poking around with printk to trace what it's doing in the
> >>> looping and non-looping cases.  It seems to be very similar up to
> >>> calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
> >>> btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
> >>> btrfs_free_block_rsv.  In the looping cases, everything up to those
> >>> functions seems the same on every loop except the first one.
> >>>
> >>> In the non-looping cases, those functions do something different than
> >>> the looping cases:  the extents disappear in the next loop, and the
> >>> balance finishes.
> >>>
> >>> I haven't figured out _what_ is different yet.  I need more cycles to
> >>> look at it.
> >>>
> >>> Your extend-the-lifespan-of-reloc-tree patch moves one of the
> >>> functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
> >>> place in the call sequence.  It was in merge_reloc_roots before the
> >>> transaction commit, now it's in relocate_block_group after transaction
> >>> commit.  My guess is that the problem lies somewhere in how the behavior
> >>> of these functions has been changed by calling them in a different
> >>> sequence.
> >>>
> >>>> But anyway, would you like to give a try of the following patch?
> >>>> https://patchwork.kernel.org/patch/11511241/
> >>>
> >>> I'm not sure how this patch could work.  We are hitting the found_extents
> >>> counter every time through the loop.  It's returning thousands of extents
> >>> each time.
> >>>
> >>>> It should make us exit the the balance so long as we have no extra
> >>>> extent to relocate.
> >>>
> >>> The problem is not that we have no extents to relocate.  The problem is
> >>> that we don't successfully get rid of the extents we do find, so we keep
> >>> finding them over and over again.
> >>
> >> That's very strange.
> >>
> >> As you can see, for relocate_block_group(), it will cleanup reloc trees.
> >>
> >> This means either we have reloc trees in use and not cleaned up, or some
> >> tracing mechanism is not work properly.
> >
> > Can you point out where in the kernel that happens?  If we throw some
> > printks at it we might see something.
>
> The merge happens at merge_reloc_roots().
> The cleanup happens at clean_dirty_subvols().
>
> In theory, all reloc trees will either marked as orphan (due to new
> changes in original subvolume) or merged in merge_reloc_roots(), and
> queued for cleanup.
> Then in clean_dirty_subvolumes() reloc trees all get cleaned up.
>
> If there are any reloc tree remaining after clean_dirty_subvolumes()
> then it could be the cause of such loop.
>
> Thanks,
> Qu
> >
> >> Anyway, if image dump with the dead looping block group specified, it
> >> would provide good hint to this long problem.
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> In testing, the patch has no effect:
> >>>
> >>>     [Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>     [Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>
> >>> The above is the tail end of 3320 loops on a single block group.
> >>>
> >>> I switched to a metadata block group and it's on the 9th loop:
> >>>
> >>>     # btrfs balance start -mconvert=raid1 /media/testfs/
> >>>     [Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
> >>>     [Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>     [Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>
> >>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> This is "semi-infinite" because it is possible for the balance to
> >>>>> terminate if something removes those 29 extents (e.g. by looking up the
> >>>>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> >>>>> fi defrag' will reduce the number of inline shared data backref objects.
> >>>>> When it's reduced all the way to zero, balance starts up again, usually
> >>>>> promptly getting stuck on the very next block group.  If the _only_
> >>>>> thing running on the filesystem is balance, it will not stop looping.
> >>>>>
> >>>>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> >>>>> tree deletion after merge_reloc_roots" as the first commit where the
> >>>>> balance loops can be reproduced.
> >>>>>
> >>>>> I tested with commit 59b2c371052c "btrfs: check commit root generation
> >>>>> in should_ignore_root" as well as the rest of misc-next, but the balance
> >>>>> loops are still easier to reproduce than to avoid.
> >>>>>
> >>>>> Once it starts happening on a filesystem, it seems to happen very
> >>>>> frequently.  It is not possible to reshape a RAID array of more than a
> >>>>> few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
> >>>>> completed in a resize or balance after a fresh boot, then balance gets
> >>>>> stuck in loops after that.  With the fast balance cancel patches it's
> >>>>> possibly to recover from the loop, but futile, since the next balance
> >>>>> will almost always also loop, even if it is passed a different block
> >>>>> group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
> >>>>> reshaping work.
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-28  9:54     ` Qu Wenruo
  2020-04-28 14:51       ` Zygo Blaxell
@ 2020-05-04 18:54       ` Andrea Gelmini
  2020-05-04 23:48         ` Qu Wenruo
  1 sibling, 1 reply; 32+ messages in thread
From: Andrea Gelmini @ 2020-05-04 18:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Zygo Blaxell, Linux BTRFS

Il giorno mar 28 apr 2020 alle ore 12:00 Qu Wenruo
<quwenruo.btrfs@gmx.com> ha scritto:

>
> Would you please take an image dump of the fs when runaway balance happened?
>
> Both metadata and data block group loop would greatly help.

Hi Qu, and thanks a lot for your work.

I have a VirtualBox machine (in save-state) with the problem running.

So, I can send it to you, if you are comfortable with VirtulBox, or we
can work together on my machine.

Short story:
a) system running is live CD ubuntu 20.04 with kernel 5.4.0-21
b) partition of 10 giga (not /) mounted;
c) shrinked to 4 giga;
d) then removed files and left just 59 mega of data;
e) started balance -m (not -d), then it loops forever (or, at least, I
left it running for hours, guess it should be enough).

If I reset it, at restart no more loop.

Thanks again,
Gelma

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-04 18:54       ` Andrea Gelmini
@ 2020-05-04 23:48         ` Qu Wenruo
  2020-05-05  9:10           ` Andrea Gelmini
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-04 23:48 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Zygo Blaxell, Linux BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --]



On 2020/5/5 上午2:54, Andrea Gelmini wrote:
> Il giorno mar 28 apr 2020 alle ore 12:00 Qu Wenruo
> <quwenruo.btrfs@gmx.com> ha scritto:
> 
>>
>> Would you please take an image dump of the fs when runaway balance happened?
>>
>> Both metadata and data block group loop would greatly help.
> 
> Hi Qu, and thanks a lot for your work.
> 
> I have a VirtualBox machine (in save-state) with the problem running.
> 
> So, I can send it to you, if you are comfortable with VirtulBox, or we
> can work together on my machine.

I mean, btrfs-image dump of the umounted fs.
(btrfs-image can compress the metadata, and won't include data, thus it
can be way smaller than the image)

> 
> Short story:
> a) system running is live CD ubuntu 20.04 with kernel 5.4.0-21
> b) partition of 10 giga (not /) mounted;
> c) shrinked to 4 giga;
> d) then removed files and left just 59 mega of data;
> e) started balance -m (not -d), then it loops forever (or, at least, I
> left it running for hours, guess it should be enough).

At this stage, the image should be pretty small.
You can try restart the system and boot into liveCD and dump the image.

It won't matter that much even you can't reproduce the loop after
restart, the point is to find out what's preventing the metadata balance
to continue. So the image-dump should be good enough.

Thanks,
Qu

> 
> If I reset it, at restart no more loop.
> 
> Thanks again,
> Gelma
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-04 23:48         ` Qu Wenruo
@ 2020-05-05  9:10           ` Andrea Gelmini
  2020-05-06  5:58             ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Gelmini @ 2020-05-05  9:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Zygo Blaxell, Linux BTRFS

Il giorno mar 5 mag 2020 alle ore 01:48 Qu Wenruo
<quwenruo.btrfs@gmx.com> ha scritto:

> I mean, btrfs-image dump of the umounted fs.
> (btrfs-image can compress the metadata, and won't include data, thus it
> can be way smaller than the image)

No problem to give you complete image, data and metadata.

> At this stage, the image should be pretty small.
> You can try restart the system and boot into liveCD and dump the image.

Just to give you more possible info, here¹ you find all the files. I
guess you figure out how to
use it without my explain, but anyway, just for the record:

- video.mkv : show you how I re-up the save-state of vm at the moment
I saw the loop;
- Virtualbox-Files.tar.bz2 : contains all the Virtualbox VM files, so
you can run it up on your own. You can also view the history command I
did;
- ubuntu-iso.tar : just to make it easier, it's simply the iso Ubuntu
I'm using, with the right path, so if you want to replicate my side;
- sda1.btrfs.dd.bz2 : is the dd of all the BTRFS partition, just after
the reset of the VM, but without mount it. So, if I mount it, I see
this:

[268398.481278] BTRFS: device fsid
cdbf6911-63f6-410e-9d22-a0376dfcc8ce devid 1 transid 32588 /dev/loop35
scanned by systemd-udevd (392357)
[268404.681217] BTRFS info (device loop35): disk space caching is enabled
[268404.681221] BTRFS info (device loop35): has skinny extents
[268404.694708] BTRFS info (device loop35): enabling ssd optimizations
[268404.700398] BTRFS info (device loop35): checking UUID tree
[268404.722430] BTRFS info (device loop35): balance: resume -musage=2 -susage=2
[268404.722523] BTRFS info (device loop35): relocating block group
12546211840 flags metadata|dup
[268404.752675] BTRFS info (device loop35): found 21 extents
[268404.771509] BTRFS info (device loop35): relocating block group
12512657408 flags system|dup
[268404.792802] BTRFS info (device loop35): found 1 extents
[268404.819137] BTRFS info (device loop35): relocating block group
12210667520 flags metadata|dup
[268404.858634] BTRFS info (device loop35): found 26 extents
[268404.915321] BTRFS info (device loop35): balance: ended with status: 0

Just in case:
1aced5ec23425845a79966d9a78033aa  sda1.btrfs.dd.bz2
93c9a2d50395383090dd031015ca8e89  ubuntu-iso.tar
e9b3d49439cc41cd34fba078cf99c1b8  video.mkv
663b9a55bbfdb51fc8a77569445a9102  Virtualbox-Files.tar.bz2

Hope it helps,
Gelma

¹ http://mail.gelma.net/btrfs-vm/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-05  9:10           ` Andrea Gelmini
@ 2020-05-06  5:58             ` Qu Wenruo
  2020-05-06 18:24               ` Andrea Gelmini
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-06  5:58 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Zygo Blaxell, Linux BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 3676 bytes --]



On 2020/5/5 下午5:10, Andrea Gelmini wrote:
> Il giorno mar 5 mag 2020 alle ore 01:48 Qu Wenruo
> <quwenruo.btrfs@gmx.com> ha scritto:
> 
>> I mean, btrfs-image dump of the umounted fs.
>> (btrfs-image can compress the metadata, and won't include data, thus it
>> can be way smaller than the image)
> 
> No problem to give you complete image, data and metadata.

Thanks for your dump.

Do you still have the initial looping dmesg?
The point here is to locate the offending block group.

But anyway, your dump is already very interesting.

- The dump has no reloc tree
  Which means, the balance should already finished and all reloc tree
  already cleaned up.
  In theory, we should be able to continue to next chunk.
  But in reality we're not.

- There is a metadata chunk with data reloc tree leaf
  If we're looping on block group 12210667520, then it's possible that
  we're deadlooping on that block group, with "found 1 extents".
  From the bytenr it indeed looks like the case.
  But still, the initial dead loop dmesg would provide better proof to
  this.
  (Sorry, not sure if I could convert the vbox format to qcow2 nor how
   to resume it to KVM/qemu, thus I still have to bother you to dump the
   dmesg)

If that's the case, it would be very interesting in how we're handling
the data reloc tree.
It would be very worthy digging then.

Thanks,
Qu

> 
>> At this stage, the image should be pretty small.
>> You can try restart the system and boot into liveCD and dump the image.
> 
> Just to give you more possible info, here¹ you find all the files. I
> guess you figure out how to
> use it without my explain, but anyway, just for the record:
> 
> - video.mkv : show you how I re-up the save-state of vm at the moment
> I saw the loop;
> - Virtualbox-Files.tar.bz2 : contains all the Virtualbox VM files, so
> you can run it up on your own. You can also view the history command I
> did;
> - ubuntu-iso.tar : just to make it easier, it's simply the iso Ubuntu
> I'm using, with the right path, so if you want to replicate my side;
> - sda1.btrfs.dd.bz2 : is the dd of all the BTRFS partition, just after
> the reset of the VM, but without mount it. So, if I mount it, I see
> this:
> 
> [268398.481278] BTRFS: device fsid
> cdbf6911-63f6-410e-9d22-a0376dfcc8ce devid 1 transid 32588 /dev/loop35
> scanned by systemd-udevd (392357)
> [268404.681217] BTRFS info (device loop35): disk space caching is enabled
> [268404.681221] BTRFS info (device loop35): has skinny extents
> [268404.694708] BTRFS info (device loop35): enabling ssd optimizations
> [268404.700398] BTRFS info (device loop35): checking UUID tree
> [268404.722430] BTRFS info (device loop35): balance: resume -musage=2 -susage=2
> [268404.722523] BTRFS info (device loop35): relocating block group
> 12546211840 flags metadata|dup
> [268404.752675] BTRFS info (device loop35): found 21 extents
> [268404.771509] BTRFS info (device loop35): relocating block group
> 12512657408 flags system|dup
> [268404.792802] BTRFS info (device loop35): found 1 extents
> [268404.819137] BTRFS info (device loop35): relocating block group
> 12210667520 flags metadata|dup
> [268404.858634] BTRFS info (device loop35): found 26 extents
> [268404.915321] BTRFS info (device loop35): balance: ended with status: 0
> 
> Just in case:
> 1aced5ec23425845a79966d9a78033aa  sda1.btrfs.dd.bz2
> 93c9a2d50395383090dd031015ca8e89  ubuntu-iso.tar
> e9b3d49439cc41cd34fba078cf99c1b8  video.mkv
> 663b9a55bbfdb51fc8a77569445a9102  Virtualbox-Files.tar.bz2
> 
> Hope it helps,
> Gelma
> 
> ¹ http://mail.gelma.net/btrfs-vm/
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-06  5:58             ` Qu Wenruo
@ 2020-05-06 18:24               ` Andrea Gelmini
  2020-05-07  9:59                 ` Andrea Gelmini
  2020-05-08  6:33                 ` Qu Wenruo
  0 siblings, 2 replies; 32+ messages in thread
From: Andrea Gelmini @ 2020-05-06 18:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Zygo Blaxell, Linux BTRFS

Il giorno mer 6 mag 2020 alle ore 07:58 Qu Wenruo
<quwenruo.btrfs@gmx.com> ha scritto:
> Thanks for your dump.
>
> Do you still have the initial looping dmesg?

Yeap, you can find the complete syslog here:
http://mail.gelma.net/btrfs-vm/syslog.txt


>   If we're looping on block group 12210667520, then it's possible that

Oh, well, from the syslog above:
May  3 17:14:25 ubuntu kernel: [  586.735870] BTRFS info (device
sda1): balance: start -musage=2 -susage=2
May  3 17:14:25 ubuntu kernel: [  586.735974] BTRFS info (device
sda1): relocating block group 12479102976 flags system|dup
May  3 17:14:25 ubuntu kernel: [  586.833331] BTRFS info (device
sda1): relocating block group 12210667520 flags metadata|dup
May  3 17:14:25 ubuntu kernel: [  586.939172] BTRFS info (device
sda1): found 26 extents
May  3 17:14:25 ubuntu kernel: [  587.021568] BTRFS info (device
sda1): found 1 extents
(and then repeating lines like this)

>   (Sorry, not sure if I could convert the vbox format to qcow2 nor how
>    to resume it to KVM/qemu, thus I still have to bother you to dump the
>    dmesg)

I guess no way to migrate the saved state of VB to QEMU.
Anyway, I have prepared a minimal Lubuntu 20.04. It starts up, run
VirtualBox and let you to replicate what I see.
It's ~40GB. If you cat it on /dev/usb-stick you can boot up  (no way
to run it inside KVM because of VB need to access VT-x).

I'm uploading it. It takes lot of hours. I tell you know when done.

> If that's the case, it would be very interesting in how we're handling
> the data reloc tree.
> It would be very worthy digging then.

Good!

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-06 18:24               ` Andrea Gelmini
@ 2020-05-07  9:59                 ` Andrea Gelmini
  2020-05-08  6:33                 ` Qu Wenruo
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Gelmini @ 2020-05-07  9:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Zygo Blaxell, Linux BTRFS

Il giorno mer 6 mag 2020 alle ore 20:24 Andrea Gelmini
<andrea.gelmini@gmail.com> ha scritto:
> I'm uploading it. It takes lot of hours. I tell you know when done.

Here you can find the complete image:
http://mail.gelma.net/btrfs-vm/live.dd.bz2

md5sum:
ea34712f83a5db647298c33b7b274f77  live.dd.bz2

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-06 18:24               ` Andrea Gelmini
  2020-05-07  9:59                 ` Andrea Gelmini
@ 2020-05-08  6:33                 ` Qu Wenruo
  1 sibling, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2020-05-08  6:33 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Zygo Blaxell, Linux BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 2506 bytes --]



On 2020/5/7 上午2:24, Andrea Gelmini wrote:
> Il giorno mer 6 mag 2020 alle ore 07:58 Qu Wenruo
> <quwenruo.btrfs@gmx.com> ha scritto:
>> Thanks for your dump.
>>
>> Do you still have the initial looping dmesg?
> 
> Yeap, you can find the complete syslog here:
> http://mail.gelma.net/btrfs-vm/syslog.txt

This helps a lot!
> 
> 
>>   If we're looping on block group 12210667520, then it's possible that
> 
> Oh, well, from the syslog above:
> May  3 17:14:25 ubuntu kernel: [  586.735870] BTRFS info (device
> sda1): balance: start -musage=2 -susage=2
> May  3 17:14:25 ubuntu kernel: [  586.735974] BTRFS info (device
> sda1): relocating block group 12479102976 flags system|dup
> May  3 17:14:25 ubuntu kernel: [  586.833331] BTRFS info (device
> sda1): relocating block group 12210667520 flags metadata|dup
> May  3 17:14:25 ubuntu kernel: [  586.939172] BTRFS info (device
> sda1): found 26 extents
> May  3 17:14:25 ubuntu kernel: [  587.021568] BTRFS info (device
> sda1): found 1 extents
> (and then repeating lines like this)

Exactly, this proved my assumption, something strange is happening for
DATA RELOC TREE.

That last one tree block is from DATA_RELOC tree, normally we would
relocate it by creating reloc tree for it, and swap tree blocks.

But we don't need to be that complex for DATA_RELOC tree, as unlike
regular subvolume tree, no snapshot can be created for DATA_RELOC tree,
thus we can handle it just like extent trees.

Although I still don't understand why it doesn't work as usual, I'm
working on a POC patch to change the behaivor, hoping to solve the problem.

During that time, if you hit similar case, the same dump would provide
great help to enhance btrfs.

Thanks,
Qu

> 
>>   (Sorry, not sure if I could convert the vbox format to qcow2 nor how
>>    to resume it to KVM/qemu, thus I still have to bother you to dump the
>>    dmesg)
> 
> I guess no way to migrate the saved state of VB to QEMU.
> Anyway, I have prepared a minimal Lubuntu 20.04. It starts up, run
> VirtualBox and let you to replicate what I see.
> It's ~40GB. If you cat it on /dev/usb-stick you can boot up  (no way
> to run it inside KVM because of VB need to access VT-x).
> 
> I'm uploading it. It takes lot of hours. I tell you know when done.
> 
>> If that's the case, it would be very interesting in how we're handling
>> the data reloc tree.
>> It would be very worthy digging then.
> 
> Good!
> 
> Ciao,
> Gelma
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-04-28  4:55   ` Zygo Blaxell
  2020-04-28  9:54     ` Qu Wenruo
@ 2020-05-11  8:31     ` Qu Wenruo
  2020-05-12 13:43       ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-11  8:31 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1.1: Type: text/plain, Size: 627 bytes --]

Hi Zygo,

Would you like to test this diff?

Although I haven't find a solid reason yet, there is another report and
with the help from the reporter, it turns out that balance hangs at
relocating DATA_RELOC tree block.

After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
since we can't create snapshot for data reloc tree.

By removing the REF_COW bit, we could ensure that data reloc tree always
get cowed for relocation (just like extent tree), this would hugely
reduce the complexity for data reloc tree.

Not sure if this would help, but it passes my local balance run.

Thanks,
Qu

[-- Attachment #1.1.2: 0001-btrfs-Remove-the-REF_COW-bit-for-data-reloc-tree.patch --]
[-- Type: text/x-patch, Size: 2891 bytes --]

From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu@suse.com>
Date: Mon, 11 May 2020 16:27:43 +0800
Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c    | 9 ++++++++-
 fs/btrfs/inode.c      | 6 ++++--
 fs/btrfs/relocation.c | 3 ++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 56675d3cd23a..cb90966a8aab 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
 	if (ret)
 		goto fail;
 
-	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
+	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
+	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
 		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
 		btrfs_check_and_init_root_item(&root->root_item);
+	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
+		/*
+		 * Data reloc tree won't be snapshotted, thus it's COW only
+		 * tree, it's needed to set TRACK_DIRTY bit for it.
+		 */
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 	}
 
 	btrfs_init_free_ino_ctl(root);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5d567082f95a..71841535c7ca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 	 * extent just the way it is.
 	 */
 	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
-	    root == fs_info->tree_root)
+	    root == fs_info->tree_root ||
+	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
 		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
 					fs_info->sectorsize),
 					(u64)-1, 0);
@@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 
 		if (found_extent &&
 		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
-		     root == fs_info->tree_root)) {
+		     root == fs_info->tree_root ||
+		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
 			struct btrfs_ref ref = { 0 };
 
 			bytes_deleted += extent_num_bytes;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index f25deca18a5d..a85dd5d465f6 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 		 * if we are modifying block in fs tree, wait for readpage
 		 * to complete and drop the extent cache
 		 */
-		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
+		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
+		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
 			if (first) {
 				inode = find_next_inode(root, key.objectid);
 				first = 0;
-- 
2.26.2


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-11  8:31     ` Qu Wenruo
@ 2020-05-12 13:43       ` Zygo Blaxell
  2020-05-12 14:11         ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-12 13:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3916 bytes --]

On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
> Hi Zygo,
> 
> Would you like to test this diff?
> 
> Although I haven't find a solid reason yet, there is another report and
> with the help from the reporter, it turns out that balance hangs at
> relocating DATA_RELOC tree block.
> 
> After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
> since we can't create snapshot for data reloc tree.
> 
> By removing the REF_COW bit, we could ensure that data reloc tree always
> get cowed for relocation (just like extent tree), this would hugely
> reduce the complexity for data reloc tree.
> 
> Not sure if this would help, but it passes my local balance run.

I ran it last night.  It did 30804 loops during a metadata block group
balance, and is now looping on a data block group as I write this.

> Thanks,
> Qu

> From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
> From: Qu Wenruo <wqu@suse.com>
> Date: Mon, 11 May 2020 16:27:43 +0800
> Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/disk-io.c    | 9 ++++++++-
>  fs/btrfs/inode.c      | 6 ++++--
>  fs/btrfs/relocation.c | 3 ++-
>  3 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 56675d3cd23a..cb90966a8aab 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
>  	if (ret)
>  		goto fail;
>  
> -	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
> +	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
> +	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
>  		btrfs_check_and_init_root_item(&root->root_item);
> +	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +		/*
> +		 * Data reloc tree won't be snapshotted, thus it's COW only
> +		 * tree, it's needed to set TRACK_DIRTY bit for it.
> +		 */
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>  	}
>  
>  	btrfs_init_free_ino_ctl(root);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5d567082f95a..71841535c7ca 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>  	 * extent just the way it is.
>  	 */
>  	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> -	    root == fs_info->tree_root)
> +	    root == fs_info->tree_root ||
> +	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
>  		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
>  					fs_info->sectorsize),
>  					(u64)-1, 0);
> @@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>  
>  		if (found_extent &&
>  		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> -		     root == fs_info->tree_root)) {
> +		     root == fs_info->tree_root ||
> +		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
>  			struct btrfs_ref ref = { 0 };
>  
>  			bytes_deleted += extent_num_bytes;
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index f25deca18a5d..a85dd5d465f6 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>  		 * if we are modifying block in fs tree, wait for readpage
>  		 * to complete and drop the extent cache
>  		 */
> -		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> +		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			if (first) {
>  				inode = find_next_inode(root, key.objectid);
>  				first = 0;
> -- 
> 2.26.2
> 





[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-12 13:43       ` Zygo Blaxell
@ 2020-05-12 14:11         ` Zygo Blaxell
  2020-05-13  2:28           ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-12 14:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 12, 2020 at 09:43:06AM -0400, Zygo Blaxell wrote:
> On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
> > Hi Zygo,
> > 
> > Would you like to test this diff?
> > 
> > Although I haven't find a solid reason yet, there is another report and
> > with the help from the reporter, it turns out that balance hangs at
> > relocating DATA_RELOC tree block.
> > 
> > After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
> > since we can't create snapshot for data reloc tree.
> > 
> > By removing the REF_COW bit, we could ensure that data reloc tree always
> > get cowed for relocation (just like extent tree), this would hugely
> > reduce the complexity for data reloc tree.
> > 
> > Not sure if this would help, but it passes my local balance run.
> 
> I ran it last night.  It did 30804 loops during a metadata block group
> balance, and is now looping on a data block group as I write this.

Here's the block group that is failing, and some poking around with it:

	root@tester:~# ~/share/python-btrfs/examples/show_block_group_contents.py 4368594108416 /media/testfs/
	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
	    inline shared data backref parent 4374646833152 count 1
	extent vaddr 4368715161600 length 120168448 refs 1 gen 1318394 flags DATA
	    inline shared data backref parent 4374801383424 count 1
	extent vaddr 4368835330048 length 127623168 refs 1 gen 1318394 flags DATA
	    inline shared data backref parent 4374801383424 count 1
	extent vaddr 4368962953216 length 124964864 refs 1 gen 1318394 flags DATA
	    inline shared data backref parent 4374801383424 count 1
	extent vaddr 4369182420992 length 36700160 refs 1 gen 1321064 flags DATA
	    inline extent data backref root 257 objectid 257 offset 822607872 count 1

The extent data backref is unusual--during loops, I don't usually see those.
And...as I write this, it disappeared (it was part of the bees hash table, and
was overwritten).  Now there are 4 extents reported in the balance loop (note:
I added a loop counter to the log message):

	[Tue May 12 09:44:22 2020] BTRFS info (device dm-0): found 5 extents, loops 378, stage: update data pointers
	[Tue May 12 09:44:23 2020] BTRFS info (device dm-0): found 5 extents, loops 379, stage: update data pointers
	[Tue May 12 09:44:24 2020] BTRFS info (device dm-0): found 5 extents, loops 380, stage: update data pointers
	[Tue May 12 09:44:26 2020] BTRFS info (device dm-0): found 5 extents, loops 381, stage: update data pointers
	[Tue May 12 09:44:27 2020] BTRFS info (device dm-0): found 5 extents, loops 382, stage: update data pointers
	[Tue May 12 10:04:49 2020] BTRFS info (device dm-0): found 5 extents, loops 383, stage: update data pointers
	[Tue May 12 10:04:53 2020] BTRFS info (device dm-0): found 4 extents, loops 384, stage: update data pointers
	[Tue May 12 10:04:58 2020] BTRFS info (device dm-0): found 4 extents, loops 385, stage: update data pointers
	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 386, stage: update data pointers
	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 387, stage: update data pointers
	[Tue May 12 10:05:01 2020] BTRFS info (device dm-0): found 4 extents, loops 388, stage: update data pointers

Some of the extents that remain are confusing python-btrfs a little:

	root@tester:~# ~/share/python-btrfs/examples/show_block_group_data_extent_filenames.py 4368594108416 /media/testfs/
	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
	Traceback (most recent call last):
	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 52, in <module>
	    inodes, bytes_missed = logical_to_ino_fn(fs.fd, extent.vaddr)
	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 28, in find_out_about_v1_or_v2
	    inodes, bytes_missed = using_v2(fd, vaddr)
	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 17, in using_v2
	    inodes, bytes_missed = btrfs.ioctl.logical_to_ino_v2(fd, vaddr, ignore_offset=True)
	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 565, in logical_to_ino_v2
	    return _logical_to_ino(fd, vaddr, bufsize, ignore_offset, _v2=True)
	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 581, in _logical_to_ino
	    fcntl.ioctl(fd, IOC_LOGICAL_INO_V2, args)
	OSError: [Errno 22] Invalid argument

	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
	/media/testfs//current/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
	ERROR: ino paths ioctl: No such file or directory
	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
	ERROR: ino paths ioctl: No such file or directory
	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
	ERROR: ino paths ioctl: No such file or directory
	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
	ERROR: ino paths ioctl: No such file or directory
	ERROR: ino paths ioctl: No such file or directory
	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
	ERROR: ino paths ioctl: No such file or directory

Hmmm, I wonder if there's a problem with deleted snapshots?  I have those
nearly continuously in my test environment, which is creating and deleting
snapshots all the time.

	root@tester:~# btrfs ins log 4368594108416 -P /media/testfs/
	inode 20838190 offset 0 root 10347
	inode 20838190 offset 0 root 8013
	inode 20838190 offset 0 root 10332
	inode 20838190 offset 0 root 10330
	inode 20838190 offset 0 root 10331
	inode 20838190 offset 0 root 10328
	inode 20838190 offset 0 root 10329
	inode 20838190 offset 0 root 10343
	inode 20838190 offset 0 root 10333
	inode 20838190 offset 0 root 10334
	inode 20838190 offset 0 root 10336
	inode 20838190 offset 0 root 10338
	inode 20838190 offset 0 root 10325
	inode 20838190 offset 0 root 10349
	inode 20838190 offset 0 root 10320
	inode 20838190 offset 0 root 10321
	inode 20838190 offset 0 root 10322
	inode 20838190 offset 0 root 10323
	inode 20838190 offset 0 root 10324
	inode 20838190 offset 0 root 10326
	inode 20838190 offset 0 root 10327
	root@tester:~# btrfs sub list -d /media/testfs/
	ID 10201 gen 1321166 top level 0 path DELETED
	ID 10210 gen 1321166 top level 0 path DELETED
	ID 10230 gen 1321166 top level 0 path DELETED
	ID 10254 gen 1321166 top level 0 path DELETED
	ID 10257 gen 1321166 top level 0 path DELETED
	ID 10274 gen 1321166 top level 0 path DELETED
	ID 10281 gen 1321166 top level 0 path DELETED
	ID 10287 gen 1321166 top level 0 path DELETED
	ID 10296 gen 1321166 top level 0 path DELETED
	ID 10298 gen 1321166 top level 0 path DELETED
	ID 10299 gen 1321166 top level 0 path DELETED
	ID 10308 gen 1321166 top level 0 path DELETED
	ID 10311 gen 1321166 top level 0 path DELETED
	ID 10313 gen 1321166 top level 0 path DELETED
	ID 10315 gen 1321166 top level 0 path DELETED
	ID 10317 gen 1321166 top level 0 path DELETED
	ID 10322 gen 1321166 top level 0 path DELETED
	ID 10323 gen 1321166 top level 0 path DELETED
	ID 10327 gen 1321166 top level 0 path DELETED
	ID 10328 gen 1321166 top level 0 path DELETED
	ID 10330 gen 1321166 top level 0 path DELETED
	ID 10333 gen 1321166 top level 0 path DELETED


> > Thanks,
> > Qu
> 
> > From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
> > From: Qu Wenruo <wqu@suse.com>
> > Date: Mon, 11 May 2020 16:27:43 +0800
> > Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree
> > 
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/disk-io.c    | 9 ++++++++-
> >  fs/btrfs/inode.c      | 6 ++++--
> >  fs/btrfs/relocation.c | 3 ++-
> >  3 files changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 56675d3cd23a..cb90966a8aab 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
> >  	if (ret)
> >  		goto fail;
> >  
> > -	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
> > +	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
> > +	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >  		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
> >  		btrfs_check_and_init_root_item(&root->root_item);
> > +	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
> > +		/*
> > +		 * Data reloc tree won't be snapshotted, thus it's COW only
> > +		 * tree, it's needed to set TRACK_DIRTY bit for it.
> > +		 */
> > +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> >  	}
> >  
> >  	btrfs_init_free_ino_ctl(root);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 5d567082f95a..71841535c7ca 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
> >  	 * extent just the way it is.
> >  	 */
> >  	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> > -	    root == fs_info->tree_root)
> > +	    root == fs_info->tree_root ||
> > +	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
> >  		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
> >  					fs_info->sectorsize),
> >  					(u64)-1, 0);
> > @@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
> >  
> >  		if (found_extent &&
> >  		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> > -		     root == fs_info->tree_root)) {
> > +		     root == fs_info->tree_root ||
> > +		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
> >  			struct btrfs_ref ref = { 0 };
> >  
> >  			bytes_deleted += extent_num_bytes;
> > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> > index f25deca18a5d..a85dd5d465f6 100644
> > --- a/fs/btrfs/relocation.c
> > +++ b/fs/btrfs/relocation.c
> > @@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
> >  		 * if we are modifying block in fs tree, wait for readpage
> >  		 * to complete and drop the extent cache
> >  		 */
> > -		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> > +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> > +		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >  			if (first) {
> >  				inode = find_next_inode(root, key.objectid);
> >  				first = 0;
> > -- 
> > 2.26.2
> > 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-12 14:11         ` Zygo Blaxell
@ 2020-05-13  2:28           ` Qu Wenruo
  2020-05-13  5:02             ` Zygo Blaxell
  2020-05-13  5:24             ` Zygo Blaxell
  0 siblings, 2 replies; 32+ messages in thread
From: Qu Wenruo @ 2020-05-13  2:28 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 12819 bytes --]



On 2020/5/12 下午10:11, Zygo Blaxell wrote:
> On Tue, May 12, 2020 at 09:43:06AM -0400, Zygo Blaxell wrote:
>> On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
>>> Hi Zygo,
>>>
>>> Would you like to test this diff?
>>>
>>> Although I haven't find a solid reason yet, there is another report and
>>> with the help from the reporter, it turns out that balance hangs at
>>> relocating DATA_RELOC tree block.
>>>
>>> After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
>>> since we can't create snapshot for data reloc tree.
>>>
>>> By removing the REF_COW bit, we could ensure that data reloc tree always
>>> get cowed for relocation (just like extent tree), this would hugely
>>> reduce the complexity for data reloc tree.
>>>
>>> Not sure if this would help, but it passes my local balance run.
>>
>> I ran it last night.  It did 30804 loops during a metadata block group
>> balance, and is now looping on a data block group as I write this.

OK, not that surprised the patch doesn't help.
But still, the patch itself could still make sense for removing the
REFCOW bit for data reloc tree.

But I'm interesting in that, after that 30804 loops, it found its way to
next block group?!

> 
> Here's the block group that is failing, and some poking around with it:

In fact, such poking would be the most valuable part.

> 
> 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_contents.py 4368594108416 /media/testfs/
> 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> 	    inline shared data backref parent 4374646833152 count 1
> 	extent vaddr 4368715161600 length 120168448 refs 1 gen 1318394 flags DATA
> 	    inline shared data backref parent 4374801383424 count 1
> 	extent vaddr 4368835330048 length 127623168 refs 1 gen 1318394 flags DATA
> 	    inline shared data backref parent 4374801383424 count 1
> 	extent vaddr 4368962953216 length 124964864 refs 1 gen 1318394 flags DATA
> 	    inline shared data backref parent 4374801383424 count 1
> 	extent vaddr 4369182420992 length 36700160 refs 1 gen 1321064 flags DATA
> 	    inline extent data backref root 257 objectid 257 offset 822607872 count 1

One interesting thing is, there are 5 extents during the loop.
The first 4 looks like they belong to data reloc tree, which means they
have been swapped, waiting to be cleaned up.

The last one belongs to root 257, and looks like it hadn't been relocated.

> 
> The extent data backref is unusual--during loops, I don't usually see those.
> And...as I write this, it disappeared (it was part of the bees hash table, and
> was overwritten).  Now there are 4 extents reported in the balance loop (note:
> I added a loop counter to the log message):

Then it means the last one get properly relocated.

The cleanup for the first 4 doesn't happen properly.
> 
> 	[Tue May 12 09:44:22 2020] BTRFS info (device dm-0): found 5 extents, loops 378, stage: update data pointers
> 	[Tue May 12 09:44:23 2020] BTRFS info (device dm-0): found 5 extents, loops 379, stage: update data pointers
> 	[Tue May 12 09:44:24 2020] BTRFS info (device dm-0): found 5 extents, loops 380, stage: update data pointers
> 	[Tue May 12 09:44:26 2020] BTRFS info (device dm-0): found 5 extents, loops 381, stage: update data pointers
> 	[Tue May 12 09:44:27 2020] BTRFS info (device dm-0): found 5 extents, loops 382, stage: update data pointers
> 	[Tue May 12 10:04:49 2020] BTRFS info (device dm-0): found 5 extents, loops 383, stage: update data pointers
> 	[Tue May 12 10:04:53 2020] BTRFS info (device dm-0): found 4 extents, loops 384, stage: update data pointers
> 	[Tue May 12 10:04:58 2020] BTRFS info (device dm-0): found 4 extents, loops 385, stage: update data pointers
> 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 386, stage: update data pointers
> 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 387, stage: update data pointers
> 	[Tue May 12 10:05:01 2020] BTRFS info (device dm-0): found 4 extents, loops 388, stage: update data pointers
> 
> Some of the extents that remain are confusing python-btrfs a little:
> 
> 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_data_extent_filenames.py 4368594108416 /media/testfs/
> 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> 	Traceback (most recent call last):
> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 52, in <module>
> 	    inodes, bytes_missed = logical_to_ino_fn(fs.fd, extent.vaddr)
> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 28, in find_out_about_v1_or_v2
> 	    inodes, bytes_missed = using_v2(fd, vaddr)
> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 17, in using_v2
> 	    inodes, bytes_missed = btrfs.ioctl.logical_to_ino_v2(fd, vaddr, ignore_offset=True)
> 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 565, in logical_to_ino_v2
> 	    return _logical_to_ino(fd, vaddr, bufsize, ignore_offset, _v2=True)
> 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 581, in _logical_to_ino
> 	    fcntl.ioctl(fd, IOC_LOGICAL_INO_V2, args)

I'm a little surprised about the it's using logical ino ioctl, not just
TREE_SEARCH.

I guess if we could get a plain tree search based one (it only search
commit root, which is exactly balance based on), it would be easier to
do the digging.

> 	OSError: [Errno 22] Invalid argument
> 
> 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
> 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
> 	/media/testfs//current/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
> 	ERROR: ino paths ioctl: No such file or directory
> 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
> 	ERROR: ino paths ioctl: No such file or directory
> 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
> 	ERROR: ino paths ioctl: No such file or directory
> 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
> 	ERROR: ino paths ioctl: No such file or directory
> 	ERROR: ino paths ioctl: No such file or directory
> 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
> 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
> 	ERROR: ino paths ioctl: No such file or directory
> 
> Hmmm, I wonder if there's a problem with deleted snapshots?

Yes, also what I'm guessing.

The cleanup of data reloc tree doesn't look correct to me.

Thanks for the new clues,
Qu

>  I have those
> nearly continuously in my test environment, which is creating and deleting
> snapshots all the time.
> 
> 	root@tester:~# btrfs ins log 4368594108416 -P /media/testfs/
> 	inode 20838190 offset 0 root 10347
> 	inode 20838190 offset 0 root 8013
> 	inode 20838190 offset 0 root 10332
> 	inode 20838190 offset 0 root 10330
> 	inode 20838190 offset 0 root 10331
> 	inode 20838190 offset 0 root 10328
> 	inode 20838190 offset 0 root 10329
> 	inode 20838190 offset 0 root 10343
> 	inode 20838190 offset 0 root 10333
> 	inode 20838190 offset 0 root 10334
> 	inode 20838190 offset 0 root 10336
> 	inode 20838190 offset 0 root 10338
> 	inode 20838190 offset 0 root 10325
> 	inode 20838190 offset 0 root 10349
> 	inode 20838190 offset 0 root 10320
> 	inode 20838190 offset 0 root 10321
> 	inode 20838190 offset 0 root 10322
> 	inode 20838190 offset 0 root 10323
> 	inode 20838190 offset 0 root 10324
> 	inode 20838190 offset 0 root 10326
> 	inode 20838190 offset 0 root 10327
> 	root@tester:~# btrfs sub list -d /media/testfs/
> 	ID 10201 gen 1321166 top level 0 path DELETED
> 	ID 10210 gen 1321166 top level 0 path DELETED
> 	ID 10230 gen 1321166 top level 0 path DELETED
> 	ID 10254 gen 1321166 top level 0 path DELETED
> 	ID 10257 gen 1321166 top level 0 path DELETED
> 	ID 10274 gen 1321166 top level 0 path DELETED
> 	ID 10281 gen 1321166 top level 0 path DELETED
> 	ID 10287 gen 1321166 top level 0 path DELETED
> 	ID 10296 gen 1321166 top level 0 path DELETED
> 	ID 10298 gen 1321166 top level 0 path DELETED
> 	ID 10299 gen 1321166 top level 0 path DELETED
> 	ID 10308 gen 1321166 top level 0 path DELETED
> 	ID 10311 gen 1321166 top level 0 path DELETED
> 	ID 10313 gen 1321166 top level 0 path DELETED
> 	ID 10315 gen 1321166 top level 0 path DELETED
> 	ID 10317 gen 1321166 top level 0 path DELETED
> 	ID 10322 gen 1321166 top level 0 path DELETED
> 	ID 10323 gen 1321166 top level 0 path DELETED
> 	ID 10327 gen 1321166 top level 0 path DELETED
> 	ID 10328 gen 1321166 top level 0 path DELETED
> 	ID 10330 gen 1321166 top level 0 path DELETED
> 	ID 10333 gen 1321166 top level 0 path DELETED
> 
> 
>>> Thanks,
>>> Qu
>>
>>> From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
>>> From: Qu Wenruo <wqu@suse.com>
>>> Date: Mon, 11 May 2020 16:27:43 +0800
>>> Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>  fs/btrfs/disk-io.c    | 9 ++++++++-
>>>  fs/btrfs/inode.c      | 6 ++++--
>>>  fs/btrfs/relocation.c | 3 ++-
>>>  3 files changed, 14 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 56675d3cd23a..cb90966a8aab 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
>>>  	if (ret)
>>>  		goto fail;
>>>  
>>> -	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
>>> +	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
>>> +	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>  		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
>>>  		btrfs_check_and_init_root_item(&root->root_item);
>>> +	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>> +		/*
>>> +		 * Data reloc tree won't be snapshotted, thus it's COW only
>>> +		 * tree, it's needed to set TRACK_DIRTY bit for it.
>>> +		 */
>>> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>>>  	}
>>>  
>>>  	btrfs_init_free_ino_ctl(root);
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 5d567082f95a..71841535c7ca 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>>>  	 * extent just the way it is.
>>>  	 */
>>>  	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
>>> -	    root == fs_info->tree_root)
>>> +	    root == fs_info->tree_root ||
>>> +	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
>>>  		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
>>>  					fs_info->sectorsize),
>>>  					(u64)-1, 0);
>>> @@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>>>  
>>>  		if (found_extent &&
>>>  		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
>>> -		     root == fs_info->tree_root)) {
>>> +		     root == fs_info->tree_root ||
>>> +		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
>>>  			struct btrfs_ref ref = { 0 };
>>>  
>>>  			bytes_deleted += extent_num_bytes;
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index f25deca18a5d..a85dd5d465f6 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>>>  		 * if we are modifying block in fs tree, wait for readpage
>>>  		 * to complete and drop the extent cache
>>>  		 */
>>> -		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
>>> +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
>>> +		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>  			if (first) {
>>>  				inode = find_next_inode(root, key.objectid);
>>>  				first = 0;
>>> -- 
>>> 2.26.2
>>>
>>
>>
>>
>>
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-13  2:28           ` Qu Wenruo
@ 2020-05-13  5:02             ` Zygo Blaxell
  2020-05-13  6:36               ` Qu Wenruo
  2020-05-13  5:24             ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-13  5:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 13, 2020 at 10:28:37AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/5/12 下午10:11, Zygo Blaxell wrote:
> > On Tue, May 12, 2020 at 09:43:06AM -0400, Zygo Blaxell wrote:
> >> On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
> >>> Hi Zygo,
> >>>
> >>> Would you like to test this diff?
> >>>
> >>> Although I haven't find a solid reason yet, there is another report and
> >>> with the help from the reporter, it turns out that balance hangs at
> >>> relocating DATA_RELOC tree block.
> >>>
> >>> After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
> >>> since we can't create snapshot for data reloc tree.
> >>>
> >>> By removing the REF_COW bit, we could ensure that data reloc tree always
> >>> get cowed for relocation (just like extent tree), this would hugely
> >>> reduce the complexity for data reloc tree.
> >>>
> >>> Not sure if this would help, but it passes my local balance run.
> >>
> >> I ran it last night.  It did 30804 loops during a metadata block group
> >> balance, and is now looping on a data block group as I write this.
> 
> OK, not that surprised the patch doesn't help.
> But still, the patch itself could still make sense for removing the
> REFCOW bit for data reloc tree.
> 
> But I'm interesting in that, after that 30804 loops, it found its way to
> next block group?!

No, it got cancelled.  My test scripts start a balance every (hour +/-
random(1 hour) and cancels a balance on a similar schedule.

> > Here's the block group that is failing, and some poking around with it:
> 
> In fact, such poking would be the most valuable part.
> 
> > 
> > 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_contents.py 4368594108416 /media/testfs/
> > 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> > 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374646833152 count 1
> > 	extent vaddr 4368715161600 length 120168448 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374801383424 count 1
> > 	extent vaddr 4368835330048 length 127623168 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374801383424 count 1
> > 	extent vaddr 4368962953216 length 124964864 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374801383424 count 1
> > 	extent vaddr 4369182420992 length 36700160 refs 1 gen 1321064 flags DATA
> > 	    inline extent data backref root 257 objectid 257 offset 822607872 count 1
> 
> One interesting thing is, there are 5 extents during the loop.
> The first 4 looks like they belong to data reloc tree, which means they
> have been swapped, waiting to be cleaned up.
> 
> The last one belongs to root 257, and looks like it hadn't been relocated.
> 
> > 
> > The extent data backref is unusual--during loops, I don't usually see those.
> > And...as I write this, it disappeared (it was part of the bees hash table, and
> > was overwritten).  Now there are 4 extents reported in the balance loop (note:
> > I added a loop counter to the log message):
> 
> Then it means the last one get properly relocated.
> 
> The cleanup for the first 4 doesn't happen properly.
> > 
> > 	[Tue May 12 09:44:22 2020] BTRFS info (device dm-0): found 5 extents, loops 378, stage: update data pointers
> > 	[Tue May 12 09:44:23 2020] BTRFS info (device dm-0): found 5 extents, loops 379, stage: update data pointers
> > 	[Tue May 12 09:44:24 2020] BTRFS info (device dm-0): found 5 extents, loops 380, stage: update data pointers
> > 	[Tue May 12 09:44:26 2020] BTRFS info (device dm-0): found 5 extents, loops 381, stage: update data pointers
> > 	[Tue May 12 09:44:27 2020] BTRFS info (device dm-0): found 5 extents, loops 382, stage: update data pointers
> > 	[Tue May 12 10:04:49 2020] BTRFS info (device dm-0): found 5 extents, loops 383, stage: update data pointers
> > 	[Tue May 12 10:04:53 2020] BTRFS info (device dm-0): found 4 extents, loops 384, stage: update data pointers
> > 	[Tue May 12 10:04:58 2020] BTRFS info (device dm-0): found 4 extents, loops 385, stage: update data pointers
> > 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 386, stage: update data pointers
> > 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 387, stage: update data pointers
> > 	[Tue May 12 10:05:01 2020] BTRFS info (device dm-0): found 4 extents, loops 388, stage: update data pointers
> > 
> > Some of the extents that remain are confusing python-btrfs a little:
> > 
> > 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_data_extent_filenames.py 4368594108416 /media/testfs/
> > 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> > 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> > 	Traceback (most recent call last):
> > 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 52, in <module>
> > 	    inodes, bytes_missed = logical_to_ino_fn(fs.fd, extent.vaddr)
> > 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 28, in find_out_about_v1_or_v2
> > 	    inodes, bytes_missed = using_v2(fd, vaddr)
> > 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 17, in using_v2
> > 	    inodes, bytes_missed = btrfs.ioctl.logical_to_ino_v2(fd, vaddr, ignore_offset=True)
> > 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 565, in logical_to_ino_v2
> > 	    return _logical_to_ino(fd, vaddr, bufsize, ignore_offset, _v2=True)
> > 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 581, in _logical_to_ino
> > 	    fcntl.ioctl(fd, IOC_LOGICAL_INO_V2, args)
> 
> I'm a little surprised about the it's using logical ino ioctl, not just
> TREE_SEARCH.

Tree search can't read shared backrefs because they refer directly to
disk blocks, not to object/type/offset tuples.  It would be nice to have
an ioctl that can read a metadata block (or even a data block) by bytenr.

Or even better, just a fd that can be obtained by some ioctl to access
the btrfs virtual address space with pread().

> I guess if we could get a plain tree search based one (it only search
> commit root, which is exactly balance based on), it would be easier to
> do the digging.

That would be nice.  I have an application for it.  ;)

> > 	OSError: [Errno 22] Invalid argument
> > 
> > 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
> > 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
> > 	/media/testfs//current/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 
> > Hmmm, I wonder if there's a problem with deleted snapshots?
> 
> Yes, also what I'm guessing.
> 
> The cleanup of data reloc tree doesn't look correct to me.
> 
> Thanks for the new clues,
> Qu

Here's a fun one:

1.  Delete all the files on a filesystem where balance loops
have occurred.

2.  Verify there are no data blocks (one data block group
with used = 0):

# show_block_groups.py /testfs/
block group vaddr 435969589248 length 1073741824 flags METADATA|RAID1 used 180224 used_pct 0
block group vaddr 4382686969856 length 33554432 flags SYSTEM|RAID1 used 16384 used_pct 0
block group vaddr 4383794266112 length 1073741824 flags DATA used 0 used_pct 0

3.  Create a new file with a single reference in the only (root) subvol:
# head -c 1024m > file
# sync
# show_block_groups.py .
block group vaddr 435969589248 length 1073741824 flags METADATA|RAID1 used 1245184 used_pct 0
block group vaddr 4382686969856 length 33554432 flags SYSTEM|RAID1 used 16384 used_pct 0
block group vaddr 4384868007936 length 1073741824 flags DATA used 961708032 used_pct 90
block group vaddr 4385941749760 length 1073741824 flags DATA used 112033792 used_pct 10

4.  Run balance, and it immediately loops on a single extent:
# btrfs balance start -d .
[Wed May 13 00:41:58 2020] BTRFS info (device dm-0): balance: start -d
[Wed May 13 00:41:58 2020] BTRFS info (device dm-0): relocating block group 4385941749760 flags data
[Wed May 13 00:42:00 2020] BTRFS info (device dm-0): found 1 extents, loops 1, stage: move data extents
[Wed May 13 00:42:00 2020] BTRFS info (device dm-0): found 1 extents, loops 2, stage: update data pointers
[Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 3, stage: update data pointers
[Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 4, stage: update data pointers
[Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 5, stage: update data pointers
[Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 6, stage: update data pointers
[Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 7, stage: update data pointers
[Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 8, stage: update data pointers
[Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 9, stage: update data pointers
[Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 10, stage: update data pointers
[Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 11, stage: update data pointers
[etc...]

I tried it a 3 more times time and there was no loop.  The 5th try looped again.

There might be a correlation with cancels.  After a fresh boot, I can
often balance a few dozen block groups before there's a loop, but if I
cancel a balance, the next balance almost always loops.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-13  2:28           ` Qu Wenruo
  2020-05-13  5:02             ` Zygo Blaxell
@ 2020-05-13  5:24             ` Zygo Blaxell
  2020-05-13 11:23               ` Qu Wenruo
  1 sibling, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-13  5:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 13, 2020 at 10:28:37AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/5/12 下午10:11, Zygo Blaxell wrote:
> > On Tue, May 12, 2020 at 09:43:06AM -0400, Zygo Blaxell wrote:
> >> On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
> >>> Hi Zygo,
> >>>
> >>> Would you like to test this diff?
> >>>
> >>> Although I haven't find a solid reason yet, there is another report and
> >>> with the help from the reporter, it turns out that balance hangs at
> >>> relocating DATA_RELOC tree block.
> >>>
> >>> After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
> >>> since we can't create snapshot for data reloc tree.
> >>>
> >>> By removing the REF_COW bit, we could ensure that data reloc tree always
> >>> get cowed for relocation (just like extent tree), this would hugely
> >>> reduce the complexity for data reloc tree.
> >>>
> >>> Not sure if this would help, but it passes my local balance run.
> >>
> >> I ran it last night.  It did 30804 loops during a metadata block group
> >> balance, and is now looping on a data block group as I write this.
> 
> OK, not that surprised the patch doesn't help.
> But still, the patch itself could still make sense for removing the
> REFCOW bit for data reloc tree.
> 
> But I'm interesting in that, after that 30804 loops, it found its way to
> next block group?!
> 
> > 
> > Here's the block group that is failing, and some poking around with it:
> 
> In fact, such poking would be the most valuable part.
> 
> > 
> > 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_contents.py 4368594108416 /media/testfs/
> > 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> > 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374646833152 count 1
> > 	extent vaddr 4368715161600 length 120168448 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374801383424 count 1
> > 	extent vaddr 4368835330048 length 127623168 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374801383424 count 1
> > 	extent vaddr 4368962953216 length 124964864 refs 1 gen 1318394 flags DATA
> > 	    inline shared data backref parent 4374801383424 count 1
> > 	extent vaddr 4369182420992 length 36700160 refs 1 gen 1321064 flags DATA
> > 	    inline extent data backref root 257 objectid 257 offset 822607872 count 1
> 
> One interesting thing is, there are 5 extents during the loop.
> The first 4 looks like they belong to data reloc tree, which means they
> have been swapped, waiting to be cleaned up.
> 
> The last one belongs to root 257, and looks like it hadn't been relocated.
> 
> > 
> > The extent data backref is unusual--during loops, I don't usually see those.
> > And...as I write this, it disappeared (it was part of the bees hash table, and
> > was overwritten).  Now there are 4 extents reported in the balance loop (note:
> > I added a loop counter to the log message):
> 
> Then it means the last one get properly relocated.

No.  In these cases the extent is removed by other filesystem activity.
If balance gets stuck looping, it can never break out of a loop if it
is the only writer on the filesystem.  I've left it looping for days,
it makes no progress.

When balance is looping, it is always stuck waiting until the extents
are removed by something else.  In this particular case, the extent was
overwritten by another process removing the extent's last reference so
it was no longer part of the block group any more.  It is possible to
break a balance loop by simply deleting all the files with extents in
the block group.

I wrote some scripts that dump out the extents in the looping block
group, find the files they belong to, and run the defrag ioctl on them,
thereby removing all the extents in the block group so the balance loop
will end, without deleting the data.  I used the script for a while and
was able to balance hundreds of block groups more than I would have been
able to without the script; however, the script couldn't run defrag on
extents that were not reachable through open() (e.g. extents referenced
by a deleted snapshot), so it couldn't work around the balance loops in
all cases.

> The cleanup for the first 4 doesn't happen properly.
> > 
> > 	[Tue May 12 09:44:22 2020] BTRFS info (device dm-0): found 5 extents, loops 378, stage: update data pointers
> > 	[Tue May 12 09:44:23 2020] BTRFS info (device dm-0): found 5 extents, loops 379, stage: update data pointers
> > 	[Tue May 12 09:44:24 2020] BTRFS info (device dm-0): found 5 extents, loops 380, stage: update data pointers
> > 	[Tue May 12 09:44:26 2020] BTRFS info (device dm-0): found 5 extents, loops 381, stage: update data pointers
> > 	[Tue May 12 09:44:27 2020] BTRFS info (device dm-0): found 5 extents, loops 382, stage: update data pointers
> > 	[Tue May 12 10:04:49 2020] BTRFS info (device dm-0): found 5 extents, loops 383, stage: update data pointers
> > 	[Tue May 12 10:04:53 2020] BTRFS info (device dm-0): found 4 extents, loops 384, stage: update data pointers
> > 	[Tue May 12 10:04:58 2020] BTRFS info (device dm-0): found 4 extents, loops 385, stage: update data pointers
> > 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 386, stage: update data pointers
> > 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 387, stage: update data pointers
> > 	[Tue May 12 10:05:01 2020] BTRFS info (device dm-0): found 4 extents, loops 388, stage: update data pointers
> > 
> > Some of the extents that remain are confusing python-btrfs a little:
> > 
> > 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_data_extent_filenames.py 4368594108416 /media/testfs/
> > 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> > 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> > 	Traceback (most recent call last):
> > 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 52, in <module>
> > 	    inodes, bytes_missed = logical_to_ino_fn(fs.fd, extent.vaddr)
> > 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 28, in find_out_about_v1_or_v2
> > 	    inodes, bytes_missed = using_v2(fd, vaddr)
> > 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 17, in using_v2
> > 	    inodes, bytes_missed = btrfs.ioctl.logical_to_ino_v2(fd, vaddr, ignore_offset=True)
> > 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 565, in logical_to_ino_v2
> > 	    return _logical_to_ino(fd, vaddr, bufsize, ignore_offset, _v2=True)
> > 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 581, in _logical_to_ino
> > 	    fcntl.ioctl(fd, IOC_LOGICAL_INO_V2, args)
> 
> I'm a little surprised about the it's using logical ino ioctl, not just
> TREE_SEARCH.
> 
> I guess if we could get a plain tree search based one (it only search
> commit root, which is exactly balance based on), it would be easier to
> do the digging.
> 
> > 	OSError: [Errno 22] Invalid argument
> > 
> > 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
> > 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
> > 	/media/testfs//current/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	ERROR: ino paths ioctl: No such file or directory
> > 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
> > 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
> > 	ERROR: ino paths ioctl: No such file or directory
> > 
> > Hmmm, I wonder if there's a problem with deleted snapshots?
> 
> Yes, also what I'm guessing.
> 
> The cleanup of data reloc tree doesn't look correct to me.
> 
> Thanks for the new clues,
> Qu
> 
> >  I have those
> > nearly continuously in my test environment, which is creating and deleting
> > snapshots all the time.
> > 
> > 	root@tester:~# btrfs ins log 4368594108416 -P /media/testfs/
> > 	inode 20838190 offset 0 root 10347
> > 	inode 20838190 offset 0 root 8013
> > 	inode 20838190 offset 0 root 10332
> > 	inode 20838190 offset 0 root 10330
> > 	inode 20838190 offset 0 root 10331
> > 	inode 20838190 offset 0 root 10328
> > 	inode 20838190 offset 0 root 10329
> > 	inode 20838190 offset 0 root 10343
> > 	inode 20838190 offset 0 root 10333
> > 	inode 20838190 offset 0 root 10334
> > 	inode 20838190 offset 0 root 10336
> > 	inode 20838190 offset 0 root 10338
> > 	inode 20838190 offset 0 root 10325
> > 	inode 20838190 offset 0 root 10349
> > 	inode 20838190 offset 0 root 10320
> > 	inode 20838190 offset 0 root 10321
> > 	inode 20838190 offset 0 root 10322
> > 	inode 20838190 offset 0 root 10323
> > 	inode 20838190 offset 0 root 10324
> > 	inode 20838190 offset 0 root 10326
> > 	inode 20838190 offset 0 root 10327
> > 	root@tester:~# btrfs sub list -d /media/testfs/
> > 	ID 10201 gen 1321166 top level 0 path DELETED
> > 	ID 10210 gen 1321166 top level 0 path DELETED
> > 	ID 10230 gen 1321166 top level 0 path DELETED
> > 	ID 10254 gen 1321166 top level 0 path DELETED
> > 	ID 10257 gen 1321166 top level 0 path DELETED
> > 	ID 10274 gen 1321166 top level 0 path DELETED
> > 	ID 10281 gen 1321166 top level 0 path DELETED
> > 	ID 10287 gen 1321166 top level 0 path DELETED
> > 	ID 10296 gen 1321166 top level 0 path DELETED
> > 	ID 10298 gen 1321166 top level 0 path DELETED
> > 	ID 10299 gen 1321166 top level 0 path DELETED
> > 	ID 10308 gen 1321166 top level 0 path DELETED
> > 	ID 10311 gen 1321166 top level 0 path DELETED
> > 	ID 10313 gen 1321166 top level 0 path DELETED
> > 	ID 10315 gen 1321166 top level 0 path DELETED
> > 	ID 10317 gen 1321166 top level 0 path DELETED
> > 	ID 10322 gen 1321166 top level 0 path DELETED
> > 	ID 10323 gen 1321166 top level 0 path DELETED
> > 	ID 10327 gen 1321166 top level 0 path DELETED
> > 	ID 10328 gen 1321166 top level 0 path DELETED
> > 	ID 10330 gen 1321166 top level 0 path DELETED
> > 	ID 10333 gen 1321166 top level 0 path DELETED
> > 
> > 
> >>> Thanks,
> >>> Qu
> >>
> >>> From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
> >>> From: Qu Wenruo <wqu@suse.com>
> >>> Date: Mon, 11 May 2020 16:27:43 +0800
> >>> Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree
> >>>
> >>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>> ---
> >>>  fs/btrfs/disk-io.c    | 9 ++++++++-
> >>>  fs/btrfs/inode.c      | 6 ++++--
> >>>  fs/btrfs/relocation.c | 3 ++-
> >>>  3 files changed, 14 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index 56675d3cd23a..cb90966a8aab 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
> >>>  	if (ret)
> >>>  		goto fail;
> >>>  
> >>> -	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
> >>> +	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
> >>> +	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>>  		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
> >>>  		btrfs_check_and_init_root_item(&root->root_item);
> >>> +	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>> +		/*
> >>> +		 * Data reloc tree won't be snapshotted, thus it's COW only
> >>> +		 * tree, it's needed to set TRACK_DIRTY bit for it.
> >>> +		 */
> >>> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> >>>  	}
> >>>  
> >>>  	btrfs_init_free_ino_ctl(root);
> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>> index 5d567082f95a..71841535c7ca 100644
> >>> --- a/fs/btrfs/inode.c
> >>> +++ b/fs/btrfs/inode.c
> >>> @@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
> >>>  	 * extent just the way it is.
> >>>  	 */
> >>>  	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> >>> -	    root == fs_info->tree_root)
> >>> +	    root == fs_info->tree_root ||
> >>> +	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
> >>>  		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
> >>>  					fs_info->sectorsize),
> >>>  					(u64)-1, 0);
> >>> @@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
> >>>  
> >>>  		if (found_extent &&
> >>>  		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> >>> -		     root == fs_info->tree_root)) {
> >>> +		     root == fs_info->tree_root ||
> >>> +		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
> >>>  			struct btrfs_ref ref = { 0 };
> >>>  
> >>>  			bytes_deleted += extent_num_bytes;
> >>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> >>> index f25deca18a5d..a85dd5d465f6 100644
> >>> --- a/fs/btrfs/relocation.c
> >>> +++ b/fs/btrfs/relocation.c
> >>> @@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
> >>>  		 * if we are modifying block in fs tree, wait for readpage
> >>>  		 * to complete and drop the extent cache
> >>>  		 */
> >>> -		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> >>> +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> >>> +		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>>  			if (first) {
> >>>  				inode = find_next_inode(root, key.objectid);
> >>>  				first = 0;
> >>> -- 
> >>> 2.26.2
> >>>
> >>
> >>
> >>
> >>
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-13  5:02             ` Zygo Blaxell
@ 2020-05-13  6:36               ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2020-05-13  6:36 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6388 bytes --]



On 2020/5/13 下午1:02, Zygo Blaxell wrote:
> On Wed, May 13, 2020 at 10:28:37AM +0800, Qu Wenruo wrote:
>>
>>
[...]
>> I'm a little surprised about the it's using logical ino ioctl, not just
>> TREE_SEARCH.
> 
> Tree search can't read shared backrefs because they refer directly to
> disk blocks, not to object/type/offset tuples.  It would be nice to have
> an ioctl that can read a metadata block (or even a data block) by bytenr.

Sorry, I mean we can use tree search to search extent tree, and only
search the backref for specified bytenr to inspect it.

In such hanging case, what I really want is to inspect if the tree block
4374646833152 belongs to data reloc tree.

> 
> Or even better, just a fd that can be obtained by some ioctl to access
> the btrfs virtual address space with pread().

If we want the content of 4374646833152, we can easily go btrfs ins
dump-tree -b 4374646833152.

As balance works on commit root, at the time of looping, that tree block
should still be on disk, and dump-tree can get it.


Another way to workaround this is, to provide full extent tree dump.
(I know this is a bad idea, but it would still be faster than interact
with mail).

> 
>> I guess if we could get a plain tree search based one (it only search
>> commit root, which is exactly balance based on), it would be easier to
>> do the digging.
> 
> That would be nice.  I have an application for it.  ;)
> 
>>> 	OSError: [Errno 22] Invalid argument
>>>
>>> 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
>>> 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//current/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>>
>>> Hmmm, I wonder if there's a problem with deleted snapshots?
>>
>> Yes, also what I'm guessing.
>>
>> The cleanup of data reloc tree doesn't look correct to me.
>>
>> Thanks for the new clues,
>> Qu
> 
> Here's a fun one:
> 
> 1.  Delete all the files on a filesystem where balance loops
> have occurred.

I tried with newly created fs, and failed to reproduce.

> 
> 2.  Verify there are no data blocks (one data block group
> with used = 0):
> 
> # show_block_groups.py /testfs/
> block group vaddr 435969589248 length 1073741824 flags METADATA|RAID1 used 180224 used_pct 0
> block group vaddr 4382686969856 length 33554432 flags SYSTEM|RAID1 used 16384 used_pct 0
> block group vaddr 4383794266112 length 1073741824 flags DATA used 0 used_pct 0
> 
> 3.  Create a new file with a single reference in the only (root) subvol:
> # head -c 1024m > file
> # sync
> # show_block_groups.py .
> block group vaddr 435969589248 length 1073741824 flags METADATA|RAID1 used 1245184 used_pct 0
> block group vaddr 4382686969856 length 33554432 flags SYSTEM|RAID1 used 16384 used_pct 0
> block group vaddr 4384868007936 length 1073741824 flags DATA used 961708032 used_pct 90
> block group vaddr 4385941749760 length 1073741824 flags DATA used 112033792 used_pct 10
> 
> 4.  Run balance, and it immediately loops on a single extent:
> # btrfs balance start -d .
> [Wed May 13 00:41:58 2020] BTRFS info (device dm-0): balance: start -d
> [Wed May 13 00:41:58 2020] BTRFS info (device dm-0): relocating block group 4385941749760 flags data
> [Wed May 13 00:42:00 2020] BTRFS info (device dm-0): found 1 extents, loops 1, stage: move data extents
> [Wed May 13 00:42:00 2020] BTRFS info (device dm-0): found 1 extents, loops 2, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 3, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 4, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 5, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 6, stage: update data pointers
> [Wed May 13 00:42:01 2020] BTRFS info (device dm-0): found 1 extents, loops 7, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 8, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 9, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 10, stage: update data pointers
> [Wed May 13 00:42:02 2020] BTRFS info (device dm-0): found 1 extents, loops 11, stage: update data pointers
> [etc...]
> 
> I tried it a 3 more times time and there was no loop.  The 5th try looped again.

10 loops, no reproduce.
I guess there are some other factors involved, like newly created fs
won't trigger it?

BTW, for the reproducible case (or the looping case), would you like to
dump the data_reloc root?

My current guess is some orphan cleanup doesn't get kicked, but it
shouldn't affect metadata with my patch :(

Thanks,
Qu
> 
> There might be a correlation with cancels.  After a fresh boot, I can
> often balance a few dozen block groups before there's a loop, but if I
> cancel a balance, the next balance almost always loops.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-13  5:24             ` Zygo Blaxell
@ 2020-05-13 11:23               ` Qu Wenruo
  2020-05-13 12:21                 ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-13 11:23 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 14969 bytes --]



On 2020/5/13 下午1:24, Zygo Blaxell wrote:
> On Wed, May 13, 2020 at 10:28:37AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/5/12 下午10:11, Zygo Blaxell wrote:
>>> On Tue, May 12, 2020 at 09:43:06AM -0400, Zygo Blaxell wrote:
>>>> On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
>>>>> Hi Zygo,
>>>>>
>>>>> Would you like to test this diff?
>>>>>
>>>>> Although I haven't find a solid reason yet, there is another report and
>>>>> with the help from the reporter, it turns out that balance hangs at
>>>>> relocating DATA_RELOC tree block.
>>>>>
>>>>> After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
>>>>> since we can't create snapshot for data reloc tree.
>>>>>
>>>>> By removing the REF_COW bit, we could ensure that data reloc tree always
>>>>> get cowed for relocation (just like extent tree), this would hugely
>>>>> reduce the complexity for data reloc tree.
>>>>>
>>>>> Not sure if this would help, but it passes my local balance run.
>>>>
>>>> I ran it last night.  It did 30804 loops during a metadata block group
>>>> balance, and is now looping on a data block group as I write this.
>>
>> OK, not that surprised the patch doesn't help.
>> But still, the patch itself could still make sense for removing the
>> REFCOW bit for data reloc tree.
>>
>> But I'm interesting in that, after that 30804 loops, it found its way to
>> next block group?!
>>
>>>
>>> Here's the block group that is failing, and some poking around with it:
>>
>> In fact, such poking would be the most valuable part.
>>
>>>
>>> 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_contents.py 4368594108416 /media/testfs/
>>> 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
>>> 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
>>> 	    inline shared data backref parent 4374646833152 count 1
>>> 	extent vaddr 4368715161600 length 120168448 refs 1 gen 1318394 flags DATA
>>> 	    inline shared data backref parent 4374801383424 count 1
>>> 	extent vaddr 4368835330048 length 127623168 refs 1 gen 1318394 flags DATA
>>> 	    inline shared data backref parent 4374801383424 count 1
>>> 	extent vaddr 4368962953216 length 124964864 refs 1 gen 1318394 flags DATA
>>> 	    inline shared data backref parent 4374801383424 count 1
>>> 	extent vaddr 4369182420992 length 36700160 refs 1 gen 1321064 flags DATA
>>> 	    inline extent data backref root 257 objectid 257 offset 822607872 count 1
>>
>> One interesting thing is, there are 5 extents during the loop.
>> The first 4 looks like they belong to data reloc tree, which means they
>> have been swapped, waiting to be cleaned up.
>>
>> The last one belongs to root 257, and looks like it hadn't been relocated.
>>
>>>
>>> The extent data backref is unusual--during loops, I don't usually see those.
>>> And...as I write this, it disappeared (it was part of the bees hash table, and
>>> was overwritten).  Now there are 4 extents reported in the balance loop (note:
>>> I added a loop counter to the log message):
>>
>> Then it means the last one get properly relocated.
> 
> No.  In these cases the extent is removed by other filesystem activity.
> If balance gets stuck looping, it can never break out of a loop if it
> is the only writer on the filesystem.  I've left it looping for days,
> it makes no progress.
> 
> When balance is looping, it is always stuck waiting until the extents
> are removed by something else.  In this particular case, the extent was
> overwritten by another process removing the extent's last reference so
> it was no longer part of the block group any more.  It is possible to
> break a balance loop by simply deleting all the files with extents in
> the block group.
> 
> I wrote some scripts that dump out the extents in the looping block
> group, find the files they belong to, and run the defrag ioctl on them,
> thereby removing all the extents in the block group so the balance loop
> will end, without deleting the data.

This should definitely be fixed.

>  I used the script for a while and
> was able to balance hundreds of block groups more than I would have been
> able to without the script; however, the script couldn't run defrag on
> extents that were not reachable through open() (e.g. extents referenced
> by a deleted snapshot),

And data reloc tree.

> so it couldn't work around the balance loops in
> all cases.

For the single data extent hanging you are able to reproduce, can you
send me a binary dump of that fs when it's hanging?

Thanks,
Qu

> 
>> The cleanup for the first 4 doesn't happen properly.
>>>
>>> 	[Tue May 12 09:44:22 2020] BTRFS info (device dm-0): found 5 extents, loops 378, stage: update data pointers
>>> 	[Tue May 12 09:44:23 2020] BTRFS info (device dm-0): found 5 extents, loops 379, stage: update data pointers
>>> 	[Tue May 12 09:44:24 2020] BTRFS info (device dm-0): found 5 extents, loops 380, stage: update data pointers
>>> 	[Tue May 12 09:44:26 2020] BTRFS info (device dm-0): found 5 extents, loops 381, stage: update data pointers
>>> 	[Tue May 12 09:44:27 2020] BTRFS info (device dm-0): found 5 extents, loops 382, stage: update data pointers
>>> 	[Tue May 12 10:04:49 2020] BTRFS info (device dm-0): found 5 extents, loops 383, stage: update data pointers
>>> 	[Tue May 12 10:04:53 2020] BTRFS info (device dm-0): found 4 extents, loops 384, stage: update data pointers
>>> 	[Tue May 12 10:04:58 2020] BTRFS info (device dm-0): found 4 extents, loops 385, stage: update data pointers
>>> 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 386, stage: update data pointers
>>> 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 387, stage: update data pointers
>>> 	[Tue May 12 10:05:01 2020] BTRFS info (device dm-0): found 4 extents, loops 388, stage: update data pointers
>>>
>>> Some of the extents that remain are confusing python-btrfs a little:
>>>
>>> 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_data_extent_filenames.py 4368594108416 /media/testfs/
>>> 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
>>> 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
>>> 	Traceback (most recent call last):
>>> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 52, in <module>
>>> 	    inodes, bytes_missed = logical_to_ino_fn(fs.fd, extent.vaddr)
>>> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 28, in find_out_about_v1_or_v2
>>> 	    inodes, bytes_missed = using_v2(fd, vaddr)
>>> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 17, in using_v2
>>> 	    inodes, bytes_missed = btrfs.ioctl.logical_to_ino_v2(fd, vaddr, ignore_offset=True)
>>> 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 565, in logical_to_ino_v2
>>> 	    return _logical_to_ino(fd, vaddr, bufsize, ignore_offset, _v2=True)
>>> 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 581, in _logical_to_ino
>>> 	    fcntl.ioctl(fd, IOC_LOGICAL_INO_V2, args)
>>
>> I'm a little surprised about the it's using logical ino ioctl, not just
>> TREE_SEARCH.
>>
>> I guess if we could get a plain tree search based one (it only search
>> commit root, which is exactly balance based on), it would be easier to
>> do the digging.
>>
>>> 	OSError: [Errno 22] Invalid argument
>>>
>>> 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
>>> 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//current/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	ERROR: ino paths ioctl: No such file or directory
>>> 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
>>> 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
>>> 	ERROR: ino paths ioctl: No such file or directory
>>>
>>> Hmmm, I wonder if there's a problem with deleted snapshots?
>>
>> Yes, also what I'm guessing.
>>
>> The cleanup of data reloc tree doesn't look correct to me.
>>
>> Thanks for the new clues,
>> Qu
>>
>>>  I have those
>>> nearly continuously in my test environment, which is creating and deleting
>>> snapshots all the time.
>>>
>>> 	root@tester:~# btrfs ins log 4368594108416 -P /media/testfs/
>>> 	inode 20838190 offset 0 root 10347
>>> 	inode 20838190 offset 0 root 8013
>>> 	inode 20838190 offset 0 root 10332
>>> 	inode 20838190 offset 0 root 10330
>>> 	inode 20838190 offset 0 root 10331
>>> 	inode 20838190 offset 0 root 10328
>>> 	inode 20838190 offset 0 root 10329
>>> 	inode 20838190 offset 0 root 10343
>>> 	inode 20838190 offset 0 root 10333
>>> 	inode 20838190 offset 0 root 10334
>>> 	inode 20838190 offset 0 root 10336
>>> 	inode 20838190 offset 0 root 10338
>>> 	inode 20838190 offset 0 root 10325
>>> 	inode 20838190 offset 0 root 10349
>>> 	inode 20838190 offset 0 root 10320
>>> 	inode 20838190 offset 0 root 10321
>>> 	inode 20838190 offset 0 root 10322
>>> 	inode 20838190 offset 0 root 10323
>>> 	inode 20838190 offset 0 root 10324
>>> 	inode 20838190 offset 0 root 10326
>>> 	inode 20838190 offset 0 root 10327
>>> 	root@tester:~# btrfs sub list -d /media/testfs/
>>> 	ID 10201 gen 1321166 top level 0 path DELETED
>>> 	ID 10210 gen 1321166 top level 0 path DELETED
>>> 	ID 10230 gen 1321166 top level 0 path DELETED
>>> 	ID 10254 gen 1321166 top level 0 path DELETED
>>> 	ID 10257 gen 1321166 top level 0 path DELETED
>>> 	ID 10274 gen 1321166 top level 0 path DELETED
>>> 	ID 10281 gen 1321166 top level 0 path DELETED
>>> 	ID 10287 gen 1321166 top level 0 path DELETED
>>> 	ID 10296 gen 1321166 top level 0 path DELETED
>>> 	ID 10298 gen 1321166 top level 0 path DELETED
>>> 	ID 10299 gen 1321166 top level 0 path DELETED
>>> 	ID 10308 gen 1321166 top level 0 path DELETED
>>> 	ID 10311 gen 1321166 top level 0 path DELETED
>>> 	ID 10313 gen 1321166 top level 0 path DELETED
>>> 	ID 10315 gen 1321166 top level 0 path DELETED
>>> 	ID 10317 gen 1321166 top level 0 path DELETED
>>> 	ID 10322 gen 1321166 top level 0 path DELETED
>>> 	ID 10323 gen 1321166 top level 0 path DELETED
>>> 	ID 10327 gen 1321166 top level 0 path DELETED
>>> 	ID 10328 gen 1321166 top level 0 path DELETED
>>> 	ID 10330 gen 1321166 top level 0 path DELETED
>>> 	ID 10333 gen 1321166 top level 0 path DELETED
>>>
>>>
>>>>> Thanks,
>>>>> Qu
>>>>
>>>>> From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
>>>>> From: Qu Wenruo <wqu@suse.com>
>>>>> Date: Mon, 11 May 2020 16:27:43 +0800
>>>>> Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>>  fs/btrfs/disk-io.c    | 9 ++++++++-
>>>>>  fs/btrfs/inode.c      | 6 ++++--
>>>>>  fs/btrfs/relocation.c | 3 ++-
>>>>>  3 files changed, 14 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>> index 56675d3cd23a..cb90966a8aab 100644
>>>>> --- a/fs/btrfs/disk-io.c
>>>>> +++ b/fs/btrfs/disk-io.c
>>>>> @@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
>>>>>  	if (ret)
>>>>>  		goto fail;
>>>>>  
>>>>> -	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
>>>>> +	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
>>>>> +	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>>>  		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
>>>>>  		btrfs_check_and_init_root_item(&root->root_item);
>>>>> +	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>>> +		/*
>>>>> +		 * Data reloc tree won't be snapshotted, thus it's COW only
>>>>> +		 * tree, it's needed to set TRACK_DIRTY bit for it.
>>>>> +		 */
>>>>> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>>>>>  	}
>>>>>  
>>>>>  	btrfs_init_free_ino_ctl(root);
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index 5d567082f95a..71841535c7ca 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>>>>>  	 * extent just the way it is.
>>>>>  	 */
>>>>>  	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
>>>>> -	    root == fs_info->tree_root)
>>>>> +	    root == fs_info->tree_root ||
>>>>> +	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
>>>>>  		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
>>>>>  					fs_info->sectorsize),
>>>>>  					(u64)-1, 0);
>>>>> @@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>>>>>  
>>>>>  		if (found_extent &&
>>>>>  		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
>>>>> -		     root == fs_info->tree_root)) {
>>>>> +		     root == fs_info->tree_root ||
>>>>> +		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
>>>>>  			struct btrfs_ref ref = { 0 };
>>>>>  
>>>>>  			bytes_deleted += extent_num_bytes;
>>>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>>>> index f25deca18a5d..a85dd5d465f6 100644
>>>>> --- a/fs/btrfs/relocation.c
>>>>> +++ b/fs/btrfs/relocation.c
>>>>> @@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
>>>>>  		 * if we are modifying block in fs tree, wait for readpage
>>>>>  		 * to complete and drop the extent cache
>>>>>  		 */
>>>>> -		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
>>>>> +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
>>>>> +		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>>>  			if (first) {
>>>>>  				inode = find_next_inode(root, key.objectid);
>>>>>  				first = 0;
>>>>> -- 
>>>>> 2.26.2
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-13 11:23               ` Qu Wenruo
@ 2020-05-13 12:21                 ` Zygo Blaxell
  2020-05-14  8:08                   ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-13 12:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/5/13 下午1:24, Zygo Blaxell wrote:
> > On Wed, May 13, 2020 at 10:28:37AM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/5/12 下午10:11, Zygo Blaxell wrote:
> >>> On Tue, May 12, 2020 at 09:43:06AM -0400, Zygo Blaxell wrote:
> >>>> On Mon, May 11, 2020 at 04:31:32PM +0800, Qu Wenruo wrote:
> >>>>> Hi Zygo,
> >>>>>
> >>>>> Would you like to test this diff?
> >>>>>
> >>>>> Although I haven't find a solid reason yet, there is another report and
> >>>>> with the help from the reporter, it turns out that balance hangs at
> >>>>> relocating DATA_RELOC tree block.
> >>>>>
> >>>>> After some more digging, DATA_RELOC tree doesn't need REF_COW bit at all
> >>>>> since we can't create snapshot for data reloc tree.
> >>>>>
> >>>>> By removing the REF_COW bit, we could ensure that data reloc tree always
> >>>>> get cowed for relocation (just like extent tree), this would hugely
> >>>>> reduce the complexity for data reloc tree.
> >>>>>
> >>>>> Not sure if this would help, but it passes my local balance run.
> >>>>
> >>>> I ran it last night.  It did 30804 loops during a metadata block group
> >>>> balance, and is now looping on a data block group as I write this.
> >>
> >> OK, not that surprised the patch doesn't help.
> >> But still, the patch itself could still make sense for removing the
> >> REFCOW bit for data reloc tree.
> >>
> >> But I'm interesting in that, after that 30804 loops, it found its way to
> >> next block group?!
> >>
> >>>
> >>> Here's the block group that is failing, and some poking around with it:
> >>
> >> In fact, such poking would be the most valuable part.
> >>
> >>>
> >>> 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_contents.py 4368594108416 /media/testfs/
> >>> 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> >>> 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> >>> 	    inline shared data backref parent 4374646833152 count 1
> >>> 	extent vaddr 4368715161600 length 120168448 refs 1 gen 1318394 flags DATA
> >>> 	    inline shared data backref parent 4374801383424 count 1
> >>> 	extent vaddr 4368835330048 length 127623168 refs 1 gen 1318394 flags DATA
> >>> 	    inline shared data backref parent 4374801383424 count 1
> >>> 	extent vaddr 4368962953216 length 124964864 refs 1 gen 1318394 flags DATA
> >>> 	    inline shared data backref parent 4374801383424 count 1
> >>> 	extent vaddr 4369182420992 length 36700160 refs 1 gen 1321064 flags DATA
> >>> 	    inline extent data backref root 257 objectid 257 offset 822607872 count 1
> >>
> >> One interesting thing is, there are 5 extents during the loop.
> >> The first 4 looks like they belong to data reloc tree, which means they
> >> have been swapped, waiting to be cleaned up.
> >>
> >> The last one belongs to root 257, and looks like it hadn't been relocated.
> >>
> >>>
> >>> The extent data backref is unusual--during loops, I don't usually see those.
> >>> And...as I write this, it disappeared (it was part of the bees hash table, and
> >>> was overwritten).  Now there are 4 extents reported in the balance loop (note:
> >>> I added a loop counter to the log message):
> >>
> >> Then it means the last one get properly relocated.
> > 
> > No.  In these cases the extent is removed by other filesystem activity.
> > If balance gets stuck looping, it can never break out of a loop if it
> > is the only writer on the filesystem.  I've left it looping for days,
> > it makes no progress.
> > 
> > When balance is looping, it is always stuck waiting until the extents
> > are removed by something else.  In this particular case, the extent was
> > overwritten by another process removing the extent's last reference so
> > it was no longer part of the block group any more.  It is possible to
> > break a balance loop by simply deleting all the files with extents in
> > the block group.
> > 
> > I wrote some scripts that dump out the extents in the looping block
> > group, find the files they belong to, and run the defrag ioctl on them,
> > thereby removing all the extents in the block group so the balance loop
> > will end, without deleting the data.
> 
> This should definitely be fixed.
> 
> >  I used the script for a while and
> > was able to balance hundreds of block groups more than I would have been
> > able to without the script; however, the script couldn't run defrag on
> > extents that were not reachable through open() (e.g. extents referenced
> > by a deleted snapshot),
> 
> And data reloc tree.
> 
> > so it couldn't work around the balance loops in
> > all cases.
> 
> For the single data extent hanging you are able to reproduce, can you
> send me a binary dump of that fs when it's hanging?

Oops, I should have thought of that.  Oh well, it's still looping from
my 5th try last night...

Here's a looping image with a single file (this one has 13 extents).
I've stopped the VM in the middle of a balance loop and captured btrfs-image
from the host.  Then resumed the VM to make sure it's still looping.

Kernel log:

	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
	[...]
	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
	(qemu) stop

btrfs-image URL:

	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin

> Thanks,
> Qu
> 
> > 
> >> The cleanup for the first 4 doesn't happen properly.
> >>>
> >>> 	[Tue May 12 09:44:22 2020] BTRFS info (device dm-0): found 5 extents, loops 378, stage: update data pointers
> >>> 	[Tue May 12 09:44:23 2020] BTRFS info (device dm-0): found 5 extents, loops 379, stage: update data pointers
> >>> 	[Tue May 12 09:44:24 2020] BTRFS info (device dm-0): found 5 extents, loops 380, stage: update data pointers
> >>> 	[Tue May 12 09:44:26 2020] BTRFS info (device dm-0): found 5 extents, loops 381, stage: update data pointers
> >>> 	[Tue May 12 09:44:27 2020] BTRFS info (device dm-0): found 5 extents, loops 382, stage: update data pointers
> >>> 	[Tue May 12 10:04:49 2020] BTRFS info (device dm-0): found 5 extents, loops 383, stage: update data pointers
> >>> 	[Tue May 12 10:04:53 2020] BTRFS info (device dm-0): found 4 extents, loops 384, stage: update data pointers
> >>> 	[Tue May 12 10:04:58 2020] BTRFS info (device dm-0): found 4 extents, loops 385, stage: update data pointers
> >>> 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 386, stage: update data pointers
> >>> 	[Tue May 12 10:05:00 2020] BTRFS info (device dm-0): found 4 extents, loops 387, stage: update data pointers
> >>> 	[Tue May 12 10:05:01 2020] BTRFS info (device dm-0): found 4 extents, loops 388, stage: update data pointers
> >>>
> >>> Some of the extents that remain are confusing python-btrfs a little:
> >>>
> >>> 	root@tester:~# ~/share/python-btrfs/examples/show_block_group_data_extent_filenames.py 4368594108416 /media/testfs/
> >>> 	block group vaddr 4368594108416 length 1073741824 flags DATA used 530509824 used_pct 49
> >>> 	extent vaddr 4368594108416 length 121053184 refs 1 gen 1318394 flags DATA
> >>> 	Traceback (most recent call last):
> >>> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 52, in <module>
> >>> 	    inodes, bytes_missed = logical_to_ino_fn(fs.fd, extent.vaddr)
> >>> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 28, in find_out_about_v1_or_v2
> >>> 	    inodes, bytes_missed = using_v2(fd, vaddr)
> >>> 	  File "/root/share/python-btrfs/examples/show_block_group_data_extent_filenames.py", line 17, in using_v2
> >>> 	    inodes, bytes_missed = btrfs.ioctl.logical_to_ino_v2(fd, vaddr, ignore_offset=True)
> >>> 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 565, in logical_to_ino_v2
> >>> 	    return _logical_to_ino(fd, vaddr, bufsize, ignore_offset, _v2=True)
> >>> 	  File "/media/share/python-btrfs/examples/btrfs/ioctl.py", line 581, in _logical_to_ino
> >>> 	    fcntl.ioctl(fd, IOC_LOGICAL_INO_V2, args)
> >>
> >> I'm a little surprised about the it's using logical ino ioctl, not just
> >> TREE_SEARCH.
> >>
> >> I guess if we could get a plain tree search based one (it only search
> >> commit root, which is exactly balance based on), it would be easier to
> >> do the digging.
> >>
> >>> 	OSError: [Errno 22] Invalid argument
> >>>
> >>> 	root@tester:~# btrfs ins log 4368594108416 /media/testfs/
> >>> 	/media/testfs//snap-1589258042/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//current/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589249822/testhost/var/log/messages.6.lzma
> >>> 	ERROR: ino paths ioctl: No such file or directory
> >>> 	/media/testfs//snap-1589249547/testhost/var/log/messages.6.lzma
> >>> 	ERROR: ino paths ioctl: No such file or directory
> >>> 	/media/testfs//snap-1589248407/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589256422/testhost/var/log/messages.6.lzma
> >>> 	ERROR: ino paths ioctl: No such file or directory
> >>> 	/media/testfs//snap-1589251322/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589251682/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589253842/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589246727/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589258582/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589244027/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589245227/testhost/var/log/messages.6.lzma
> >>> 	ERROR: ino paths ioctl: No such file or directory
> >>> 	ERROR: ino paths ioctl: No such file or directory
> >>> 	/media/testfs//snap-1589246127/testhost/var/log/messages.6.lzma
> >>> 	/media/testfs//snap-1589247327/testhost/var/log/messages.6.lzma
> >>> 	ERROR: ino paths ioctl: No such file or directory
> >>>
> >>> Hmmm, I wonder if there's a problem with deleted snapshots?
> >>
> >> Yes, also what I'm guessing.
> >>
> >> The cleanup of data reloc tree doesn't look correct to me.
> >>
> >> Thanks for the new clues,
> >> Qu
> >>
> >>>  I have those
> >>> nearly continuously in my test environment, which is creating and deleting
> >>> snapshots all the time.
> >>>
> >>> 	root@tester:~# btrfs ins log 4368594108416 -P /media/testfs/
> >>> 	inode 20838190 offset 0 root 10347
> >>> 	inode 20838190 offset 0 root 8013
> >>> 	inode 20838190 offset 0 root 10332
> >>> 	inode 20838190 offset 0 root 10330
> >>> 	inode 20838190 offset 0 root 10331
> >>> 	inode 20838190 offset 0 root 10328
> >>> 	inode 20838190 offset 0 root 10329
> >>> 	inode 20838190 offset 0 root 10343
> >>> 	inode 20838190 offset 0 root 10333
> >>> 	inode 20838190 offset 0 root 10334
> >>> 	inode 20838190 offset 0 root 10336
> >>> 	inode 20838190 offset 0 root 10338
> >>> 	inode 20838190 offset 0 root 10325
> >>> 	inode 20838190 offset 0 root 10349
> >>> 	inode 20838190 offset 0 root 10320
> >>> 	inode 20838190 offset 0 root 10321
> >>> 	inode 20838190 offset 0 root 10322
> >>> 	inode 20838190 offset 0 root 10323
> >>> 	inode 20838190 offset 0 root 10324
> >>> 	inode 20838190 offset 0 root 10326
> >>> 	inode 20838190 offset 0 root 10327
> >>> 	root@tester:~# btrfs sub list -d /media/testfs/
> >>> 	ID 10201 gen 1321166 top level 0 path DELETED
> >>> 	ID 10210 gen 1321166 top level 0 path DELETED
> >>> 	ID 10230 gen 1321166 top level 0 path DELETED
> >>> 	ID 10254 gen 1321166 top level 0 path DELETED
> >>> 	ID 10257 gen 1321166 top level 0 path DELETED
> >>> 	ID 10274 gen 1321166 top level 0 path DELETED
> >>> 	ID 10281 gen 1321166 top level 0 path DELETED
> >>> 	ID 10287 gen 1321166 top level 0 path DELETED
> >>> 	ID 10296 gen 1321166 top level 0 path DELETED
> >>> 	ID 10298 gen 1321166 top level 0 path DELETED
> >>> 	ID 10299 gen 1321166 top level 0 path DELETED
> >>> 	ID 10308 gen 1321166 top level 0 path DELETED
> >>> 	ID 10311 gen 1321166 top level 0 path DELETED
> >>> 	ID 10313 gen 1321166 top level 0 path DELETED
> >>> 	ID 10315 gen 1321166 top level 0 path DELETED
> >>> 	ID 10317 gen 1321166 top level 0 path DELETED
> >>> 	ID 10322 gen 1321166 top level 0 path DELETED
> >>> 	ID 10323 gen 1321166 top level 0 path DELETED
> >>> 	ID 10327 gen 1321166 top level 0 path DELETED
> >>> 	ID 10328 gen 1321166 top level 0 path DELETED
> >>> 	ID 10330 gen 1321166 top level 0 path DELETED
> >>> 	ID 10333 gen 1321166 top level 0 path DELETED
> >>>
> >>>
> >>>>> Thanks,
> >>>>> Qu
> >>>>
> >>>>> From 82f3b96a68561b2de9712262cb652192b8ea9b1b Mon Sep 17 00:00:00 2001
> >>>>> From: Qu Wenruo <wqu@suse.com>
> >>>>> Date: Mon, 11 May 2020 16:27:43 +0800
> >>>>> Subject: [PATCH] btrfs: Remove the REF_COW bit for data reloc tree
> >>>>>
> >>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>>>> ---
> >>>>>  fs/btrfs/disk-io.c    | 9 ++++++++-
> >>>>>  fs/btrfs/inode.c      | 6 ++++--
> >>>>>  fs/btrfs/relocation.c | 3 ++-
> >>>>>  3 files changed, 14 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>>>> index 56675d3cd23a..cb90966a8aab 100644
> >>>>> --- a/fs/btrfs/disk-io.c
> >>>>> +++ b/fs/btrfs/disk-io.c
> >>>>> @@ -1418,9 +1418,16 @@ static int btrfs_init_fs_root(struct btrfs_root *root)
> >>>>>  	if (ret)
> >>>>>  		goto fail;
> >>>>>  
> >>>>> -	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
> >>>>> +	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
> >>>>> +	    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>>>>  		set_bit(BTRFS_ROOT_REF_COWS, &root->state);
> >>>>>  		btrfs_check_and_init_root_item(&root->root_item);
> >>>>> +	} else if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>>>> +		/*
> >>>>> +		 * Data reloc tree won't be snapshotted, thus it's COW only
> >>>>> +		 * tree, it's needed to set TRACK_DIRTY bit for it.
> >>>>> +		 */
> >>>>> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> >>>>>  	}
> >>>>>  
> >>>>>  	btrfs_init_free_ino_ctl(root);
> >>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>>>> index 5d567082f95a..71841535c7ca 100644
> >>>>> --- a/fs/btrfs/inode.c
> >>>>> +++ b/fs/btrfs/inode.c
> >>>>> @@ -4129,7 +4129,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
> >>>>>  	 * extent just the way it is.
> >>>>>  	 */
> >>>>>  	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> >>>>> -	    root == fs_info->tree_root)
> >>>>> +	    root == fs_info->tree_root ||
> >>>>> +	    root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
> >>>>>  		btrfs_drop_extent_cache(BTRFS_I(inode), ALIGN(new_size,
> >>>>>  					fs_info->sectorsize),
> >>>>>  					(u64)-1, 0);
> >>>>> @@ -4334,7 +4335,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
> >>>>>  
> >>>>>  		if (found_extent &&
> >>>>>  		    (test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
> >>>>> -		     root == fs_info->tree_root)) {
> >>>>> +		     root == fs_info->tree_root ||
> >>>>> +		     root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)) {
> >>>>>  			struct btrfs_ref ref = { 0 };
> >>>>>  
> >>>>>  			bytes_deleted += extent_num_bytes;
> >>>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> >>>>> index f25deca18a5d..a85dd5d465f6 100644
> >>>>> --- a/fs/btrfs/relocation.c
> >>>>> +++ b/fs/btrfs/relocation.c
> >>>>> @@ -1087,7 +1087,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
> >>>>>  		 * if we are modifying block in fs tree, wait for readpage
> >>>>>  		 * to complete and drop the extent cache
> >>>>>  		 */
> >>>>> -		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> >>>>> +		if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> >>>>> +		    root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >>>>>  			if (first) {
> >>>>>  				inode = find_next_inode(root, key.objectid);
> >>>>>  				first = 0;
> >>>>> -- 
> >>>>> 2.26.2
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> > 
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-13 12:21                 ` Zygo Blaxell
@ 2020-05-14  8:08                   ` Qu Wenruo
  2020-05-14  8:55                     ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-14  8:08 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2244 bytes --]



On 2020/5/13 下午8:21, Zygo Blaxell wrote:
> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
>>
>>
[...]
> 
> Kernel log:
> 
> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
> 	[...]
> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
> 	(qemu) stop
> 
> btrfs-image URL:
> 
> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
> 
The image shows several very strange result.

For one, although we're relocating block group 4394531684352, the
previous two block groups doesn't really get relocated.

There are still extents there, all belongs to data reloc tree.

Furthermore, the data reloc tree inode 620 should be evicted when
previous block group relocation finishes.

So I'm considering something went wrong in data reloc tree, would you
please try the following diff?
(Either on vanilla kernel or with my previous useless patch)

Thanks,
Qu

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9afc1a6928cf..ef9e18bab6f6 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
btrfs_fs_info *fs_info,
        BTRFS_I(inode)->index_cnt = group->start;

        err = btrfs_orphan_add(trans, BTRFS_I(inode));
+       WARN_ON(atomic_read(inode->i_count) != 1);
 out:
        btrfs_put_root(root);
        btrfs_end_transaction(trans);
@@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
btrfs_fs_info *fs_info, u64 group_start)
 out:
        if (err && rw)
                btrfs_dec_block_group_ro(rc->block_group);
+       WARN_ON(atomic_read(inode->i_count) != 1);
        iput(rc->data_inode);
        btrfs_put_block_group(rc->block_group);
        free_reloc_control(rc);


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-14  8:08                   ` Qu Wenruo
@ 2020-05-14  8:55                     ` Qu Wenruo
  2020-05-14 17:44                       ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-14  8:55 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2687 bytes --]



On 2020/5/14 下午4:08, Qu Wenruo wrote:
> 
> 
> On 2020/5/13 下午8:21, Zygo Blaxell wrote:
>> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
>>>
>>>
> [...]
>>
>> Kernel log:
>>
>> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
>> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
>> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
>> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
>> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
>> 	[...]
>> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
>> 	(qemu) stop
>>
>> btrfs-image URL:
>>
>> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
>>
> The image shows several very strange result.
> 
> For one, although we're relocating block group 4394531684352, the
> previous two block groups doesn't really get relocated.
> 
> There are still extents there, all belongs to data reloc tree.
> 
> Furthermore, the data reloc tree inode 620 should be evicted when
> previous block group relocation finishes.
> 
> So I'm considering something went wrong in data reloc tree, would you
> please try the following diff?
> (Either on vanilla kernel or with my previous useless patch)

Oh, my previous testing patch is doing wrong inode put for data reloc
tree, thus it's possible to lead to such situation.

Thankfully the v2 for upstream gets the problem fixed.

Thus it goes back to the original stage, still no faster way to
reproduce the problem...

Thanks,
Qu
> 
> Thanks,
> Qu
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 9afc1a6928cf..ef9e18bab6f6 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
> btrfs_fs_info *fs_info,
>         BTRFS_I(inode)->index_cnt = group->start;
> 
>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
> +       WARN_ON(atomic_read(inode->i_count) != 1);
>  out:
>         btrfs_put_root(root);
>         btrfs_end_transaction(trans);
> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
> btrfs_fs_info *fs_info, u64 group_start)
>  out:
>         if (err && rw)
>                 btrfs_dec_block_group_ro(rc->block_group);
> +       WARN_ON(atomic_read(inode->i_count) != 1);
>         iput(rc->data_inode);
>         btrfs_put_block_group(rc->block_group);
>         free_reloc_control(rc);
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-14  8:55                     ` Qu Wenruo
@ 2020-05-14 17:44                       ` Zygo Blaxell
  2020-05-14 23:43                         ` Qu Wenruo
  2020-05-15  6:57                         ` Qu Wenruo
  0 siblings, 2 replies; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-14 17:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 14, 2020 at 04:55:18PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/5/14 下午4:08, Qu Wenruo wrote:
> > 
> > 
> > On 2020/5/13 下午8:21, Zygo Blaxell wrote:
> >> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
> >>>
> >>>
> > [...]
> >>
> >> Kernel log:
> >>
> >> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
> >> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
> >> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
> >> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
> >> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
> >> 	[...]
> >> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
> >> 	(qemu) stop
> >>
> >> btrfs-image URL:
> >>
> >> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
> >>
> > The image shows several very strange result.
> > 
> > For one, although we're relocating block group 4394531684352, the
> > previous two block groups doesn't really get relocated.
> > 
> > There are still extents there, all belongs to data reloc tree.
> > 
> > Furthermore, the data reloc tree inode 620 should be evicted when
> > previous block group relocation finishes.
> > 
> > So I'm considering something went wrong in data reloc tree, would you
> > please try the following diff?
> > (Either on vanilla kernel or with my previous useless patch)
> 
> Oh, my previous testing patch is doing wrong inode put for data reloc
> tree, thus it's possible to lead to such situation.
> 
> Thankfully the v2 for upstream gets the problem fixed.
> 
> Thus it goes back to the original stage, still no faster way to
> reproduce the problem...

Can we attack the problem by logging kernel activity?  Like can we
log whenever we add or remove items from the data reloc tree, or
why we don't?

I can get a filesystem with a single data block group and a single
(visible) extent that loops, and somehow it's so easy to do that that I'm
having problems making filesystems _not_ do it.  What can we do with that?

What am I (and everyone else with this problem) doing that you are not?
Usually that difference is "I'm running bees" but we're running out of
bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
are reporting the problem without running bees.  I'm also running balance
cancels, which seem to increase the repro rate (though they might just
be increasing the number of balances tested per day, and there could be
just a fixed percentage of balances that loop).

I will see if I can build a standalone kvm image that generates balance
loops on blank disks.  If I'm successful, you can download it and then
run all the experiments you want.

I also want to see if reverting the extended reloc tree lifespan patch
(d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots") stops the looping on misc-next.  I found that
reverting that patch stops the balance looping on 5.1.21 in an earlier
experiment.  Maybe there are two bugs here, and we've already fixed one,
but the symptom won't go away because some second bug has appeared.


> Thanks,
> Qu
> > 
> > Thanks,
> > Qu
> > 
> > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> > index 9afc1a6928cf..ef9e18bab6f6 100644
> > --- a/fs/btrfs/relocation.c
> > +++ b/fs/btrfs/relocation.c
> > @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
> > btrfs_fs_info *fs_info,
> >         BTRFS_I(inode)->index_cnt = group->start;
> > 
> >         err = btrfs_orphan_add(trans, BTRFS_I(inode));
> > +       WARN_ON(atomic_read(inode->i_count) != 1);
> >  out:
> >         btrfs_put_root(root);
> >         btrfs_end_transaction(trans);
> > @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
> > btrfs_fs_info *fs_info, u64 group_start)
> >  out:
> >         if (err && rw)
> >                 btrfs_dec_block_group_ro(rc->block_group);
> > +       WARN_ON(atomic_read(inode->i_count) != 1);
> >         iput(rc->data_inode);
> >         btrfs_put_block_group(rc->block_group);
> >         free_reloc_control(rc);
> > 
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-14 17:44                       ` Zygo Blaxell
@ 2020-05-14 23:43                         ` Qu Wenruo
  2020-05-15  6:57                         ` Qu Wenruo
  1 sibling, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2020-05-14 23:43 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4969 bytes --]



On 2020/5/15 上午1:44, Zygo Blaxell wrote:
> On Thu, May 14, 2020 at 04:55:18PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/5/14 下午4:08, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/5/13 下午8:21, Zygo Blaxell wrote:
>>>> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
>>>>>
>>>>>
>>> [...]
>>>>
>>>> Kernel log:
>>>>
>>>> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
>>>> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
>>>> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
>>>> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
>>>> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
>>>> 	[...]
>>>> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
>>>> 	(qemu) stop
>>>>
>>>> btrfs-image URL:
>>>>
>>>> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
>>>>
>>> The image shows several very strange result.
>>>
>>> For one, although we're relocating block group 4394531684352, the
>>> previous two block groups doesn't really get relocated.
>>>
>>> There are still extents there, all belongs to data reloc tree.
>>>
>>> Furthermore, the data reloc tree inode 620 should be evicted when
>>> previous block group relocation finishes.
>>>
>>> So I'm considering something went wrong in data reloc tree, would you
>>> please try the following diff?
>>> (Either on vanilla kernel or with my previous useless patch)
>>
>> Oh, my previous testing patch is doing wrong inode put for data reloc
>> tree, thus it's possible to lead to such situation.
>>
>> Thankfully the v2 for upstream gets the problem fixed.
>>
>> Thus it goes back to the original stage, still no faster way to
>> reproduce the problem...
> 
> Can we attack the problem by logging kernel activity?  Like can we
> log whenever we add or remove items from the data reloc tree, or
> why we don't?

Sure, the only and tiny problem is the amount of logs we're adding.

I have no problem to add such debugging, either using plain pr_info() or
trace events.
Although I prefer plain pr_info().

> 
> I can get a filesystem with a single data block group and a single
> (visible) extent that loops, and somehow it's so easy to do that that I'm
> having problems making filesystems _not_ do it.  What can we do with that?

Even without my data reloc tree patch?

> 
> What am I (and everyone else with this problem) doing that you are not?

I just can't reproduce the bug, that's the biggest problem.

> Usually that difference is "I'm running bees" but we're running out of
> bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
> are reporting the problem without running bees.  I'm also running balance
> cancels, which seem to increase the repro rate (though they might just
> be increasing the number of balances tested per day, and there could be
> just a fixed percentage of balances that loop).

I tend to not to add cancel into the case, as it may add more complexity
and may be a different problem.

> 
> I will see if I can build a standalone kvm image that generates balance
> loops on blank disks.  If I'm successful, you can download it and then
> run all the experiments you want.

That's great.
Looking forward that.

Thanks,
Qu
> 
> I also want to see if reverting the extended reloc tree lifespan patch
> (d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
> merge_reloc_roots") stops the looping on misc-next.  I found that
> reverting that patch stops the balance looping on 5.1.21 in an earlier
> experiment.  Maybe there are two bugs here, and we've already fixed one,
> but the symptom won't go away because some second bug has appeared.
> 
> 
>> Thanks,
>> Qu
>>>
>>> Thanks,
>>> Qu
>>>
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index 9afc1a6928cf..ef9e18bab6f6 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
>>> btrfs_fs_info *fs_info,
>>>         BTRFS_I(inode)->index_cnt = group->start;
>>>
>>>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>  out:
>>>         btrfs_put_root(root);
>>>         btrfs_end_transaction(trans);
>>> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
>>> btrfs_fs_info *fs_info, u64 group_start)
>>>  out:
>>>         if (err && rw)
>>>                 btrfs_dec_block_group_ro(rc->block_group);
>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>         iput(rc->data_inode);
>>>         btrfs_put_block_group(rc->block_group);
>>>         free_reloc_control(rc);
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-14 17:44                       ` Zygo Blaxell
  2020-05-14 23:43                         ` Qu Wenruo
@ 2020-05-15  6:57                         ` Qu Wenruo
  2020-05-15 15:17                           ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-15  6:57 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4986 bytes --]



On 2020/5/15 上午1:44, Zygo Blaxell wrote:
> On Thu, May 14, 2020 at 04:55:18PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/5/14 下午4:08, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/5/13 下午8:21, Zygo Blaxell wrote:
>>>> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
>>>>>
>>>>>
>>> [...]
>>>>
>>>> Kernel log:
>>>>
>>>> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
>>>> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
>>>> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
>>>> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
>>>> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
>>>> 	[...]
>>>> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
>>>> 	(qemu) stop
>>>>
>>>> btrfs-image URL:
>>>>
>>>> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
>>>>
>>> The image shows several very strange result.
>>>
>>> For one, although we're relocating block group 4394531684352, the
>>> previous two block groups doesn't really get relocated.
>>>
>>> There are still extents there, all belongs to data reloc tree.
>>>
>>> Furthermore, the data reloc tree inode 620 should be evicted when
>>> previous block group relocation finishes.
>>>
>>> So I'm considering something went wrong in data reloc tree, would you
>>> please try the following diff?
>>> (Either on vanilla kernel or with my previous useless patch)
>>
>> Oh, my previous testing patch is doing wrong inode put for data reloc
>> tree, thus it's possible to lead to such situation.
>>
>> Thankfully the v2 for upstream gets the problem fixed.
>>
>> Thus it goes back to the original stage, still no faster way to
>> reproduce the problem...
> 
> Can we attack the problem by logging kernel activity?  Like can we
> log whenever we add or remove items from the data reloc tree, or
> why we don't?
> 
> I can get a filesystem with a single data block group and a single
> (visible) extent that loops, and somehow it's so easy to do that that I'm
> having problems making filesystems _not_ do it.  What can we do with that?

OK, finally got it reproduced, but it's way more complex than I thought.

First, we need to cancel some balance.
Then if the canceling timing is good, next balance will always hang.

Furthermore, if the kernel has CONFIG_BTRFS_DEBUG compiled, the kernel
would report leaking reloc tree, then followed by NULL pointer dereference.

Now since I can reproduce it reliably, I guess I don't need to bother
you every time I have some new things to try.

Thanks for your report!
Qu

> 
> What am I (and everyone else with this problem) doing that you are not?
> Usually that difference is "I'm running bees" but we're running out of
> bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
> are reporting the problem without running bees.  I'm also running balance
> cancels, which seem to increase the repro rate (though they might just
> be increasing the number of balances tested per day, and there could be
> just a fixed percentage of balances that loop).
> 
> I will see if I can build a standalone kvm image that generates balance
> loops on blank disks.  If I'm successful, you can download it and then
> run all the experiments you want.
> 
> I also want to see if reverting the extended reloc tree lifespan patch
> (d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
> merge_reloc_roots") stops the looping on misc-next.  I found that
> reverting that patch stops the balance looping on 5.1.21 in an earlier
> experiment.  Maybe there are two bugs here, and we've already fixed one,
> but the symptom won't go away because some second bug has appeared.
> 
> 
>> Thanks,
>> Qu
>>>
>>> Thanks,
>>> Qu
>>>
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index 9afc1a6928cf..ef9e18bab6f6 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
>>> btrfs_fs_info *fs_info,
>>>         BTRFS_I(inode)->index_cnt = group->start;
>>>
>>>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>  out:
>>>         btrfs_put_root(root);
>>>         btrfs_end_transaction(trans);
>>> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
>>> btrfs_fs_info *fs_info, u64 group_start)
>>>  out:
>>>         if (err && rw)
>>>                 btrfs_dec_block_group_ro(rc->block_group);
>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>         iput(rc->data_inode);
>>>         btrfs_put_block_group(rc->block_group);
>>>         free_reloc_control(rc);
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-15  6:57                         ` Qu Wenruo
@ 2020-05-15 15:17                           ` Zygo Blaxell
  2020-05-18  5:25                             ` Qu Wenruo
  2020-05-20  7:27                             ` Qu Wenruo
  0 siblings, 2 replies; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-15 15:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 15, 2020 at 02:57:52PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/5/15 上午1:44, Zygo Blaxell wrote:
> > On Thu, May 14, 2020 at 04:55:18PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/5/14 下午4:08, Qu Wenruo wrote:
> >>>
> >>>
> >>> On 2020/5/13 下午8:21, Zygo Blaxell wrote:
> >>>> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
> >>>>>
> >>>>>
> >>> [...]
> >>>>
> >>>> Kernel log:
> >>>>
> >>>> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
> >>>> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
> >>>> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
> >>>> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
> >>>> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
> >>>> 	[...]
> >>>> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
> >>>> 	(qemu) stop
> >>>>
> >>>> btrfs-image URL:
> >>>>
> >>>> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
> >>>>
> >>> The image shows several very strange result.
> >>>
> >>> For one, although we're relocating block group 4394531684352, the
> >>> previous two block groups doesn't really get relocated.
> >>>
> >>> There are still extents there, all belongs to data reloc tree.
> >>>
> >>> Furthermore, the data reloc tree inode 620 should be evicted when
> >>> previous block group relocation finishes.
> >>>
> >>> So I'm considering something went wrong in data reloc tree, would you
> >>> please try the following diff?
> >>> (Either on vanilla kernel or with my previous useless patch)
> >>
> >> Oh, my previous testing patch is doing wrong inode put for data reloc
> >> tree, thus it's possible to lead to such situation.
> >>
> >> Thankfully the v2 for upstream gets the problem fixed.
> >>
> >> Thus it goes back to the original stage, still no faster way to
> >> reproduce the problem...
> > 
> > Can we attack the problem by logging kernel activity?  Like can we
> > log whenever we add or remove items from the data reloc tree, or
> > why we don't?
> > 
> > I can get a filesystem with a single data block group and a single
> > (visible) extent that loops, and somehow it's so easy to do that that I'm
> > having problems making filesystems _not_ do it.  What can we do with that?
> 
> OK, finally got it reproduced, but it's way more complex than I thought.
> 
> First, we need to cancel some balance.
> Then if the canceling timing is good, next balance will always hang.

I've seen that, but it doesn't seem to be causative, i.e. you can use
balance cancel to trigger the problem more often, but cancel doesn't
seem to cause the problem itself.

I have been running the fast balance cancel patches on kernel 5.0 (this
is our current production kernel).  Balances can be cancelled on that
kernel with no looping.  I don't know if the cancel leaves reloc trees
in weird states, but the reloc roots merging code manages to clean them
up and break balance out of the loop.

Loops did occur in test runs before fast balance cancels (or balance
cancels at all) and others have reported similar issues without patched
kernels; however, those older observations would be on kernels 5.2 or
5.3 which had severe UAF bugs due to the delayed reloc roots change.

A lot of weird random stuff would happen during balances on older kernels
that stopped after the UAF bug fix in 5.4.14; however, the balance loops
persist.

> Furthermore, if the kernel has CONFIG_BTRFS_DEBUG compiled, the kernel
> would report leaking reloc tree, then followed by NULL pointer dereference.

That I have not seen.  I'm running misc-next, and there were some fixes
for NULL derefs caught by the new reference tracking code.  Maybe it's
already been fixed?

> Now since I can reproduce it reliably, I guess I don't need to bother
> you every time I have some new things to try.
> 
> Thanks for your report!
> Qu
> 
> > 
> > What am I (and everyone else with this problem) doing that you are not?
> > Usually that difference is "I'm running bees" but we're running out of
> > bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
> > are reporting the problem without running bees.  I'm also running balance
> > cancels, which seem to increase the repro rate (though they might just
> > be increasing the number of balances tested per day, and there could be
> > just a fixed percentage of balances that loop).
> > 
> > I will see if I can build a standalone kvm image that generates balance
> > loops on blank disks.  If I'm successful, you can download it and then
> > run all the experiments you want.
> > 
> > I also want to see if reverting the extended reloc tree lifespan patch
> > (d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
> > merge_reloc_roots") stops the looping on misc-next.  I found that
> > reverting that patch stops the balance looping on 5.1.21 in an earlier
> > experiment.  Maybe there are two bugs here, and we've already fixed one,
> > but the symptom won't go away because some second bug has appeared.

I completed this experiment.  I reverted the delay reloc tree commit,
which required also reverting all the bug fixes on top of delay reloc
tree in later kernels...

	Revert "btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots"
	Revert "btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan"
	Revert "btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"
	Revert "btrfs: relocation: fix use-after-free on dead relocation roots"
	Revert "btrfs: relocation: fix reloc_root lifespan and access"
	Revert "btrfs: reloc: clean dirty subvols if we fail to start a transaction"
	Revert "btrfs: unset reloc control if we fail to recover"
	Revert "btrfs: fix transaction leak in btrfs_recover_relocation"

This test kernel also has fast balance cancel backported:

	btrfs: relocation: Check cancel request after each extent found
	btrfs: relocation: Check cancel request after each data page read
	btrfs: relocation: add error injection points for cancelling balance

My test kernel is based on 5.4.40.  On 5.7-rc kernels there's a lot
of changes for refcounting roots that are too much for mere git reverts
to unwind.

I ran it for a while with randomly scheduled balances and cancels: 65
block groups, 47 balance cancels, 20 block groups completed, 0 extra
loops.  With the delay reloc tree commit in place it's normally not more
than 5 block groups before looping starts.

> > 
> >> Thanks,
> >> Qu
> >>>
> >>> Thanks,
> >>> Qu
> >>>
> >>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> >>> index 9afc1a6928cf..ef9e18bab6f6 100644
> >>> --- a/fs/btrfs/relocation.c
> >>> +++ b/fs/btrfs/relocation.c
> >>> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
> >>> btrfs_fs_info *fs_info,
> >>>         BTRFS_I(inode)->index_cnt = group->start;
> >>>
> >>>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
> >>> +       WARN_ON(atomic_read(inode->i_count) != 1);
> >>>  out:
> >>>         btrfs_put_root(root);
> >>>         btrfs_end_transaction(trans);
> >>> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
> >>> btrfs_fs_info *fs_info, u64 group_start)
> >>>  out:
> >>>         if (err && rw)
> >>>                 btrfs_dec_block_group_ro(rc->block_group);
> >>> +       WARN_ON(atomic_read(inode->i_count) != 1);
> >>>         iput(rc->data_inode);
> >>>         btrfs_put_block_group(rc->block_group);
> >>>         free_reloc_control(rc);
> >>>
> >>
> > 
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-15 15:17                           ` Zygo Blaxell
@ 2020-05-18  5:25                             ` Qu Wenruo
  2020-05-20  7:27                             ` Qu Wenruo
  1 sibling, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2020-05-18  5:25 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8203 bytes --]



On 2020/5/15 下午11:17, Zygo Blaxell wrote:
> On Fri, May 15, 2020 at 02:57:52PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/5/15 上午1:44, Zygo Blaxell wrote:
>>> On Thu, May 14, 2020 at 04:55:18PM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2020/5/14 下午4:08, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2020/5/13 下午8:21, Zygo Blaxell wrote:
>>>>>> On Wed, May 13, 2020 at 07:23:40PM +0800, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>> [...]
>>>>>>
>>>>>> Kernel log:
>>>>>>
>>>>>> 	[96199.614869][ T9676] BTRFS info (device dm-0): balance: start -d
>>>>>> 	[96199.616086][ T9676] BTRFS info (device dm-0): relocating block group 4396679168000 flags data
>>>>>> 	[96199.782217][ T9676] BTRFS info (device dm-0): relocating block group 4395605426176 flags data
>>>>>> 	[96199.971118][ T9676] BTRFS info (device dm-0): relocating block group 4394531684352 flags data
>>>>>> 	[96220.858317][ T9676] BTRFS info (device dm-0): found 13 extents, loops 1, stage: move data extents
>>>>>> 	[...]
>>>>>> 	[121403.509718][ T9676] BTRFS info (device dm-0): found 13 extents, loops 131823, stage: update data pointers
>>>>>> 	(qemu) stop
>>>>>>
>>>>>> btrfs-image URL:
>>>>>>
>>>>>> 	http://www.furryterror.org/~zblaxell/tmp/.fsinqz/image.bin
>>>>>>
>>>>> The image shows several very strange result.
>>>>>
>>>>> For one, although we're relocating block group 4394531684352, the
>>>>> previous two block groups doesn't really get relocated.
>>>>>
>>>>> There are still extents there, all belongs to data reloc tree.
>>>>>
>>>>> Furthermore, the data reloc tree inode 620 should be evicted when
>>>>> previous block group relocation finishes.
>>>>>
>>>>> So I'm considering something went wrong in data reloc tree, would you
>>>>> please try the following diff?
>>>>> (Either on vanilla kernel or with my previous useless patch)
>>>>
>>>> Oh, my previous testing patch is doing wrong inode put for data reloc
>>>> tree, thus it's possible to lead to such situation.
>>>>
>>>> Thankfully the v2 for upstream gets the problem fixed.
>>>>
>>>> Thus it goes back to the original stage, still no faster way to
>>>> reproduce the problem...
>>>
>>> Can we attack the problem by logging kernel activity?  Like can we
>>> log whenever we add or remove items from the data reloc tree, or
>>> why we don't?
>>>
>>> I can get a filesystem with a single data block group and a single
>>> (visible) extent that loops, and somehow it's so easy to do that that I'm
>>> having problems making filesystems _not_ do it.  What can we do with that?
>>
>> OK, finally got it reproduced, but it's way more complex than I thought.
>>
>> First, we need to cancel some balance.
>> Then if the canceling timing is good, next balance will always hang.
> 
> I've seen that, but it doesn't seem to be causative, i.e. you can use
> balance cancel to trigger the problem more often, but cancel doesn't
> seem to cause the problem itself.
> 
> I have been running the fast balance cancel patches on kernel 5.0 (this
> is our current production kernel).  Balances can be cancelled on that
> kernel with no looping.  I don't know if the cancel leaves reloc trees
> in weird states, but the reloc roots merging code manages to clean them
> up and break balance out of the loop.
> 
> Loops did occur in test runs before fast balance cancels (or balance
> cancels at all) and others have reported similar issues without patched
> kernels; however, those older observations would be on kernels 5.2 or
> 5.3 which had severe UAF bugs due to the delayed reloc roots change.
> 
> A lot of weird random stuff would happen during balances on older kernels
> that stopped after the UAF bug fix in 5.4.14; however, the balance loops
> persist.
> 
>> Furthermore, if the kernel has CONFIG_BTRFS_DEBUG compiled, the kernel
>> would report leaking reloc tree, then followed by NULL pointer dereference.
> 
> That I have not seen.  I'm running misc-next, and there were some fixes
> for NULL derefs caught by the new reference tracking code.  Maybe it's
> already been fixed?

Latest misc-next still shows the NULL pointer dereference, so that's
something persistent, and would help me to ping down the bug.

Thanks,
Qu

> 
>> Now since I can reproduce it reliably, I guess I don't need to bother
>> you every time I have some new things to try.
>>
>> Thanks for your report!
>> Qu
>>
>>>
>>> What am I (and everyone else with this problem) doing that you are not?
>>> Usually that difference is "I'm running bees" but we're running out of
>>> bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
>>> are reporting the problem without running bees.  I'm also running balance
>>> cancels, which seem to increase the repro rate (though they might just
>>> be increasing the number of balances tested per day, and there could be
>>> just a fixed percentage of balances that loop).
>>>
>>> I will see if I can build a standalone kvm image that generates balance
>>> loops on blank disks.  If I'm successful, you can download it and then
>>> run all the experiments you want.
>>>
>>> I also want to see if reverting the extended reloc tree lifespan patch
>>> (d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
>>> merge_reloc_roots") stops the looping on misc-next.  I found that
>>> reverting that patch stops the balance looping on 5.1.21 in an earlier
>>> experiment.  Maybe there are two bugs here, and we've already fixed one,
>>> but the symptom won't go away because some second bug has appeared.
> 
> I completed this experiment.  I reverted the delay reloc tree commit,
> which required also reverting all the bug fixes on top of delay reloc
> tree in later kernels...
> 
> 	Revert "btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots"
> 	Revert "btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan"
> 	Revert "btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"
> 	Revert "btrfs: relocation: fix use-after-free on dead relocation roots"
> 	Revert "btrfs: relocation: fix reloc_root lifespan and access"
> 	Revert "btrfs: reloc: clean dirty subvols if we fail to start a transaction"
> 	Revert "btrfs: unset reloc control if we fail to recover"
> 	Revert "btrfs: fix transaction leak in btrfs_recover_relocation"
> 
> This test kernel also has fast balance cancel backported:
> 
> 	btrfs: relocation: Check cancel request after each extent found
> 	btrfs: relocation: Check cancel request after each data page read
> 	btrfs: relocation: add error injection points for cancelling balance
> 
> My test kernel is based on 5.4.40.  On 5.7-rc kernels there's a lot
> of changes for refcounting roots that are too much for mere git reverts
> to unwind.
> 
> I ran it for a while with randomly scheduled balances and cancels: 65
> block groups, 47 balance cancels, 20 block groups completed, 0 extra
> loops.  With the delay reloc tree commit in place it's normally not more
> than 5 block groups before looping starts.
> 
>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>>>> index 9afc1a6928cf..ef9e18bab6f6 100644
>>>>> --- a/fs/btrfs/relocation.c
>>>>> +++ b/fs/btrfs/relocation.c
>>>>> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>         BTRFS_I(inode)->index_cnt = group->start;
>>>>>
>>>>>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
>>>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>>>  out:
>>>>>         btrfs_put_root(root);
>>>>>         btrfs_end_transaction(trans);
>>>>> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
>>>>> btrfs_fs_info *fs_info, u64 group_start)
>>>>>  out:
>>>>>         if (err && rw)
>>>>>                 btrfs_dec_block_group_ro(rc->block_group);
>>>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>>>         iput(rc->data_inode);
>>>>>         btrfs_put_block_group(rc->block_group);
>>>>>         free_reloc_control(rc);
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-15 15:17                           ` Zygo Blaxell
  2020-05-18  5:25                             ` Qu Wenruo
@ 2020-05-20  7:27                             ` Qu Wenruo
  2020-05-21  3:26                               ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2020-05-20  7:27 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6102 bytes --]



On 2020/5/15 下午11:17, Zygo Blaxell wrote:
>>
>> OK, finally got it reproduced, but it's way more complex than I thought.
>>
>> First, we need to cancel some balance.
>> Then if the canceling timing is good, next balance will always hang.
> 
> I've seen that, but it doesn't seem to be causative, i.e. you can use
> balance cancel to trigger the problem more often, but cancel doesn't
> seem to cause the problem itself.
> 
> I have been running the fast balance cancel patches on kernel 5.0 (this
> is our current production kernel).  Balances can be cancelled on that
> kernel with no looping.  I don't know if the cancel leaves reloc trees
> in weird states, but the reloc roots merging code manages to clean them
> up and break balance out of the loop.

Finally got it pinned down and fixed.
You can fetch the fixes here (2 small fixes would solve the problem):
https://patchwork.kernel.org/project/linux-btrfs/list/?series=290655

The cause is indeed related to that patch.
And it doesn't need cancel to reproduce, ENOSPC can also trigger it,
which also matches what I see internally from SUSE.

Although the fix is small and already passes all my local tests, and I
believe David would push it soon to upstream, extra test would hurt.

Thank you very much for your long term testing and involvement in btrfs!
Qu

> 
> Loops did occur in test runs before fast balance cancels (or balance
> cancels at all) and others have reported similar issues without patched
> kernels; however, those older observations would be on kernels 5.2 or
> 5.3 which had severe UAF bugs due to the delayed reloc roots change.
> 
> A lot of weird random stuff would happen during balances on older kernels
> that stopped after the UAF bug fix in 5.4.14; however, the balance loops
> persist.
> 
>> Furthermore, if the kernel has CONFIG_BTRFS_DEBUG compiled, the kernel
>> would report leaking reloc tree, then followed by NULL pointer dereference.
> 
> That I have not seen.  I'm running misc-next, and there were some fixes
> for NULL derefs caught by the new reference tracking code.  Maybe it's
> already been fixed?
> 
>> Now since I can reproduce it reliably, I guess I don't need to bother
>> you every time I have some new things to try.
>>
>> Thanks for your report!
>> Qu
>>
>>>
>>> What am I (and everyone else with this problem) doing that you are not?
>>> Usually that difference is "I'm running bees" but we're running out of
>>> bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
>>> are reporting the problem without running bees.  I'm also running balance
>>> cancels, which seem to increase the repro rate (though they might just
>>> be increasing the number of balances tested per day, and there could be
>>> just a fixed percentage of balances that loop).
>>>
>>> I will see if I can build a standalone kvm image that generates balance
>>> loops on blank disks.  If I'm successful, you can download it and then
>>> run all the experiments you want.
>>>
>>> I also want to see if reverting the extended reloc tree lifespan patch
>>> (d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
>>> merge_reloc_roots") stops the looping on misc-next.  I found that
>>> reverting that patch stops the balance looping on 5.1.21 in an earlier
>>> experiment.  Maybe there are two bugs here, and we've already fixed one,
>>> but the symptom won't go away because some second bug has appeared.
> 
> I completed this experiment.  I reverted the delay reloc tree commit,
> which required also reverting all the bug fixes on top of delay reloc
> tree in later kernels...
> 
> 	Revert "btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots"
> 	Revert "btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan"
> 	Revert "btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"
> 	Revert "btrfs: relocation: fix use-after-free on dead relocation roots"
> 	Revert "btrfs: relocation: fix reloc_root lifespan and access"
> 	Revert "btrfs: reloc: clean dirty subvols if we fail to start a transaction"
> 	Revert "btrfs: unset reloc control if we fail to recover"
> 	Revert "btrfs: fix transaction leak in btrfs_recover_relocation"
> 
> This test kernel also has fast balance cancel backported:
> 
> 	btrfs: relocation: Check cancel request after each extent found
> 	btrfs: relocation: Check cancel request after each data page read
> 	btrfs: relocation: add error injection points for cancelling balance
> 
> My test kernel is based on 5.4.40.  On 5.7-rc kernels there's a lot
> of changes for refcounting roots that are too much for mere git reverts
> to unwind.
> 
> I ran it for a while with randomly scheduled balances and cancels: 65
> block groups, 47 balance cancels, 20 block groups completed, 0 extra
> loops.  With the delay reloc tree commit in place it's normally not more
> than 5 block groups before looping starts.
> 
>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>>>> index 9afc1a6928cf..ef9e18bab6f6 100644
>>>>> --- a/fs/btrfs/relocation.c
>>>>> +++ b/fs/btrfs/relocation.c
>>>>> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>         BTRFS_I(inode)->index_cnt = group->start;
>>>>>
>>>>>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
>>>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>>>  out:
>>>>>         btrfs_put_root(root);
>>>>>         btrfs_end_transaction(trans);
>>>>> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
>>>>> btrfs_fs_info *fs_info, u64 group_start)
>>>>>  out:
>>>>>         if (err && rw)
>>>>>                 btrfs_dec_block_group_ro(rc->block_group);
>>>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
>>>>>         iput(rc->data_inode);
>>>>>         btrfs_put_block_group(rc->block_group);
>>>>>         free_reloc_control(rc);
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Balance loops: what we know so far
  2020-05-20  7:27                             ` Qu Wenruo
@ 2020-05-21  3:26                               ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2020-05-21  3:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 20, 2020 at 03:27:24PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/5/15 下午11:17, Zygo Blaxell wrote:
> >>
> >> OK, finally got it reproduced, but it's way more complex than I thought.
> >>
> >> First, we need to cancel some balance.
> >> Then if the canceling timing is good, next balance will always hang.
> > 
> > I've seen that, but it doesn't seem to be causative, i.e. you can use
> > balance cancel to trigger the problem more often, but cancel doesn't
> > seem to cause the problem itself.
> > 
> > I have been running the fast balance cancel patches on kernel 5.0 (this
> > is our current production kernel).  Balances can be cancelled on that
> > kernel with no looping.  I don't know if the cancel leaves reloc trees
> > in weird states, but the reloc roots merging code manages to clean them
> > up and break balance out of the loop.
> 
> Finally got it pinned down and fixed.
> You can fetch the fixes here (2 small fixes would solve the problem):
> https://patchwork.kernel.org/project/linux-btrfs/list/?series=290655
> 
> The cause is indeed related to that patch.
> And it doesn't need cancel to reproduce, ENOSPC can also trigger it,
> which also matches what I see internally from SUSE.

I can confirm a likely correlation between ENOSPC and balance loops,
though I don't have exactly the right test data.  The filesystems where
loops were observed without balance cancels were small, and tended to
fill up from time to time--exactly the conditions when a balance might
be required to free up some space for metadata growth and when balance
might hit ENOSPC.

My test VMs are small and hit ENOSPC from time to time.  When I removed
balance cancels from the test, I guess ENOSPC was still triggering
the looping bug.  That also explains why it took a few days for a test
filesystem to get into a state where looping occurred without balance
cancels--that's how long it takes to fill up the disk with undeleted
snapshots.

The big filesystems didn't hit ENOSPC, but we sometimes used ad-hoc
balance cancels to force balance to stop if it was still running past
the end of a maintenance window.  That didn't happen often enough for an
obvious correlation to pop up, but that's an artifact of our response
to the balance looping.  The only way to fix the loops was a reboot,
and we don't like combining "reboot" and "often" on production machines,
so we downgraded those to 4.19 to wait out the bug.  When we did that,
we lost an important subset of test data pointing directly at how to
reproduce the bug.

Also we have some big servers that have been quietly running 5.4
kernels with no problems at all.  No ENOSPC, no balance cancel, and no
balance loops.  So the correlation holds for the negative conditions
and result too.

Good!  All the above means we had one bug with two trigger conditions,
not two distinct bugs.

> Although the fix is small and already passes all my local tests, and I
> believe David would push it soon to upstream, extra test would hurt.

I've been running misc-next with these patches on a test VM today.
The test workload contains both ENOSPC and balance cancel events.
No balance loops so far.  Looks like you nailed it.

> Thank you very much for your long term testing and involvement in btrfs!

And thank you for the fix!

> Qu
> 
> > 
> > Loops did occur in test runs before fast balance cancels (or balance
> > cancels at all) and others have reported similar issues without patched
> > kernels; however, those older observations would be on kernels 5.2 or
> > 5.3 which had severe UAF bugs due to the delayed reloc roots change.
> > 
> > A lot of weird random stuff would happen during balances on older kernels
> > that stopped after the UAF bug fix in 5.4.14; however, the balance loops
> > persist.
> > 
> >> Furthermore, if the kernel has CONFIG_BTRFS_DEBUG compiled, the kernel
> >> would report leaking reloc tree, then followed by NULL pointer dereference.
> > 
> > That I have not seen.  I'm running misc-next, and there were some fixes
> > for NULL derefs caught by the new reference tracking code.  Maybe it's
> > already been fixed?
> > 
> >> Now since I can reproduce it reliably, I guess I don't need to bother
> >> you every time I have some new things to try.
> >>
> >> Thanks for your report!
> >> Qu
> >>
> >>>
> >>> What am I (and everyone else with this problem) doing that you are not?
> >>> Usually that difference is "I'm running bees" but we're running out of
> >>> bugs related to LOGICAL_INO and the dedupe ioctl, and I think other people
> >>> are reporting the problem without running bees.  I'm also running balance
> >>> cancels, which seem to increase the repro rate (though they might just
> >>> be increasing the number of balances tested per day, and there could be
> >>> just a fixed percentage of balances that loop).
> >>>
> >>> I will see if I can build a standalone kvm image that generates balance
> >>> loops on blank disks.  If I'm successful, you can download it and then
> >>> run all the experiments you want.
> >>>
> >>> I also want to see if reverting the extended reloc tree lifespan patch
> >>> (d2311e698578 "btrfs: relocation: Delay reloc tree deletion after
> >>> merge_reloc_roots") stops the looping on misc-next.  I found that
> >>> reverting that patch stops the balance looping on 5.1.21 in an earlier
> >>> experiment.  Maybe there are two bugs here, and we've already fixed one,
> >>> but the symptom won't go away because some second bug has appeared.
> > 
> > I completed this experiment.  I reverted the delay reloc tree commit,
> > which required also reverting all the bug fixes on top of delay reloc
> > tree in later kernels...
> > 
> > 	Revert "btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots"
> > 	Revert "btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan"
> > 	Revert "btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()"
> > 	Revert "btrfs: relocation: fix use-after-free on dead relocation roots"
> > 	Revert "btrfs: relocation: fix reloc_root lifespan and access"
> > 	Revert "btrfs: reloc: clean dirty subvols if we fail to start a transaction"
> > 	Revert "btrfs: unset reloc control if we fail to recover"
> > 	Revert "btrfs: fix transaction leak in btrfs_recover_relocation"
> > 
> > This test kernel also has fast balance cancel backported:
> > 
> > 	btrfs: relocation: Check cancel request after each extent found
> > 	btrfs: relocation: Check cancel request after each data page read
> > 	btrfs: relocation: add error injection points for cancelling balance
> > 
> > My test kernel is based on 5.4.40.  On 5.7-rc kernels there's a lot
> > of changes for refcounting roots that are too much for mere git reverts
> > to unwind.
> > 
> > I ran it for a while with randomly scheduled balances and cancels: 65
> > block groups, 47 balance cancels, 20 block groups completed, 0 extra
> > loops.  With the delay reloc tree commit in place it's normally not more
> > than 5 block groups before looping starts.
> > 
> >>>
> >>>> Thanks,
> >>>> Qu
> >>>>>
> >>>>> Thanks,
> >>>>> Qu
> >>>>>
> >>>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> >>>>> index 9afc1a6928cf..ef9e18bab6f6 100644
> >>>>> --- a/fs/btrfs/relocation.c
> >>>>> +++ b/fs/btrfs/relocation.c
> >>>>> @@ -3498,6 +3498,7 @@ struct inode *create_reloc_inode(struct
> >>>>> btrfs_fs_info *fs_info,
> >>>>>         BTRFS_I(inode)->index_cnt = group->start;
> >>>>>
> >>>>>         err = btrfs_orphan_add(trans, BTRFS_I(inode));
> >>>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
> >>>>>  out:
> >>>>>         btrfs_put_root(root);
> >>>>>         btrfs_end_transaction(trans);
> >>>>> @@ -3681,6 +3682,7 @@ int btrfs_relocate_block_group(struct
> >>>>> btrfs_fs_info *fs_info, u64 group_start)
> >>>>>  out:
> >>>>>         if (err && rw)
> >>>>>                 btrfs_dec_block_group_ro(rc->block_group);
> >>>>> +       WARN_ON(atomic_read(inode->i_count) != 1);
> >>>>>         iput(rc->data_inode);
> >>>>>         btrfs_put_block_group(rc->block_group);
> >>>>>         free_reloc_control(rc);
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> > 
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-05-21  3:26 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-11 21:14 Balance loops: what we know so far Zygo Blaxell
2020-04-27  7:07 ` Qu Wenruo
2020-04-28  4:55   ` Zygo Blaxell
2020-04-28  9:54     ` Qu Wenruo
2020-04-28 14:51       ` Zygo Blaxell
2020-04-29  5:34         ` Qu Wenruo
2020-04-29 12:23           ` Sebastian Döring
2020-05-04 18:54       ` Andrea Gelmini
2020-05-04 23:48         ` Qu Wenruo
2020-05-05  9:10           ` Andrea Gelmini
2020-05-06  5:58             ` Qu Wenruo
2020-05-06 18:24               ` Andrea Gelmini
2020-05-07  9:59                 ` Andrea Gelmini
2020-05-08  6:33                 ` Qu Wenruo
2020-05-11  8:31     ` Qu Wenruo
2020-05-12 13:43       ` Zygo Blaxell
2020-05-12 14:11         ` Zygo Blaxell
2020-05-13  2:28           ` Qu Wenruo
2020-05-13  5:02             ` Zygo Blaxell
2020-05-13  6:36               ` Qu Wenruo
2020-05-13  5:24             ` Zygo Blaxell
2020-05-13 11:23               ` Qu Wenruo
2020-05-13 12:21                 ` Zygo Blaxell
2020-05-14  8:08                   ` Qu Wenruo
2020-05-14  8:55                     ` Qu Wenruo
2020-05-14 17:44                       ` Zygo Blaxell
2020-05-14 23:43                         ` Qu Wenruo
2020-05-15  6:57                         ` Qu Wenruo
2020-05-15 15:17                           ` Zygo Blaxell
2020-05-18  5:25                             ` Qu Wenruo
2020-05-20  7:27                             ` Qu Wenruo
2020-05-21  3:26                               ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.