From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Balance loops: what we know so far
Date: Tue, 28 Apr 2020 10:51:46 -0400 [thread overview]
Message-ID: <20200428145145.GB10796@hungrycats.org> (raw)
In-Reply-To: <ea42b9cb-3754-3f47-8d3c-208760e1c2ac@gmx.com>
On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote:
>
>
> On 2020/4/28 下午12:55, Zygo Blaxell wrote:
> > On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> >>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> >>> in balance and device shrink/remove:
> >>>
> >>> [Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> [Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> [Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>
> >>> This is a block group while it's looping, as seen by python-btrfs:
> >>>
> >>> # share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> >>> block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> [...]
> >>>
> >>> All of the extent data backrefs are removed by the balance, but the
> >>> loop keeps trying to get rid of the shared data backrefs. It has
> >>> no effect on them, but keeps trying anyway.
> >>
> >> I guess this shows a pretty good clue.
> >>
> >> I was always thinking about the reloc tree, but in your case, it's data
> >> reloc tree owning them.
> >
> > In that case, yes. Metadata balances loop too, in the "move data extents"
> > stage while data balances loop in the "update data pointers" stage.
>
> Would you please take an image dump of the fs when runaway balance happened?
>
> Both metadata and data block group loop would greatly help.
There's two problems with this:
1) my smallest test filesystems have 29GB of metadata,
2) the problem is not reproducible with an image.
I've tried using VM snapshots to put a filesystem into a reproducible
looping state. A block group that loops on one boot doesn't repeatably
loop on another boot from the same initial state; however, once a
booted system starts looping, it continues to loop even if the balance
is cancelled, and I restart balance on the same block group or other
random block groups.
I have production filesystems with tens of thousands of block groups
and almost all of them loop (as I said before, I cannot complete any
RAID reshapes with 5.1+ kernels). They can't _all_ be bad.
Cancelling a balance (usually) doesn't recover. Rebooting does.
The change that triggered this changes the order of operations in
the kernel. That smells like a runtime thing to me.
> >> In that case, data reloc tree is only cleaned up at the end of
> >> btrfs_relocate_block_group().
> >>
> >> Thus it is never cleaned up until we exit the balance loop.
> >>
> >> I'm not sure why this is happening only after I extended the lifespan of
> >> reloc tree (not data reloc tree).
> >
> > I have been poking around with printk to trace what it's doing in the
> > looping and non-looping cases. It seems to be very similar up to
> > calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
> > btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
> > btrfs_free_block_rsv. In the looping cases, everything up to those
> > functions seems the same on every loop except the first one.
> >
> > In the non-looping cases, those functions do something different than
> > the looping cases: the extents disappear in the next loop, and the
> > balance finishes.
> >
> > I haven't figured out _what_ is different yet. I need more cycles to
> > look at it.
> >
> > Your extend-the-lifespan-of-reloc-tree patch moves one of the
> > functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
> > place in the call sequence. It was in merge_reloc_roots before the
> > transaction commit, now it's in relocate_block_group after transaction
> > commit. My guess is that the problem lies somewhere in how the behavior
> > of these functions has been changed by calling them in a different
> > sequence.
> >
> >> But anyway, would you like to give a try of the following patch?
> >> https://patchwork.kernel.org/patch/11511241/
> >
> > I'm not sure how this patch could work. We are hitting the found_extents
> > counter every time through the loop. It's returning thousands of extents
> > each time.
> >
> >> It should make us exit the the balance so long as we have no extra
> >> extent to relocate.
> >
> > The problem is not that we have no extents to relocate. The problem is
> > that we don't successfully get rid of the extents we do find, so we keep
> > finding them over and over again.
>
> That's very strange.
>
> As you can see, for relocate_block_group(), it will cleanup reloc trees.
>
> This means either we have reloc trees in use and not cleaned up, or some
> tracing mechanism is not work properly.
Can you point out where in the kernel that happens? If we throw some
printks at it we might see something.
> Anyway, if image dump with the dead looping block group specified, it
> would provide good hint to this long problem.
>
> Thanks,
> Qu
>
> >
> > In testing, the patch has no effect:
> >
> > [Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > [Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >
> > The above is the tail end of 3320 loops on a single block group.
> >
> > I switched to a metadata block group and it's on the 9th loop:
> >
> > # btrfs balance start -mconvert=raid1 /media/testfs/
> > [Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
> > [Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > [Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >
> >
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> This is "semi-infinite" because it is possible for the balance to
> >>> terminate if something removes those 29 extents (e.g. by looking up the
> >>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> >>> fi defrag' will reduce the number of inline shared data backref objects.
> >>> When it's reduced all the way to zero, balance starts up again, usually
> >>> promptly getting stuck on the very next block group. If the _only_
> >>> thing running on the filesystem is balance, it will not stop looping.
> >>>
> >>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> >>> tree deletion after merge_reloc_roots" as the first commit where the
> >>> balance loops can be reproduced.
> >>>
> >>> I tested with commit 59b2c371052c "btrfs: check commit root generation
> >>> in should_ignore_root" as well as the rest of misc-next, but the balance
> >>> loops are still easier to reproduce than to avoid.
> >>>
> >>> Once it starts happening on a filesystem, it seems to happen very
> >>> frequently. It is not possible to reshape a RAID array of more than a
> >>> few hundred GB on kernels after 5.0. I can get maybe 50-100 block groups
> >>> completed in a resize or balance after a fresh boot, then balance gets
> >>> stuck in loops after that. With the fast balance cancel patches it's
> >>> possibly to recover from the loop, but futile, since the next balance
> >>> will almost always also loop, even if it is passed a different block
> >>> group. I've had to downgrade to 5.0 or 4.19 to complete any RAID
> >>> reshaping work.
> >>>
> >>
> >
> >
> >
>
next prev parent reply other threads:[~2020-04-28 14:52 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-11 21:14 Balance loops: what we know so far Zygo Blaxell
2020-04-27 7:07 ` Qu Wenruo
2020-04-28 4:55 ` Zygo Blaxell
2020-04-28 9:54 ` Qu Wenruo
2020-04-28 14:51 ` Zygo Blaxell [this message]
2020-04-29 5:34 ` Qu Wenruo
2020-04-29 12:23 ` Sebastian Döring
2020-05-04 18:54 ` Andrea Gelmini
2020-05-04 23:48 ` Qu Wenruo
2020-05-05 9:10 ` Andrea Gelmini
2020-05-06 5:58 ` Qu Wenruo
2020-05-06 18:24 ` Andrea Gelmini
2020-05-07 9:59 ` Andrea Gelmini
2020-05-08 6:33 ` Qu Wenruo
2020-05-11 8:31 ` Qu Wenruo
2020-05-12 13:43 ` Zygo Blaxell
2020-05-12 14:11 ` Zygo Blaxell
2020-05-13 2:28 ` Qu Wenruo
2020-05-13 5:02 ` Zygo Blaxell
2020-05-13 6:36 ` Qu Wenruo
2020-05-13 5:24 ` Zygo Blaxell
2020-05-13 11:23 ` Qu Wenruo
2020-05-13 12:21 ` Zygo Blaxell
2020-05-14 8:08 ` Qu Wenruo
2020-05-14 8:55 ` Qu Wenruo
2020-05-14 17:44 ` Zygo Blaxell
2020-05-14 23:43 ` Qu Wenruo
2020-05-15 6:57 ` Qu Wenruo
2020-05-15 15:17 ` Zygo Blaxell
2020-05-18 5:25 ` Qu Wenruo
2020-05-20 7:27 ` Qu Wenruo
2020-05-21 3:26 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200428145145.GB10796@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).