All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Balance loops: what we know so far
Date: Tue, 28 Apr 2020 10:51:46 -0400	[thread overview]
Message-ID: <20200428145145.GB10796@hungrycats.org> (raw)
In-Reply-To: <ea42b9cb-3754-3f47-8d3c-208760e1c2ac@gmx.com>

On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/4/28 下午12:55, Zygo Blaxell wrote:
> > On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> >>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> >>> in balance and device shrink/remove:
> >>>
> >>> 	[Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>> 	[Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>
> >>> This is a block group while it's looping, as seen by python-btrfs:
> >>>
> >>> 	# share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> >>> 	block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> [...]
> >>>
> >>> All of the extent data backrefs are removed by the balance, but the
> >>> loop keeps trying to get rid of the shared data backrefs.  It has
> >>> no effect on them, but keeps trying anyway.
> >>
> >> I guess this shows a pretty good clue.
> >>
> >> I was always thinking about the reloc tree, but in your case, it's data
> >> reloc tree owning them.
> > 
> > In that case, yes.  Metadata balances loop too, in the "move data extents"
> > stage while data balances loop in the "update data pointers" stage.
> 
> Would you please take an image dump of the fs when runaway balance happened?
> 
> Both metadata and data block group loop would greatly help.

There's two problems with this:

	1) my smallest test filesystems have 29GB of metadata,

	2) the problem is not reproducible with an image.

I've tried using VM snapshots to put a filesystem into a reproducible
looping state.  A block group that loops on one boot doesn't repeatably
loop on another boot from the same initial state; however, once a
booted system starts looping, it continues to loop even if the balance
is cancelled, and I restart balance on the same block group or other
random block groups.

I have production filesystems with tens of thousands of block groups
and almost all of them loop (as I said before, I cannot complete any
RAID reshapes with 5.1+ kernels).  They can't _all_ be bad.

Cancelling a balance (usually) doesn't recover.  Rebooting does.
The change that triggered this changes the order of operations in
the kernel.  That smells like a runtime thing to me.

> >> In that case, data reloc tree is only cleaned up at the end of
> >> btrfs_relocate_block_group().
> >>
> >> Thus it is never cleaned up until we exit the balance loop.
> >>
> >> I'm not sure why this is happening only after I extended the lifespan of
> >> reloc tree (not data reloc tree).
> > 
> > I have been poking around with printk to trace what it's doing in the
> > looping and non-looping cases.  It seems to be very similar up to
> > calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
> > btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
> > btrfs_free_block_rsv.  In the looping cases, everything up to those
> > functions seems the same on every loop except the first one.
> > 
> > In the non-looping cases, those functions do something different than
> > the looping cases:  the extents disappear in the next loop, and the
> > balance finishes.
> > 
> > I haven't figured out _what_ is different yet.  I need more cycles to
> > look at it.
> > 
> > Your extend-the-lifespan-of-reloc-tree patch moves one of the
> > functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
> > place in the call sequence.  It was in merge_reloc_roots before the
> > transaction commit, now it's in relocate_block_group after transaction
> > commit.  My guess is that the problem lies somewhere in how the behavior
> > of these functions has been changed by calling them in a different
> > sequence.
> > 
> >> But anyway, would you like to give a try of the following patch?
> >> https://patchwork.kernel.org/patch/11511241/
> > 
> > I'm not sure how this patch could work.  We are hitting the found_extents
> > counter every time through the loop.  It's returning thousands of extents
> > each time.
> > 
> >> It should make us exit the the balance so long as we have no extra
> >> extent to relocate.
> > 
> > The problem is not that we have no extents to relocate.  The problem is
> > that we don't successfully get rid of the extents we do find, so we keep
> > finding them over and over again.
> 
> That's very strange.
> 
> As you can see, for relocate_block_group(), it will cleanup reloc trees.
> 
> This means either we have reloc trees in use and not cleaned up, or some
> tracing mechanism is not work properly.

Can you point out where in the kernel that happens?  If we throw some
printks at it we might see something.

> Anyway, if image dump with the dead looping block group specified, it
> would provide good hint to this long problem.
> 
> Thanks,
> Qu
> 
> > 
> > In testing, the patch has no effect:
> > 
> > 	[Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 	[Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> > 
> > The above is the tail end of 3320 loops on a single block group.
> > 
> > I switched to a metadata block group and it's on the 9th loop:
> > 
> > 	# btrfs balance start -mconvert=raid1 /media/testfs/
> > 	[Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
> > 	[Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 	[Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> > 
> > 
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> This is "semi-infinite" because it is possible for the balance to
> >>> terminate if something removes those 29 extents (e.g. by looking up the
> >>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> >>> fi defrag' will reduce the number of inline shared data backref objects.
> >>> When it's reduced all the way to zero, balance starts up again, usually
> >>> promptly getting stuck on the very next block group.  If the _only_
> >>> thing running on the filesystem is balance, it will not stop looping.
> >>>
> >>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> >>> tree deletion after merge_reloc_roots" as the first commit where the
> >>> balance loops can be reproduced.
> >>>
> >>> I tested with commit 59b2c371052c "btrfs: check commit root generation
> >>> in should_ignore_root" as well as the rest of misc-next, but the balance
> >>> loops are still easier to reproduce than to avoid.
> >>>
> >>> Once it starts happening on a filesystem, it seems to happen very
> >>> frequently.  It is not possible to reshape a RAID array of more than a
> >>> few hundred GB on kernels after 5.0.  I can get maybe 50-100 block groups
> >>> completed in a resize or balance after a fresh boot, then balance gets
> >>> stuck in loops after that.  With the fast balance cancel patches it's
> >>> possibly to recover from the loop, but futile, since the next balance
> >>> will almost always also loop, even if it is passed a different block
> >>> group.  I've had to downgrade to 5.0 or 4.19 to complete any RAID
> >>> reshaping work.
> >>>
> >>
> > 
> > 
> > 
> 




  reply	other threads:[~2020-04-28 14:52 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-11 21:14 Balance loops: what we know so far Zygo Blaxell
2020-04-27  7:07 ` Qu Wenruo
2020-04-28  4:55   ` Zygo Blaxell
2020-04-28  9:54     ` Qu Wenruo
2020-04-28 14:51       ` Zygo Blaxell [this message]
2020-04-29  5:34         ` Qu Wenruo
2020-04-29 12:23           ` Sebastian Döring
2020-05-04 18:54       ` Andrea Gelmini
2020-05-04 23:48         ` Qu Wenruo
2020-05-05  9:10           ` Andrea Gelmini
2020-05-06  5:58             ` Qu Wenruo
2020-05-06 18:24               ` Andrea Gelmini
2020-05-07  9:59                 ` Andrea Gelmini
2020-05-08  6:33                 ` Qu Wenruo
2020-05-11  8:31     ` Qu Wenruo
2020-05-12 13:43       ` Zygo Blaxell
2020-05-12 14:11         ` Zygo Blaxell
2020-05-13  2:28           ` Qu Wenruo
2020-05-13  5:02             ` Zygo Blaxell
2020-05-13  6:36               ` Qu Wenruo
2020-05-13  5:24             ` Zygo Blaxell
2020-05-13 11:23               ` Qu Wenruo
2020-05-13 12:21                 ` Zygo Blaxell
2020-05-14  8:08                   ` Qu Wenruo
2020-05-14  8:55                     ` Qu Wenruo
2020-05-14 17:44                       ` Zygo Blaxell
2020-05-14 23:43                         ` Qu Wenruo
2020-05-15  6:57                         ` Qu Wenruo
2020-05-15 15:17                           ` Zygo Blaxell
2020-05-18  5:25                             ` Qu Wenruo
2020-05-20  7:27                             ` Qu Wenruo
2020-05-21  3:26                               ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200428145145.GB10796@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.