From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Nikolay Borisov <nborisov@suse.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: bisected: btrfs dedupe regression in v5.11-rc1
Date: Fri, 7 Jan 2022 13:31:07 -0500 [thread overview]
Message-ID: <YdiG6xYbY0tZ21j9@hungrycats.org> (raw)
In-Reply-To: <Ybz4JI+Kl2J7Py3z@hungrycats.org>
On Fri, Dec 17, 2021 at 03:50:44PM -0500, Zygo Blaxell wrote:
> On Thu, Dec 16, 2021 at 05:07:50PM -0500, Josef Bacik wrote:
> > On Thu, Dec 16, 2021 at 11:29:06PM +0200, Nikolay Borisov wrote:
> > > On 16.12.21 г. 7:33, Zygo Blaxell wrote:
> > > > Again we have "3 locks held" but no list of locks. WTF is 10883 doing?
> > > > Well, first of all it's using 100% CPU in the kernel. Some samples of
> > > > kernel stacks:
> > > >
> > > > # cat /proc/*/task/10883/stack
> > > > [<0>] down_read_nested+0x32/0x140
> > > > [<0>] __btrfs_tree_read_lock+0x2d/0x110
> > > > [<0>] btrfs_tree_read_lock+0x10/0x20
> > > > [<0>] btrfs_search_old_slot+0x627/0x8a0
> > > > [<0>] btrfs_next_old_leaf+0xcb/0x340
> > > > [<0>] find_parent_nodes+0xcd7/0x1c40
> > > > [<0>] btrfs_find_all_leafs+0x63/0xb0
> > > > [<0>] iterate_extent_inodes+0xc8/0x270
> > > > [<0>] iterate_inodes_from_logical+0x9f/0xe0
> > >
> > > That's the real culprit, in this case we are not searching the commit
> > > root hence we've attached to the transaction. So we are doing backref
> > > resolution which either:
> > >
> > > a) Hits some pathological case and loops for very long time, backref
> > > resolution is known to take a lot of time.
> > >
> > > b) We hit a bug in backref resolution and loop forever which again
> > > results in the transaction being kept open.
> > >
> > > Now I wonder why you were able to bisect this to the seemingly unrelated
> > > commit in the vfs code.
> > >
> > > Josef any ideas how to proceed further to debug why backref resolution
> > > takes a long time and if it's just an infinite loop?
> > >
> >
> > It's probably an infinite loop, I'd just start with something like this
> >
> > bpftrace -e 'tracepoint:btrfs:btrfs_prelim_ref_insert { printf("bytenr is %llu", args->bytenr); }'
> >
> > and see if it's spitting out the same shit over and over again. If it is can I
> > get a btrfs inspect-internal dump-tree -e on the device along with the bytenr
> > it's hung up on so I can figure out wtf it's tripping over?
>
> That bpftrace command doesn't output anything after the hang. Before the
> hang, it's thousands and thousands of values changing as bees moves around
> the filesystem.
I left my VM running tests for a few weeks and got some more information.
Or at least more data, I'm not feeling particularly informed by it. :-P
1. It's not a regression. 5.10 has the same symptoms, but about 100x
less often (once a week under these test conditions, compared to once
every 90 minutes or so on 5.11-rc1).
2. Bisection doesn't work, because there are patches that are repeatably
good and bad mixed together, so the bisect algorithm (even with stochastic
enhancement) repeatably picks the wrong commits and converges with
high confidence on nonsense. Instead of bisecting, I picked commits
semi-randomly from 5.11-rc1's patch set, and got these results:
124 3a160a933111 btrfs: drop never met disk total bytes check in verify_one_dev_extent
1x hang, 2x slower
125 bacce86ae8a7 btrfs: drop unused argument step from btrfs_free_extra_devids
1x pass (fast)
126 2766ff61762c btrfs: update the number of bytes used by an inode atomically
1x hang (<20 minutes)
127 7f458a3873ae btrfs: fix race when defragmenting leads to unnecessary IO
1x hang, runs 3x slower
128 5893dfb98f25 btrfs: refactor btrfs_drop_extents() to make it easier to extend
2x hang (<20 minutes)
129 e114c545bb69 btrfs: set the lockdep class for extent buffers on creation
2x pass (but runs 2x slower, both times)
130 3fbaf25817f7 btrfs: pass the owner_root and level to alloc_extent_buffer
1x pass
131 5d81230baa90 btrfs: pass the root owner and level around for readahead
1x pass
132 1b7ec85ef490 btrfs: pass root owner to read_tree_block
1x pass
133 182c79fcb857 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
134 3acfbd6a990c btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
1x hang
135 6b2cb7cb959a btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
136 c990ada2a0bb btrfs: use btrfs_read_node_slot in walk_down_tree
1x hang
137 6b3426be27de btrfs: use btrfs_read_node_slot in replace_path
1x hang, 1x pass
138 c975253682e0 btrfs: use btrfs_read_node_slot in do_relocation
1x hang
139 8ef385bbf099 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
1x hang, 1x pass
140 206983b72a36 btrfs: use btrfs_read_node_slot in btrfs_realloc_node
1x pass
141 bfb484d922a3 btrfs: cleanup extent buffer readahead
1x pass
142 416e3445ef80 btrfs: remove lockdep classes for the fs tree
143 3e48d8d2540d btrfs: discard: reschedule work after sysfs param update
144 df903e5d294f btrfs: don't miss async discards after scheduled work override
145 6e88f116bd4c btrfs: discard: store async discard delay as ns not as jiffies
2x hang
146 e50404a8a699 btrfs: discard: speed up async discard up to iops_limit
[snip]
155 0d01e247a06b btrfs: assert page mapping lock in attach_extent_buffer_page
1x hang, 1x pass
156 bbb86a371791 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
1x hang
157 e747853cae3a btrfs: load free space cache asynchronously
1x pass
158 4d7240f0abda btrfs: load the free space cache inode extents from commit root
1x hang
159 cd79909bc7cd btrfs: load free space cache into a temporary ctl
2x pass
160 66b53bae46c8 btrfs: cleanup btrfs_discard_update_discardable usage
2x hang, 1x pass
161 2ca08c56e813 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
2x pass
162 27d56e62e474 btrfs: update last_byte_to_unpin in switch_commit_roots
2x pass
163 9076dbd5ee83 btrfs: do not shorten unpin len for caching block groups
164 dc5161648693 btrfs: reorder extent buffer members for better packing
2x pass
165 b9729ce014f6 btrfs: locking: rip out path->leave_spinning
166 ac5887c8e013 btrfs: locking: remove all the blocking helpers
167 2ae0c2d80d25 btrfs: scrub: remove local copy of csum_size from context
168 419b791ce760 btrfs: check integrity: remove local copy of csum_size
1x hang, 1x pass
169 713cebfb9891 btrfs: remove unnecessary local variables for checksum size
170 223486c27b36 btrfs: switch cached fs_info::csum_size from u16 to u32
171 55fc29bed8dd btrfs: use cached value of fs_info::csum_size everywhere
172 fe5ecbe818de btrfs: precalculate checksums per leaf once
173 22b6331d9617 btrfs: store precalculated csum_size in fs_info
174 265fdfa6ce0a btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
175 098e63082b9b btrfs: replace div_u64 by shift in free_space_bitmap_size
2x pass
176 ab108d992b12 btrfs: use precalculated sectorsize_bits from fs_info
[snip]
200 5e8b9ef30392 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
1x pass
201 4e4cabece9f9 btrfs: split btrfs_direct_IO to read and write
[snip]
215 d70bf7484f72 btrfs: unify the ro checking for mount options
1x pass
216 a6889caf6ec6 btrfs: do not start readahead for csum tree when scrubbing non-data block groups
217 a57ad681f12e btrfs: assert we are holding the reada_lock when releasing a readahead zone
218 aa8c1a41a1e6 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
219 13f0dd8f7861 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
220 949b32732eab btrfs: use iosize while reading compressed pages
221 eefa45f59379 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
222 fb8a7e941b1b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
1x pass
There is some repeatability in these results--some commits have a much
lower failure rate than others--but I don't see a reason why the bad
commits are bad or the good commits are good. There are some commits with
locking and concurrency implications, but they're as likely to produce
good as bad results in test. Sometimes there's a consistent change in
test result after a commit that only rearranges function arguments on
the stack.
Maybe what we're looking at is a subtle race that is popping up due
to unrelated changes in the kernel, and disappearing just as often,
and 5.11-rc1 in particular did something innocent that aggravates
it somehow, so all later kernels hit the problem more often than
5.10 did.
3. Somewhere around "7f458a3873ae btrfs: fix race when defragmenting
leads to unnecessary IO" bees starts running about 3x slower than on
earlier kernels. bees is a nightmare of nondeterministically racing
worker threads, so I'm not sure how important this observation is,
but it keeps showing up in the data.
4. I had one machine on 5.10.84 (not a test VM) with a shell process
that got stuck spinning 100% CPU in the kernel on sys_write. bees was
also running, but its threads were all stuck waiting for the shell to
release the transaction. Other crashes on 5.10.8x kernels look more
like the one in this thread, with a logical_ino spinning.
> > If it's not looping there, it may be looping higher up, but I don't see where it
> > would be doing that. Lets start here and work our way up if we need to.
> > Thanks,
> >
> > Josef
next prev parent reply other threads:[~2022-01-07 18:31 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-10 18:34 bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
2021-12-12 10:03 ` Thorsten Leemhuis
2021-12-13 13:28 ` Nikolay Borisov
2021-12-13 23:12 ` Zygo Blaxell
2021-12-14 11:11 ` Nikolay Borisov
2021-12-14 19:50 ` Zygo Blaxell
2021-12-14 22:25 ` Nikolay Borisov
2021-12-16 5:33 ` Zygo Blaxell
2021-12-16 21:29 ` Nikolay Borisov
2021-12-16 22:07 ` Josef Bacik
2021-12-17 20:50 ` Zygo Blaxell
2022-01-07 18:31 ` Zygo Blaxell [this message]
2022-01-20 14:04 ` bisected: btrfs dedupe regression in v5.11-rc1 Thorsten Leemhuis
2022-01-21 0:27 ` Zygo Blaxell
2022-02-09 12:22 ` Libor Klepáč
2022-02-18 14:46 ` Thorsten Leemhuis
2022-03-06 10:31 ` Thorsten Leemhuis
2022-03-06 23:34 ` Zygo Blaxell
2022-03-07 6:17 ` Thorsten Leemhuis
2021-12-17 5:38 ` bisected: btrfs dedupe regression in v5.11-rc1: 3078d85c9a10 vfs: verify source area in vfs_dedupe_file_range_one() Zygo Blaxell
2022-06-13 8:38 ` Libor Klepáč
2022-06-21 5:08 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YdiG6xYbY0tZ21j9@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=nborisov@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).