Re: "kernel BUG" and segmentation fault with "device delete"

From: Chris Murphy <lists@colorremedies.com>
To: Vladimir Panteleev <thecybershadow@gmail.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>
Subject: Re: "kernel BUG" and segmentation fault with "device delete"
Date: Sat, 6 Jul 2019 11:36:13 -0600	[thread overview]
Message-ID: <CAJCQCtQVMnP5G=Hp0tnoXuc+_j0Wg3heb1exnNj2nND4Mc3aiw@mail.gmail.com> (raw)
In-Reply-To: <0212c1f0-f02d-bf0f-5748-b1332b6bbfad@gmail.com>

On Fri, Jul 5, 2019 at 9:38 PM Vladimir Panteleev
<thecybershadow@gmail.com> wrote:
>
> On 06/07/2019 02.38, Chris Murphy wrote:
> > It's a really good question for developers if there is a good reason
> > to permit rw mount of a volume that's missing two or more devices for
> > raid 1, 10, or 5; and missing three or more for raid6. I cannot think
> > of a good reason to allow degraded,rw mounts for a raid10 missing two
> > devices.
>
> Sorry, the code currently indeed does not permit mounting a RAID10
> filesystem with more than one missing device in rw. I needed to patch my
> kernel to force it to allow it, as I was working on the assumption that
> the two remaining drives contained a copy of all data (which turned out
> to be true).

Oh gotcha. I glossed over that. Ahh yeah, so we're kinda back to end
user sabotage in that case. :-)

The thing about Btrfs, it has very little pre-defined on disk layout.
The only things explicitly assigned locations are the superblocks. The
super points to the start of root tree and chunk tree, and those can
start literally anywhere. When block groups are mirrored, which device
they appear on, and the physical location on each device, is also not
consistent.

In other words, you could do this test a bunch of times, and then as
the file system ages it becomes even more non-deterministic, the
likelihood of  some data loss when losing two devices on a raid10 very
quickly approaches 100%.

>
> > Wow that's really interesting. So you did 'btrfs replace start' for
> > one of the missing drive devid's, with a loop device as the
> > replacement, and that worked and finished?!
>
> Yes, that's right.

I suspect it's lucky. There's every reason to believe in a repeat
scenario you can end up with raid1 block groups only on to two missing
devices.

>
> > Does this three device volume mount rw and not degraded? I guess it
> > must have because 'btrfs fi us' worked on it.
> >
> >          devid    1 size 7.28TiB used 2.71TiB path /dev/sdd1
> >          devid    2 size 7.28TiB used 22.01GiB path /dev/loop0
> >          devid    3 size 7.28TiB used 2.69TiB path /dev/sdf1
>
> Indeed - with the loop device attached, I can mount the filesystem rw
> just fine without any mount flags, with a stock kernel.
>
> > OK so what happens now if you try to 'btrfs device remove /dev/loop0' ?
>
> Unfortunately it fails in the same way (warning followed by "kernel
> BUG"). The same thing happens if I try to rebalance the metadata.

That seems like a legitimate bug even if the way you got to this point
is sorta screwy and definitely an edge case.

>
> > Well there's definitely something screwy if Btrfs needs something on a
> > missing drive, which is indicated by its refusal to remove it from the
> > volume, and yet at same time it's possible to e.g. rsync every file to
> > /dev/null without any errors. That's a bug somewhere.
>
> As I understand, I don't think it actually "needs" any data from that
> device, it's just having trouble updating some metadata as it tries to
> move one redundant copy of the data from there to somewhere else. It's
> not refusing to remove the device either, rather it tries and fails at
> doing so.

I think the developers would say anytime the user space tools permit
an action that results in a kernel warning, it's a bug. The priority
of fixing that bug will of course depend on the likelihood of users
running into it, and the scope of the fix, and the resources required.

>
> > I'm not a developer but a dev very well might need to have a simple
> > reproducer for this in order to locate the problem. But the call trace
> > might tell them what they need to know. I'm not sure.
>
> What I'm going to try to do next is to create another COW layer on top
> of the three devices I have, attach them to a virtual machine, and boot
> that (as it's not fun to reboot the physical machine each time the code
> crashes). Then I could maybe poke the related kernel code to try to
> understand the problem better.

I don't really understand the code, but then also I don't know what's
happening as it tries to remove the device and what logical problems
Btrfs is running into that eventually causes the warning. It might be
there's already confusion with on-disk metadata.

Btrfs debugging isn't enabled in default kernels, it's vaguely
possible that would reveal more information. And then the integrity
checker can be incredibly verbose, as in so verbose you definitely do
not want to be writing out a persistent kernel message log to the same
Btrfs file system you're checking. The integrity checker also isn't
enabled in distro kernels. It's both a compile time option as well as
a mount time option (separate for metadata only and with data
checking). But i can't give any advice on what mask options to use
that might help reveal what's going on and where Btrfs gets tripped
up. It does look like it's related to the global reserve, which is
something of a misnomer. It's not some separate thing, it's really
space within a metadata block group.

What still would be interesting is if there's a way to reproduce this
layout, where user space tools permit device removal but then the
kernel splats with this warning.

-- 
Chris Murphy