Re: raid10 corruption while removing failing disk

From: Chris Murphy <lists@colorremedies.com>
To: "Agustín DallʼAlba" <agustin@dallalba.com.ar>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: raid10 corruption while removing failing disk
Date: Tue, 11 Aug 2020 13:17:44 -0600	[thread overview]
Message-ID: <CAJCQCtSdJVw5o2hJ3OyE6-nvM2xpx=nRHLVNSgf9ydD2O--vMQ@mail.gmail.com> (raw)
In-Reply-To: <dc0bea2ee916ce4d1a53fe59869b7b7d8868f617.camel@dallalba.com.ar>

On Mon, Aug 10, 2020 at 11:06 PM Agustín DallʼAlba
<agustin@dallalba.com.ar> wrote:
>
> On Mon, 2020-08-10 at 20:34 -0600, Chris Murphy wrote:
> > On Mon, Aug 10, 2020 at 1:03 AM Agustín DallʼAlba
> > <agustin@dallalba.com.ar> wrote:
> > > Hello!
> > >
> > > The last quarterly scrub on our btrfs filesystem found a few bad
> > > sectors in one of its devices (/dev/sdd), and because there's nobody on
> > > site to replace the failing disk I decided to remove it from the array
> > > with `btrfs device remove` before the problem could get worse.
> >
> > It doesn't much matter if it gets worse, because you still have
> > redundancy on that dying drive until the moment it's completely toast.
> > And btrfs doesn't care if it's spewing read errors.
>
> By 'get worse', I mean another drive failing, and then we'd definitely
> lose data. Because of the pandemic there was (and still is) nobody on
> site to replace the drive, and I won't be able to go there for who
> knows how many months.

Fair point.

> I have a _partial_ dmesg of this time period. It's got a lot of gaps in
> between reboots. I'll send it to you without ccing the list. The
> failing drive is an atrocious WD green for which I forgot to set the
> idle3 timer, that doesn't support SCT ERC and lately just hangs forever
> and requires a power cycle. So there's no way around the slowness. It
> was added on a pinch a year ago because we needed more space. I
> probably should have ask someone to disconnect it and used 'remove
> missing'.

That drive should have '/sys/block/sda/device/timeout' at least 120.
Although I've seen folks on linux-raid@ suggest 180. I don't know what
the actual maximum time for "deep recovery" these drives could have.

As the signal in a sector weakens, the reads get slower. You can
freshen the signal simply by rewriting data. Btrfs doesn't ever do
overwrites, but you can use 'btrfs balance' for this task. Once a year
seems reasonable, or as you notice reads becoming slower. And use a
filtered balance to avoid doing it all at once.

>
> > > # btrfs check --force --readonly /dev/sda
> > > WARNING: filesystem mounted, continuing because of --force
> > > Checking filesystem on /dev/sda
> > > UUID: 4d3acf20-d408-49ab-b0a6-182396a9f27c
> > > checksum verify failed on 10919566688256 found BAB1746E wanted A8A48266
> > > checksum verify failed on 10919566688256 found BAB1746E wanted A8A48266
> >
> > So they aren't at all the same, that's unexpected.
>
> What do you mean by this?

I only fully understood what you meant by this:
>instead of `found BAB1746E wanted A8A48266` it prints `found 0000006E wanted 00000066`

once I re-read the first email that had the full 'btrfs check' output
from the old version. And yeah I don't know why they're different now.

> > My advice is to mount ro, backup (or two copies for important info),
> > and start with a new Btrfs file system and restore. It's not worth
> > repairing.
>
> Sigh, I was expecting I'd have to do this. At least no data was lost,
> and the system still functions even though it's read-only. Do you think
> check --repair is not worth trying? Everything of value is already
> backed up, but restoring it would take many hours of work.

Metadata, RAID10: total=9.00GiB, used=7.57GiB

Ballpark 8 hours for --repair given metadata size and spinning drives.
It'll add some time adding --init-extent-tree which... is decently
likely to be needed here. So the gotcha is, see if --repair works, and
it fixes some stuff but still needs extent tree repaired anyway. Now
you have to do that and it could be another 8 hours. Or do you go with
the heavy hammer right away to save time and do both at once? But the
heavy hammer is riskier.

Whether repair or start over, you need to have the backup plus 2x for
important stuff. To do the repair you need to be prepared for the
possibility tihngs get worse. I'll argue strongly that it's a bug if
things get worse (i.e. now you can't mount ro at all) but as a risk
assessment, it has to be considered.

--
Chris Murphy