Re: Corrupted filesystem, looking for guidance

From: Chris Murphy <lists@colorremedies.com>
To: "Sébastien Luttringer" <seblu@seblu.net>
Cc: Chris Murphy <lists@colorremedies.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Corrupted filesystem, looking for guidance
Date: Mon, 18 Feb 2019 14:06:36 -0700	[thread overview]
Message-ID: <CAJCQCtTq8YLmti_tf0oNaSGn94qvGxs-mQeDdvxddE61L0Rjdg@mail.gmail.com> (raw)
In-Reply-To: <91e2c9ef095eae21f9e88f7b5cf49102571dcba8.camel@seblu.net>

On Mon, Feb 18, 2019 at 1:14 PM Sébastien Luttringer <seblu@seblu.net> wrote:
>
> On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote:
> > On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote:
> >
> > FYI: This only does full stripe reads, recomputes parity and overwrites the
> > parity strip. It assumes the data strips are correct, so long as the
> > underlying member devices do not return a read error. And the only way they
> > can return a read error is if their SCT ERC time is less than the kernel's
> > SCSI command timer. Otherwise errors can accumulate.
> >
> > smartctl -l scterc /dev/sdX
> > cat /sys/block/sdX/device/timeout
> >
> > The first must be a lesser value than the second. If the first is disabled
> > and can't be enabled, then the generally accepted assumed maximum time for
> > recoveries is an almost unbelievable 180 seconds; so the second needs to be
> > set to 180 and is not persistent. You'll need a udev rule or startup script
> > to set it at every boot.
> All my disks firmwares doesn't allow ERC to be modified trough SCT.
>
>    # smartctl -l scterc /dev/sda
>    smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build)
>    Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
>
>    SCT Error Recovery Control command not supported
>
> I was not aware of that timer. I needed time to read and experiment on this.
> Sorry for the long response time. I hope you didn't timeout. :)
>
> After simulated several errors and timeouts with scsi_debug[1],
> fault_injection[2], and dmsetup[3], I don't understand why you suggest this
> could lead to corruption. When an SCSI command timeout, the mid-layer[4] do
> several error recovery attempt. These attempts are logged into the kernel ring
> buffer and at worst the device is put offline.

No at worst what happens if SCSI command timer is reached before the
drive's SCT ERC timeout, is the kernel assumes the device is not
responding and does a link reset. That link reset obiterates the
entire command queue on SATA drives. And that means it's no longer
possible to determine what sector is having a problem; and therefore
not possible to fix it by overwriting that sector with good data. This
is a problem for Btrfs raid, as well as md and LVM.

>
> From my experiment, the md layer has no timeout, and waits as long as the
> underlying layer doesn't return, either during check or normal read/write
> attempt.
>
> I understand the benefits of keeping the disk time to recover from errors below
> the hba timeout. It prevents the disk to be kicked out of the array.

The md driver tolerates a fixed number or rate (I'm not sure which) of
read errors before a drive is marked faulty. The md driver I think
tolerates only one write failure, and then the drive is marked faulty.

So far there is no faulty concept in Btrfs, there are patches upstream
for this, but I don't know about their merge status.

> However, I don't see how this could lead to a difference between check and
> repair in the md layer and even trigger some corruption between the chunks
> inside a stipe.

It allows bad sectors to accumulate, because they never get repaired.
The only way they can be repaired is if the drive itself gives up on a
sector, and reports a discrete uncorrected read error along with the
sector LBA. That's the only way the md driver knows what md chunk is
affected, and where to get a good copy, read it, and then overwrite
the bad copy on the device with a read error.

The linux-raid@ list is full of examples of this. And it does
sometimes lead to the loss of the array, in particular in the case of
parity arrays where such read errors tend to be colocated. A read
error in a stripe is functionally identical to a single device loss
for that stripe. So if the bad sector isn't repaired, only one more
error is needed and you get a full stripe loss, and it's not
recoverable. If the lost stripe is (user) data only then you just lose
a file. But if the lost stripe contains file system metadata it can
mean the loss of the file system on that md array.

> After reading the whole md (5) manual, I realize how bad it is to rely on the
> md layer to guaranty data integrity. There is no mechanism to known which chunk
> is corrupted in a stripe.

Correct. There is a tool part of mdadm that will do this if it's a raid6 array.

> I'm wondering if using btrfs raid5, despite its known flaws, it is not safer
> than md.

I can't point to a study that'd give us the various probabilities to
answer this question. In the meantime, I'd say all raid5 is fraught
with peril the instant there's any unhandled corruption or read error.
And it's a very common misconfiguration to have consumer SATA drives
that lack configurable SCT ERC so that it's less time to produce an
error, than for the SCSI command timer to cause a link reset.

>
> > Further, if the mismatches are consistently in the same sector range, it
> > suggests the repair scrub returned one set of data, and the subsequent check
> > scrub returned different data - that's the only way you get mismatches
> > following a repair scrub.
> It was the same range. That was my understanding too.
>
> I finally get ride of these errors by removing a disk, wiping the superblock
> and adding it back to the raid. Since then, no check error (tested twice).

*shrug* I'm not super familiar with all the mdadm features. It's
vaguely possible your md array is using the bad block mapping feature,
and perhaps that's related to this behavior. Something in my memory is
telling me that this isn't really the best feature to have enabled in
every use case; it's really strictly for continuing to use drives that
have all reserve sectors used up, which means bad sectors result in
write failures. The bad block mapping allows md to do its own
remapping so there won't be write failures in such a case.

Anyway, raids are complicated and they are something of a Rube
Goldberg contraption. If you don't understand all the  possible
outcomes, and aren't prepared for failures, it can lead to panic. And
I've read on linux-raid a lot of panic induced dataloss. Really common
is people do google searches first and get bad advice like recreating
an array and then they wonder why there array is wiped... *shrug*

My advice is, don't be in a hurry to fix things when they go wrong.
Collect information. Do things that don't write changes anywhere. Post
all information to the proper mailing list working from the bottom
(start) of the storage stack to the top (the file system), and trust
their advise.

>
> > If it's bad RAM, then chances are both copies of metadata will be identically
> > wrong and thus no help in recovery.
> RAM is not ECC. I tested the RAM recently and no error was found.

You might check the archives about various memory testing strategies.
A simple hour long test often won't find the most pernicious memory
errors. At least do it over a weekend.

Quick search austin hemmelgarn memory test compile and I found this thread:

Re: btrfs ate my data in just two days, after a fresh install. ram and
disk are ok. it still mounts, but I cannot repair
Wed, May 4, 2016, 10:12 PM

> But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap
> file on my system disk (an ssd). The filesystem on it is also btrfs, so I used
> a loop device to workaround the hole issue.
> I can find some link reset on this drive at time it was used as swap file.
> Maybe this could be a reason.

Yeah, if there is a link reset on the drive, the whole command queue
is lost. It could cause a bunch of i/o errors that look scary but are
one time errors that are related to the link reset. So you really
don't want the link resets happening.

Conversely many applications get mad if there really is a hang for 180
seconds for a consumer drive to do deep recovery. So it's a catch 22
if you use case can tolerate it. But hopefully you only rarely have
bad sectors anyway. Once nice thing about Btrfs is you can do a
balance and it causes everything to be written out, which itself
"refreshens" sector data with a stronger signal. You probably
shouldn't have to do that too often, maybe once every 12-18 months.
Otherwise, too many bad sectors is a valid warranty claim.

> I think I will remove the md layer and use only BTRFS to be able to recover
> from silent data corruption.

Btrfs on top of md will still repair metadata from data corruption if
the metadata profile is DUP.

And in the case of (user) data corruption, it's still not silent.
Btrfs will tell you what file is corrupt and you can recover it from a
backup.

I can't tell you that Btrfs raid5 with a missing/failed drive is
anymore reliable than md raid5. In a way it's simpler so that might be
to your advantage, it really depends on your comfort and experience
with user space tools.

If you do want to move to strictly Btrfs, I suggest raid5 for data but
use raid1 for metadata instead of raid5. Metadata raid 5 writes can't
really be assured to be atomic. Using raid1 metadata is less fragile.

No matter what, keep backups up to date, always be prepared to have to
use them. The main idea of any raid is to just give you some extra
uptime in the face of a failure. And the uptime is for your
applications.

> But I'm curious to be able to repair a broken BTRFS without moving all the
> dataset to another place. It's the second time it happen to me.
>
> I tried:
> # btrfs check --init-extent-tree /dev/md127
> # btrfs check --clear-space-cache v2 /dev/md127
> # btrfs check --clear-space-cache v1 /dev/md127
> # btrfs rescue super-recover /dev/md127
> # btrfs check -b --repair /dev/md127
> # btrfs check --repair /dev/md127
> # btrfs rescue zero-log /dev/md127

Wrong order. Not obvious either that it's the wrong order, the tools
don't do a great job of telling us what order to do things in. Also,
all of these involve writes. You really need to understand the problem
first.

zero log means some last minute writes will be lost, and it should
only be used if there's difficulty mounting and the kernel errors
point to a problem with log replay.

clear-space is safe, the cache is recreated at next mount time, so it
might result in slow initial mount after use.

super-recover is safe by itself or with -v. It should be safe with -y
but -y does write changes to disk.

--init-extent-tree is about the biggest hammer in the arsenal and
fixes only a very specific problem with the extent tree and usually
doesn't help just makes things worse.

--repair should be safe but even in 4.20.1 tools you'll see the man
page says it's dangerous and you should ask on list before using it.

> The detailed output is here [6]. But none of the above allowed me to drop the
> broken part of the btrfs tree to move forward. Is there a way to repair (by
> loosing corrupted data) without need to drop all the correct data?

Well at this point if you ran a those commands the file system is
different so you should refresh the thread by posting current normal
mount (no options) kernel messages; and also 'btrfs check' output
without repair; and also output from btrfs-debug-tree. If the problem
is simple enough and a dev has time it might be they get you a file
system specific patch to apply and it can be fixed. But it's really
important that you stop making changes to the file system in the
meantime. Just gather information. Be deliberate.

--
Chris Murphy