Re: Corrupted filesystem, looking for guidance

From: "Sébastien Luttringer" <seblu@seblu.net>
To: Chris Murphy <lists@colorremedies.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Corrupted filesystem, looking for guidance
Date: Mon, 18 Feb 2019 21:14:47 +0100	[thread overview]
Message-ID: <91e2c9ef095eae21f9e88f7b5cf49102571dcba8.camel@seblu.net> (raw)
In-Reply-To: <CAJCQCtQ+b9y7fBXPPhB-gQrHAH-pCzau6nP1OabsC1GNqNnE1w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6467 bytes --]

On Tue, 2019-02-12 at 15:40 -0700, Chris Murphy wrote:
> On Mon, Feb 11, 2019 at 8:16 PM Sébastien Luttringer <seblu@seblu.net> wrote:
> 
> FYI: This only does full stripe reads, recomputes parity and overwrites the
> parity strip. It assumes the data strips are correct, so long as the
> underlying member devices do not return a read error. And the only way they
> can return a read error is if their SCT ERC time is less than the kernel's
> SCSI command timer. Otherwise errors can accumulate.
> 
> smartctl -l scterc /dev/sdX
> cat /sys/block/sdX/device/timeout
> 
> The first must be a lesser value than the second. If the first is disabled
> and can't be enabled, then the generally accepted assumed maximum time for
> recoveries is an almost unbelievable 180 seconds; so the second needs to be
> set to 180 and is not persistent. You'll need a udev rule or startup script
> to set it at every boot.
All my disks firmwares doesn't allow ERC to be modified trough SCT.

   # smartctl -l scterc /dev/sda
   smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.20-seblu] (local build)
   Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

   SCT Error Recovery Control command not supported

I was not aware of that timer. I needed time to read and experiment on this.
Sorry for the long response time. I hope you didn't timeout. :)

After simulated several errors and timeouts with scsi_debug[1],
fault_injection[2], and dmsetup[3], I don't understand why you suggest this
could lead to corruption. When an SCSI command timeout, the mid-layer[4] do
several error recovery attempt. These attempts are logged into the kernel ring
buffer and at worst the device is put offline.

From my experiment, the md layer has no timeout, and waits as long as the
underlying layer doesn't return, either during check or normal read/write
attempt.

I understand the benefits of keeping the disk time to recover from errors below
the hba timeout. It prevents the disk to be kicked out of the array. 
However, I don't see how this could lead to a difference between check and
repair in the md layer and even trigger some corruption between the chunks
inside a stipe.

> 
> It is sufficient to merely run a check, rather than repair, to trigger the
> proper md RAID fixup from a device read error.
> 
> Getting a mismatch on a check means there's a hardware problem somewhere. The
> mismatch count only tells you there is a mismatch between data strips and
> their parity strip. It doesn't tell you which device is wrong. And if there
> are no read errors, and no link resets, and yet you get mismatches, that
> suggests silent data corruption. 
After reading the whole md (5) manual, I realize how bad it is to rely on the
md layer to guaranty data integrity. There is no mechanism to known which chunk
is corrupted in a stripe.
I'm wondering if using btrfs raid5, despite its known flaws, it is not safer
than md.

> Further, if the mismatches are consistently in the same sector range, it
> suggests the repair scrub returned one set of data, and the subsequent check
> scrub returned different data - that's the only way you get mismatches
> following a repair scrub.
It was the same range. That was my understanding too.

I finally get ride of these errors by removing a disk, wiping the superblock
and adding it back to the raid. Since then, no check error (tested twice).

> If it's bad RAM, then chances are both copies of metadata will be identically
> wrong and thus no help in recovery.
RAM is not ECC. I tested the RAM recently and no error was found.

But, I needed more RAM to rsync all the data w/ hardlinks, so I added a swap
file on my system disk (an ssd). The filesystem on it is also btrfs, so I used
a loop device to workaround the hole issue.
I can find some link reset on this drive at time it was used as swap file.
Maybe this could be a reason.

> > How could I save my filesystem? Should I try --repair or --init-csum-tree?
> 
> If it mounts read-only, update your backups. That is the first priority. Be
> prepared to need them. If it will not mount read only anymore then I suggest
> 'btrfs restore' to scrape data out of the volume to a backup while it's still
> possible. Any repair attempt means writing changes, and any writes are
> inherently risky in this situation. So yeah - if the data is important, focus
> on backups first.
Fortunately, data are safe, as I was in the middle of restoring them back to
the server after a first issue with an old BTRFS filesystem[5].

> Next, I expect until the RAID is healthy that it's difficult to make a
> successful repair of the file system. And for the RAID to be healthy, first
> memory and storage hardware needs to be certainly healthy - the fact there
> are mismatches following an md repair scrub directly suggests hardware
> issues. The linux-raid list is usually quite helpful tracking down such
> problems, including which devices are suspect, but they're going to ask the
> same questions about SCT ERC and SCSI command timer values I mentioned
> earlier, and will want to figure out why you're continuing to see mismatches
> even after a repair scrub - not normal.

I think I will remove the md layer and use only BTRFS to be able to recover
from silent data corruption.
But I'm curious to be able to repair a broken BTRFS without moving all the
dataset to another place. It's the second time it happen to me.

I tried:
# btrfs check --init-extent-tree /dev/md127
# btrfs check --clear-space-cache v2 /dev/md127
# btrfs check --clear-space-cache v1 /dev/md127
# btrfs rescue super-recover /dev/md127
# btrfs check -b --repair /dev/md127
# btrfs check --repair /dev/md127
# btrfs rescue zero-log /dev/md127

The detailed output is here [6]. But none of the above allowed me to drop the
broken part of the btrfs tree to move forward. Is there a way to repair (by
loosing corrupted data) without need to drop all the correct data?

Regards,

[1] http://sg.danny.cz/sg/sdebug26.html
[2] 
https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt
[3] https://linux.die.net/man/8/dmsetup
[4] https://www.tldp.org/HOWTO/SCSI-Generic-HOWTO/x215.html
[5] 
https://lore.kernel.org/linux-btrfs/6e66eb52e4c13fc4206d742e1dade38b04592e49.camel@seblu.net/
[6] http://cloud.seblu.net/s/EPieGzGm9xcyQzd

-- 
Sébastien Luttringer

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 821 bytes --]