Re: Corrupted filesystem, looking for guidance

From: "Sébastien Luttringer" <seblu@seblu.net>
To: Chris Murphy <lists@colorremedies.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Corrupted filesystem, looking for guidance
Date: Sat, 23 Feb 2019 19:14:39 +0100	[thread overview]
Message-ID: <4fd5e655c49278cf5967b2774ab34e4a0571f722.camel@seblu.net> (raw)
In-Reply-To: <CAJCQCtTq8YLmti_tf0oNaSGn94qvGxs-mQeDdvxddE61L0Rjdg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4545 bytes --]

On Mon, 2019-02-18 at 14:06 -0700, Chris Murphy wrote:
> No at worst what happens if SCSI command timer is reached before the
> drive's SCT ERC timeout, is the kernel assumes the device is not
> responding and does a link reset. That link reset obiterates the
> entire command queue on SATA drives. And that means it's no longer
> possible to determine what sector is having a problem; and therefore
> not possible to fix it by overwriting that sector with good data. This
> is a problem for Btrfs raid, as well as md and LVM.

According to the Timeout Mismatch[1] kernel raid wiki:

  Unfortunately, with desktop drives, they can take over two minutes to 
  give up, while the linux kernel will give up after 30 seconds. At which 
  point, the RAID code recomputes the block and tries to write it back to 
  the disk. The disk is still trying to read the data and fails to 
  respond, so the raid code assumes the drive is dead and kicks it from 
  the array. This is how a single error with these drives can easily kill 
  an array.

I get your point that at worst more than one drive can be kicked out, breaking
the whole raid.

What I don't get is how this could end up to silent sector corruption or let
accumulate bad sectors. A read timeout, a link reset will end up with an error
kick at minimum one drive from the array, forcing a full rebuild. No?

I discovered that my SAS drives have no such timeout and they don't need an ERC
value to be defined. So, I updated my timeout to 180 when my drives are SATA
and doesn't support ERC. Thanks a lot for making me discovering this.

> *shrug* I'm not super familiar with all the mdadm features. It's
> vaguely possible your md array is using the bad block mapping feature,
> and perhaps that's related to this behavior. Something in my memory is
> telling me that this isn't really the best feature to have enabled in
> every use case; it's really strictly for continuing to use drives that
> have all reserve sectors used up, which means bad sectors result in
> write failures. The bad block mapping allows md to do its own
> remapping so there won't be write failures in such a case.
I didn't check if this log was empty. As this option is enabled by default,
there is one per disk in my array.

> You might check the archives about various memory testing strategies.
> A simple hour long test often won't find the most pernicious memory
> errors. At least do it over a weekend.
> 
> Quick search austin hemmelgarn memory test compile and I found this thread:
> 
I found it. I ran for 72 hours a variant with an Arch live system running a
loop compiling a 4.20.10 kernel, and 4 memtest86+ running inside a qemu.
No error so looks memory is ok.

> If you do want to move to strictly Btrfs, I suggest raid5 for data but
> use raid1 for metadata instead of raid5. Metadata raid 5 writes can't
> really be assured to be atomic. Using raid1 metadata is less fragile.
Make sense. Is raid10 suitable (atomic) option for metadata? Looks like
performance are better than raid1?

> No matter what, keep backups up to date, always be prepared to have to
> use them. The main idea of any raid is to just give you some extra
> uptime in the face of a failure. And the uptime is for your
> applications.
This server is my backup server. I don't plan to backup the backup dataset on
it, so if I loose it, I loose my backup history.

> --repair should be safe but even in 4.20.1 tools you'll see the man
> page says it's dangerous and you should ask on list before using it.
Few month ago I was strongly advised to ask here before calling repair.
Are you saying that it's no more useful?

> Well at this point if you ran a those commands the file system is
> different so you should refresh the thread by posting current normal
> mount (no options) kernel messages; and also 'btrfs check' output
> without repair; and also output from btrfs-debug-tree. If the problem
> is simple enough and a dev has time it might be they get you a file
> system specific patch to apply and it can be fixed. But it's really
> important that you stop making changes to the file system in the
> meantime. Just gather information. Be deliberate.
It's a pity that there is yet no solution without involving a human. I'll not
request developer time which could be used to improve the filesystem. :)

I'm going to start over. Thanks!

Regards,

[1]https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-- 
Sébastien "Seblu" Luttringer

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 821 bytes --]