Re: Errors after SATA hard resets: parent transid verify failed, csum mismatch on free space cache

From: Chris Murphy <lists@colorremedies.com>
To: Stephen Conrad <conradsd@gmail.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>
Subject: Re: Errors after SATA hard resets: parent transid verify failed, csum mismatch on free space cache
Date: Mon, 6 Jan 2020 16:56:19 -0700	[thread overview]
Message-ID: <CAJCQCtRJV1ug8L1Q1pJ5ePJdnFP-osekbqzuDb8QSTQ5b0Tm1Q@mail.gmail.com> (raw)
In-Reply-To: <1794842.PFUSC7HjHz@paca>

On Mon, Jan 6, 2020 at 10:32 AM Stephen Conrad <conradsd@gmail.com> wrote:

> In January 2019 the file system remounted read-only.  I have retained the dmesg
> logs and notes on the actions I took at that time.  Summary is that I saw a
> bunch of SATA "link is slow to respond" and "hard resetting link" messages

This is raid1? I can't tell.

Anyway, this is common when drive SCT ERC is a greater duration than
the SCSI command timer (defaults to 30s). Consumer drives often have
it disabled and you'd have to dig through specs to find out what
duration that means, but with HDDs it is often way more than 30s. The
idea is single drive "deep recovery" - it's better for the drive to
become slow than to report EIO for a bad sector. That may be specious,
but for sure it's not a good idea for RAID. First choice is to set SCT
ERC less than 30s. Common is 70 deciseconds for NAS and enterprise
drives. Second choice, if SCT ERC cannot be configured, is change the
SCSI command timer to something like 120-180s, which I know is crazy
but there you go. More here.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

Note that SCT ERC is a drive firmware configuration. Where the SCSI
command timer is a kernel timer that applies per (whole) block device.
Both reset to defaults on power cycle. You can use a udev rule for
either of these (you only need one, again SCT ERC is preferred if
available). SCT ERC is set by smartctl, and SCSI command timer is set
via sysfs so you can just echo the value you want.

> followed by BTRFS errors, "BTRFS: Transaction aborted (error -5)", and a stack
> trace.  I tried a few things (unsuccessfully) and had to power down the
> hardware.  After a power cycle the file system mounted R/W as normal.   I then
> ran a btrfs check and btrfs scrub which showed no errors.

Make sure timeout mismatch doesn't exist, and if you make a change,
redo the scrub to be sure. You really want the drive to give up as
fast as  possible so Btrfs is informed of the bad read, which will
include the bad sector's LBA, Btrfs can look up what's there, find a
good copy from mirror, and write over the bad sector causing it to be
fixed up.

> In October 2019 I had a similar event.  Logs show SATA "hard resetting link"
> for both drives as well as BTRFS "lost page write due to IO error" messages.

Yeah virtually certain it's a misconfiguration. Very common. Bad
sectors start out recovering pretty quick, but get slower and slower,
eventually hitting the 30s SCSI command timer, and then the kernel
does a link reset. And then these things never get fixed up, because
when the link is reset, the entire drive's command queue is lost. That
means no discrete read error, and no LBA for the bad sector which
Btrfs needs to know in order to find a good copy of that data and
write it back to the bad sector, thereby fixing it.

> The logs show "forced readonly", "BTRFS: Transaction aborted (error -5)" and a
> stack trace.  After hard power cycle I ran a "btrfs check -p" which resulted
> in a stream messages like "parent transid verify failed on 2418006753280
> wanted 34457 found 30647" and then the following:
> parent transid verify failed on 2417279598592 wanted 35322 found 29823

Yuck. That's a lot of missing generations. I have no idea how that's
possible, especially if this is raid1 (or DUP) metadata.

> # btrfs dev stat /mnt/
> [/dev/mapper/K1JG82AD].write_io_errs   15799831
> [/dev/mapper/K1JG82AD].read_io_errs    15764242
> [/dev/mapper/K1JG82AD].flush_io_errs   4385

That is incredibly bad. That drive has dropped many millions of
writes, and the flush errors could mean this drive is failing to
adhere to proper write ordering. Maybe Qu can tell us.

> At this point I was still seeing occasional log entries for "parent transid
> verify failed" and "read error corrected" so I decided to upgrade from Debian9
> to Debian10 to get more current tools.  Running a scrub with Debian10 tools I
> saw errors detected and corrected... I also saw sata link issues during the
> scrub...

Right. You'll also find this discussed a ton on the linux-raid@ list
for the very same reason, it's not unique to Btrfs.

> # date
> Mon 09 Dec 2019 10:29:05 PM EST
> # btrfs scrub start -B -d /dev/disk/by-uuid/X
> scrub device /dev/mapper/K1JG82AD (id 1) done
>         scrub started at Sun Dec  8 23:06:59 2019 and finished after 05:46:26
>         total bytes scrubbed: 2.80TiB with 9490467 errors
>         error details: verify=1349 csum=9489118
>         corrected errors: 9490467, uncorrectable errors: 0, unverified errors:
> 0

Cool!

BTW you'll want to reset those stats with -z, so that you can tell if
these errors start happening again.

> 1) How should I interpret these errors?  Seems that btrfs messages are telling
> me that there are an abundance of errors everywhere, but that they are all
> correctable...  Should I panic?  Should I proceed?

Never panic. Often results in misunderstanding the problem and bad decisions.

The drive with all the errors has a bunch of bad sectors, maybe by now
they've been repaired, but it's not clear until you ensure the timeout
mismatch is addressed. Then the scrub will be reliable, even if there
are stalls on bad sectors that take a damn long time to either
successfully read or fail with the proper error.

What you really want are consumer drives with configurable SCT ERC; or
get "NAS" drives which already have this set to a default of 70
deciseconds.

>
> 2) Is my file system broken? Is my data corrupted?  Should I be able to scrub
> etc to get back to operation without scary log messages? Can I trust data that
> I copy out now, or need to fall back on old/incomplete backups?

Btrfs will EIO rather than propagate corrupt data to user space. You
can trust the data you copy out. This assumes all data has checksums.
If you're using nocow for anything, then there can be silent
corruption.

> 3) What steps are recommended to backup/offload/recover data?  I am considering
> installing the disks into a different machine, then mounting the array read-
> only, and then pulling a full copy of the data...
>
> 4) What steps should I take to clean up the file system errors/messages?  Start
> fresh after full backup, (though I hate the idea of migrating off a redundant
> array onto a single disk in the process)?  Scrub etc?  Evaluate each disk
> independently and rebuild one from the other?

I think all of these are addressed, but let me know if something isn't clear.

-- 
Chris Murphy