Is it logical to use a disk that scrub fails but smartctl succeeds?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Is it logical to use a disk that scrub fails but smartctl succeeds?
@ 2019-12-11 13:11 Cerem Cem ASLAN
  2019-12-11 16:00 ` Adam Borowski
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Cerem Cem ASLAN @ 2019-12-11 13:11 UTC (permalink / raw)
  To: Btrfs BTRFS

This is the second time after a year that the server's disk throws
"INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
along with some corrected errors. However, "smartctl -x" displays
"SMART overall-health self-assessment test result: PASSED".

Should we interpret "btrfs scrub"'s "uncorrectable error count" as
"time to replace the disk" or are those unrelated events?

Thanks in advance.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-11 13:11 Is it logical to use a disk that scrub fails but smartctl succeeds? Cerem Cem ASLAN
@ 2019-12-11 16:00 ` Adam Borowski
  2019-12-12 15:40   ` Cerem Cem ASLAN
  2019-12-11 18:36 ` Zygo Blaxell
  2019-12-11 18:37 ` Chris Murphy
  2 siblings, 1 reply; 11+ messages in thread
From: Adam Borowski @ 2019-12-11 16:00 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Btrfs BTRFS

On Wed, Dec 11, 2019 at 04:11:05PM +0300, Cerem Cem ASLAN wrote:
> This is the second time after a year that the server's disk throws
> "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> along with some corrected errors. However, "smartctl -x" displays
> "SMART overall-health self-assessment test result: PASSED".
> 
> Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> "time to replace the disk" or are those unrelated events?

"btrfs scrub" operates on a higher layer, and can detect more errors, some
of which may have a cause elsewhere.  For example, dodgy memory very often
corrupts data this way; you can retry the scrub to see if the corruption
happened during write (so the data is lost) or during read (so retrying
should work).  In that case, you may want to test and/or replace your
memory, motherboard, processor, etc.

Or, the disk's firmware may fail to detect errors.  It's supposed to verify
disk's internal checksum but detecting errors is another place where a dodgy
manufacturer can shave some costs -- either intentionally, or by neglecting
testing.

Or, some buggy software (which may even include btrfs itself, albeit
unlikely) might scribble on wrong areas of the disk.

Or...

Anyway, all you know for sure that you have _some_ breakage, which a
filesystem without data checksums would fail to detect, allowing silent data
corruption.  Finding the cause is another story.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol,
⣾⠁⢠⠒⠀⣿⡁ 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month.
⢿⡄⠘⠷⠚⠋⠀ Filter out and throw away the fruits (can dump them into a cake,
⠈⠳⣄⠀⠀⠀⠀ etc), let the drink age at least 3-6 months.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-11 13:11 Is it logical to use a disk that scrub fails but smartctl succeeds? Cerem Cem ASLAN
  2019-12-11 16:00 ` Adam Borowski
@ 2019-12-11 18:36 ` Zygo Blaxell
  2019-12-11 18:37 ` Chris Murphy
  2 siblings, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2019-12-11 18:36 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4605 bytes --]

On Wed, Dec 11, 2019 at 04:11:05PM +0300, Cerem Cem ASLAN wrote:
> This is the second time after a year that the server's disk throws
> "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> along with some corrected errors. 

Some minor failures (e.g. a bad sector event every N drive-years)
are expected and recoverable during normal drive operation.  The
expected rates are laid out in the drive model's detailed spec sheets.
More than one failure per drive-year is quite high.  You are correct
to suspect the drive might be failing.

> However, "smartctl -x" displays
> "SMART overall-health self-assessment test result: PASSED".

When firwmare reports a "PASSED" result in SMART, it means that there
are no known unrecoverable failures.  PASSED does not mean that there
are no failures, nor does it mean all detected failures will in fact be
recoverable--it simply means that the drive has not failed so much that
it is no longer possible to try recovery any more.

"FAILED" can indicate the drive firmware has detected a truly
unrecoverable condition, or it has violated a theoretical (but possibly
harmless) constraint.  FAILED could also be the result of a drive
firmware bug.

In practical terms, the health self-assessment is useless, as it has
only two possible results and neither provides actionable information.

Try 'smartctl -t long', then wait some minutes (it will give you an
estimate of how many), then look at the detailed self-test log output from
'smartctl -x'.  The long self-test usually reads all sectors on the disk
and will quantify errors (giving UNC sector counts and locations).

Spinning hard drives normally grow a few bad sectors over multi-year
time scales.  They are often recoverable: simply write replacement data
into the sector, and the drive will remap the write to a good area of the
disk; however, if there are thousands of errors, and new bad sectors show
up in later tests that were not present in tests from a few days before,
then the drive is likely failing and should be replaced.

> Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> "time to replace the disk" or are those unrelated events?

You need to look at the specific error counts individually, as they
indicate different problems.  There are 5 kinds of uncorrectable
error:

	- verify errors -> bad drive firmware (buy a different model
	disk, or try disabling write caching) or you are using a virtual
	storage stack (e.g. the btrfs is running in a VM and the VM disk
	image file is not configured correctly).  The disk told btrfs
	a write was completed, but btrfs later read the data and found
	something unexpected in its place.  If the underlying problem
	is not corrected the filesystem will eventually be destroyed.

        - corruption or csum errors -> usually RAM failure, either in the
        host or the hard drive, but can also be a symptom of firmware
        bugs or cable issues.  SMART cannot detect any of these errors
        except bad cables (as SATA/UDMA CRC errors).  Replace hardware,
        but see below for details about *which* hardware.

        - read errors -> one event per year is OK, maybe 2 on a cheap
        drive.  A single failure event should add no more than a few
        hundred unreadable blocks.  Replace hardware on the third event
        in a 365-day period, or if more than 1000 errors per TB are
        found in a single scrub.  Read errors can occur temporarily
        when the drive is outside of its operating temperature range
        (in which case fix the temperature not the drive).

        - write / flush errors -> sector remapping failed, probably
        because there are too many errors on the disk and the remap
        reserved area is now full.  Replace hardware.

Note that the hardware you have to replace is not necessarily the disk.
Bad cables (data or power) and bad power supplies can cause many of
these symptoms.  Host RAM failure can result in csum errors, though
usually btrfs is severely damaged before the csum errors are apparent.
For bad firmware you have to replace the disk with a model that has
different firmware, or upgrade the firmware if possible.

Also note that btrfs raid5/6 profiles have issues that make scrub output
unusable for the purpose of assessing drive health: accurate attribution
of csum failures to devices is not possible, and there's at least one
outstanding btrfs data corruption bug on raid5/6 that will show up in
scrub as csum failures.

> Thanks in advance.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-11 13:11 Is it logical to use a disk that scrub fails but smartctl succeeds? Cerem Cem ASLAN
  2019-12-11 16:00 ` Adam Borowski
  2019-12-11 18:36 ` Zygo Blaxell
@ 2019-12-11 18:37 ` Chris Murphy
  2019-12-11 18:52   ` Chris Murphy
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2019-12-11 18:37 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Btrfs BTRFS

On Wed, Dec 11, 2019 at 6:11 AM Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
>
> This is the second time after a year that the server's disk throws
> "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> along with some corrected errors. However, "smartctl -x" displays
> "SMART overall-health self-assessment test result: PASSED".
>
> Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> "time to replace the disk" or are those unrelated events?
>
> Thanks in advance.

This is a bit old, and there are more recent papers on better
approaches. But as it relates to only SMART attributes correlating to
failures, it demonstrates there's a big window where failures can
happen and SMART gives no advance warning.
https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf

If you are doing 'smartctl -t long' or similarly have smartd
configured to do the long test periodically, and if that test never
shows a failure, that means the drive thinks it's doing a good job :D
If you assume the drive's error detection is working, then no errors
detected by the drive means the data on the drive is the data the
drive computed the checksum on. That leaves the drive's own
controller, memory cache, and everything before that (connectors,
cables, logic board controller, logic board RAM, probably not CPU
memory or the CPU itself or you'd have a ton of problems) which could
contribute to corruption of the data that Btrfs could detect that the
drive firmware will assume is correct.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-11 18:37 ` Chris Murphy
@ 2019-12-11 18:52   ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2019-12-11 18:52 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Cerem Cem ASLAN, Btrfs BTRFS

On Wed, Dec 11, 2019 at 11:37 AM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Wed, Dec 11, 2019 at 6:11 AM Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
> >
> > This is the second time after a year that the server's disk throws
> > "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> > along with some corrected errors. However, "smartctl -x" displays
> > "SMART overall-health self-assessment test result: PASSED".
> >
> > Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> > "time to replace the disk" or are those unrelated events?
> >
> > Thanks in advance.
>
> This is a bit old, and there are more recent papers on better
> approaches. But as it relates to only SMART attributes correlating to
> failures, it demonstrates there's a big window where failures can
> happen and SMART gives no advance warning.
> https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf
>
> If you are doing 'smartctl -t long' or similarly have smartd
> configured to do the long test periodically, and if that test never
> shows a failure, that means the drive thinks it's doing a good job :D
> If you assume the drive's error detection is working, then no errors
> detected by the drive means the data on the drive is the data the
> drive computed the checksum on. That leaves the drive's own
> controller, memory cache, and everything before that (connectors,
> cables, logic board controller, logic board RAM, probably not CPU
> memory or the CPU itself or you'd have a ton of problems) which could
> contribute to corruption of the data that Btrfs could detect that the
> drive firmware will assume is correct.

Last sentence is a bit sloppy wording. The drive firmware doesn't
assume the data is correct; it produced a checksum predicated on
(likely) already corrupt data; therefore the internal read back and
error detection based on that internal checksum recorded with that
sector data indicates the data is correct. Ergo, it has no way of
knowing the data is bad.

There is error detection (CRC) used between logic board controller and
the controller in the drive, because connector and cable errors are a
known source of possible problems. This may or may not be recorded or
reported by SMART in attribute 199. And it may or may not get reported
to the kernel (it really should, and probably usually is). So if you
have any of those, there's some small but non-zero chance of collision
where errors are happening but not detected. This error detection is
really a low bar, it's not intended to compensate for regular error
induced by a bad cable or connector, it's designed to be a red flag to
take action.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-11 16:00 ` Adam Borowski
@ 2019-12-12 15:40   ` Cerem Cem ASLAN
  2019-12-12 18:56     ` Chris Murphy
  2019-12-12 19:18     ` Remi Gauvin
  0 siblings, 2 replies; 11+ messages in thread
From: Cerem Cem ASLAN @ 2019-12-12 15:40 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Btrfs BTRFS

Thanks for those quick replies. It took a bit to build some pieces of
this reply though.

I realized that I had made a huge mistake by relying on a backup
strategy by syncing valuable data between two computers on two sites
while completely ignoring such a disk failure that may happen on both
sites at the same interval of "btrfs scrub" examination. This is what
happened at the moment. (We are talking about more than 6 months. This
is of course a big period. Obviously not monitoring the filesystem for
this much time is my fault, I accept that, and lessons learned:
https://github.com/ceremcem/monitor-btrfs-disk)

My first action was determining the corrupted files. I was wondering
if insisting CouchDB on BTRFS would eventually cause a failure or not,
so this corrupted files list might help shedding the light on the
cause: https://gist.github.com/ceremcem/b507be2669682857f37039eb9655d7ad

My second action is, as there is only a disk present at the moment, to
convert the Single data profile to DUP (which I couldn't, due to
"Input/output error"s) in order to be able to fix any further
corruption. I'll replace the disk by two new disks in the meanwhile
and setup a RAID-1 with them.

While searching for "converting to DUP profile", I noticed that the
man page of btrfs explicitly states:

> In any case, a device that starts to misbehave and repairs from the DUP copy should be replaced! DUP is not backup.

Based on that, the uncorrectable errors (in Single profile) also means
that we should replace the misbehaving disk.

> Try 'smartctl -t long', then wait some minutes (it will give you an
> estimate of how many), then look at the detailed self-test log output from
> 'smartctl -x'.  The long self-test usually reads all sectors on the disk
> and will quantify errors (giving UNC sector counts and locations).

I tried this one, however I couldn't interpret the results. Here is
the `smartctl -a /dev/sda` output:
https://gist.github.com/1a741135af10f6bebcaf6175c04594df

> You need to look at the specific error counts individually, as they
> indicate different problems.  There are 5 kinds of uncorrectable
> error:

`btrfs scrub` isn't giving us those kinds of details, or is it? How
can we get such a detailed report?

Thank you all for those detailed answers.

Adam Borowski <kilobyte@angband.pl>, 11 Ara 2019 Çar, 19:00 tarihinde
şunu yazdı:

>
> On Wed, Dec 11, 2019 at 04:11:05PM +0300, Cerem Cem ASLAN wrote:
> > This is the second time after a year that the server's disk throws
> > "INPUT OUTPUT ERROR" and "btrfs scrub" finds some uncorrectable errors
> > along with some corrected errors. However, "smartctl -x" displays
> > "SMART overall-health self-assessment test result: PASSED".
> >
> > Should we interpret "btrfs scrub"'s "uncorrectable error count" as
> > "time to replace the disk" or are those unrelated events?
>
> "btrfs scrub" operates on a higher layer, and can detect more errors, some
> of which may have a cause elsewhere.  For example, dodgy memory very often
> corrupts data this way; you can retry the scrub to see if the corruption
> happened during write (so the data is lost) or during read (so retrying
> should work).  In that case, you may want to test and/or replace your
> memory, motherboard, processor, etc.
>
> Or, the disk's firmware may fail to detect errors.  It's supposed to verify
> disk's internal checksum but detecting errors is another place where a dodgy
> manufacturer can shave some costs -- either intentionally, or by neglecting
> testing.
>
> Or, some buggy software (which may even include btrfs itself, albeit
> unlikely) might scribble on wrong areas of the disk.
>
> Or...
>
>
> Anyway, all you know for sure that you have _some_ breakage, which a
> filesystem without data checksums would fail to detect, allowing silent data
> corruption.  Finding the cause is another story.
>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀ A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol,
> ⣾⠁⢠⠒⠀⣿⡁ 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month.
> ⢿⡄⠘⠷⠚⠋⠀ Filter out and throw away the fruits (can dump them into a cake,
> ⠈⠳⣄⠀⠀⠀⠀ etc), let the drink age at least 3-6 months.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-12 15:40   ` Cerem Cem ASLAN
@ 2019-12-12 18:56     ` Chris Murphy
  2019-12-16 10:36       ` Cerem Cem ASLAN
  2019-12-12 19:18     ` Remi Gauvin
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2019-12-12 18:56 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Adam Borowski, Btrfs BTRFS

On Thu, Dec 12, 2019 at 8:40 AM Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:

> My second action is, as there is only a disk present at the moment, to
> convert the Single data profile to DUP (which I couldn't, due to
> "Input/output error"s) in order to be able to fix any further
> corruption. I'll replace the disk by two new disks in the meanwhile
> and setup a RAID-1 with them.
>
> While searching for "converting to DUP profile", I noticed that the
> man page of btrfs explicitly states:
>
> > In any case, a device that starts to misbehave and repairs from the DUP copy should be replaced! DUP is not backup.
>
> Based on that, the uncorrectable errors (in Single profile) also means
> that we should replace the misbehaving disk.

If the single copy of metadata is critical for getting data off the
drive, and that single copy of metadata can't be read due to UNC error
from the drive, then it means normal operation of that file system
isn't possible. It's now a data scraping job.

My tactic would be to mount it ro, and try to get all the data off I
can through the normal means (file copy, rsync, whatever), doing it in
order of priority to improve the chance of successfully getting out
the most important data. Once this method consistently fails you'll
need to go to btrfs restore which is an offline scraping tool and
rather tedious to use.
https://btrfs.wiki.kernel.org/index.php/Restore

But if there's a scant amount of minimum necessary metadata intact,
you will get your data off, even if there are errors in the data
(that's one of the options in restore is to ignore errors). Whereas
normal operation Btrfs won't hand over anything with checksum errors,
you get EIO instead. So there's a decent chance of getting the data
off the drive this way.

>
> > Try 'smartctl -t long', then wait some minutes (it will give you an
> > estimate of how many), then look at the detailed self-test log output from
> > 'smartctl -x'.  The long self-test usually reads all sectors on the disk
> > and will quantify errors (giving UNC sector counts and locations).
>
> I tried this one, however I couldn't interpret the results. Here is
> the `smartctl -a /dev/sda` output:
> https://gist.github.com/1a741135af10f6bebcaf6175c04594df

196 Reallocated_Event_Count 0x0032   050   050   000    Old_age
Always       -       2652
197 Current_Pending_Sector  0x0022   084   084   000    Old_age
Always       -       1167

I don't know how many spare sectors a drive typically has, let alone
this particular make/model. This is somewhat inflated in that the
numbers likely are 512 byte sectors, so for every bad 4096 byte
physical sector, you see 8 bad sectors reported.

First order of priority is to get data off the drive, if you don't
have a current backup.
Second, once you have a backup or already do, demote this drive
entirely, and start your restore from backup using good drives.

Later you can mess around with this marginally reliable drive if you
want, maybe in a raid1 context. I'm suspicious whether there are in
fact some UNC write errors. The most recent error starting at line 95,
says both reads and writes were happening. Was it the read that
trigger the error (probably), in which case you could keep using this
drive, while remaining very suspicious of it and the high likelihood
it will betray you again in the future. If it is a write error, you'll
see that in dmesg just like read errors. That's disqualifying. You
pretty much have to take the drive out of use as it means it can't
remap sectors anymore at all, probably because it's run out of reserve
sectors.

I personally wouldn't mess around with this drive, trying to produce
DUP metadata out of likely corrupt single metadata. It doesn't help
the situation.

Oh and last but actually I should have mentioned it first, because
you'll want to do this right away. You might check if this drive has
configurable SCT ERC.

smartctl -l scterc /dev/

If it does, it might be worth explicitly setting the timeout to
something crazy. Like 180 seconds. That means a command something
like:

smartctl -l scterc,1800,70

That'll leave up to 180 seconds to keep retrying reads, and 7 seconds
for writes (doesn't matter what the write value is, you shouldn't be
writing anyway, but no point in waiting a long time for writes). Maybe
maybe maybe, very off chance, that this improves the chance the drive
firmware can recover data from these bad sectors.

Which by the way those bad sectors may have developed relatively
quickly. If some piece of surface material broke off and got caught
however briefly by the drive head, it could have scraped 300 sectors
instantly before the debris was flung off. Of course, it's also
possible it's been degrading slowly over six months.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-12 15:40   ` Cerem Cem ASLAN
  2019-12-12 18:56     ` Chris Murphy
@ 2019-12-12 19:18     ` Remi Gauvin
  2019-12-12 19:24       ` Chris Murphy
  1 sibling, 1 reply; 11+ messages in thread
From: Remi Gauvin @ 2019-12-12 19:18 UTC (permalink / raw)
  To: Cerem Cem ASLAN, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 953 bytes --]

On 2019-12-12 10:40 a.m., Cerem Cem ASLAN wrote:

>> Try 'smartctl -t long', then wait some minutes (it will give you an
>> estimate of how many), then look at the detailed self-test log output from
>> 'smartctl -x'.  The long self-test usually reads all sectors on the disk
>> and will quantify errors (giving UNC sector counts and locations).
> 
> I tried this one, however I couldn't interpret the results. Here is
> the `smartctl -a /dev/sda` output:
> https://gist.github.com/1a741135af10f6bebcaf6175c04594df
> 

That drive is toast.. the giveaway here is the over 1000 "Current
Pending Sectors.".. there's no point trying to convert this drive to
DUP,, it must simply be stopped, and what files you can successfully
copy consider lucky.  The rest depends on your backup...  (I wasn't
clear on why your backup is supposed to be bad... BTRFS should have
caught any errors during the backup and stopped things with I/O errors.)



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-12 19:18     ` Remi Gauvin
@ 2019-12-12 19:24       ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2019-12-12 19:24 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: Cerem Cem ASLAN, linux-btrfs

On Thu, Dec 12, 2019 at 12:18 PM Remi Gauvin <remi@georgianit.com> wrote:
>
> On 2019-12-12 10:40 a.m., Cerem Cem ASLAN wrote:
>
> >> Try 'smartctl -t long', then wait some minutes (it will give you an
> >> estimate of how many), then look at the detailed self-test log output from
> >> 'smartctl -x'.  The long self-test usually reads all sectors on the disk
> >> and will quantify errors (giving UNC sector counts and locations).
> >
> > I tried this one, however I couldn't interpret the results. Here is
> > the `smartctl -a /dev/sda` output:
> > https://gist.github.com/1a741135af10f6bebcaf6175c04594df
> >
>
> That drive is toast.. the giveaway here is the over 1000 "Current
> Pending Sectors.".. there's no point trying to convert this drive to
> DUP,, it must simply be stopped, and what files you can successfully
> copy consider lucky.  The rest depends on your backup...  (I wasn't
> clear on why your backup is supposed to be bad... BTRFS should have
> caught any errors during the backup and stopped things with I/O errors.)
>


Exactly. It's possible though that the backup is missing files as a
result of EIO, if those errors weren't discovered until recently.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-12 18:56     ` Chris Murphy
@ 2019-12-16 10:36       ` Cerem Cem ASLAN
  2019-12-16 16:26         ` Chris Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Cerem Cem ASLAN @ 2019-12-16 10:36 UTC (permalink / raw)
  Cc: Btrfs BTRFS

Hi,

> But if there's a scant amount of minimum necessary metadata intact,
> you will get your data off, even if there are errors in the data
> (that's one of the options in restore is to ignore errors). Whereas
> normal operation Btrfs won't hand over anything with checksum errors,
> you get EIO instead. So there's a decent chance of getting the data
> off the drive this way.
>
> First order of priority is to get data off the drive, if you don't
> have a current backup.
> Second, once you have a backup or already do, demote this drive
> entirely, and start your restore from backup using good drives.

+

> That drive is toast.. the giveaway here is the over 1000 "Current
> Pending Sectors.".. there's no point trying to convert this drive to
> DUP,, it must simply be stopped, and what files you can successfully
> copy consider lucky.

Right after those comments I changed my priority to get the data off
to a reliable location (and not converting the profile to DUP) before
renewing the drives. Luckily, merging the good files from three
mirrored machines made it possible to recover nearly all data (all
important data except a few unimportant corrupted log files). Thanks
again and again for this valuable redirection.

> Oh and last but actually I should have mentioned it first, because
> you'll want to do this right away. You might check if this drive has
> configurable SCT ERC.
>
> smartctl -l scterc /dev/

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

It seems like the drive has STC ERC support but disabled. However some
weird error is thrown with your correct syntax:

=======> INVALID ARGUMENT TO -l: scterc,1800,70

It's an interesting approach to setup long read time windows. I'll
keep this in mind even though this time I'm determined to make the
correct setup that will make such a data scraping job unnecessary.

> I wasn't clear on why your backup is supposed to be bad... BTRFS should have
> caught any errors during the backup and stopped things with I/O errors.

My strategy was setting up multiple machines that will sync with each
other over the network. Database part was easy since CouchDB has the
synchronization feature out of the box. For the rest of the system
(I'm using LXC containers per service) I would use `btrf send | btrfs
receive` every hour by rotating a single snapshot. I didn't setup a
RAID-1 profile because I thought it's not necessary in this context.

First problem was that I "hoped" the machine would just crash with
"DRDY ERR"s when the disk has *any* problems. I was hoping to be
notified on the very first error by a total system failure. Obviously
it doesn't work like that. Neither OS nor the rest of the applications
may not throw any error till it attempts to read or write to that
corrupted file. So those corruptions took place without noticing. This
was my mistake and I learnt that I should check the filesystem by
`btrfs scrub`. The "bad idea" part was this: Expecting an immediate
disk failure notification by total system crash.

Second problem is I mistakenly thought that `btrfs sub snap` would
throw the same "Input/output error"s just like `cp` does. However, it
turns out, this was not the case, which is totally logical. If we had
such a checksum control while snapshotting, a snapshot operation would
take too long. I'm just realizing that.

After monitoring those corruption events, I still think that I don't
need a RAID-1 setup in order not to loose data. However, a RAID-1
setup will greatly shorten the recovery time of the problematic node.

Now I think the good idea is: Make RAID-1, "monitor-disk-health", be
prepared to replace the disks on the fly.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Is it logical to use a disk that scrub fails but smartctl succeeds?
  2019-12-16 10:36       ` Cerem Cem ASLAN
@ 2019-12-16 16:26         ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2019-12-16 16:26 UTC (permalink / raw)
  To: Cerem Cem ASLAN; +Cc: Btrfs BTRFS

On Mon, Dec 16, 2019 at 3:36 AM Cerem Cem ASLAN <ceremcem@ceremcem.net> wrote:
>
> > smartctl -l scterc /dev/
>
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled

For daily production use I recommend changing both to 7 seconds, it's
possible to setup a udev rule for this so it's always in place for
specific drives by /dev/by whatever you want, wwn or serial or label.
Whereas /dev/sda /dev/sdb is not always reliably assigned during
startup.

The logic is that it's better to have quick failures. These produce
discrete errors, with the affected LBA for the sector, and Btrfs can
act on this with self-healing, whether it's an ordinary read, or a
scrub. Self-healing does require redundancy. But even with single copy
data, you'll get a path to file reference for the affected file. It's
often easier to just delete that file and copy it from backup.

Whereas with ERC disabled, it's uncertain what the error timeout is.
With consumer drives, so-called "deep recovery" is possible which can
take an extraordinary amount of time, and manifests as storage stack
slow down. But by default the kernel's SCSI block layer has a command
timer of its own, by default 30 seconds. If a command hasn't completed
in 30 seconds, this kernel command timer will try to reset the device.
Upon reset, the entire command queue is lost on SATA drives; on SAS
drives just that delayed command is excised, but in either case, it's
never discovered what sector is causing the delay. Essentially the
real problem gets masked by the reset.

The end result is that it's possible for bad sectors to just get worse
and worse (slower and slower recovery) until the data on them is lost
for good, and in the meantime the storage stack gets hung up on these
slow read delays as the drive firmware keeps retrying to read from
marginal sectors. There might be a reasonable use case for long
recoveries, e.g. a boot drive, with single copy data and metadata,
where it's better to have slow downs than to have EIO blow things up
in a non-obvious way. I personally would still favor short recovery
below 30 seconds, and that way I'll see a discrete drive read error
along with the blow up, and make the connection. Whereas slow downs
have no log entries until there's a link reset by the kernel.

Also, 7 seconds comes from what I typically see from NAS and
enterprise drives. So it's not a random pick, but other values are
sane as well as long as SCT ERC is less than the SCSI command timer
value (which is per block device, it is not a setting in the device,
it is a kernel setting, found in /sys )

Bit older reference but is still valid across Linux distros
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/online_storage_reconfiguration_guide/task_controlling-scsi-command-timer-onlining-devices

>
> It seems like the drive has STC ERC support but disabled. However some
> weird error is thrown with your correct syntax:
>
> =======> INVALID ARGUMENT TO -l: scterc,1800,70
>
> It's an interesting approach to setup long read time windows. I'll
> keep this in mind even though this time I'm determined to make the
> correct setup that will make such a data scraping job unnecessary.

It could be a firmware bug *shrug* try something else like:

-l scterc,1200,1200

Maybe it wants them to be identical.

> First problem was that I "hoped" the machine would just crash with
> "DRDY ERR"s when the disk has *any* problems.

Right. So instead look through logs suggesting there have been link
resets (typically from libata but it depends on what drives you have,
what this error looks like exactly). Link resets prevent the drive
specific error from happening. Hence you want the drive's internal
firmware to give up on error recovery before the kernel gives up on
command delays.

More here which itself has a pile of links to this same issue
affecting md arrays.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-12-16 16:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-11 13:11 Is it logical to use a disk that scrub fails but smartctl succeeds? Cerem Cem ASLAN
2019-12-11 16:00 ` Adam Borowski
2019-12-12 15:40   ` Cerem Cem ASLAN
2019-12-12 18:56     ` Chris Murphy
2019-12-16 10:36       ` Cerem Cem ASLAN
2019-12-16 16:26         ` Chris Murphy
2019-12-12 19:18     ` Remi Gauvin
2019-12-12 19:24       ` Chris Murphy
2019-12-11 18:36 ` Zygo Blaxell
2019-12-11 18:37 ` Chris Murphy
2019-12-11 18:52   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.