linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
@ 2019-06-23 20:45 Zygo Blaxell
  2019-06-24  0:46 ` Qu Wenruo
  2019-06-24  2:45 ` Remi Gauvin
  0 siblings, 2 replies; 10+ messages in thread
From: Zygo Blaxell @ 2019-06-23 20:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 19143 bytes --]

On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
> > On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote:
> >> What should I do now ... to use btrfs safely? Should i not use it with
> >> DM-crypt
> > 
> > You might need to disable write caching on your drives, i.e. hdparm -W0.
> 
> This is quite troublesome.
> 
> Disabling write cache normally means performance impact.

The drives I've found that need write cache disabled aren't particularly
fast to begin with, so disabling write cache doesn't harm their
performance very much.  All the speed gains of write caching are lost
when someone has to spend time doing a forced restore from backup after
transid-verify failure.  If you really do need performance, there are
drives with working firmware available that don't cost much more.

> And disabling it normally would hide the true cause (if it's something
> btrfs' fault).

This is true; however, even if a hypothetical btrfs bug existed,
disabling write caching is an immediately deployable workaround, and
there's currently no other solution other than avoiding drives with
bad firmware.

There could be improvements possible for btrfs to work around bad
firmware...if someone's willing to donate their sanity to get inside
the heads of firmware bugs, and can find a way to fix it that doesn't
make things worse for everyone with working firmware.

> > I have a few drives in my collection that don't have working write cache.
> > They are usually fine, but when otherwise minor failure events occur (e.g.
> > bad cables, bad power supply, failing UNC sectors) then the write cache
> > doesn't behave correctly, and any filesystem or database on the drive
> > gets trashed.
> 
> Normally this shouldn't be the case, as long as the fs has correct
> journal and flush/barrier.

If you are asking the question:

        "Are there some currently shipping retail hard drives that are
        orders of magnitude more likely to corrupt data after simple
        power failures than other drives?"

then the answer is:

	"Hell, yes!  How could there NOT be?"

It wouldn't take very much capital investment or time to find this out
in lab conditions.  Just kill power every 25 minutes while running a
btrfs stress-test should do it--or have a UPS hardware failure in ops,
the effect is the same.  Bad drives will show up in a few hours, good
drives take much longer--long enough that, statistically, the good drives
will probably fail outright before btrfs gets corrupted.

> If it's really the hardware to blame, then it means its flush/fua is not
> implemented properly at all, thus the possibility of a single power loss
> leading to corruption should be VERY VERY high.

That exactly matches my observations.  Only a few disks fail at all,
but the ones that do fail do so very often:  60% of corruptions at
10 power failures or less, 100% at 30 power failures or more.

> >  This isn't normal behavior, but the problem does affect
> > the default configuration of some popular mid-range drive models from
> > top-3 hard disk vendors, so it's quite common.
> 
> Would you like to share the info and test methodology to determine it's
> the device to blame? (maybe in another thread)

It's basic data mining on operations failure event logs.

We track events like filesystem corruption, data loss, other hardware
failure, operator errors, power failures, system crashes, dmesg error
messages, etc., and count how many times each failure occurs in systems
with which hardware components.  When a failure occurs, we break the
affected system apart and place its components into other systems or
test machines to isolate which component is causing the failure (e.g. a
failing power supply could create RAM corruption events and disk failure
events, so we move the hardware around to see where the failure goes).
If the same component is involved in repeatable failure events, the
correlation jumps out of the data and we know that component is bad.
We can also do correlations by attributes of the components, i.e. vendor,
model, size, firmware revision, manufacturing date, and correlate
vendor-model-size-firmware to btrfs transid verify failures across
a fleet of different systems.

I can go to the data and get a list of all the drive model and firmware
revisions that have been installed in machines with 0 "parent transid
verify failed" events since 2014, and are still online today:

        Device Model: CT240BX500SSD1 Firmware Version: M6CR013
        Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060
        Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G
        Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C
        Device Model: KINGSTON SA400S37240G Firmware Version: R0105A
        Device Model: ST12000VN0007-2GS116 Firmware Version: SC60
        Device Model: ST5000VN0001-1SF17X Firmware Version: AN02
        Device Model: ST8000VN0002-1Z8112 Firmware Version: SC61
        Device Model: TOSHIBA-TR200 Firmware Version: SBFA12.2
        Device Model: WDC WD121KRYZ-01W0RB0 Firmware Version: 01.01H01
        Device Model: WDC WDS250G2B0A-00SM50 Firmware Version: X61190WD
        Model Family: SandForce Driven SSDs Device Model: KINGSTON SV300S37A240G Firmware Version: 608ABBF0
        Model Family: Seagate IronWolf Device Model: ST10000VN0004-1ZD101 Firmware Version: SC60
        Model Family: Seagate NAS HDD Device Model: ST4000VN000-1H4168 Firmware Version: SC44
        Model Family: Seagate NAS HDD Device Model: ST8000VN0002-1Z8112 Firmware Version: SC60
        Model Family: Toshiba 2.5" HDD MK..59GSXP (AF) Device Model: TOSHIBA MK3259GSXP Firmware Version: GN003J
        Model Family: Western Digital Gold Device Model: WDC WD101KRYZ-01JPDB0 Firmware Version: 01.01H01
        Model Family: Western Digital Green Device Model: WDC WD10EZRX-00L4HB0 Firmware Version: 01.01A01
        Model Family: Western Digital Re Device Model: WDC WD2000FYYZ-01UL1B1 Firmware Version: 01.01K02
        Model Family: Western Digital Red Device Model: WDC WD50EFRX-68MYMN1 Firmware Version: 82.00A82
        Model Family: Western Digital Red Device Model: WDC WD80EFZX-68UW8N0 Firmware Version: 83.H0A83
        Model Family: Western Digital Red Pro Device Model: WDC WD6002FFWX-68TZ4N0 Firmware Version: 83.H0A83

So far so good.  The above list of drive model-vendor-firmware have
collectively had hundreds of drive-power-failure events in the last 5
years, so we have been giving the firmware a fair workout [1].

Now let's look for some bad stuff.  How about a list of drives that were
involved in parent transid verify failure events occurring within 1-10
power cycles after mkfs events:

	Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80

Change the query to 1-30 power cycles, and we get another model with
the same firmware version string:

	Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80

Removing the upper bound on power cycle count doesn't find any more.

The drives running 80.00A80 are all in fairly similar condition: no errors
in SMART, the drive was apparently healthy at the time of failure (no
unusual speed variations, no unexpected drive resets, or any of the other
things that happen to these drives as they age and fail, but that are
not reported as official errors on the models without TLER).  There are
multiple transid-verify failures logged in multiple very different host
systems (e.g. Intel 1U server in a data center, AMD desktop in an office,
hardware ages a few years apart).  This is a consistent and repeatable
behavior that does not correlate to any other attribute.

Now, if you've been reading this far, you might wonder why the previous
two ranges were lower-bounded at 1 power cycle, and the reason is because
I have another firmware in the data set with _zero_ power cycles between
mkfs and failure:

	Model Family: Western Digital Caviar Black Device Model: WDC WD1002FAEX-00Z3A0 Firmware Version: 05.01D05

These drives have 0 power fail events between mkfs and "parent transid
verify failed" events, i.e. it's not necessary to have a power failure
at all for these drives to unrecoverably corrupt btrfs.  In all cases the
failure occurs on the same days as "Current Pending Sector" and "Offline
UNC sector" SMART events.  The WD Black firmware seems to be OK with write
cache enabled most of the time (there's years in the log data without any
transid-verify failures), but the WD Black will drop its write cache when
it sees a UNC sector, and btrfs notices the failure a few hours later.

Recently I've been asking people on IRC who present btrfs filesystems
with transid-verify failures (excluding those with obvious symptoms of
host RAM failure).  So far all the users who have participated in this
totally unscientific survey have WD Green 2TB and WD Black hard drives
with the same firmware revisions as above.  The most recent report was
this week.  I guess there are lot of drives with these firmwares still
in inventories out there.

The data says there's at least 2 firmware versions in the wild which
have 100% of the btrfs transid-verify failures.  These are only 8%
of the total fleet of disks in my data set, but they are punching far
above their weight in terms of failure event count.

I first observed these correlations back in 2016.  We had a lot of WD
Green and Black drives in service at the time--too many to replace or
upgrade them all early--so I looked for a workaround to force the
drives to behave properly.  Since it looked like a write ordering issue,
I disabled the write cache on drives with these firmware versions, and
found that the transid-verify filesystem failures stopped immediately
(they had been bi-weekly events with write cache enabled).

That was 3 years ago, and there are no new transid-verify failures
logged since then.  The drives are still online today with filesystems
mkfsed in 2016.

One bias to be aware of from this data set:  it goes back further than 5
years, and we use the data to optimize hardware costs including the cost
of ops failures.  You might notice there are no Seagate Barracudas[2] in
the data, while there are the similar WD models.  In an unbiased sample
of hard drives, there are likely to be more bad firmware revisions than
found in this data set.  I found 2, and that's a lower bound on the real
number out there.

> Your idea on hardware's faulty FLUSH/FUA implementation could definitely
> cause exactly the same problem, but the last time I asked similar
> problem to fs-devel, there is no proof for such possibility.

Well, correlation isn't proof, it's true; however, if a behavior looks
like a firmware bug, and quacks like a firmware bug, and is otherwise
indistinguishable from a firmware bug, then it's probably a firmware bug.

I don't know if any of these problems are really device firmware bugs or
Linux bugs, particularly in the WD Black case.  That's a question for
someone who can collect some of these devices and do deeper analysis.

In particular, my data is not sufficient to rule out either of these two
theories for the WD Black:

	1.  Linux doesn't use FLUSH/FUA correctly when there are IO errors
	/ drive resets / other things that happen around the times that
	drives have bad sectors, but it is OK as long as there are no
	cached writes that need to be flushed, or

	2.  It's just a bug in one particular drive firmware revision,
	Linux is doing the right thing with FLUSH/FUA and the firmware
	is not.

For the bad WD Green/Red firmware it's much simpler:  those firmware
revisions fail while the drive is not showing any symptoms of defects.
AFAIK there's nothing happening on these drives for Linux code to get
confused about that doesn't also happen on every other drive firmware.

Maybe it's a firmware bug WD already fixed back in 2014, and it just
takes a decade for all the old drives to work their way through the
supply chain and service lifetime.

> The problem is always a ghost to chase, extra info would greatly help us
> to pin it down.

This lack of information is a bit frustrating.  It's not particularly
hard or expensive to collect this data, but I've had to collect it
myself because I don't know of any reliable source I could buy it from.

I found two bad firmwares by accident when I wasn't looking for bad
firmware.  If I'd known where to look, I could have found them much
faster: I had the necessary failure event observations within a few
months after starting the first btrfs pilot projects, but I wasn't
expecting to find firmware bugs, so I didn't recognize them until there
were double-digit failure counts.

WD Green and Black are low-cost consumer hard drives under $250.
One drive of each size in both product ranges comes to a total price
of around $1200 on Amazon.  Lots of end users will have these drives,
and some of them will want to use btrfs, but some of the drives apparently
do not have working write caching.  We should at least know which ones
those are, maybe make a kernel blacklist to disable the write caching
feature on some firmware versions by default.

A modestly funded deliberate search project could build a map of firmware
reliability in currently shipping retail hard drives from all three
big vendors, and keep it updated as new firmware revisions come out.
Sort of like Backblaze's hard drive reliability stats, except you don't
need a thousand drives to test firmware--one or two will suffice most of
the time [3].  The data can probably be scraped from end user reports
(if you have enough of them to filter out noise) and existing ops logs
(better, if their methodology is sound) too.



> Thanks,
> Qu

[1] Pedants will notice that some of these drive firmwares range in age
from 6 months to 7 years, and neither of those numbers is 5 years, and
the power failure rate is implausibly high for a data center environment.
Some of the devices live in offices and laptops, and the power failures
are not evenly distributed across the fleet.  It's entirely possible that
some newer device in the 0-failures list will fail horribly next week.
Most of the NAS and DC devices and all the SSDs have not had any UNC
sector events in the fleet yet, and they could still turn out to be
ticking time bombs like the WD Black once they start to grow sector
defects.  The data does _not_ say that all of those 0-failure firmwares
are bug free under identical conditions--it says that, in a race to
be the first ever firmware to demonstrate bad behavior, the firmwares
in the 0-failures list haven't left the starting line yet, while the 2
firmwares in the multi-failures list both seem to be trying to _win_.

[2] We had a few surviving Seagate Barracudas in 2016, but over 85% of
those built before 2015 had failed by 2016, and none of the survivors
are still online today.  In practical terms, it doesn't matter if a
pre-2015 Barracuda has correct power-failing write-cache behavior when
the drive hardware typically dies more often than the host's office has
power interruptions.

[3] OK, maybe it IS hard to find WD Black drives to test at the _exact_
moment they are remapping UNC sectors...tap one gently with a hammer,
maybe, or poke a hole in the air filter to let a bit of dust in?

> > After turning off write caching, btrfs can keep running on these problem
> > drive models until they get too old and broken to spin up any more.
> > With write caching turned on, these drive models will eat a btrfs every
> > few months.
> > 
> > 
> >> Or even use ZFS instead...
> >>
> >> Am 11/06/2019 um 15:02 schrieb Qu Wenruo:
> >>>
> >>> On 2019/6/11 下午6:53, claudius@winca.de wrote:
> >>>> HI Guys,
> >>>>
> >>>> you are my last try. I was so happy to use BTRFS but now i really hate
> >>>> it....
> >>>>
> >>>>
> >>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019
> >>>> x86_64 x86_64 x86_64 GNU/Linux
> >>>> btrfs-progs v4.15.1
> >>> So old kernel and old progs.
> >>>
> >>>> btrfs fi show
> >>>> Label: none  uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85
> >>>>          Total devices 1 FS bytes used 4.58TiB
> >>>>          devid    1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1
> >>>>
> >>>>
> >>>> dmesg
> >>>>
> >>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at
> >>>> mount time
> >>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled
> >>>> [57501.267529] BTRFS info (device dm-5): has skinny extents
> >>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed
> >>>> on 2069131051008 wanted 4240 found 5115
> >>> Some metadata CoW is not recorded correctly.
> >>>
> >>> Hopes you didn't every try any btrfs check --repair|--init-* or anything
> >>> other than --readonly.
> >>> As there is a long exiting bug in btrfs-progs which could cause similar
> >>> corruption.
> >>>
> >>>
> >>>
> >>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed
> >>>> on 2069131051008 wanted 4240 found 5115
> >>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5
> >>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed
> >>>>
> >>>>
> >>>> btrfs check /dev/mapper/volume1
> >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>> Ignoring transid failure
> >>>> extent buffer leak: start 2024985772032 len 16384
> >>>> ERROR: cannot open file system
> >>>>
> >>>>
> >>>>
> >>>> im not able to mount it anymore.
> >>>>
> >>>>
> >>>> I found the drive in RO the other day and realized somthing was wrong
> >>>> ... i did a reboot and now i cant mount anmyore
> >>> Btrfs extent tree must has been corrupted at that time.
> >>>
> >>> Full recovery back to fully RW mountable fs doesn't look possible.
> >>> As metadata CoW is completely screwed up in this case.
> >>>
> >>> Either you could use btrfs-restore to try to restore the data into
> >>> another location.
> >>>
> >>> Or try my kernel branch:
> >>> https://github.com/adam900710/linux/tree/rescue_options
> >>>
> >>> It's an older branch based on v5.1-rc4.
> >>> But it has some extra new mount options.
> >>> For your case, you need to compile the kernel, then mount it with "-o
> >>> ro,rescue=skip_bg,rescue=no_log_replay".
> >>>
> >>> If it mounts (as RO), then do all your salvage.
> >>> It should be a faster than btrfs-restore, and you can use all your
> >>> regular tool to backup.
> >>>
> >>> Thanks,
> >>> Qu
> >>>
> >>>>
> >>>> any help
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-23 20:45 btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) Zygo Blaxell
@ 2019-06-24  0:46 ` Qu Wenruo
  2019-06-24  4:29   ` Zygo Blaxell
  2019-06-24 17:31   ` Chris Murphy
  2019-06-24  2:45 ` Remi Gauvin
  1 sibling, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2019-06-24  0:46 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 21201 bytes --]



On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
>> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
>>> On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote:
>>>> What should I do now ... to use btrfs safely? Should i not use it with
>>>> DM-crypt
>>>
>>> You might need to disable write caching on your drives, i.e. hdparm -W0.
>>
>> This is quite troublesome.
>>
>> Disabling write cache normally means performance impact.
> 
> The drives I've found that need write cache disabled aren't particularly
> fast to begin with, so disabling write cache doesn't harm their
> performance very much.  All the speed gains of write caching are lost
> when someone has to spend time doing a forced restore from backup after
> transid-verify failure.  If you really do need performance, there are
> drives with working firmware available that don't cost much more.
> 
>> And disabling it normally would hide the true cause (if it's something
>> btrfs' fault).
> 
> This is true; however, even if a hypothetical btrfs bug existed,
> disabling write caching is an immediately deployable workaround, and
> there's currently no other solution other than avoiding drives with
> bad firmware.
> 
> There could be improvements possible for btrfs to work around bad
> firmware...if someone's willing to donate their sanity to get inside
> the heads of firmware bugs, and can find a way to fix it that doesn't
> make things worse for everyone with working firmware.
> 
>>> I have a few drives in my collection that don't have working write cache.
>>> They are usually fine, but when otherwise minor failure events occur (e.g.
>>> bad cables, bad power supply, failing UNC sectors) then the write cache
>>> doesn't behave correctly, and any filesystem or database on the drive
>>> gets trashed.
>>
>> Normally this shouldn't be the case, as long as the fs has correct
>> journal and flush/barrier.
> 
> If you are asking the question:
> 
>         "Are there some currently shipping retail hard drives that are
>         orders of magnitude more likely to corrupt data after simple
>         power failures than other drives?"
> 
> then the answer is:
> 
> 	"Hell, yes!  How could there NOT be?"
> 
> It wouldn't take very much capital investment or time to find this out
> in lab conditions.  Just kill power every 25 minutes while running a
> btrfs stress-test should do it--or have a UPS hardware failure in ops,
> the effect is the same.  Bad drives will show up in a few hours, good
> drives take much longer--long enough that, statistically, the good drives
> will probably fail outright before btrfs gets corrupted.

Now it sounds like we really need some good (more elegant than just
random power failure, but more controlled system) way to do such test.

> 
>> If it's really the hardware to blame, then it means its flush/fua is not
>> implemented properly at all, thus the possibility of a single power loss
>> leading to corruption should be VERY VERY high.
> 
> That exactly matches my observations.  Only a few disks fail at all,
> but the ones that do fail do so very often:  60% of corruptions at
> 10 power failures or less, 100% at 30 power failures or more.
> 
>>>  This isn't normal behavior, but the problem does affect
>>> the default configuration of some popular mid-range drive models from
>>> top-3 hard disk vendors, so it's quite common.
>>
>> Would you like to share the info and test methodology to determine it's
>> the device to blame? (maybe in another thread)
> 
> It's basic data mining on operations failure event logs.
> 
> We track events like filesystem corruption, data loss, other hardware
> failure, operator errors, power failures, system crashes, dmesg error
> messages, etc., and count how many times each failure occurs in systems
> with which hardware components.  When a failure occurs, we break the
> affected system apart and place its components into other systems or
> test machines to isolate which component is causing the failure (e.g. a
> failing power supply could create RAM corruption events and disk failure
> events, so we move the hardware around to see where the failure goes).
> If the same component is involved in repeatable failure events, the
> correlation jumps out of the data and we know that component is bad.
> We can also do correlations by attributes of the components, i.e. vendor,
> model, size, firmware revision, manufacturing date, and correlate
> vendor-model-size-firmware to btrfs transid verify failures across
> a fleet of different systems.
> 
> I can go to the data and get a list of all the drive model and firmware
> revisions that have been installed in machines with 0 "parent transid
> verify failed" events since 2014, and are still online today:
> 
>         Device Model: CT240BX500SSD1 Firmware Version: M6CR013
>         Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060
>         Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G
>         Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C
>         Device Model: KINGSTON SA400S37240G Firmware Version: R0105A
>         Device Model: ST12000VN0007-2GS116 Firmware Version: SC60
>         Device Model: ST5000VN0001-1SF17X Firmware Version: AN02
>         Device Model: ST8000VN0002-1Z8112 Firmware Version: SC61
>         Device Model: TOSHIBA-TR200 Firmware Version: SBFA12.2
>         Device Model: WDC WD121KRYZ-01W0RB0 Firmware Version: 01.01H01
>         Device Model: WDC WDS250G2B0A-00SM50 Firmware Version: X61190WD
>         Model Family: SandForce Driven SSDs Device Model: KINGSTON SV300S37A240G Firmware Version: 608ABBF0
>         Model Family: Seagate IronWolf Device Model: ST10000VN0004-1ZD101 Firmware Version: SC60
>         Model Family: Seagate NAS HDD Device Model: ST4000VN000-1H4168 Firmware Version: SC44
>         Model Family: Seagate NAS HDD Device Model: ST8000VN0002-1Z8112 Firmware Version: SC60
>         Model Family: Toshiba 2.5" HDD MK..59GSXP (AF) Device Model: TOSHIBA MK3259GSXP Firmware Version: GN003J
>         Model Family: Western Digital Gold Device Model: WDC WD101KRYZ-01JPDB0 Firmware Version: 01.01H01
>         Model Family: Western Digital Green Device Model: WDC WD10EZRX-00L4HB0 Firmware Version: 01.01A01
>         Model Family: Western Digital Re Device Model: WDC WD2000FYYZ-01UL1B1 Firmware Version: 01.01K02
>         Model Family: Western Digital Red Device Model: WDC WD50EFRX-68MYMN1 Firmware Version: 82.00A82
>         Model Family: Western Digital Red Device Model: WDC WD80EFZX-68UW8N0 Firmware Version: 83.H0A83
>         Model Family: Western Digital Red Pro Device Model: WDC WD6002FFWX-68TZ4N0 Firmware Version: 83.H0A83

At least there are a lot of GOOD disks, what a relief.

> 
> So far so good.  The above list of drive model-vendor-firmware have
> collectively had hundreds of drive-power-failure events in the last 5
> years, so we have been giving the firmware a fair workout [1].
> 
> Now let's look for some bad stuff.  How about a list of drives that were
> involved in parent transid verify failure events occurring within 1-10
> power cycles after mkfs events:
> 
> 	Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80
> 
> Change the query to 1-30 power cycles, and we get another model with
> the same firmware version string:
> 
> 	Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80
> 
> Removing the upper bound on power cycle count doesn't find any more.
> 
> The drives running 80.00A80 are all in fairly similar condition: no errors
> in SMART, the drive was apparently healthy at the time of failure (no
> unusual speed variations, no unexpected drive resets, or any of the other
> things that happen to these drives as they age and fail, but that are
> not reported as official errors on the models without TLER).  There are
> multiple transid-verify failures logged in multiple very different host
> systems (e.g. Intel 1U server in a data center, AMD desktop in an office,
> hardware ages a few years apart).  This is a consistent and repeatable
> behavior that does not correlate to any other attribute.
> 
> Now, if you've been reading this far, you might wonder why the previous
> two ranges were lower-bounded at 1 power cycle, and the reason is because
> I have another firmware in the data set with _zero_ power cycles between
> mkfs and failure:
> 
> 	Model Family: Western Digital Caviar Black Device Model: WDC WD1002FAEX-00Z3A0 Firmware Version: 05.01D05
> 
> These drives have 0 power fail events between mkfs and "parent transid
> verify failed" events, i.e. it's not necessary to have a power failure
> at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> failure occurs on the same days as "Current Pending Sector" and "Offline
> UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> cache enabled most of the time (there's years in the log data without any
> transid-verify failures), but the WD Black will drop its write cache when
> it sees a UNC sector, and btrfs notices the failure a few hours later.
> 
> Recently I've been asking people on IRC who present btrfs filesystems
> with transid-verify failures (excluding those with obvious symptoms of
> host RAM failure).  So far all the users who have participated in this
> totally unscientific survey have WD Green 2TB and WD Black hard drives
> with the same firmware revisions as above.  The most recent report was
> this week.  I guess there are lot of drives with these firmwares still
> in inventories out there.
> 
> The data says there's at least 2 firmware versions in the wild which
> have 100% of the btrfs transid-verify failures.  These are only 8%
> of the total fleet of disks in my data set, but they are punching far
> above their weight in terms of failure event count.
> 
> I first observed these correlations back in 2016.  We had a lot of WD
> Green and Black drives in service at the time--too many to replace or
> upgrade them all early--so I looked for a workaround to force the
> drives to behave properly.  Since it looked like a write ordering issue,
> I disabled the write cache on drives with these firmware versions, and
> found that the transid-verify filesystem failures stopped immediately
> (they had been bi-weekly events with write cache enabled).

So the worst scenario really happens in real world, badly implemented
flush/fua from firmware.
Btrfs has no way to fix such low level problem.


BTW, do you have any corruption using the bad drivers (with write cache)
with traditional journal based fs like XFS/EXT4?

Btrfs is relying more the hardware to implement barrier/flush properly,
or CoW can be easily ruined.
If the firmware is only tested (if tested) against such fs, it may be
the problem of the vendor.

> 
> That was 3 years ago, and there are no new transid-verify failures
> logged since then.  The drives are still online today with filesystems
> mkfsed in 2016.
> 
> One bias to be aware of from this data set:  it goes back further than 5
> years, and we use the data to optimize hardware costs including the cost
> of ops failures.  You might notice there are no Seagate Barracudas[2] in
> the data, while there are the similar WD models.  In an unbiased sample
> of hard drives, there are likely to be more bad firmware revisions than
> found in this data set.  I found 2, and that's a lower bound on the real
> number out there.
> 
>> Your idea on hardware's faulty FLUSH/FUA implementation could definitely
>> cause exactly the same problem, but the last time I asked similar
>> problem to fs-devel, there is no proof for such possibility.
> 
> Well, correlation isn't proof, it's true; however, if a behavior looks
> like a firmware bug, and quacks like a firmware bug, and is otherwise
> indistinguishable from a firmware bug, then it's probably a firmware bug.
> 
> I don't know if any of these problems are really device firmware bugs or
> Linux bugs, particularly in the WD Black case.  That's a question for
> someone who can collect some of these devices and do deeper analysis.
> 
> In particular, my data is not sufficient to rule out either of these two
> theories for the WD Black:
> 
> 	1.  Linux doesn't use FLUSH/FUA correctly when there are IO errors
> 	/ drive resets / other things that happen around the times that
> 	drives have bad sectors, but it is OK as long as there are no
> 	cached writes that need to be flushed, or
> 
> 	2.  It's just a bug in one particular drive firmware revision,
> 	Linux is doing the right thing with FLUSH/FUA and the firmware
> 	is not.
> 
> For the bad WD Green/Red firmware it's much simpler:  those firmware
> revisions fail while the drive is not showing any symptoms of defects.
> AFAIK there's nothing happening on these drives for Linux code to get
> confused about that doesn't also happen on every other drive firmware.
> 
> Maybe it's a firmware bug WD already fixed back in 2014, and it just
> takes a decade for all the old drives to work their way through the
> supply chain and service lifetime.
> 
>> The problem is always a ghost to chase, extra info would greatly help us
>> to pin it down.
> 
> This lack of information is a bit frustrating.  It's not particularly
> hard or expensive to collect this data, but I've had to collect it
> myself because I don't know of any reliable source I could buy it from.
> 
> I found two bad firmwares by accident when I wasn't looking for bad
> firmware.  If I'd known where to look, I could have found them much
> faster: I had the necessary failure event observations within a few
> months after starting the first btrfs pilot projects, but I wasn't
> expecting to find firmware bugs, so I didn't recognize them until there
> were double-digit failure counts.
> 
> WD Green and Black are low-cost consumer hard drives under $250.
> One drive of each size in both product ranges comes to a total price
> of around $1200 on Amazon.  Lots of end users will have these drives,
> and some of them will want to use btrfs, but some of the drives apparently
> do not have working write caching.  We should at least know which ones
> those are, maybe make a kernel blacklist to disable the write caching
> feature on some firmware versions by default.

To me, the problem isn't for anyone to test these drivers, but how
convincing the test methodology is and how accessible the test device
would be.

Your statistic has a lot of weight, but it takes you years and tons of
disks to expose it, not something can be reproduced easily.

On the other hand, if we're going to reproduce power failure quickly and
reliably in a lab enivronment, then how?
Software based SATA power cutoff? Or hardware controllable SATA power cable?
And how to make sure it's the flush/fua not implemented properly?

It may take us quite some time to start a similar project (maybe need
extra hardware development).

But indeed, a project to do 3rd-party SATA hard disk testing looks very
interesting for my next year hackweek project.

Thanks,
Qu

> 
> A modestly funded deliberate search project could build a map of firmware
> reliability in currently shipping retail hard drives from all three
> big vendors, and keep it updated as new firmware revisions come out.
> Sort of like Backblaze's hard drive reliability stats, except you don't
> need a thousand drives to test firmware--one or two will suffice most of
> the time [3].  The data can probably be scraped from end user reports
> (if you have enough of them to filter out noise) and existing ops logs
> (better, if their methodology is sound) too.
> 
> 
> 
>> Thanks,
>> Qu
> 
> [1] Pedants will notice that some of these drive firmwares range in age
> from 6 months to 7 years, and neither of those numbers is 5 years, and
> the power failure rate is implausibly high for a data center environment.
> Some of the devices live in offices and laptops, and the power failures
> are not evenly distributed across the fleet.  It's entirely possible that
> some newer device in the 0-failures list will fail horribly next week.
> Most of the NAS and DC devices and all the SSDs have not had any UNC
> sector events in the fleet yet, and they could still turn out to be
> ticking time bombs like the WD Black once they start to grow sector
> defects.  The data does _not_ say that all of those 0-failure firmwares
> are bug free under identical conditions--it says that, in a race to
> be the first ever firmware to demonstrate bad behavior, the firmwares
> in the 0-failures list haven't left the starting line yet, while the 2
> firmwares in the multi-failures list both seem to be trying to _win_.
> 
> [2] We had a few surviving Seagate Barracudas in 2016, but over 85% of
> those built before 2015 had failed by 2016, and none of the survivors
> are still online today.  In practical terms, it doesn't matter if a
> pre-2015 Barracuda has correct power-failing write-cache behavior when
> the drive hardware typically dies more often than the host's office has
> power interruptions.
> 
> [3] OK, maybe it IS hard to find WD Black drives to test at the _exact_
> moment they are remapping UNC sectors...tap one gently with a hammer,
> maybe, or poke a hole in the air filter to let a bit of dust in?
> 
>>> After turning off write caching, btrfs can keep running on these problem
>>> drive models until they get too old and broken to spin up any more.
>>> With write caching turned on, these drive models will eat a btrfs every
>>> few months.
>>>
>>>
>>>> Or even use ZFS instead...
>>>>
>>>> Am 11/06/2019 um 15:02 schrieb Qu Wenruo:
>>>>>
>>>>> On 2019/6/11 下午6:53, claudius@winca.de wrote:
>>>>>> HI Guys,
>>>>>>
>>>>>> you are my last try. I was so happy to use BTRFS but now i really hate
>>>>>> it....
>>>>>>
>>>>>>
>>>>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019
>>>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>>> btrfs-progs v4.15.1
>>>>> So old kernel and old progs.
>>>>>
>>>>>> btrfs fi show
>>>>>> Label: none  uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85
>>>>>>          Total devices 1 FS bytes used 4.58TiB
>>>>>>          devid    1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1
>>>>>>
>>>>>>
>>>>>> dmesg
>>>>>>
>>>>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at
>>>>>> mount time
>>>>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled
>>>>>> [57501.267529] BTRFS info (device dm-5): has skinny extents
>>>>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed
>>>>>> on 2069131051008 wanted 4240 found 5115
>>>>> Some metadata CoW is not recorded correctly.
>>>>>
>>>>> Hopes you didn't every try any btrfs check --repair|--init-* or anything
>>>>> other than --readonly.
>>>>> As there is a long exiting bug in btrfs-progs which could cause similar
>>>>> corruption.
>>>>>
>>>>>
>>>>>
>>>>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed
>>>>>> on 2069131051008 wanted 4240 found 5115
>>>>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5
>>>>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed
>>>>>>
>>>>>>
>>>>>> btrfs check /dev/mapper/volume1
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> Ignoring transid failure
>>>>>> extent buffer leak: start 2024985772032 len 16384
>>>>>> ERROR: cannot open file system
>>>>>>
>>>>>>
>>>>>>
>>>>>> im not able to mount it anymore.
>>>>>>
>>>>>>
>>>>>> I found the drive in RO the other day and realized somthing was wrong
>>>>>> ... i did a reboot and now i cant mount anmyore
>>>>> Btrfs extent tree must has been corrupted at that time.
>>>>>
>>>>> Full recovery back to fully RW mountable fs doesn't look possible.
>>>>> As metadata CoW is completely screwed up in this case.
>>>>>
>>>>> Either you could use btrfs-restore to try to restore the data into
>>>>> another location.
>>>>>
>>>>> Or try my kernel branch:
>>>>> https://github.com/adam900710/linux/tree/rescue_options
>>>>>
>>>>> It's an older branch based on v5.1-rc4.
>>>>> But it has some extra new mount options.
>>>>> For your case, you need to compile the kernel, then mount it with "-o
>>>>> ro,rescue=skip_bg,rescue=no_log_replay".
>>>>>
>>>>> If it mounts (as RO), then do all your salvage.
>>>>> It should be a faster than btrfs-restore, and you can use all your
>>>>> regular tool to backup.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> any help
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-23 20:45 btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) Zygo Blaxell
  2019-06-24  0:46 ` Qu Wenruo
@ 2019-06-24  2:45 ` Remi Gauvin
  2019-06-24  4:37   ` Zygo Blaxell
  1 sibling, 1 reply; 10+ messages in thread
From: Remi Gauvin @ 2019-06-24  2:45 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs

On 2019-06-23 4:45 p.m., Zygo Blaxell wrote:

> 	Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80
> 
> Change the query to 1-30 power cycles, and we get another model with
> the same firmware version string:
> 
> 	Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80
> 

> 
> These drives have 0 power fail events between mkfs and "parent transid
> verify failed" events, i.e. it's not necessary to have a power failure
> at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> failure occurs on the same days as "Current Pending Sector" and "Offline
> UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> cache enabled most of the time (there's years in the log data without any
> transid-verify failures), but the WD Black will drop its write cache when
> it sees a UNC sector, and btrfs notices the failure a few hours later.
> 

First, thank you very much for sharing.  I've seen you mention several
times before problems with common consumer drives, but seeing one
specific identified problem firmware version is *very* valuable info.

I have a question about the Black Drives dropping the cache on UNC
error.  If a transid id error like that occurred on a BTRFS RAID 1,
would BTRFS find the correct metadata on the 2nd drive, or does it stop
dead on 1 transid failure?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24  0:46 ` Qu Wenruo
@ 2019-06-24  4:29   ` Zygo Blaxell
  2019-06-24  5:39     ` Qu Wenruo
  2019-06-24 17:31   ` Chris Murphy
  1 sibling, 1 reply; 10+ messages in thread
From: Zygo Blaxell @ 2019-06-24  4:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 14745 bytes --]

On Mon, Jun 24, 2019 at 08:46:06AM +0800, Qu Wenruo wrote:
> On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> > On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
> >> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
[...]
> So the worst scenario really happens in real world, badly implemented
> flush/fua from firmware.
> Btrfs has no way to fix such low level problem.
> 
> BTW, do you have any corruption using the bad drivers (with write cache)
> with traditional journal based fs like XFS/EXT4?

Those filesystems don't make full-filesystem data integrity guarantees
like btrfs does, and there's no ext4 equivalent of dup metadata for
self-repair (even metadata csums in ext4 are a recent invention).
Ops didn't record failure events when e2fsck quietly repairs unexpected
filesystem inconsistencies.  On ext3, maybe data corruption happens
because of drive firmware bugs, or maybe the application just didn't use
fsync properly.  Maybe two disks in md-RAID1 have different contents
because they had slightly different IO timings.  Who knows?  There's no
way to tell from passive ops failure monitoring.

On btrfs with flushoncommit, every data anomaly (e.g. backups not
matching origin hosts, obviously corrupted files, scrub failures, etc)
is a distinct failure event.  Differences between disk contents in RAID1
arrays are failure events.  We can put disks with two different firmware
versions in a RAID1 pair, and btrfs will tell us if they disagree, use
the correct one to fix the broken one, or tell us they're both wrong
and it's time to warm up the backups.

In 2013 I had some big RAID10 arrays of WD Green 2TB disks using ext3/4
and mdadm, and there were a *lot* of data corruption events.  So many
events that we didn't have the capacity to investigate them before new
ones came in.  File restore requests for corrupted data were piling up
faster than they could be processed, and we had no systematic way to tell
whether the origin or backup file was correct when they were different.
Those problems eventually expedited our migration to btrfs, because btrfs
let us do deeper and more uniform data collection to see where all the
corruption was coming from.  While changing filesystems, we moved all
the data onto new disks that happened to not have firmware bugs, and all
the corruption abruptly disappeared (well, except for data corrupted by
bugs in btrfs itself, but now those are fixed too).  We didn't know
what was happening until years later when the smaller/cheaper systems
had enough failures to make noticeable patterns.

I would not be surprised if we were having firmware corruption problems
with ext3/ext4 the whole time those RAID10 arrays existed.  Alas, we were
not capturing firmware revision data at the time (only vendor/model),
and we only started capturing firmware revisions after all the old
drives were recycled.  I don't know exactly what firmware versions were
in those arrays...though I do have a short list of suspects.  ;)

> Btrfs is relying more the hardware to implement barrier/flush properly,
> or CoW can be easily ruined.
> If the firmware is only tested (if tested) against such fs, it may be
> the problem of the vendor.
[...]
> > WD Green and Black are low-cost consumer hard drives under $250.
> > One drive of each size in both product ranges comes to a total price
> > of around $1200 on Amazon.  Lots of end users will have these drives,
> > and some of them will want to use btrfs, but some of the drives apparently
> > do not have working write caching.  We should at least know which ones
> > those are, maybe make a kernel blacklist to disable the write caching
> > feature on some firmware versions by default.
> 
> To me, the problem isn't for anyone to test these drivers, but how
> convincing the test methodology is and how accessible the test device
> would be.
> 
> Your statistic has a lot of weight, but it takes you years and tons of
> disks to expose it, not something can be reproduced easily.
>
> On the other hand, if we're going to reproduce power failure quickly and
> reliably in a lab enivronment, then how?
> Software based SATA power cutoff? Or hardware controllable SATA power cable?

You might be overthinking this a bit.  Software-controlled switched
PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a
Raspberry Pi) can turn the AC power on and off on a test box.  Get a
cheap desktop machine, put as many different drives into it as it can
hold, start writing test patterns, kill mains power to the whole thing,
power it back up, analyze the data that is now present on disk, log the
result over the network, repeat.  This is the most accurate simulation,
since it replicates all the things that happen during a typical end-user's
power failure, only much more often.  Hopefully all the hardware involved
is designed to handle this situation already.  A standard office PC is
theoretically designed for 1000 cycles (200 working days over 5 years)
and should be able to test 60 drives (6 SATA ports, 10 sets of drives
tested 100 cycles each).  The hardware is all standard equipment in any
IT department.

You only need special-purpose hardware if the general-purpose stuff
is failing in ways that aren't interesting (e.g. host RAM is corrupted
during writes so the drive writes garbage, or the power supply breaks
before 1000 cycles).  Some people build elaborate hard disk torture
rigs that mess with input voltages, control temperature and vibration,
etc. to try to replicate the effects effects of aging, but these setups
aren't representative of typical end-user environments and the results
will only be interesting to hardware makers.

We expect most drives to work and it seems that they do most of the
time--it is the drives that fail most frequently that are interesting.
The drives that fail most frequently are also the easiest to identify
in testing--by definition, they will reproduce failures faster than
the others.

Even if there is an intermittent firmware bug that only appears under
rare conditions, if it happens with lower probability than drive hardware
failure then it's not particularly important.  The target hardware failure
rate for hard drives is 0.1% over the warranty period according to the
specs for many models.  If one drive's hardware is going to fail
with p < 0.001, then maybe the firmware bug makes it lose data at p =
0.00075 instead of p = 0.00050.  Users won't care about this--they'll
use RAID to contain the damage, or just accept the failure risks of a
single-disk system.  Filesystem failures that occur after the drive has
degraded to the point of being unusable are not interesting at all.

> And how to make sure it's the flush/fua not implemented properly?

Is it necessary?  The drive could write garbage on the disk, or write
correct data to the wrong physical location, when the voltage drops at
the wrong time.  The drive electronics/firmware are supposed to implement
measures to prevent that, and who knows whether they try, and whether
they are successful?  The data corruption that results from the above
events is technically not a flush/fua failure, since it's not a write
reordering or a premature command completion notification to the host,
but it's still data corruption on power failure.

Drives can fail in multiple ways, and it's hard (even for hard disk
engineering teams) to really know what is going on while the power supply
goes out of spec.  To an end user, it doesn't matter why the drive fails,
only that it does fail.  Once you have *enough* drives, some of them
are always failing, and it just becomes a question of balancing the
different risks and mitigation costs (i.e. pick a drive that doesn't
fail so much, and a filesystem that tolerates the failure modes that
happen to average or better drives, and maybe use RAID1 with a mix of
drive vendors to avoid having both mirrors hit by a common firmware bug).

To make sure btrfs is using flush/fua correctly, log the sequence of block
writes and fua/flush commands, then replay that sequence one operation
at a time, and make sure the filesystem correctly recovers after each
operation.  That doessn't need or even want hardware, though--it's better
work for a VM that can operate on block-level snapshots of the filesystem.

> It may take us quite some time to start a similar project (maybe need
> extra hardware development).
> 
> But indeed, a project to do 3rd-party SATA hard disk testing looks very
> interesting for my next year hackweek project.
> 
> Thanks,
> Qu
> 
> > 
> > A modestly funded deliberate search project could build a map of firmware
> > reliability in currently shipping retail hard drives from all three
> > big vendors, and keep it updated as new firmware revisions come out.
> > Sort of like Backblaze's hard drive reliability stats, except you don't
> > need a thousand drives to test firmware--one or two will suffice most of
> > the time [3].  The data can probably be scraped from end user reports
> > (if you have enough of them to filter out noise) and existing ops logs
> > (better, if their methodology is sound) too.
> > 
> > 
> > 
> >> Thanks,
> >> Qu
> > 
> > [1] Pedants will notice that some of these drive firmwares range in age
> > from 6 months to 7 years, and neither of those numbers is 5 years, and
> > the power failure rate is implausibly high for a data center environment.
> > Some of the devices live in offices and laptops, and the power failures
> > are not evenly distributed across the fleet.  It's entirely possible that
> > some newer device in the 0-failures list will fail horribly next week.
> > Most of the NAS and DC devices and all the SSDs have not had any UNC
> > sector events in the fleet yet, and they could still turn out to be
> > ticking time bombs like the WD Black once they start to grow sector
> > defects.  The data does _not_ say that all of those 0-failure firmwares
> > are bug free under identical conditions--it says that, in a race to
> > be the first ever firmware to demonstrate bad behavior, the firmwares
> > in the 0-failures list haven't left the starting line yet, while the 2
> > firmwares in the multi-failures list both seem to be trying to _win_.
> > 
> > [2] We had a few surviving Seagate Barracudas in 2016, but over 85% of
> > those built before 2015 had failed by 2016, and none of the survivors
> > are still online today.  In practical terms, it doesn't matter if a
> > pre-2015 Barracuda has correct power-failing write-cache behavior when
> > the drive hardware typically dies more often than the host's office has
> > power interruptions.
> > 
> > [3] OK, maybe it IS hard to find WD Black drives to test at the _exact_
> > moment they are remapping UNC sectors...tap one gently with a hammer,
> > maybe, or poke a hole in the air filter to let a bit of dust in?
> > 
> >>> After turning off write caching, btrfs can keep running on these problem
> >>> drive models until they get too old and broken to spin up any more.
> >>> With write caching turned on, these drive models will eat a btrfs every
> >>> few months.
> >>>
> >>>
> >>>> Or even use ZFS instead...
> >>>>
> >>>> Am 11/06/2019 um 15:02 schrieb Qu Wenruo:
> >>>>>
> >>>>> On 2019/6/11 下午6:53, claudius@winca.de wrote:
> >>>>>> HI Guys,
> >>>>>>
> >>>>>> you are my last try. I was so happy to use BTRFS but now i really hate
> >>>>>> it....
> >>>>>>
> >>>>>>
> >>>>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019
> >>>>>> x86_64 x86_64 x86_64 GNU/Linux
> >>>>>> btrfs-progs v4.15.1
> >>>>> So old kernel and old progs.
> >>>>>
> >>>>>> btrfs fi show
> >>>>>> Label: none  uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85
> >>>>>>          Total devices 1 FS bytes used 4.58TiB
> >>>>>>          devid    1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1
> >>>>>>
> >>>>>>
> >>>>>> dmesg
> >>>>>>
> >>>>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at
> >>>>>> mount time
> >>>>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled
> >>>>>> [57501.267529] BTRFS info (device dm-5): has skinny extents
> >>>>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed
> >>>>>> on 2069131051008 wanted 4240 found 5115
> >>>>> Some metadata CoW is not recorded correctly.
> >>>>>
> >>>>> Hopes you didn't every try any btrfs check --repair|--init-* or anything
> >>>>> other than --readonly.
> >>>>> As there is a long exiting bug in btrfs-progs which could cause similar
> >>>>> corruption.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed
> >>>>>> on 2069131051008 wanted 4240 found 5115
> >>>>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5
> >>>>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed
> >>>>>>
> >>>>>>
> >>>>>> btrfs check /dev/mapper/volume1
> >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
> >>>>>> Ignoring transid failure
> >>>>>> extent buffer leak: start 2024985772032 len 16384
> >>>>>> ERROR: cannot open file system
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> im not able to mount it anymore.
> >>>>>>
> >>>>>>
> >>>>>> I found the drive in RO the other day and realized somthing was wrong
> >>>>>> ... i did a reboot and now i cant mount anmyore
> >>>>> Btrfs extent tree must has been corrupted at that time.
> >>>>>
> >>>>> Full recovery back to fully RW mountable fs doesn't look possible.
> >>>>> As metadata CoW is completely screwed up in this case.
> >>>>>
> >>>>> Either you could use btrfs-restore to try to restore the data into
> >>>>> another location.
> >>>>>
> >>>>> Or try my kernel branch:
> >>>>> https://github.com/adam900710/linux/tree/rescue_options
> >>>>>
> >>>>> It's an older branch based on v5.1-rc4.
> >>>>> But it has some extra new mount options.
> >>>>> For your case, you need to compile the kernel, then mount it with "-o
> >>>>> ro,rescue=skip_bg,rescue=no_log_replay".
> >>>>>
> >>>>> If it mounts (as RO), then do all your salvage.
> >>>>> It should be a faster than btrfs-restore, and you can use all your
> >>>>> regular tool to backup.
> >>>>>
> >>>>> Thanks,
> >>>>> Qu
> >>>>>
> >>>>>>
> >>>>>> any help
> >>
> > 
> > 
> > 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24  2:45 ` Remi Gauvin
@ 2019-06-24  4:37   ` Zygo Blaxell
  2019-06-24  5:27     ` Zygo Blaxell
  0 siblings, 1 reply; 10+ messages in thread
From: Zygo Blaxell @ 2019-06-24  4:37 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]

On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote:
> On 2019-06-23 4:45 p.m., Zygo Blaxell wrote:
> 
> > 	Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80
> > 
> > Change the query to 1-30 power cycles, and we get another model with
> > the same firmware version string:
> > 
> > 	Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80
> > 
> 
> > 
> > These drives have 0 power fail events between mkfs and "parent transid
> > verify failed" events, i.e. it's not necessary to have a power failure
> > at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> > failure occurs on the same days as "Current Pending Sector" and "Offline
> > UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> > cache enabled most of the time (there's years in the log data without any
> > transid-verify failures), but the WD Black will drop its write cache when
> > it sees a UNC sector, and btrfs notices the failure a few hours later.
> > 
> 
> First, thank you very much for sharing.  I've seen you mention several
> times before problems with common consumer drives, but seeing one
> specific identified problem firmware version is *very* valuable info.
> 
> I have a question about the Black Drives dropping the cache on UNC
> error.  If a transid id error like that occurred on a BTRFS RAID 1,
> would BTRFS find the correct metadata on the 2nd drive, or does it stop
> dead on 1 transid failure?

Well, the 2nd drive has to have correct metadata--if you are mirroring
a pair of disks with the same firmware bug, that's not likely to happen.

There is a bench test that will demonstrate the transid verify self-repair
procedure: disconnect one half of a RAID1 array, write for a while, then
reconnect and do a scrub.  btrfs should self-repair all the metadata on
the disconnected drive until it all matches the connected one.  Some of
the data blocks might be hosed though (due to CRC32 collisions), so
don't do this test on data you care about.

> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24  4:37   ` Zygo Blaxell
@ 2019-06-24  5:27     ` Zygo Blaxell
  0 siblings, 0 replies; 10+ messages in thread
From: Zygo Blaxell @ 2019-06-24  5:27 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3427 bytes --]

On Mon, Jun 24, 2019 at 12:37:51AM -0400, Zygo Blaxell wrote:
> On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote:
> > On 2019-06-23 4:45 p.m., Zygo Blaxell wrote:
> > 
> > > 	Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80
> > > 
> > > Change the query to 1-30 power cycles, and we get another model with
> > > the same firmware version string:
> > > 
> > > 	Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80
> > > 
> > 
> > > 
> > > These drives have 0 power fail events between mkfs and "parent transid
> > > verify failed" events, i.e. it's not necessary to have a power failure
> > > at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> > > failure occurs on the same days as "Current Pending Sector" and "Offline
> > > UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> > > cache enabled most of the time (there's years in the log data without any
> > > transid-verify failures), but the WD Black will drop its write cache when
> > > it sees a UNC sector, and btrfs notices the failure a few hours later.
> > > 
> > 
> > First, thank you very much for sharing.  I've seen you mention several
> > times before problems with common consumer drives, but seeing one
> > specific identified problem firmware version is *very* valuable info.
> > 
> > I have a question about the Black Drives dropping the cache on UNC
> > error.  If a transid id error like that occurred on a BTRFS RAID 1,
> > would BTRFS find the correct metadata on the 2nd drive, or does it stop
> > dead on 1 transid failure?
> 
> Well, the 2nd drive has to have correct metadata--if you are mirroring
> a pair of disks with the same firmware bug, that's not likely to happen.

OK, I forgot the Black case is a little complicated...

I guess if you had two WD Black drives and they had all their UNC sector
events at different times, then the btrfs RAID1 repair should still
work with write cache enabled.  That seems kind of risky, though--what
if something bumps the machine and both disks get UNC sectors at once?

Alternatives in roughly decreasing order of risk:

	1.  Disable write caching on both Blacks in the pair

	2.  Replace both Blacks with drives in the 0-failure list

	3.  Replace one Black with a Seagate Firecuda or WD Red Pro
	(any other 0-failure drive will do, but these have similar
	performance specs to Black) to ensure firmware diversity

	4.  Find some Black drives with different firmware that have UNC
	sectors and see what happens with write caching during sector
	remap events:  if they behave well, enable write caching on
	all drives with matching firmware, disable if not

	5.  Leave write caching on for now, but as soon as any Black
	reports UNC sectors or reallocation events in SMART data, turn
	write caching off for the remainder of the drive's service life.

> There is a bench test that will demonstrate the transid verify self-repair
> procedure: disconnect one half of a RAID1 array, write for a while, then
> reconnect and do a scrub.  btrfs should self-repair all the metadata on
> the disconnected drive until it all matches the connected one.  Some of
> the data blocks might be hosed though (due to CRC32 collisions), so
> don't do this test on data you care about.
> 
> > 
> > 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24  4:29   ` Zygo Blaxell
@ 2019-06-24  5:39     ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2019-06-24  5:39 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7291 bytes --]



On 2019/6/24 下午12:29, Zygo Blaxell wrote:
[...]
> 
>> Btrfs is relying more the hardware to implement barrier/flush properly,
>> or CoW can be easily ruined.
>> If the firmware is only tested (if tested) against such fs, it may be
>> the problem of the vendor.
> [...]
>>> WD Green and Black are low-cost consumer hard drives under $250.
>>> One drive of each size in both product ranges comes to a total price
>>> of around $1200 on Amazon.  Lots of end users will have these drives,
>>> and some of them will want to use btrfs, but some of the drives apparently
>>> do not have working write caching.  We should at least know which ones
>>> those are, maybe make a kernel blacklist to disable the write caching
>>> feature on some firmware versions by default.
>>
>> To me, the problem isn't for anyone to test these drivers, but how
>> convincing the test methodology is and how accessible the test device
>> would be.
>>
>> Your statistic has a lot of weight, but it takes you years and tons of
>> disks to expose it, not something can be reproduced easily.
>>
>> On the other hand, if we're going to reproduce power failure quickly and
>> reliably in a lab enivronment, then how?
>> Software based SATA power cutoff? Or hardware controllable SATA power cable?
> 
> You might be overthinking this a bit.  Software-controlled switched
> PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a
> Raspberry Pi) can turn the AC power on and off on a test box.  Get a
> cheap desktop machine, put as many different drives into it as it can
> hold, start writing test patterns, kill mains power to the whole thing,
> power it back up, analyze the data that is now present on disk, log the
> result over the network, repeat.  This is the most accurate simulation,
> since it replicates all the things that happen during a typical end-user's
> power failure, only much more often.

To me, this is not as good as expected methodology.
It simulates the most common real world power loss case, but I'd say
it's less reliable in pinning down the incorrect behavior.
(And extra time wasted on POST, booting into OS and things like that)

My idea is, some SBC based controller controlling the power cable of the
disk.
And another system (or the same SBC if it supports SATA) doing regular
workload, with dm-log-writes recording every write operations.
Then kill the power to the disk.

Then compare the data on-disk against dm-log-writes to see how the data
differs.

From the view point of end user, this is definitely overkilled, but at
least to me, this could proof how bad the firmware is, leaving no excuse
for the vendor to dodge the bullet, and maybe do them a favor by pinning
down the sequence leading to corruption.

Although there are a lot of untested things which can go wrong:
- How kernel handles unresponsible disk?
- Will dm-log-writes record and handle error correctly?
- Is there anything special SATA controller will do?

But at least this is going to be a very interesting project.
I already have a rockpro64 SBC with SATA PCIE card, just need to craft
an GPIO controlled switch to kill SATA power.

>  Hopefully all the hardware involved
> is designed to handle this situation already.  A standard office PC is
> theoretically designed for 1000 cycles (200 working days over 5 years)
> and should be able to test 60 drives (6 SATA ports, 10 sets of drives
> tested 100 cycles each).  The hardware is all standard equipment in any
> IT department.
> 
> You only need special-purpose hardware if the general-purpose stuff
> is failing in ways that aren't interesting (e.g. host RAM is corrupted
> during writes so the drive writes garbage, or the power supply breaks
> before 1000 cycles).  Some people build elaborate hard disk torture
> rigs that mess with input voltages, control temperature and vibration,
> etc. to try to replicate the effects effects of aging, but these setups
> aren't representative of typical end-user environments and the results
> will only be interesting to hardware makers.
> 
> We expect most drives to work and it seems that they do most of the
> time--it is the drives that fail most frequently that are interesting.
> The drives that fail most frequently are also the easiest to identify
> in testing--by definition, they will reproduce failures faster than
> the others.
> 
> Even if there is an intermittent firmware bug that only appears under
> rare conditions, if it happens with lower probability than drive hardware
> failure then it's not particularly important.  The target hardware failure
> rate for hard drives is 0.1% over the warranty period according to the
> specs for many models.  If one drive's hardware is going to fail
> with p < 0.001, then maybe the firmware bug makes it lose data at p =
> 0.00075 instead of p = 0.00050.  Users won't care about this--they'll
> use RAID to contain the damage, or just accept the failure risks of a
> single-disk system.  Filesystem failures that occur after the drive has
> degraded to the point of being unusable are not interesting at all.
> 
>> And how to make sure it's the flush/fua not implemented properly?
> 
> Is it necessary?  The drive could write garbage on the disk, or write
> correct data to the wrong physical location, when the voltage drops at
> the wrong time.  The drive electronics/firmware are supposed to implement
> measures to prevent that, and who knows whether they try, and whether
> they are successful?  The data corruption that results from the above
> events is technically not a flush/fua failure, since it's not a write
> reordering or a premature command completion notification to the host,
> but it's still data corruption on power failure.
> 
> Drives can fail in multiple ways, and it's hard (even for hard disk
> engineering teams) to really know what is going on while the power supply
> goes out of spec.  To an end user, it doesn't matter why the drive fails,
> only that it does fail.  Once you have *enough* drives, some of them
> are always failing, and it just becomes a question of balancing the
> different risks and mitigation costs (i.e. pick a drive that doesn't
> fail so much, and a filesystem that tolerates the failure modes that
> happen to average or better drives, and maybe use RAID1 with a mix of
> drive vendors to avoid having both mirrors hit by a common firmware bug).
> 
> To make sure btrfs is using flush/fua correctly, log the sequence of block
> writes and fua/flush commands, then replay that sequence one operation
> at a time, and make sure the filesystem correctly recovers after each
> operation.  That doessn't need or even want hardware, though--it's better
> work for a VM that can operate on block-level snapshots of the filesystem.

That's already what we're doing, dm-log-writes.
And we failed to expose major problems.

All the fsync related bugs, like what Filipe is always fixing, can't be
easily exposed by random workload even with dm-log-writes.
Most of these bugs needs special corner case to hit, but IIRC so far no
transid problem is caused by it.

But anyway, thanks for your info, we see some hope in pinning down the
problem.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24  0:46 ` Qu Wenruo
  2019-06-24  4:29   ` Zygo Blaxell
@ 2019-06-24 17:31   ` Chris Murphy
  2019-06-26  2:30     ` Zygo Blaxell
  2019-07-02 13:32     ` Andrea Gelmini
  1 sibling, 2 replies; 10+ messages in thread
From: Chris Murphy @ 2019-06-24 17:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Zygo Blaxell, Btrfs BTRFS

On Sun, Jun 23, 2019 at 7:52 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> > I first observed these correlations back in 2016.  We had a lot of WD
> > Green and Black drives in service at the time--too many to replace or
> > upgrade them all early--so I looked for a workaround to force the
> > drives to behave properly.  Since it looked like a write ordering issue,
> > I disabled the write cache on drives with these firmware versions, and
> > found that the transid-verify filesystem failures stopped immediately
> > (they had been bi-weekly events with write cache enabled).
>
> So the worst scenario really happens in real world, badly implemented
> flush/fua from firmware.
> Btrfs has no way to fix such low level problem.

Right. The questions I have: should Btrfs (or any file system) be able
to detect such devices and still protect the data? i.e. for the file
system to somehow be more suspicious, without impacting performance,
and go read-only sooner so that at least read-only mount can work? Or
is this so much work for such a tiny edge case that it's not worth it?

Arguably the hardware is some kind of zombie saboteur. It's not
totally dead, it gives the impression that it's working most of the
time, and then silently fails to do what we think it should in an
extraordinary departure from specs and expectations.

Are there other failure cases that could look like this and therefore
worth handling? As storage stacks get more complicated with ever more
complex firmware, and firmware updates in the field, it might be
useful to have at least one file system that can detect such problems
sooner than others and go read-only to prevent further problems?


> BTW, do you have any corruption using the bad drivers (with write cache)
> with traditional journal based fs like XFS/EXT4?
>
> Btrfs is relying more the hardware to implement barrier/flush properly,
> or CoW can be easily ruined.
> If the firmware is only tested (if tested) against such fs, it may be
> the problem of the vendor.

I think we can definitely say this is a vendor problem. But the
question still is whether the file system as a role in at least
disqualifying hardware when it knows it's acting up before the file
system is thoroughly damaged?

I also wonder how ext4 and XFS will behave. In some ways they might
tolerate the problem without noticing it for longer, where instead of
kernel space recognizing it, it's actually user space / application
layer that gets confused first, if it's bogus data that's being
returned. Filesystem metadata is a relatively small target for such
corruption when the file system mostly does overwrites.

I also wonder how ZFS handles this. Both in the single device case,
and in the RAIDZ case.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24 17:31   ` Chris Murphy
@ 2019-06-26  2:30     ` Zygo Blaxell
  2019-07-02 13:32     ` Andrea Gelmini
  1 sibling, 0 replies; 10+ messages in thread
From: Zygo Blaxell @ 2019-06-26  2:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 9715 bytes --]

On Mon, Jun 24, 2019 at 11:31:35AM -0600, Chris Murphy wrote:
> On Sun, Jun 23, 2019 at 7:52 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> > > I first observed these correlations back in 2016.  We had a lot of WD
> > > Green and Black drives in service at the time--too many to replace or
> > > upgrade them all early--so I looked for a workaround to force the
> > > drives to behave properly.  Since it looked like a write ordering issue,
> > > I disabled the write cache on drives with these firmware versions, and
> > > found that the transid-verify filesystem failures stopped immediately
> > > (they had been bi-weekly events with write cache enabled).
> >
> > So the worst scenario really happens in real world, badly implemented
> > flush/fua from firmware.
> > Btrfs has no way to fix such low level problem.
> 
> Right. The questions I have: should Btrfs (or any file system) be able
> to detect such devices and still protect the data? i.e. for the file
> system to somehow be more suspicious, without impacting performance,
> and go read-only sooner so that at least read-only mount can work? 

Part of the point of UNC sector remapping, especially in consumer
hard drives, is that filesystems _don't_ notice it (health monitoring
daemons might notice SMART events, but it's intentionally transparent to
applications and filesystems).  The alternative is that one bad sector
throws an application that is not prepared to handle it, or forces the
filesystem RO, or triggers a full-device RAID data rebuild.

Of course that all goes sideways if the firmware loses its mind (and
write cache) during UNC sector remapping.

> Or is this so much work for such a tiny edge case that it's not worth it?
> 
> Arguably the hardware is some kind of zombie saboteur. It's not
> totally dead, it gives the impression that it's working most of the
> time, and then silently fails to do what we think it should in an
> extraordinary departure from specs and expectations.

> Are there other failure cases that could look like this and therefore
> worth handling? 

In some ways firmware bugs are just another hardware failure.  Hard disks
are free to have any sector unreadable at any time, or one day the
entire disk could just decide not to spin up any more, or non-ECC
RAM in the embedded controller board could flip some bits at random.
These are all standard failure modes that btrfs detects (and, with an
intact mirror available, automatically corrects).

Firmware bugs are different quantitatively:  they turn
common-but-recoverable failure events into common-and-catastrophic
failure events.  Most people expect catastrophic failure events to
be less common, but manufacturing is hard, and sometimes they are not.
Entire production runs of hard drives can die early due to a manufacturing
equipment miscalibration or a poor choice of electrical component.

> As storage stacks get more complicated with ever more
> complex firmware, and firmware updates in the field, it might be
> useful to have at least one file system that can detect such problems
> sooner than others and go read-only to prevent further problems?

I thought we already had one:  btrfs.  Probably ZFS too.

The problem with parent transid verify failure is that the problem is
detected after the filesystem is already damaged.  It's too late to go
RO then, you need a time machine to get the data back.

We could maybe make some more pessimistic assumptions about how stable
new data is so that we can recover from damage in new data far beyond what
flush/fua expectations permit.  AFAIK the Green only fails during a
power failure, so btrfs could keep the last N filesystem transid trees
intact at all times, and during mount btrfs could verify the integrity
of the last transaction and roll back to an earlier transid if there
was a failure.  This has been attempted before, and it has various new
ENOSPC failure modes, and it requires modifications to some already
very complex btrfs code, but if we waved a magic wand and a complete,
debugged implementation of this appeared with a reasonable memory and/or
iops overhead, it would work on the Green drives.

The WD Black is a different beast:  some sequence of writes is lost
when a UNC sector is encountered, but the drive doesn't report the loss
immediately (if it did, btrfs would already go RO before the end of the
transaction, and the metadata tree would remain intact).  The loss is
only detected some time after, during reads which might be thousands of
transids later.  

Both of these approaches have a problem:  when the workaround is used,
the filesystem rolls back to an earlier state, including user data.
In some cases that might not be a good thing, e.g. rolling back 1000
transids on a mail store or OLTP database, or rolling back datacow
files while _not_ rolling back nodatacow files.

btrfs already writes two complete copies of the metadata with dup
metadata, but firmware bugs can kill both copies.  btrfs could hold the
last 256MB of metadata writes in RAM (or whatever amount of RAM is bigger
than the drive cache), and replay those writes or verify the metadata
trees whenever a bad sector is reported or the drive does a bus reset.
This would work if the write cache is dropped during a read, but if the
firmware silently drops the write cache while remapping a UNC sector then
btrfs will not be able to detect the event and would not know to replay
the write log.  This kind of solution seems expensive, and maybe a little
silly, and might not even work against all possible drive firmware bugs
(what if the drive indefinitely postpones some writes, so 256MB isn't
enough RAM log?).

Also, a more meta observation:  we don't know this is what is really
happening in the firmware.  There are clearly problems observed
when multiple events occur currently, but there are several possible
mechanisms that could lead to the behavior, and nowhere in my data is
enough information to determine which one is correct.  So if a drive
has a firmware bug that just redirects a cache write to an entirely
random address on the disk (e.g. it corrupts or overruns an internal RAM
buffer) the symptoms will match the observed behavior, but none of these
workaround strategies will work.  You'd need to have a RAID1 mirror in a
different disk to protect against arbitrary data loss anywhere in a
single drive--and btrfs can already support that because it's a normal
behavior for all hard drives.

The cost of these workarounds has to be weighed against the impact
(how many drives are out there with these firmware bugs) and compared
with the cost of other solutions that already exist.  A heterogeneous
RAID1 solves this problem--unless you are unlucky and get two different
firmwares with the same bug.

It may be possible that the best workaround is also the simplest, and also
works for all filesystems at once:  turn the write cache off for drives
where it doesn't work.  CoW filesystems write in big contiguous sorted
chunks, and that gets most of the benefit of write reordering before the
drive gets the data, so there is less to lose if the drive cannot reorder.
An overwriting filesystem writes in smaller, scattered chunks with more
seeking, and can get more benefit from write caching in the drive.

> > BTW, do you have any corruption using the bad drivers (with write cache)
> > with traditional journal based fs like XFS/EXT4?
> >
> > Btrfs is relying more the hardware to implement barrier/flush properly,
> > or CoW can be easily ruined.
> > If the firmware is only tested (if tested) against such fs, it may be
> > the problem of the vendor.
> 
> I think we can definitely say this is a vendor problem. But the
> question still is whether the file system as a role in at least
> disqualifying hardware when it knows it's acting up before the file
> system is thoroughly damaged?

How does a filesystem know the device is acting up without letting the
device damage the filesystem first?  i.e. how do you do this without
maintaining a firmware revision blacklist?  Some sort of extended
self-test during mkfs?  Or something an admin can run online, like a
balance or scrub?  That would not catch the WD Black firmware revisions
that need a bad sector to make the bad behavior appear.

> I also wonder how ext4 and XFS will behave. In some ways they might
> tolerate the problem without noticing it for longer, where instead of
> kernel space recognizing it, it's actually user space / application
> layer that gets confused first, if it's bogus data that's being
> returned. Filesystem metadata is a relatively small target for such
> corruption when the file system mostly does overwrites.

The worst case on those filesystems is less bad than btrfs (for the
filesystem--the user data is trashed in ways that are not reported and
might be difficult to detect).

btrfs checks everything--metadata and user data--and stops when
unrecoverable failure is detected, so the logical result is that btrfs
stops on firmware bugs.  That's a design feature or horrible flaw,
depending on what the user's goals are.

ext4 optimizes for availability and performance (simplicity ended with
ext3) and intentionally ignores some possible failure modes (ext4 makes no
attempt to verify user data integrity at all, and even metadata checksums
are optional).  XFS protects itself similarly, but not user data.

> I also wonder how ZFS handles this. Both in the single device case,
> and in the RAIDZ case.
> 
> 
> -- 
> Chris Murphy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
  2019-06-24 17:31   ` Chris Murphy
  2019-06-26  2:30     ` Zygo Blaxell
@ 2019-07-02 13:32     ` Andrea Gelmini
  1 sibling, 0 replies; 10+ messages in thread
From: Andrea Gelmini @ 2019-07-02 13:32 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Zygo Blaxell, Btrfs BTRFS

On Mon, Jun 24, 2019 at 11:31:35AM -0600, Chris Murphy wrote:
> Right. The questions I have: should Btrfs (or any file system) be able
> to detect such devices and still protect the data? i.e. for the file

I have more than 600 industrial machine all around the world.
After a few fs corruption (ext4) I found the culprit in the SSD
(choosed by the provider) cheating about flush/sync.

Well, forcing the data=journal at mount, fixed the problem. Same SSDs,
since years, no more problem at all.

Personally I don't really care about performance. Resilience first.
Than options to fix even if the hardware is in the middle of nowhere,
without need to go on site.

Thanks a lot for your work,
Andrea


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-07-02 13:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-23 20:45 btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) Zygo Blaxell
2019-06-24  0:46 ` Qu Wenruo
2019-06-24  4:29   ` Zygo Blaxell
2019-06-24  5:39     ` Qu Wenruo
2019-06-24 17:31   ` Chris Murphy
2019-06-26  2:30     ` Zygo Blaxell
2019-07-02 13:32     ` Andrea Gelmini
2019-06-24  2:45 ` Remi Gauvin
2019-06-24  4:37   ` Zygo Blaxell
2019-06-24  5:27     ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).