On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
>> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
>>> On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote:
>>>> What should I do now ... to use btrfs safely? Should i not use it with
>>>> DM-crypt
>>>
>>> You might need to disable write caching on your drives, i.e. hdparm -W0.
>>
>> This is quite troublesome.
>>
>> Disabling write cache normally means performance impact.
> 
> The drives I've found that need write cache disabled aren't particularly
> fast to begin with, so disabling write cache doesn't harm their
> performance very much.  All the speed gains of write caching are lost
> when someone has to spend time doing a forced restore from backup after
> transid-verify failure.  If you really do need performance, there are
> drives with working firmware available that don't cost much more.
> 
>> And disabling it normally would hide the true cause (if it's something
>> btrfs' fault).
> 
> This is true; however, even if a hypothetical btrfs bug existed,
> disabling write caching is an immediately deployable workaround, and
> there's currently no other solution other than avoiding drives with
> bad firmware.
> 
> There could be improvements possible for btrfs to work around bad
> firmware...if someone's willing to donate their sanity to get inside
> the heads of firmware bugs, and can find a way to fix it that doesn't
> make things worse for everyone with working firmware.
> 
>>> I have a few drives in my collection that don't have working write cache.
>>> They are usually fine, but when otherwise minor failure events occur (e.g.
>>> bad cables, bad power supply, failing UNC sectors) then the write cache
>>> doesn't behave correctly, and any filesystem or database on the drive
>>> gets trashed.
>>
>> Normally this shouldn't be the case, as long as the fs has correct
>> journal and flush/barrier.
> 
> If you are asking the question:
> 
>         "Are there some currently shipping retail hard drives that are
>         orders of magnitude more likely to corrupt data after simple
>         power failures than other drives?"
> 
> then the answer is:
> 
> 	"Hell, yes!  How could there NOT be?"
> 
> It wouldn't take very much capital investment or time to find this out
> in lab conditions.  Just kill power every 25 minutes while running a
> btrfs stress-test should do it--or have a UPS hardware failure in ops,
> the effect is the same.  Bad drives will show up in a few hours, good
> drives take much longer--long enough that, statistically, the good drives
> will probably fail outright before btrfs gets corrupted.

Now it sounds like we really need some good (more elegant than just
random power failure, but more controlled system) way to do such test.

> 
>> If it's really the hardware to blame, then it means its flush/fua is not
>> implemented properly at all, thus the possibility of a single power loss
>> leading to corruption should be VERY VERY high.
> 
> That exactly matches my observations.  Only a few disks fail at all,
> but the ones that do fail do so very often:  60% of corruptions at
> 10 power failures or less, 100% at 30 power failures or more.
> 
>>>  This isn't normal behavior, but the problem does affect
>>> the default configuration of some popular mid-range drive models from
>>> top-3 hard disk vendors, so it's quite common.
>>
>> Would you like to share the info and test methodology to determine it's
>> the device to blame? (maybe in another thread)
> 
> It's basic data mining on operations failure event logs.
> 
> We track events like filesystem corruption, data loss, other hardware
> failure, operator errors, power failures, system crashes, dmesg error
> messages, etc., and count how many times each failure occurs in systems
> with which hardware components.  When a failure occurs, we break the
> affected system apart and place its components into other systems or
> test machines to isolate which component is causing the failure (e.g. a
> failing power supply could create RAM corruption events and disk failure
> events, so we move the hardware around to see where the failure goes).
> If the same component is involved in repeatable failure events, the
> correlation jumps out of the data and we know that component is bad.
> We can also do correlations by attributes of the components, i.e. vendor,
> model, size, firmware revision, manufacturing date, and correlate
> vendor-model-size-firmware to btrfs transid verify failures across
> a fleet of different systems.
> 
> I can go to the data and get a list of all the drive model and firmware
> revisions that have been installed in machines with 0 "parent transid
> verify failed" events since 2014, and are still online today:
> 
>         Device Model: CT240BX500SSD1 Firmware Version: M6CR013
>         Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060
>         Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G
>         Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C
>         Device Model: KINGSTON SA400S37240G Firmware Version: R0105A
>         Device Model: ST12000VN0007-2GS116 Firmware Version: SC60
>         Device Model: ST5000VN0001-1SF17X Firmware Version: AN02
>         Device Model: ST8000VN0002-1Z8112 Firmware Version: SC61
>         Device Model: TOSHIBA-TR200 Firmware Version: SBFA12.2
>         Device Model: WDC WD121KRYZ-01W0RB0 Firmware Version: 01.01H01
>         Device Model: WDC WDS250G2B0A-00SM50 Firmware Version: X61190WD
>         Model Family: SandForce Driven SSDs Device Model: KINGSTON SV300S37A240G Firmware Version: 608ABBF0
>         Model Family: Seagate IronWolf Device Model: ST10000VN0004-1ZD101 Firmware Version: SC60
>         Model Family: Seagate NAS HDD Device Model: ST4000VN000-1H4168 Firmware Version: SC44
>         Model Family: Seagate NAS HDD Device Model: ST8000VN0002-1Z8112 Firmware Version: SC60
>         Model Family: Toshiba 2.5" HDD MK..59GSXP (AF) Device Model: TOSHIBA MK3259GSXP Firmware Version: GN003J
>         Model Family: Western Digital Gold Device Model: WDC WD101KRYZ-01JPDB0 Firmware Version: 01.01H01
>         Model Family: Western Digital Green Device Model: WDC WD10EZRX-00L4HB0 Firmware Version: 01.01A01
>         Model Family: Western Digital Re Device Model: WDC WD2000FYYZ-01UL1B1 Firmware Version: 01.01K02
>         Model Family: Western Digital Red Device Model: WDC WD50EFRX-68MYMN1 Firmware Version: 82.00A82
>         Model Family: Western Digital Red Device Model: WDC WD80EFZX-68UW8N0 Firmware Version: 83.H0A83
>         Model Family: Western Digital Red Pro Device Model: WDC WD6002FFWX-68TZ4N0 Firmware Version: 83.H0A83

At least there are a lot of GOOD disks, what a relief.

> 
> So far so good.  The above list of drive model-vendor-firmware have
> collectively had hundreds of drive-power-failure events in the last 5
> years, so we have been giving the firmware a fair workout [1].
> 
> Now let's look for some bad stuff.  How about a list of drives that were
> involved in parent transid verify failure events occurring within 1-10
> power cycles after mkfs events:
> 
> 	Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80
> 
> Change the query to 1-30 power cycles, and we get another model with
> the same firmware version string:
> 
> 	Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80
> 
> Removing the upper bound on power cycle count doesn't find any more.
> 
> The drives running 80.00A80 are all in fairly similar condition: no errors
> in SMART, the drive was apparently healthy at the time of failure (no
> unusual speed variations, no unexpected drive resets, or any of the other
> things that happen to these drives as they age and fail, but that are
> not reported as official errors on the models without TLER).  There are
> multiple transid-verify failures logged in multiple very different host
> systems (e.g. Intel 1U server in a data center, AMD desktop in an office,
> hardware ages a few years apart).  This is a consistent and repeatable
> behavior that does not correlate to any other attribute.
> 
> Now, if you've been reading this far, you might wonder why the previous
> two ranges were lower-bounded at 1 power cycle, and the reason is because
> I have another firmware in the data set with _zero_ power cycles between
> mkfs and failure:
> 
> 	Model Family: Western Digital Caviar Black Device Model: WDC WD1002FAEX-00Z3A0 Firmware Version: 05.01D05
> 
> These drives have 0 power fail events between mkfs and "parent transid
> verify failed" events, i.e. it's not necessary to have a power failure
> at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> failure occurs on the same days as "Current Pending Sector" and "Offline
> UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> cache enabled most of the time (there's years in the log data without any
> transid-verify failures), but the WD Black will drop its write cache when
> it sees a UNC sector, and btrfs notices the failure a few hours later.
> 
> Recently I've been asking people on IRC who present btrfs filesystems
> with transid-verify failures (excluding those with obvious symptoms of
> host RAM failure).  So far all the users who have participated in this
> totally unscientific survey have WD Green 2TB and WD Black hard drives
> with the same firmware revisions as above.  The most recent report was
> this week.  I guess there are lot of drives with these firmwares still
> in inventories out there.
> 
> The data says there's at least 2 firmware versions in the wild which
> have 100% of the btrfs transid-verify failures.  These are only 8%
> of the total fleet of disks in my data set, but they are punching far
> above their weight in terms of failure event count.
> 
> I first observed these correlations back in 2016.  We had a lot of WD
> Green and Black drives in service at the time--too many to replace or
> upgrade them all early--so I looked for a workaround to force the
> drives to behave properly.  Since it looked like a write ordering issue,
> I disabled the write cache on drives with these firmware versions, and
> found that the transid-verify filesystem failures stopped immediately
> (they had been bi-weekly events with write cache enabled).

So the worst scenario really happens in real world, badly implemented
flush/fua from firmware.
Btrfs has no way to fix such low level problem.


BTW, do you have any corruption using the bad drivers (with write cache)
with traditional journal based fs like XFS/EXT4?

Btrfs is relying more the hardware to implement barrier/flush properly,
or CoW can be easily ruined.
If the firmware is only tested (if tested) against such fs, it may be
the problem of the vendor.

> 
> That was 3 years ago, and there are no new transid-verify failures
> logged since then.  The drives are still online today with filesystems
> mkfsed in 2016.
> 
> One bias to be aware of from this data set:  it goes back further than 5
> years, and we use the data to optimize hardware costs including the cost
> of ops failures.  You might notice there are no Seagate Barracudas[2] in
> the data, while there are the similar WD models.  In an unbiased sample
> of hard drives, there are likely to be more bad firmware revisions than
> found in this data set.  I found 2, and that's a lower bound on the real
> number out there.
> 
>> Your idea on hardware's faulty FLUSH/FUA implementation could definitely
>> cause exactly the same problem, but the last time I asked similar
>> problem to fs-devel, there is no proof for such possibility.
> 
> Well, correlation isn't proof, it's true; however, if a behavior looks
> like a firmware bug, and quacks like a firmware bug, and is otherwise
> indistinguishable from a firmware bug, then it's probably a firmware bug.
> 
> I don't know if any of these problems are really device firmware bugs or
> Linux bugs, particularly in the WD Black case.  That's a question for
> someone who can collect some of these devices and do deeper analysis.
> 
> In particular, my data is not sufficient to rule out either of these two
> theories for the WD Black:
> 
> 	1.  Linux doesn't use FLUSH/FUA correctly when there are IO errors
> 	/ drive resets / other things that happen around the times that
> 	drives have bad sectors, but it is OK as long as there are no
> 	cached writes that need to be flushed, or
> 
> 	2.  It's just a bug in one particular drive firmware revision,
> 	Linux is doing the right thing with FLUSH/FUA and the firmware
> 	is not.
> 
> For the bad WD Green/Red firmware it's much simpler:  those firmware
> revisions fail while the drive is not showing any symptoms of defects.
> AFAIK there's nothing happening on these drives for Linux code to get
> confused about that doesn't also happen on every other drive firmware.
> 
> Maybe it's a firmware bug WD already fixed back in 2014, and it just
> takes a decade for all the old drives to work their way through the
> supply chain and service lifetime.
> 
>> The problem is always a ghost to chase, extra info would greatly help us
>> to pin it down.
> 
> This lack of information is a bit frustrating.  It's not particularly
> hard or expensive to collect this data, but I've had to collect it
> myself because I don't know of any reliable source I could buy it from.
> 
> I found two bad firmwares by accident when I wasn't looking for bad
> firmware.  If I'd known where to look, I could have found them much
> faster: I had the necessary failure event observations within a few
> months after starting the first btrfs pilot projects, but I wasn't
> expecting to find firmware bugs, so I didn't recognize them until there
> were double-digit failure counts.
> 
> WD Green and Black are low-cost consumer hard drives under $250.
> One drive of each size in both product ranges comes to a total price
> of around $1200 on Amazon.  Lots of end users will have these drives,
> and some of them will want to use btrfs, but some of the drives apparently
> do not have working write caching.  We should at least know which ones
> those are, maybe make a kernel blacklist to disable the write caching
> feature on some firmware versions by default.

To me, the problem isn't for anyone to test these drivers, but how
convincing the test methodology is and how accessible the test device
would be.

Your statistic has a lot of weight, but it takes you years and tons of
disks to expose it, not something can be reproduced easily.

On the other hand, if we're going to reproduce power failure quickly and
reliably in a lab enivronment, then how?
Software based SATA power cutoff? Or hardware controllable SATA power cable?
And how to make sure it's the flush/fua not implemented properly?

It may take us quite some time to start a similar project (maybe need
extra hardware development).

But indeed, a project to do 3rd-party SATA hard disk testing looks very
interesting for my next year hackweek project.

Thanks,
Qu

> 
> A modestly funded deliberate search project could build a map of firmware
> reliability in currently shipping retail hard drives from all three
> big vendors, and keep it updated as new firmware revisions come out.
> Sort of like Backblaze's hard drive reliability stats, except you don't
> need a thousand drives to test firmware--one or two will suffice most of
> the time [3].  The data can probably be scraped from end user reports
> (if you have enough of them to filter out noise) and existing ops logs
> (better, if their methodology is sound) too.
> 
> 
> 
>> Thanks,
>> Qu
> 
> [1] Pedants will notice that some of these drive firmwares range in age
> from 6 months to 7 years, and neither of those numbers is 5 years, and
> the power failure rate is implausibly high for a data center environment.
> Some of the devices live in offices and laptops, and the power failures
> are not evenly distributed across the fleet.  It's entirely possible that
> some newer device in the 0-failures list will fail horribly next week.
> Most of the NAS and DC devices and all the SSDs have not had any UNC
> sector events in the fleet yet, and they could still turn out to be
> ticking time bombs like the WD Black once they start to grow sector
> defects.  The data does _not_ say that all of those 0-failure firmwares
> are bug free under identical conditions--it says that, in a race to
> be the first ever firmware to demonstrate bad behavior, the firmwares
> in the 0-failures list haven't left the starting line yet, while the 2
> firmwares in the multi-failures list both seem to be trying to _win_.
> 
> [2] We had a few surviving Seagate Barracudas in 2016, but over 85% of
> those built before 2015 had failed by 2016, and none of the survivors
> are still online today.  In practical terms, it doesn't matter if a
> pre-2015 Barracuda has correct power-failing write-cache behavior when
> the drive hardware typically dies more often than the host's office has
> power interruptions.
> 
> [3] OK, maybe it IS hard to find WD Black drives to test at the _exact_
> moment they are remapping UNC sectors...tap one gently with a hammer,
> maybe, or poke a hole in the air filter to let a bit of dust in?
> 
>>> After turning off write caching, btrfs can keep running on these problem
>>> drive models until they get too old and broken to spin up any more.
>>> With write caching turned on, these drive models will eat a btrfs every
>>> few months.
>>>
>>>
>>>> Or even use ZFS instead...
>>>>
>>>> Am 11/06/2019 um 15:02 schrieb Qu Wenruo:
>>>>>
>>>>> On 2019/6/11 下午6:53, claudius@winca.de wrote:
>>>>>> HI Guys,
>>>>>>
>>>>>> you are my last try. I was so happy to use BTRFS but now i really hate
>>>>>> it....
>>>>>>
>>>>>>
>>>>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019
>>>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>>> btrfs-progs v4.15.1
>>>>> So old kernel and old progs.
>>>>>
>>>>>> btrfs fi show
>>>>>> Label: none  uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85
>>>>>>          Total devices 1 FS bytes used 4.58TiB
>>>>>>          devid    1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1
>>>>>>
>>>>>>
>>>>>> dmesg
>>>>>>
>>>>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at
>>>>>> mount time
>>>>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled
>>>>>> [57501.267529] BTRFS info (device dm-5): has skinny extents
>>>>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed
>>>>>> on 2069131051008 wanted 4240 found 5115
>>>>> Some metadata CoW is not recorded correctly.
>>>>>
>>>>> Hopes you didn't every try any btrfs check --repair|--init-* or anything
>>>>> other than --readonly.
>>>>> As there is a long exiting bug in btrfs-progs which could cause similar
>>>>> corruption.
>>>>>
>>>>>
>>>>>
>>>>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed
>>>>>> on 2069131051008 wanted 4240 found 5115
>>>>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5
>>>>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed
>>>>>>
>>>>>>
>>>>>> btrfs check /dev/mapper/volume1
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115
>>>>>> Ignoring transid failure
>>>>>> extent buffer leak: start 2024985772032 len 16384
>>>>>> ERROR: cannot open file system
>>>>>>
>>>>>>
>>>>>>
>>>>>> im not able to mount it anymore.
>>>>>>
>>>>>>
>>>>>> I found the drive in RO the other day and realized somthing was wrong
>>>>>> ... i did a reboot and now i cant mount anmyore
>>>>> Btrfs extent tree must has been corrupted at that time.
>>>>>
>>>>> Full recovery back to fully RW mountable fs doesn't look possible.
>>>>> As metadata CoW is completely screwed up in this case.
>>>>>
>>>>> Either you could use btrfs-restore to try to restore the data into
>>>>> another location.
>>>>>
>>>>> Or try my kernel branch:
>>>>> https://github.com/adam900710/linux/tree/rescue_options
>>>>>
>>>>> It's an older branch based on v5.1-rc4.
>>>>> But it has some extra new mount options.
>>>>> For your case, you need to compile the kernel, then mount it with "-o
>>>>> ro,rescue=skip_bg,rescue=no_log_replay".
>>>>>
>>>>> If it mounts (as RO), then do all your salvage.
>>>>> It should be a faster than btrfs-restore, and you can use all your
>>>>> regular tool to backup.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> any help
>>
> 
> 
>