Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
Date: Mon, 24 Jun 2019 13:39:39 +0800	[thread overview]
Message-ID: <c6a0f62c-6bcc-a291-02cd-734da4a5f951@gmx.com> (raw)
In-Reply-To: <20190624042926.GA11820@hungrycats.org>

[-- Attachment #1.1: Type: text/plain, Size: 7291 bytes --]

On 2019/6/24 下午12:29, Zygo Blaxell wrote:
[...]
> 
>> Btrfs is relying more the hardware to implement barrier/flush properly,
>> or CoW can be easily ruined.
>> If the firmware is only tested (if tested) against such fs, it may be
>> the problem of the vendor.
> [...]
>>> WD Green and Black are low-cost consumer hard drives under $250.
>>> One drive of each size in both product ranges comes to a total price
>>> of around $1200 on Amazon.  Lots of end users will have these drives,
>>> and some of them will want to use btrfs, but some of the drives apparently
>>> do not have working write caching.  We should at least know which ones
>>> those are, maybe make a kernel blacklist to disable the write caching
>>> feature on some firmware versions by default.
>>
>> To me, the problem isn't for anyone to test these drivers, but how
>> convincing the test methodology is and how accessible the test device
>> would be.
>>
>> Your statistic has a lot of weight, but it takes you years and tons of
>> disks to expose it, not something can be reproduced easily.
>>
>> On the other hand, if we're going to reproduce power failure quickly and
>> reliably in a lab enivronment, then how?
>> Software based SATA power cutoff? Or hardware controllable SATA power cable?
> 
> You might be overthinking this a bit.  Software-controlled switched
> PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a
> Raspberry Pi) can turn the AC power on and off on a test box.  Get a
> cheap desktop machine, put as many different drives into it as it can
> hold, start writing test patterns, kill mains power to the whole thing,
> power it back up, analyze the data that is now present on disk, log the
> result over the network, repeat.  This is the most accurate simulation,
> since it replicates all the things that happen during a typical end-user's
> power failure, only much more often.

To me, this is not as good as expected methodology.
It simulates the most common real world power loss case, but I'd say
it's less reliable in pinning down the incorrect behavior.
(And extra time wasted on POST, booting into OS and things like that)

My idea is, some SBC based controller controlling the power cable of the
disk.
And another system (or the same SBC if it supports SATA) doing regular
workload, with dm-log-writes recording every write operations.
Then kill the power to the disk.

Then compare the data on-disk against dm-log-writes to see how the data
differs.

From the view point of end user, this is definitely overkilled, but at
least to me, this could proof how bad the firmware is, leaving no excuse
for the vendor to dodge the bullet, and maybe do them a favor by pinning
down the sequence leading to corruption.

Although there are a lot of untested things which can go wrong:
- How kernel handles unresponsible disk?
- Will dm-log-writes record and handle error correctly?
- Is there anything special SATA controller will do?

But at least this is going to be a very interesting project.
I already have a rockpro64 SBC with SATA PCIE card, just need to craft
an GPIO controlled switch to kill SATA power.

>  Hopefully all the hardware involved
> is designed to handle this situation already.  A standard office PC is
> theoretically designed for 1000 cycles (200 working days over 5 years)
> and should be able to test 60 drives (6 SATA ports, 10 sets of drives
> tested 100 cycles each).  The hardware is all standard equipment in any
> IT department.
> 
> You only need special-purpose hardware if the general-purpose stuff
> is failing in ways that aren't interesting (e.g. host RAM is corrupted
> during writes so the drive writes garbage, or the power supply breaks
> before 1000 cycles).  Some people build elaborate hard disk torture
> rigs that mess with input voltages, control temperature and vibration,
> etc. to try to replicate the effects effects of aging, but these setups
> aren't representative of typical end-user environments and the results
> will only be interesting to hardware makers.
> 
> We expect most drives to work and it seems that they do most of the
> time--it is the drives that fail most frequently that are interesting.
> The drives that fail most frequently are also the easiest to identify
> in testing--by definition, they will reproduce failures faster than
> the others.
> 
> Even if there is an intermittent firmware bug that only appears under
> rare conditions, if it happens with lower probability than drive hardware
> failure then it's not particularly important.  The target hardware failure
> rate for hard drives is 0.1% over the warranty period according to the
> specs for many models.  If one drive's hardware is going to fail
> with p < 0.001, then maybe the firmware bug makes it lose data at p =
> 0.00075 instead of p = 0.00050.  Users won't care about this--they'll
> use RAID to contain the damage, or just accept the failure risks of a
> single-disk system.  Filesystem failures that occur after the drive has
> degraded to the point of being unusable are not interesting at all.
> 
>> And how to make sure it's the flush/fua not implemented properly?
> 
> Is it necessary?  The drive could write garbage on the disk, or write
> correct data to the wrong physical location, when the voltage drops at
> the wrong time.  The drive electronics/firmware are supposed to implement
> measures to prevent that, and who knows whether they try, and whether
> they are successful?  The data corruption that results from the above
> events is technically not a flush/fua failure, since it's not a write
> reordering or a premature command completion notification to the host,
> but it's still data corruption on power failure.
> 
> Drives can fail in multiple ways, and it's hard (even for hard disk
> engineering teams) to really know what is going on while the power supply
> goes out of spec.  To an end user, it doesn't matter why the drive fails,
> only that it does fail.  Once you have *enough* drives, some of them
> are always failing, and it just becomes a question of balancing the
> different risks and mitigation costs (i.e. pick a drive that doesn't
> fail so much, and a filesystem that tolerates the failure modes that
> happen to average or better drives, and maybe use RAID1 with a mix of
> drive vendors to avoid having both mirrors hit by a common firmware bug).
> 
> To make sure btrfs is using flush/fua correctly, log the sequence of block
> writes and fua/flush commands, then replay that sequence one operation
> at a time, and make sure the filesystem correctly recovers after each
> operation.  That doessn't need or even want hardware, though--it's better
> work for a VM that can operate on block-level snapshots of the filesystem.

That's already what we're doing, dm-log-writes.
And we failed to expose major problems.

All the fsync related bugs, like what Filipe is always fixing, can't be
easily exposed by random workload even with dm-log-writes.
Most of these bugs needs special corner case to hit, but IIRC so far no
transid problem is caused by it.

But anyway, thanks for your info, we see some hope in pinning down the
problem.

Thanks,
Qu

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]