On Mon, Jun 24, 2019 at 11:31:35AM -0600, Chris Murphy wrote: > On Sun, Jun 23, 2019 at 7:52 PM Qu Wenruo wrote: > > > > > > > > On 2019/6/24 上午4:45, Zygo Blaxell wrote: > > > I first observed these correlations back in 2016. We had a lot of WD > > > Green and Black drives in service at the time--too many to replace or > > > upgrade them all early--so I looked for a workaround to force the > > > drives to behave properly. Since it looked like a write ordering issue, > > > I disabled the write cache on drives with these firmware versions, and > > > found that the transid-verify filesystem failures stopped immediately > > > (they had been bi-weekly events with write cache enabled). > > > > So the worst scenario really happens in real world, badly implemented > > flush/fua from firmware. > > Btrfs has no way to fix such low level problem. > > Right. The questions I have: should Btrfs (or any file system) be able > to detect such devices and still protect the data? i.e. for the file > system to somehow be more suspicious, without impacting performance, > and go read-only sooner so that at least read-only mount can work? Part of the point of UNC sector remapping, especially in consumer hard drives, is that filesystems _don't_ notice it (health monitoring daemons might notice SMART events, but it's intentionally transparent to applications and filesystems). The alternative is that one bad sector throws an application that is not prepared to handle it, or forces the filesystem RO, or triggers a full-device RAID data rebuild. Of course that all goes sideways if the firmware loses its mind (and write cache) during UNC sector remapping. > Or is this so much work for such a tiny edge case that it's not worth it? > > Arguably the hardware is some kind of zombie saboteur. It's not > totally dead, it gives the impression that it's working most of the > time, and then silently fails to do what we think it should in an > extraordinary departure from specs and expectations. > Are there other failure cases that could look like this and therefore > worth handling? In some ways firmware bugs are just another hardware failure. Hard disks are free to have any sector unreadable at any time, or one day the entire disk could just decide not to spin up any more, or non-ECC RAM in the embedded controller board could flip some bits at random. These are all standard failure modes that btrfs detects (and, with an intact mirror available, automatically corrects). Firmware bugs are different quantitatively: they turn common-but-recoverable failure events into common-and-catastrophic failure events. Most people expect catastrophic failure events to be less common, but manufacturing is hard, and sometimes they are not. Entire production runs of hard drives can die early due to a manufacturing equipment miscalibration or a poor choice of electrical component. > As storage stacks get more complicated with ever more > complex firmware, and firmware updates in the field, it might be > useful to have at least one file system that can detect such problems > sooner than others and go read-only to prevent further problems? I thought we already had one: btrfs. Probably ZFS too. The problem with parent transid verify failure is that the problem is detected after the filesystem is already damaged. It's too late to go RO then, you need a time machine to get the data back. We could maybe make some more pessimistic assumptions about how stable new data is so that we can recover from damage in new data far beyond what flush/fua expectations permit. AFAIK the Green only fails during a power failure, so btrfs could keep the last N filesystem transid trees intact at all times, and during mount btrfs could verify the integrity of the last transaction and roll back to an earlier transid if there was a failure. This has been attempted before, and it has various new ENOSPC failure modes, and it requires modifications to some already very complex btrfs code, but if we waved a magic wand and a complete, debugged implementation of this appeared with a reasonable memory and/or iops overhead, it would work on the Green drives. The WD Black is a different beast: some sequence of writes is lost when a UNC sector is encountered, but the drive doesn't report the loss immediately (if it did, btrfs would already go RO before the end of the transaction, and the metadata tree would remain intact). The loss is only detected some time after, during reads which might be thousands of transids later. Both of these approaches have a problem: when the workaround is used, the filesystem rolls back to an earlier state, including user data. In some cases that might not be a good thing, e.g. rolling back 1000 transids on a mail store or OLTP database, or rolling back datacow files while _not_ rolling back nodatacow files. btrfs already writes two complete copies of the metadata with dup metadata, but firmware bugs can kill both copies. btrfs could hold the last 256MB of metadata writes in RAM (or whatever amount of RAM is bigger than the drive cache), and replay those writes or verify the metadata trees whenever a bad sector is reported or the drive does a bus reset. This would work if the write cache is dropped during a read, but if the firmware silently drops the write cache while remapping a UNC sector then btrfs will not be able to detect the event and would not know to replay the write log. This kind of solution seems expensive, and maybe a little silly, and might not even work against all possible drive firmware bugs (what if the drive indefinitely postpones some writes, so 256MB isn't enough RAM log?). Also, a more meta observation: we don't know this is what is really happening in the firmware. There are clearly problems observed when multiple events occur currently, but there are several possible mechanisms that could lead to the behavior, and nowhere in my data is enough information to determine which one is correct. So if a drive has a firmware bug that just redirects a cache write to an entirely random address on the disk (e.g. it corrupts or overruns an internal RAM buffer) the symptoms will match the observed behavior, but none of these workaround strategies will work. You'd need to have a RAID1 mirror in a different disk to protect against arbitrary data loss anywhere in a single drive--and btrfs can already support that because it's a normal behavior for all hard drives. The cost of these workarounds has to be weighed against the impact (how many drives are out there with these firmware bugs) and compared with the cost of other solutions that already exist. A heterogeneous RAID1 solves this problem--unless you are unlucky and get two different firmwares with the same bug. It may be possible that the best workaround is also the simplest, and also works for all filesystems at once: turn the write cache off for drives where it doesn't work. CoW filesystems write in big contiguous sorted chunks, and that gets most of the benefit of write reordering before the drive gets the data, so there is less to lose if the drive cannot reorder. An overwriting filesystem writes in smaller, scattered chunks with more seeking, and can get more benefit from write caching in the drive. > > BTW, do you have any corruption using the bad drivers (with write cache) > > with traditional journal based fs like XFS/EXT4? > > > > Btrfs is relying more the hardware to implement barrier/flush properly, > > or CoW can be easily ruined. > > If the firmware is only tested (if tested) against such fs, it may be > > the problem of the vendor. > > I think we can definitely say this is a vendor problem. But the > question still is whether the file system as a role in at least > disqualifying hardware when it knows it's acting up before the file > system is thoroughly damaged? How does a filesystem know the device is acting up without letting the device damage the filesystem first? i.e. how do you do this without maintaining a firmware revision blacklist? Some sort of extended self-test during mkfs? Or something an admin can run online, like a balance or scrub? That would not catch the WD Black firmware revisions that need a bad sector to make the bad behavior appear. > I also wonder how ext4 and XFS will behave. In some ways they might > tolerate the problem without noticing it for longer, where instead of > kernel space recognizing it, it's actually user space / application > layer that gets confused first, if it's bogus data that's being > returned. Filesystem metadata is a relatively small target for such > corruption when the file system mostly does overwrites. The worst case on those filesystems is less bad than btrfs (for the filesystem--the user data is trashed in ways that are not reported and might be difficult to detect). btrfs checks everything--metadata and user data--and stops when unrecoverable failure is detected, so the logical result is that btrfs stops on firmware bugs. That's a design feature or horrible flaw, depending on what the user's goals are. ext4 optimizes for availability and performance (simplicity ended with ext3) and intentionally ignores some possible failure modes (ext4 makes no attempt to verify user data integrity at all, and even metadata checksums are optional). XFS protects itself similarly, but not user data. > I also wonder how ZFS handles this. Both in the single device case, > and in the RAIDZ case. > > > -- > Chris Murphy