On Sat, Nov 09, 2019 at 11:00:42AM +0100, Richard Weinberger wrote: > ----- Ursprüngliche Mail ----- > > Von: "Zygo Blaxell" > > The most striking thing about the description of your setup is that you > > have ECC RAM and you have a scrub regime to detect errors...but you have > > both a huge gap in error detection coverage and a mechanism to propagate > > errors across what is supposed to be a fault isolation boundary because > > you're using mdadm raid1 instead of btrfs raid1. If one of your disks > > goes bad, not only will it break your filesystem, but you won't know > > which disk you need to replace. > > I didn't claim that my setup is perfect. What strikes me a little is that > the only possible explanation from your side are super corner cases like > silent data corruption within an enterprise disk, followed by silent failure > of my RAID1, etc.. These are not super corner cases. This is the point you keep missing. The first event (silent data corruption) is not unusual. The integrity mechanisms you are relying on to detect failures (mdadm checkarray and SMART) don't detect this failure mode very well (or at all). Your setup goes directly from "double the risk of silent data corruption by using mdadm-RAID1" to "detect data corruption after the fact by using btrfs" with *nothing* in between. Your setup is almost like an incubator for reproducing this kind of issue. We use a similar arrangement of systems and drives in the test lab when we want to quickly reproduce data corruption problems with real drive firmware (as opposed to just corrupting the disks ourselves). The main difference is that we don't use md-raid1 for this because it's a slow, unreliable, often manual process to identify which sectors were corrupted and which were correct. btrfs-raid1 spits out the physical sector address and device in the kernel log, all we have to do is read the blocks on both drives to confirm the good and bad contents. Silent data corruption in hard drives is an event that happens once or twice a year somewhere in the fleet (among all the other more visible failure modes like UNC sectors and total disk failures). We've identified 4 models of WD hard drive that are capable of returning mangled data on reads, half of those without reporting a read error or any SMART indication. We also have Toshiba, Hitachi, and Seagate hard drives, but so far no silent corruption failures from those vendors. I'm sure they can have data corruption too, but after a decade of testing so far the score is WD 4, everyone else 0. I don't know what you're using for "RAID health check", but if it's mdadm's checkarray script, note that the checkarray script does not report corruption errors. It does set up the mdadm mismatch checker, and mismatches are counted by the kernel in /sys/block/*/md/mismatch_cnt, but it does *not* report mismatches in an email alert, kernel log, /proc/mdstat, or in mdadm -D output. It does not preserve mismatch counts across reboots, array resyncs, or from one check to the next--in all these events, the counter resets to zero, and mdadm's one and only indicator of past corruption events is lost. Unless you have been going out of your way to scrape mismatch_cnt values out of /sys/block/*/md, you have ignored all corruption errors mdadm might have detected so far. You might find some if you run a check now, though. On the other hand, if you *have* been logging mismatch_cnt since before the first btrfs error was reported, and it's been consistently zero, then something more interesting may be happening. > I fully agree that such things *can* happen but it is not the most likely > kind of failure. Data corruption in the drive is the best fit for the symptoms presented so far. They happen every few years on production systems, usually when we introduce a new drive model to the fleet and it turns out to have a much higher than normal failure rate. Sometimes it happens to otherwise good drives as they fail (i.e. we get corrupted data first, then a few weeks or months later the SMART errors start, or the drive just stops). The likelihood of it happening in your setup is doubled. The hypothetical bug in scrub you suggest elsewhere in this thread doesn't happen in the field, and seems to be difficult to implement deliberately, much less accidentally. Historically this has been some of the most correct and robust btrfs code. Failures here are not likely at all. > All devices are being checked by SMART. Sure, SMART could also be lying to me, but... SMART doesn't lie when it doesn't report problems. To be able to lie, SMART would first have to know the truth. A SMART pass means the power supply is working, the drive firmware successfully booted, and the firmware didn't record any recognized failure events in a log. There is a world of difference between "didn't record any recognized failure events" and "didn't have any failures", which includes various points like "had failures but didn't successfully record them due to the failure" and "had failures that the firmware didn't recognize". Enterprise drives are not immune to these problems; if anything, they're more vulnerable due to higher SoC and firmware complexity. > Thanks, > //richard