From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wols Lists Subject: RFC - Raid error detection and auto-recovery (was Fault tolerance with badblocks) Date: Wed, 10 May 2017 14:26:12 +0100 Message-ID: <591314F4.2010702@youngman.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: Sender: linux-raid-owner@vger.kernel.org To: linux-raid , Nix List-Id: linux-raid.ids This discussion seems to have become a bit heated, but I think we have the following: FACT: linux md raid can do error detection but doesn't. Why not? It seems people are worried about the performance hit. FACT: linux md raid can do automatic error correction but doesn't. Why not? It seems people are more worried about the problems it could cause than the problems it would fix. OBSERVATION: The kernel guys seem to get fixated on kernel performance and miss the bigger picture. At the end of the day, the most important thing on the computer is the USER'S DATA. And if we can't protect that, they'll throw the computer in the bin. Or replace linux with Windows. Or something like that. And when there's a problem, it all too often comes over that the kernel guys CAN fix it but WON'T. The ext2/3/4 transition is a case in point. The current frustration where the kernel guys say "user data is the application's problem" but the postgresql guys are saying "how can we guarantee integrity when you won't give us the tools we need to guarantee our data is safe". This situation smacks of the same arrogance, sorry. "We can save your data but we won't". FURTHER FACTUAL TIDBITS: The usual response seems to be to push the problem somewhere else. For example "The user should keep backups". BUT HOW? I've investigated! Let's say I buy a spare drive for my backup. But I installed raid to avoid being at the mercy of a single drive. Now I am again because my backup is a single drive! BIG FAIL. Okay, I'll buy two drives, and have a backup raid. But what if my backup raid is reporting a mismatch count too? Now I have TWO copies where I can't vouch for their integrity. Double the trouble. BIG FAIL. Tape is cheap, you say? No bl***ding way!!! I've just done a quick investigation, and for the price of a tape drive I could probably turn my 2x3TB raid-1 into a 3x3TB raid-5, AND buy sufficient disks to implement a raid-based grandfather/father/son backup procedure, and STILL have some change left over. (I am using cheapie desktop drives, but I could probably afford cheap NAS drives with that money.) PROPOSAL: Enable integrity checking. We need to create something like /sys/md/array/verify_data_on_read. If that's set to true and we can check integrity (ie not raid-0), rather than reading just the data disks, we read the entire stripe, check the mirror or parity, and then decide what to do. If we can return error-corrected data obviously we do. I think we should return an error if we can't, no? We can't set this by default. The *potential* performance hit is too great. But now the sysadmin can choose between performance or integrity, rather than the present state where he has no choice. And in reality, I don't think a system like mine would even notice! Low read/write activity, and masses of spare ram. Chances are most of my disk activity is cached and doesn't go anywhere near the raid code. The kernel code size impact is minimal, I suspect. All the code required is probably there, it just needs a little "re-purposing". PROPOSAL: Enable automatic correction Likewise create /sys/md/array/correct_data_on_read. This won't work if verify_data_on_read is not set, and likewise it will not be set by default. IFF we need to reconstruct the data from a 3-or-more raid-1 mirror or a raid-6, it will rewrite the corrected stripe. RATIONALE: NEVER THROW AWAY USER DATA IF YOU CAN RECONSTRUCT IT !!! This gives control to the sysadmin. At the end of the day, it should be *his* call, not the devs', as to whether verify-on-read is worth the performance hit. (Successful reconstructions should be logged ...) Likewise, while correct_data_on_read could mess up the array if the error isn't actually on the drive, that should be the sysadmin's call, not the devs'. And because we only rewrite if we think we have successfully recreated the data, the chances of it messing up are actually quite small. Because verify_data_on_read is set, that addresses Neil's concern of changing the data underneath an app - the app has been given the corrected data so we write the corrected data back to disk. NOTES: >From Peter Anvin's paper it seems that the chance of wrongly identifying a single-disk error is low. And it's even lower if we look for the clues he mentions. Because we only correct those errors we are sure we've correctly identified, other sources of corruption shouldn't get fed back to the disk. This makes an error-correcting scrub easy :-) Run as an overnight script... cat 1 > /sys/md/root/verify_data_on_read cat 1 > /sys/md/root/correct_data_on_read tar -c / > /dev/null cat 0 > /sys/md/root/correct_data_on_read cat 0 > /sys/md/root/verify_data_on_read Coders and code welcome ... :-) Cheers, Wol