From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750899AbWHARkp (ORCPT ); Tue, 1 Aug 2006 13:40:45 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750963AbWHARkp (ORCPT ); Tue, 1 Aug 2006 13:40:45 -0400 Received: from 63-162-81-179.lisco.net ([63.162.81.179]:64718 "EHLO grunt.slaphack.com") by vger.kernel.org with ESMTP id S1750899AbWHARko (ORCPT ); Tue, 1 Aug 2006 13:40:44 -0400 Message-ID: <44CF9217.6040609@slaphack.com> Date: Tue, 01 Aug 2006 12:40:39 -0500 From: David Masover User-Agent: Thunderbird 1.5.0.5 (Macintosh/20060719) MIME-Version: 1.0 To: Alan Cox CC: Adrian Ulrich , "Horst H. von Brand" , bernd-schubert@gmx.de, reiserfs-list@namesys.com, jbglaw@lug-owl.de, clay.barnes@gmail.com, rudy@edsons.demon.nl, ipso@snappymail.ca, reiser@namesys.com, lkml@lpbproductions.com, jeff@garzik.org, tytso@mit.edu, linux-kernel@vger.kernel.org Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion References: <200607312314.37863.bernd-schubert@gmx.de> <200608011428.k71ESIuv007094@laptop13.inf.utfsm.cl> <20060801165234.9448cb6f.reiser4@blinkenlights.ch> <1154446189.15540.43.camel@localhost.localdomain> <44CF84F0.8080303@slaphack.com> <1154452770.15540.65.camel@localhost.localdomain> In-Reply-To: <1154452770.15540.65.camel@localhost.localdomain> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Alan Cox wrote: > Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover: >> Yikes. Undetected. >> >> Wait, what? Disks, at least, would be protected by RAID. Are you >> telling me RAID won't detect such an error? > > Yes. > > RAID deals with the case where a device fails. RAID 1 with 2 disks can > in theory detect an internal inconsistency but cannot fix it. Still, if it does that, that should be enough. The scary part wasn't that there's an internal inconsistency, but that you wouldn't know. And it can fix it if you can figure out which disk went. Or give it 3 disks and it should be entirely automatic -- admin gets paged, admin hotswaps in a new disk, done. >> we're OK with that, so long as our filesystems are robust enough. If >> it's an _undetected_ error, doesn't that cause way more problems >> (impossible problems) than FS corruption? Ok, your FS is fine -- but >> now your bank database shows $1k less on random accounts -- is that ok? > > Not really no. Your bank is probably using a machine (hopefully using a > machine) with ECC memory, ECC cache and the like. The UDMA and SATA > storage subsystems use CRC checksums between the controller and the > device. SCSI uses various similar systems - some older ones just use a > parity bit so have only a 50/50 chance of noticing a bit error. > > Similarly the media itself is recorded with a lot of FEC (forward error > correction) so will spot most changes. > > Unfortunately when you throw this lot together with astronomical amounts > of data you get burned now and then, especially as most systems are not > using ECC ram, do not have ECC on the CPU registers and may not even > have ECC on the caches in the disks. It seems like this is the place to fix it, not the software. If the software can fix it easily, great. But I'd much rather rely on the hardware looking after itself, because when hardware goes bad, all bets are off. Specifically, it seems like you do mention lots of hardware solutions, that just aren't always used. It seems like storage itself is getting cheap enough that it's time to step back a year or two in Moore's Law to get the reliability. >>> The sort of changes this needs hit the block layer and ever fs. >> Seems it would need to hit every application also... > > Depending how far you propogate it. Someone people working with huge > data sets already write and check user level CRC values for this reason > (in fact bitkeeper does it for one example). It should be relatively > cheap to get much of that benefit without doing application to > application just as TCP gets most of its benefit without going app to > app. And yet, if you can do that, I'd suspect you can, should, must do it at a lower level than the FS. Again, FS robustness is good, but if the disk itself is going, what good is having your directory (mostly) intact if the files themselves have random corruptions? If you can't trust the disk, you need more than just an FS which can mostly survive hardware failure. You also need the FS itself (or maybe the block layer) to support bad block relocation and all that good stuff, or you need your apps designed to do that job by themselves. It just doesn't make sense to me to do this at the FS level. You mention TCP -- ok, but if TCP is doing its job, I shouldn't also need to implement checksums and other robustness at the protocol layer (http, ftp, ssh), should I? Because in this analogy, it looks like TCP is the "block layer" and a protocol is the "fs". As I understand it, TCP only lets the protocol/application know when something's seriously FUBARed and it has to drop the connection. Similarly, the FS (and the apps) shouldn't have to know about hardware problems until it really can't do anything about it anymore, at which point the right thing to do is for the FS and apps to go "oh shit" and drop what they're doing, and the admin replaces hardware and restores from backup. Or brings a backup server online, or... I guess my main point was that _undetected_ problems are serious, but if you can detect them, and you have at least a bit of redundancy, you should be good. For instance, if your RAID reports errors that it can't fix, you bring that server down and let the backup server run.