* ext2/3: document conditions when reliable operation is possible @ 2009-03-12 9:21 Pavel Machek 2009-03-12 11:40 ` Jochen Voß ` (2 more replies) 0 siblings, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-12 9:21 UTC (permalink / raw) To: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc Cc: linux-ext4 Not all block devices are suitable for all filesystems. In fact, some block devices are so broken that reliable operation is pretty much impossible. Document stuff ext2/ext3 needs for reliable operation. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..9c3d729 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,47 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortuantely, none of the cheap USB/SD flash cards I seen do + behave like this, and are unsuitable for all linux filesystems + I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite the next 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _around_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be neccessary; + otherwise, disks may write garbage during powerfail. + Not sure how common that problem is on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..02a9bd5 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,27 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek @ 2009-03-12 11:40 ` Jochen Voß 2009-03-21 11:24 ` Pavel Machek 2009-03-12 19:13 ` Rob Landley 2009-03-16 19:45 ` Greg Freemyer 2 siblings, 1 reply; 309+ messages in thread From: Jochen Voß @ 2009-03-12 11:40 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Hi, 2009/3/12 Pavel Machek <pavel@ucw.cz>: > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > index 4333e83..b09aa4c 100644 > --- a/Documentation/filesystems/ext2.txt > +++ b/Documentation/filesystems/ext2.txt > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they > have to be 8 character filenames, even then we are fairly close to > running out of unique filenames. > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely ^^^^ Shouldn't this be "Ext2"? All the best, Jochen -- http://seehuhn.de/ ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 11:40 ` Jochen Voß @ 2009-03-21 11:24 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-21 11:24 UTC (permalink / raw) To: Jochen Voß Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Thu 2009-03-12 11:40:52, Jochen Voß wrote: > Hi, > > 2009/3/12 Pavel Machek <pavel@ucw.cz>: > > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > > index 4333e83..b09aa4c 100644 > > --- a/Documentation/filesystems/ext2.txt > > +++ b/Documentation/filesystems/ext2.txt > > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they > > have to be 8 character filenames, even then we are fairly close to > > running out of unique filenames. > > > > +Requirements > > +============ > > + > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > ^^^^ > Shouldn't this be "Ext2"? Thanks, fixed. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible @ 2009-03-21 11:24 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-21 11:24 UTC (permalink / raw) To: Jochen Voß Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Thu 2009-03-12 11:40:52, Jochen Voß wrote: > Hi, > > 2009/3/12 Pavel Machek <pavel@ucw.cz>: > > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > > index 4333e83..b09aa4c 100644 > > --- a/Documentation/filesystems/ext2.txt > > +++ b/Documentation/filesystems/ext2.txt > > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they > > have to be 8 character filenames, even then we are fairly close to > > running out of unique filenames. > > > > +Requirements > > +============ > > + > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > ^^^^ > Shouldn't this be "Ext2"? Thanks, fixed. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek 2009-03-12 11:40 ` Jochen Voß @ 2009-03-12 19:13 ` Rob Landley 2009-03-16 12:28 ` Pavel Machek ` (2 more replies) 2009-03-16 19:45 ` Greg Freemyer 2 siblings, 3 replies; 309+ messages in thread From: Rob Landley @ 2009-03-12 19:13 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Thursday 12 March 2009 04:21:14 Pavel Machek wrote: > Not all block devices are suitable for all filesystems. In fact, some > block devices are so broken that reliable operation is pretty much > impossible. Document stuff ext2/ext3 needs for reliable operation. > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > diff --git a/Documentation/filesystems/expectations.txt > b/Documentation/filesystems/expectations.txt new file mode 100644 > index 0000000..9c3d729 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,47 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly, because success > +on fsync was already returned when data hit the journal. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. I vaguely recall that the behavior of when a write error _does_ occur is to remount the filesystem read only? (Is this VFS or per-fs?) Is there any kind of hotplug event associated with this? I'm aware write errors shouldn't happen, and by the time they do it's too late to gracefully handle them, and all we can do is fail. So how do we fail? > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Unfortuantely, none of the cheap USB/SD flash cards I seen do I've seen > + behave like this, and are unsuitable for all linux filesystems "are thus unsuitable", perhaps? (Too pretentious? :) > + I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite the next 64k, 128k, or > + even a couple megabytes on the really _big_ ones. Somebody corrected me, it's not "the next" it's "the surrounding". (Writes aren't always cleanly at the start of an erase block, so critical data _before_ what you touch is endangered too.) > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be neccessary; Necessary > + otherwise, disks may write garbage during powerfail. > + Not sure how common that problem is on generic PC machines. > + > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. These days instead of "atomic" it's better to think in terms of "barriers". Requesting a flush blocks until all the data written _before_ that point has made it to disk. This wait may be arbitrarily long on a busy system with lots of disk transactions happening in parallel (perhaps because Firefox decided to garbage collect and is spending the next 30 seconds swapping itself back in to do so). > + > + > diff --git a/Documentation/filesystems/ext2.txt > b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644 > --- a/Documentation/filesystems/ext2.txt > +++ b/Documentation/filesystems/ext2.txt > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory > entries, so they have to be 8 character filenames, even then we are fairly > close to running out of unique filenames. > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: This paragraph talks about ext3... > +* write errors not allowed > + > +* sector writes are atomic > + > +(see expectations.txt; note that most/all linux block-based > +filesystems have similar expectations) > + > +* write caching is disabled. ext2 does not know how to issue barriers > + as of 2.6.28. hdparm -W0 disables it on SATA disks. And here we're talking about ext2. Does neither one know about write barriers, or does this just apply to ext2? (What about ext4?) Also I remember a historical problem that not all disks honor write barriers, because actual data integrity makes for horrible benchmark numbers. Dunno how current that is with SATA, Alan Cox would probably know. Rob ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 19:13 ` Rob Landley @ 2009-03-16 12:28 ` Pavel Machek 2009-03-16 19:26 ` Rob Landley 2009-03-16 12:30 ` Pavel Machek 2009-08-29 1:33 ` Robert Hancock 2 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-03-16 12:28 UTC (permalink / raw) To: Rob Landley Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Hi! > > +Write errors not allowed (NO-WRITE-ERRORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Writes to media never fail. Even if disk returns error condition > > +during write, filesystems can't handle that correctly, because success > > +on fsync was already returned when data hit the journal. > > + > > + Fortunately writes failing are very uncommon on traditional > > + spinning disks, as they have spare sectors they use when write > > + fails. > > I vaguely recall that the behavior of when a write error _does_ occur is to > remount the filesystem read only? (Is this VFS or per-fs?) Per-fs. > Is there any kind of hotplug event associated with this? I don't think so. > I'm aware write errors shouldn't happen, and by the time they do it's too late > to gracefully handle them, and all we can do is fail. So how do we > fail? Well, even remount-ro may be too late, IIRC. > > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > > I've seen > > > + behave like this, and are unsuitable for all linux filesystems > > "are thus unsuitable", perhaps? (Too pretentious? :) ACK, thanks. > > + I know. > > + > > + An inherent problem with using flash as a normal block > > + device is that the flash erase size is bigger than > > + most filesystem sector sizes. So when you request a > > + write, it may erase and rewrite the next 64k, 128k, or > > + even a couple megabytes on the really _big_ ones. > > Somebody corrected me, it's not "the next" it's "the surrounding". Its "some" ... due to wear leveling logic. > (Writes aren't always cleanly at the start of an erase block, so critical data > _before_ what you touch is endangered too.) Well, flashes do remap, so it is actually "random blocks". > > + otherwise, disks may write garbage during powerfail. > > + Not sure how common that problem is on generic PC machines. > > + > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > + because it needs to write both changed data, and parity, to > > + different disks. > > These days instead of "atomic" it's better to think in terms of > "barriers". This is not about barriers (that should be different topic). Atomic write means that either whole sector is written, or nothing at all is written. Because raid5 needs to update both master data and parity at the same time, I don't think it can guarantee this during powerfail. > > +Requirements > > +* write errors not allowed > > + > > +* sector writes are atomic > > + > > +(see expectations.txt; note that most/all linux block-based > > +filesystems have similar expectations) > > + > > +* write caching is disabled. ext2 does not know how to issue barriers > > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > And here we're talking about ext2. Does neither one know about write > barriers, or does this just apply to ext2? (What about ext4?) This document is about ext2. Ext3 can support barriers in 2.6.28. Someone else needs to write ext4 docs :-). > Also I remember a historical problem that not all disks honor write barriers, > because actual data integrity makes for horrible benchmark numbers. Dunno how > current that is with SATA, Alan Cox would probably know. Sounds like broken disk, then. We should blacklist those. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 12:28 ` Pavel Machek @ 2009-03-16 19:26 ` Rob Landley 2009-03-23 10:45 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-03-16 19:26 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Monday 16 March 2009 07:28:47 Pavel Machek wrote: > Hi! > > > + Fortunately writes failing are very uncommon on traditional > > > + spinning disks, as they have spare sectors they use when write > > > + fails. > > > > I vaguely recall that the behavior of when a write error _does_ occur is > > to remount the filesystem read only? (Is this VFS or per-fs?) > > Per-fs. Might be nice to note that in the doc. > > Is there any kind of hotplug event associated with this? > > I don't think so. There probably should be, but that's a separate issue. > > I'm aware write errors shouldn't happen, and by the time they do it's too > > late to gracefully handle them, and all we can do is fail. So how do we > > fail? > > Well, even remount-ro may be too late, IIRC. Care to elaborate? (When a filesystem is mounted RO, I'm not sure what happens to the pages that have already been dirtied...) > > (Writes aren't always cleanly at the start of an erase block, so critical > > data _before_ what you touch is endangered too.) > > Well, flashes do remap, so it is actually "random blocks". Fun. When "please do not turn of your playstation until game save completes" honestly seems like the best solution for making the technology reliable, something is wrong with the technology. I think I'll stick with rotating disks for now, thanks. > > > + otherwise, disks may write garbage during powerfail. > > > + Not sure how common that problem is on generic PC machines. > > > + > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > > + because it needs to write both changed data, and parity, to > > > + different disks. > > > > These days instead of "atomic" it's better to think in terms of > > "barriers". > > This is not about barriers (that should be different topic). Atomic > write means that either whole sector is written, or nothing at all is > written. Because raid5 needs to update both master data and parity at > the same time, I don't think it can guarantee this during powerfail. Good point, but I thought that's what journaling was for? I'm aware that any flash filesystem _must_ be journaled in order to work sanely, and must be able to view the underlying erase granularity down to the bare metal, through any remapping the hardware's doing. Possibly what's really needed is a "flash is weird" section, since flash filesystems can't be mounted on arbitrary block devices. Although an "-O erase_size=128" option so they _could_ would be nice. There's "mtdram" which seems to be the only remaining use for ram disks, but why there isn't an "mtdwrap" that works with arbitrary underlying block devices, I have no idea. (Layering it on top of a loopback device would be most useful.) > > > +Requirements > > > +* write errors not allowed > > > + > > > +* sector writes are atomic > > > + > > > +(see expectations.txt; note that most/all linux block-based > > > +filesystems have similar expectations) > > > + > > > +* write caching is disabled. ext2 does not know how to issue barriers > > > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > > > And here we're talking about ext2. Does neither one know about write > > barriers, or does this just apply to ext2? (What about ext4?) > > This document is about ext2. Ext3 can support barriers in > 2.6.28. Someone else needs to write ext4 docs :-). > > > Also I remember a historical problem that not all disks honor write > > barriers, because actual data integrity makes for horrible benchmark > > numbers. Dunno how current that is with SATA, Alan Cox would probably > > know. > > Sounds like broken disk, then. We should blacklist those. It wasn't just one brand of disk cheating like that, and you'd have to ask him (or maybe Jens Axboe or somebody) whether the problem is still current. I've been off in embedded-land for a few years now... Rob ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 19:26 ` Rob Landley @ 2009-03-23 10:45 ` Pavel Machek 2009-03-30 15:06 ` Goswin von Brederlow 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-03-23 10:45 UTC (permalink / raw) To: Rob Landley Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon 2009-03-16 14:26:23, Rob Landley wrote: > On Monday 16 March 2009 07:28:47 Pavel Machek wrote: > > Hi! > > > > + Fortunately writes failing are very uncommon on traditional > > > > + spinning disks, as they have spare sectors they use when write > > > > + fails. > > > > > > I vaguely recall that the behavior of when a write error _does_ occur is > > > to remount the filesystem read only? (Is this VFS or per-fs?) > > > > Per-fs. > > Might be nice to note that in the doc. Ok, can you suggest a patch? I believe remount-ro is already documented ... somewhere :-). > > > I'm aware write errors shouldn't happen, and by the time they do it's too > > > late to gracefully handle them, and all we can do is fail. So how do we > > > fail? > > > > Well, even remount-ro may be too late, IIRC. > > Care to elaborate? (When a filesystem is mounted RO, I'm not sure what > happens to the pages that have already been dirtied...) Well, fsync() error reporting does not really work properly, but I guess it will save you for the remount-ro case. So the data will be in the journal, but it will be impossible to replay it... > > > (Writes aren't always cleanly at the start of an erase block, so critical > > > data _before_ what you touch is endangered too.) > > > > Well, flashes do remap, so it is actually "random blocks". > > Fun. Yes. > > > > + otherwise, disks may write garbage during powerfail. > > > > + Not sure how common that problem is on generic PC machines. > > > > + > > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > > > + because it needs to write both changed data, and parity, to > > > > + different disks. > > > > > > These days instead of "atomic" it's better to think in terms of > > > "barriers". > > > > This is not about barriers (that should be different topic). Atomic > > write means that either whole sector is written, or nothing at all is > > written. Because raid5 needs to update both master data and parity at > > the same time, I don't think it can guarantee this during powerfail. > > Good point, but I thought that's what journaling was for? I believe journaling operates on assumption that "either whole sector is written, or nothing at all is written". > I'm aware that any flash filesystem _must_ be journaled in order to work > sanely, and must be able to view the underlying erase granularity down to the > bare metal, through any remapping the hardware's doing. Possibly what's > really needed is a "flash is weird" section, since flash filesystems can't be > mounted on arbitrary block devices. > Although an "-O erase_size=128" option so they _could_ would be nice. There's > "mtdram" which seems to be the only remaining use for ram disks, but why there > isn't an "mtdwrap" that works with arbitrary underlying block devices, I have > no idea. (Layering it on top of a loopback device would be most > useful.) I don't think that works. Compactflash (etc) cards basically randomly remap the data, so you can't really run flash filesystem over compactflash/usb/SD card -- you don't know the details of remapping. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-23 10:45 ` Pavel Machek @ 2009-03-30 15:06 ` Goswin von Brederlow 2009-08-24 9:26 ` Pavel Machek 2009-08-24 9:31 ` [patch] " Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: Goswin von Brederlow @ 2009-03-30 15:06 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Pavel Machek <pavel@ucw.cz> writes: > On Mon 2009-03-16 14:26:23, Rob Landley wrote: >> On Monday 16 March 2009 07:28:47 Pavel Machek wrote: >> > > > + otherwise, disks may write garbage during powerfail. >> > > > + Not sure how common that problem is on generic PC machines. >> > > > + >> > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, >> > > > + because it needs to write both changed data, and parity, to >> > > > + different disks. >> > > >> > > These days instead of "atomic" it's better to think in terms of >> > > "barriers". Would be nice to have barriers in md and dm. >> > This is not about barriers (that should be different topic). Atomic >> > write means that either whole sector is written, or nothing at all is >> > written. Because raid5 needs to update both master data and parity at >> > the same time, I don't think it can guarantee this during powerfail. Actualy raid5 should have no problem with a power failure during normal operations of the raid. The parity block should get marked out of sync, then the new data block should be written, then the new parity block and then the parity block should be flaged in sync. >> Good point, but I thought that's what journaling was for? > > I believe journaling operates on assumption that "either whole sector > is written, or nothing at all is written". The real problem comes in degraded mode. In that case the data block (if present) and parity block must be written at the same time atomically. If the system crashes after writing one but before writing the other then the data block on the missng drive changes its contents. And for example with a chunk size of 1MB and 16 disks that could be 15MB away from the block you actualy do change. And you can not recover that after a crash as you need both the original and changed contents of the block. So writing one sector has the risk of corrupting another (for the FS) totally unconnected sector. No amount of journaling will help there. The raid5 would need to do journaling or use battery backed cache. MfG Goswin ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-30 15:06 ` Goswin von Brederlow @ 2009-08-24 9:26 ` Pavel Machek 2009-08-24 9:31 ` [patch] " Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 9:26 UTC (permalink / raw) To: Goswin von Brederlow Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Hi! > >> > This is not about barriers (that should be different topic). Atomic > >> > write means that either whole sector is written, or nothing at all is > >> > written. Because raid5 needs to update both master data and parity at > >> > the same time, I don't think it can guarantee this during powerfail. > > Actualy raid5 should have no problem with a power failure during > normal operations of the raid. The parity block should get marked out > of sync, then the new data block should be written, then the new > parity block and then the parity block should be flaged in sync. > > >> Good point, but I thought that's what journaling was for? > > > > I believe journaling operates on assumption that "either whole sector > > is written, or nothing at all is written". > > The real problem comes in degraded mode. In that case the data block > (if present) and parity block must be written at the same time > atomically. If the system crashes after writing one but before writing > the other then the data block on the missng drive changes its > contents. And for example with a chunk size of 1MB and 16 disks that > could be 15MB away from the block you actualy do change. And you can > not recover that after a crash as you need both the original and > changed contents of the block. > > So writing one sector has the risk of corrupting another (for the FS) > totally unconnected sector. No amount of journaling will help > there. The raid5 would need to do journaling or use battery backed > cache. Thanks, I updated my notes. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* [patch] ext2/3: document conditions when reliable operation is possible 2009-03-30 15:06 ` Goswin von Brederlow 2009-08-24 9:26 ` Pavel Machek @ 2009-08-24 9:31 ` Pavel Machek 2009-08-24 11:19 ` Florian Weimer ` (2 more replies) 1 sibling, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 9:31 UTC (permalink / raw) To: Goswin von Brederlow Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Running journaling filesystem such as ext3 over flashdisk or degraded RAID array is a bad idea: journaling guarantees no longer apply and you will get data corruption on powerfail. We can't solve it easily, but we should certainly warn the users. I actually lost data because I did not understand these limitations... Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..80fa886 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,52 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + RAID-4/5/6 in degraded mode has same problem. + + +Don't damage the old data on a failed write (ATOMIC-WRITES) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. (But it will only really show up in degraded mode). + UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..0a9b87f 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 570f9bd..2ce82a3 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -199,6 +202,47 @@ debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + + (Thrash may get written into sectors during powerfail. And + ext3 handles this surprisingly well at least in the + catastrophic case of garbage getting written into the inode + table, since the journal replay often will "repair" the + garbage that was written into the filesystem metadata blocks. + It won't do a bit of good for the data blocks, of course + (unless you are using data=journal mode). But this means that + in fact, ext3 is more resistant to suriving failures to the + first problem (powerfail while writing can damage old data on + a failed write) but fortunately, hard drives generally don't + cause collateral damage on a failed write. + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. + + References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 9:31 ` [patch] " Pavel Machek @ 2009-08-24 11:19 ` Florian Weimer 2009-08-24 13:01 ` Theodore Tso ` (2 more replies) 2009-08-24 13:21 ` Greg Freemyer 2009-08-24 21:11 ` Rob Landley 2 siblings, 3 replies; 309+ messages in thread From: Florian Weimer @ 2009-08-24 11:19 UTC (permalink / raw) To: Pavel Machek Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 * Pavel Machek: > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. You should make clear that the file lists per-file-system rules and that some file sytems can recover from some of the error conditions. > +* don't damage the old data on a failed write (ATOMIC-WRITES) > + > + (Thrash may get written into sectors during powerfail. And > + ext3 handles this surprisingly well at least in the > + catastrophic case of garbage getting written into the inode > + table, since the journal replay often will "repair" the > + garbage that was written into the filesystem metadata blocks. Isn't this by design? In other words, if the metadata doesn't survive non-atomic writes, wouldn't it be an ext3 bug? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 11:19 ` Florian Weimer @ 2009-08-24 13:01 ` Theodore Tso 2009-08-24 14:55 ` Artem Bityutskiy ` (2 more replies) 2009-08-24 13:50 ` Theodore Tso 2009-08-24 18:39 ` Pavel Machek 2 siblings, 3 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-24 13:01 UTC (permalink / raw) To: Florian Weimer Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote: > * Pavel Machek: > > > +Linux block-backed filesystems can only work correctly when several > > +conditions are met in the block layer and below (disks, flash > > +cards). Some of them are obvious ("data on media should not change > > +randomly"), some are less so. > > You should make clear that the file lists per-file-system rules and > that some file sytems can recover from some of the error conditions. The only one that falls into that category is the one about not being able to handle failed writes, and the way most failures take place, they generally fail the ATOMIC-WRITES criterion in any case. That is, when a write fails, an attempt to read from that sector will generally result in either (a) an error, or (b) data other than what was there before the write was attempted. > > +* don't damage the old data on a failed write (ATOMIC-WRITES) > > + > > + (Thrash may get written into sectors during powerfail. And > > + ext3 handles this surprisingly well at least in the > > + catastrophic case of garbage getting written into the inode > > + table, since the journal replay often will "repair" the > > + garbage that was written into the filesystem metadata blocks. > > Isn't this by design? In other words, if the metadata doesn't survive > non-atomic writes, wouldn't it be an ext3 bug? Part of the problem here is that "atomic-writes" is confusing; it doesn't mean what many people think it means. The assumption which many naive filesystem designers make is that writes succeed or they don't. If they don't succeed, they don't change the previously existing data in any way. So in the case of journalling, the assumption which gets made is that when the power fails, the disk either writes a particular disk block, or it doesn't. The problem here is as with humans and animals, death is not an event, it is a process. When the power fails, the system just doesn't stop functioning; the power on the +5 and +12 volt rails start dropping to zero, and different components fail at different times. Specifically, DRAM, being the most voltage sensitve, tends to fail before the DMA subsystem, the PCI bus, and the hard drive fails. So as a result, garbage can get written out to disk as part of the failure. That's just the way hardware works. Now consider a file system which does logical journalling. It has written to the journal, using a compact encoding, "the i_blocks field is now 25, and i_size is 13000", and the journal transaction has committed. So now, it's time to update the inode on disk; but at that precise moment, the power failures, and garbage is written to the inode table. Oops! The entire sector containing the inode is trashed. But the only thing which recorded in the journal is the new value of i_blocks and i_size. So a journal replay won't help file systems that do logical block journalling. Is that a file system "bug"? Well, it's better to call that a mismatch between the assumptions made of physical devices, and of the file system code. On Irix, SGI hardware had a powerfail interrupt, and the power supply and extra-big capacitors, so that when a power fail interrupt came in, the Irix would run around frantically shutting down pending DMA transfers to prevent this failure mode from causing problems. PC class hardware (according to Ted's law), is cr*p, and doesn't have a powerfail interrupt, so it's not something that we have. Ext3, ext4, and ocfs2 does physical block journalling, so as long as journal truncate hasn't taken place right before the failure, the replay of the physical block journal tends to repair this most (but not necessarily all) cases of "garbage is written right before power failure". People who care about this should really use a UPS, and wire up the USB and/or serial cable from the UPS to the system, so that the OS can do a controlled shutdown if the UPS is close to shutting down due to an extended power failure. There is another kind of non-atomic write that nearly all file systems are subject to, however, and to give an example of this, consider what happens if you a laptop is subjected to a sudden shock while it is writing a sector, and the hard drive doesn't an accelerometer which tries to anticipates such shocks. (nb, these things aren't fool-proof; even if a HDD has one of these sensors, they only work if they can detect the transition to free-fall, and the hard drive has time to retract the heads before the actual shock hits; if you have a sudden shock, the g-shock sensors won't have time to react and save the hard drive). Depending on how severe the shock happens to be, the head could end up impacting the platter, destroying the medium (which used to be iron-oxide; hence the term "spinning rust platters") at that spot. This will obviously cause a write failure, and the previous contents of the sector will be lost. This is also considered a failure of the ATOMIC-WRITE property, and no, ext3 doesn't handle this case gracefully. Very few file systems do. (It is possible for an OS that doesn't have fixed metadata to immediately write the inode table to a different location on the disk, and then update the pointers to the inode table point to the new location on disk; but very few filesystems do this, and even those that do usually rely on the superblock being available on a fixed location on disk. It's much simpler to assume that hard drives usually behave sanely, and that writes very rarely fail.) It's for this reason that I've never been completely sure how useful Pavel's proposed treatise about file systems expectations really are --- because all storage subsystems *usually* provide these guarantees, but it is the very rare storage system that *always* provides these guarantees. We could just as easily have several kilobytes of explanation in Documentation/* explaining how we assume that DRAM always returns the same value that was stored in it previously --- and yet most PC class hardware still does not use ECC memory, and cosmic rays are a reality. That means that most Linux systems run on systems that are vulnerable to this kind of failure --- and the world hasn't ended. As I recall, the main problem which Pavel had was when he was using ext3 on a *really* trashy flash drive, on a *really* trashing laptop where the flash card stuck out slightly, and any jostling of the netbook would cause the flash card to become disconnected from the laptop, and cause write errors, very easily and very frequently. In those circumstnaces, it's highly unlikely that ***any*** file system would have been able to survive such an unreliable storage system. One of the problems I have with the break down which Pavel has used is that it doesn't break things down according to probability; the chance of a storage subsystem scribbling garbage on its last write during a power failure is very different from the chance that the hard drive fails due to a shock, or due to some spilling printer toner near the disk drive which somehow manages to find its way inside the enclosure containing the spinning platters, versus the other forms of random failures that lead to write failures. All of these fall into the category of a failure of the property he has named "ATOMIC-WRITE", but in fact ways in which the filesystem might try to protect itself are varied, and it isn't necessarily all or nothing. One can imagine a file system which can handle write failures for data blocks, but not for metadata blocks; given that data blocks outnumber metadata blocks by hundreds to one, and that write failures are relatively rare (unless you have said trashy laptop with a trash flash card), a file system that can gracefully deal with data block failures would be a useful advancement. But these things are never absolute, mainly because people aren't willing to pay for either the cost of superior hardware (consider the cost of ECC memory, which isn't *that* much more expensive; and yet most PC class systems don't use it) or in terms of software overhead (historically many file system designers have eschewed the use of physical block journalling because it really hurts on meta-data intensive benchmarks), talking about absolute requirements for ATOMIC-WRITE isn't all that useful --- because nearly all hardware doesn't provide these guarantees, and nearly all filesystems require them. So to call out ext2 and ext3 for requiring them, without making clear that pretty much *all* file systems require them, ends up causing people to switch over to some other file system that ironically enough, might end up being *more* vulernable, but which didn't earn Pavel's displeasure because he didn't try using, say, XFS on his flashcard on his trashy laptop. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:01 ` Theodore Tso @ 2009-08-24 14:55 ` Artem Bityutskiy 2009-08-24 22:30 ` Rob Landley 2009-08-24 19:52 ` Pavel Machek 2009-08-25 14:43 ` Florian Weimer 2 siblings, 1 reply; 309+ messages in thread From: Artem Bityutskiy @ 2009-08-24 14:55 UTC (permalink / raw) To: Theodore Tso Cc: Florian Weimer, Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Hi Theodore, thanks for the insightful writing. On 08/24/2009 04:01 PM, Theodore Tso wrote: ...snip ... > It's for this reason that I've never been completely sure how useful > Pavel's proposed treatise about file systems expectations really are > --- because all storage subsystems *usually* provide these guarantees, > but it is the very rare storage system that *always* provides these > guarantees. There is a thing called eMMC (embedded MMC) in the embedded world. You may consider it as a non-removable MMC. This thing is a block device from the Linux POW, and you may mount ext3 on top of it. And people do this. The device seems to have a decent FTL, and does not look bad. However, there are subtle things which mortals never think about. In case of eMMC - power cuts may make some sectors unreadable - eMMC returns ECC errors on reads. Namely, the sectors which were being written at the very moment when the power cut happened may become unreadable. And this makes ext3 refuse mounting the file-system, this makes chkfs.ext3 refuse the file-system. Although this should be fixable in SW, but we did not find time to do this so far. Anyway, my point is that documenting subtle things like this is a very good thing to do, just because nowadays we are trying to use existing software with flash-based storage devices, which may violate these subtle assumptions, or introduce other ones. Probably, Pavel did too good job in generalizing things, and it could be better to make a doc about HDD vs SSD or HDD vs Flash-based-storage. Not sure. But the idea to document subtle FS assumption is good, IMO. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 14:55 ` Artem Bityutskiy @ 2009-08-24 22:30 ` Rob Landley 0 siblings, 0 replies; 309+ messages in thread From: Rob Landley @ 2009-08-24 22:30 UTC (permalink / raw) To: Artem Bityutskiy Cc: Theodore Tso, Florian Weimer, Pavel Machek, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Monday 24 August 2009 09:55:53 Artem Bityutskiy wrote: > Probably, Pavel did too good job in generalizing things, and it could be > better to make a doc about HDD vs SSD or HDD vs Flash-based-storage. > Not sure. But the idea to document subtle FS assumption is good, IMO. The standard procedure for this seems to be to cc: Jonathan Corbet on the discussion, make puppy eyes at him, and subscribe to Linux Weekly News. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:01 ` Theodore Tso @ 2009-08-24 19:52 ` Pavel Machek 2009-08-24 19:52 ` Pavel Machek 2009-08-25 14:43 ` Florian Weimer 2 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 19:52 UTC (permalink / raw) To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Hi! > > Isn't this by design? In other words, if the metadata doesn't survive > > non-atomic writes, wouldn't it be an ext3 bug? > > Part of the problem here is that "atomic-writes" is confusing; it > doesn't mean what many people think it means. The assumption which > many naive filesystem designers make is that writes succeed or they > don't. If they don't succeed, they don't change the previously > existing data in any way. > > So in the case of journalling, the assumption which gets made is that > when the power fails, the disk either writes a particular disk block, > or it doesn't. The problem here is as with humans and animals, death > is not an event, it is a process. When the power fails, the system > just doesn't stop functioning; the power on the +5 and +12 volt rails > start dropping to zero, and different components fail at different > times. Specifically, DRAM, being the most voltage sensitve, tends to > fail before the DMA subsystem, the PCI bus, and the hard drive fails. > So as a result, garbage can get written out to disk as part of the > failure. That's just the way hardware works. Yep, and at that point you lost data. You had "silent data corruption" from fs point of view, and that's bad. It will be probably very bad on XFS, probably okay on Ext3, and certainly okay on Ext2: you do filesystem check, and you should be able to repair any damage. So yes, physical journaling is good, but fsck is better. > Is that a file system "bug"? Well, it's better to call that a > mismatch between the assumptions made of physical devices, and of the > file system code. On Irix, SGI hardware had a powerfail interrupt, If those filesystem assumptions were not documented, I'd call it filesystem bug. So better document them ;-). > There is another kind of non-atomic write that nearly all file systems > are subject to, however, and to give an example of this, consider what > happens if you a laptop is subjected to a sudden shock while it is > writing a sector, and the hard drive doesn't an accelerometer which ... > Depending on how severe the shock happens to be, the head could end up > impacting the platter, destroying the medium (which used to be > iron-oxide; hence the term "spinning rust platters") at that spot. > This will obviously cause a write failure, and the previous contents > of the sector will be lost. This is also considered a failure of the > ATOMIC-WRITE property, and no, ext3 doesn't handle this case > gracefully. Very few file systems do. (It is possible for an OS > that Actually, ext2 should be able to survive that, no? Error writing -> remount ro -> fsck on next boot -> drive relocates the sectors. > It's for this reason that I've never been completely sure how useful > Pavel's proposed treatise about file systems expectations really are > --- because all storage subsystems *usually* provide these guarantees, > but it is the very rare storage system that *always* provides these > guarantees. Well... there's very big difference between harddrives and flash memory. Harddrives usually work, and flash memory never does. > We could just as easily have several kilobytes of explanation in > Documentation/* explaining how we assume that DRAM always returns the > same value that was stored in it previously --- and yet most PC class > hardware still does not use ECC memory, and cosmic rays are a reality. > That means that most Linux systems run on systems that are vulnerable > to this kind of failure --- and the world hasn't ended. There's a difference. In case of cosmic rays, hardware is clearly buggy. I have one machine with bad DRAM (about 1 errors in 2 days), and I still use it. I will not complain if ext3 trashes that. In case of degraded raid-5, even with perfect hardware, and with ext3 on top of that, you'll get silent data corruption. Nice, eh? Clearly, Linux is buggy there. It could be argued it is raid-5's fault, or maybe it is ext3's fault, but... linux is still buggy. > As I recall, the main problem which Pavel had was when he was using > ext3 on a *really* trashy flash drive, on a *really* trashing laptop > where the flash card stuck out slightly, and any jostling of the > netbook would cause the flash card to become disconnected from the > laptop, and cause write errors, very easily and very frequently. In > those circumstnaces, it's highly unlikely that ***any*** file system > would have been able to survive such an unreliable storage system. Well well well. Before I pulled that flash card, I assumed that doing so is safe, because flashcard is presented as block device and ext3 should cope with sudden disk disconnects. And I was wrong wrong wrong. (Noone told me at the university. I guess I should want my money back). Plus note that it is not only my trashy laptop and one trashy MMC card; every USB thumb drive I seen is affected. (OTOH USB disks should be safe AFAICT). Ext3 is unsuitable for flash cards and RAID arrays, plain and simple. It is not documented anywhere :-(. [ext2 should work better -- at least you'll not get silent data corruption.] > One of the problems I have with the break down which Pavel has used is > that it doesn't break things down according to probability; the chance > of a storage subsystem scribbling garbage on its last write during a Can you suggest better patch? I'm not saying we should redesign ext3, but... someone should have told me that ext3+USB thumb drive=problems. > But these things are never absolute, mainly because people aren't > willing to pay for either the cost of superior hardware (consider the > cost of ECC memory, which isn't *that* much more expensive; and yet > most PC class systems don't use it) or in terms of software overhead > (historically many file system designers have eschewed the use of > physical block journalling because it really hurts on meta-data > intensive benchmarks), talking about absolute requirements for > ATOMIC-WRITE isn't all that useful --- because nearly all hardware > doesn't provide these guarantees, and nearly all filesystems require > them. So to call out ext2 and ext3 for requiring them, without > making ext3+raid5 will fail even if you have perfect hardware. > clear that pretty much *all* file systems require them, ends up > causing people to switch over to some other file system that > ironically enough, might end up being *more* vulernable, but which > didn't earn Pavel's displeasure because he didn't try using, say, XFS > on his flashcard on his trashy laptop. I hold ext2/ext3 to higher standards than other filesystem in tree. I'd not use XFS/VFAT etc. I would not want people to migrate towards XFS/VFAT, and yes I believe XFSs/VFATs/... requirements should be documented, too. (But I know too little about those filesystems). If you can suggest better wording, please help me. But... those requirements are non-trivial, commonly not met and the result is data loss. It has to be documented somehow. Make it as innocent-looking as you can... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-08-24 19:52 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 19:52 UTC (permalink / raw) To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, An Hi! > > Isn't this by design? In other words, if the metadata doesn't survive > > non-atomic writes, wouldn't it be an ext3 bug? > > Part of the problem here is that "atomic-writes" is confusing; it > doesn't mean what many people think it means. The assumption which > many naive filesystem designers make is that writes succeed or they > don't. If they don't succeed, they don't change the previously > existing data in any way. > > So in the case of journalling, the assumption which gets made is that > when the power fails, the disk either writes a particular disk block, > or it doesn't. The problem here is as with humans and animals, death > is not an event, it is a process. When the power fails, the system > just doesn't stop functioning; the power on the +5 and +12 volt rails > start dropping to zero, and different components fail at different > times. Specifically, DRAM, being the most voltage sensitve, tends to > fail before the DMA subsystem, the PCI bus, and the hard drive fails. > So as a result, garbage can get written out to disk as part of the > failure. That's just the way hardware works. Yep, and at that point you lost data. You had "silent data corruption" from fs point of view, and that's bad. It will be probably very bad on XFS, probably okay on Ext3, and certainly okay on Ext2: you do filesystem check, and you should be able to repair any damage. So yes, physical journaling is good, but fsck is better. > Is that a file system "bug"? Well, it's better to call that a > mismatch between the assumptions made of physical devices, and of the > file system code. On Irix, SGI hardware had a powerfail interrupt, If those filesystem assumptions were not documented, I'd call it filesystem bug. So better document them ;-). > There is another kind of non-atomic write that nearly all file systems > are subject to, however, and to give an example of this, consider what > happens if you a laptop is subjected to a sudden shock while it is > writing a sector, and the hard drive doesn't an accelerometer which ... > Depending on how severe the shock happens to be, the head could end up > impacting the platter, destroying the medium (which used to be > iron-oxide; hence the term "spinning rust platters") at that spot. > This will obviously cause a write failure, and the previous contents > of the sector will be lost. This is also considered a failure of the > ATOMIC-WRITE property, and no, ext3 doesn't handle this case > gracefully. Very few file systems do. (It is possible for an OS > that Actually, ext2 should be able to survive that, no? Error writing -> remount ro -> fsck on next boot -> drive relocates the sectors. > It's for this reason that I've never been completely sure how useful > Pavel's proposed treatise about file systems expectations really are > --- because all storage subsystems *usually* provide these guarantees, > but it is the very rare storage system that *always* provides these > guarantees. Well... there's very big difference between harddrives and flash memory. Harddrives usually work, and flash memory never does. > We could just as easily have several kilobytes of explanation in > Documentation/* explaining how we assume that DRAM always returns the > same value that was stored in it previously --- and yet most PC class > hardware still does not use ECC memory, and cosmic rays are a reality. > That means that most Linux systems run on systems that are vulnerable > to this kind of failure --- and the world hasn't ended. There's a difference. In case of cosmic rays, hardware is clearly buggy. I have one machine with bad DRAM (about 1 errors in 2 days), and I still use it. I will not complain if ext3 trashes that. In case of degraded raid-5, even with perfect hardware, and with ext3 on top of that, you'll get silent data corruption. Nice, eh? Clearly, Linux is buggy there. It could be argued it is raid-5's fault, or maybe it is ext3's fault, but... linux is still buggy. > As I recall, the main problem which Pavel had was when he was using > ext3 on a *really* trashy flash drive, on a *really* trashing laptop > where the flash card stuck out slightly, and any jostling of the > netbook would cause the flash card to become disconnected from the > laptop, and cause write errors, very easily and very frequently. In > those circumstnaces, it's highly unlikely that ***any*** file system > would have been able to survive such an unreliable storage system. Well well well. Before I pulled that flash card, I assumed that doing so is safe, because flashcard is presented as block device and ext3 should cope with sudden disk disconnects. And I was wrong wrong wrong. (Noone told me at the university. I guess I should want my money back). Plus note that it is not only my trashy laptop and one trashy MMC card; every USB thumb drive I seen is affected. (OTOH USB disks should be safe AFAICT). Ext3 is unsuitable for flash cards and RAID arrays, plain and simple. It is not documented anywhere :-(. [ext2 should work better -- at least you'll not get silent data corruption.] > One of the problems I have with the break down which Pavel has used is > that it doesn't break things down according to probability; the chance > of a storage subsystem scribbling garbage on its last write during a Can you suggest better patch? I'm not saying we should redesign ext3, but... someone should have told me that ext3+USB thumb drive=problems. > But these things are never absolute, mainly because people aren't > willing to pay for either the cost of superior hardware (consider the > cost of ECC memory, which isn't *that* much more expensive; and yet > most PC class systems don't use it) or in terms of software overhead > (historically many file system designers have eschewed the use of > physical block journalling because it really hurts on meta-data > intensive benchmarks), talking about absolute requirements for > ATOMIC-WRITE isn't all that useful --- because nearly all hardware > doesn't provide these guarantees, and nearly all filesystems require > them. So to call out ext2 and ext3 for requiring them, without > making ext3+raid5 will fail even if you have perfect hardware. > clear that pretty much *all* file systems require them, ends up > causing people to switch over to some other file system that > ironically enough, might end up being *more* vulernable, but which > didn't earn Pavel's displeasure because he didn't try using, say, XFS > on his flashcard on his trashy laptop. I hold ext2/ext3 to higher standards than other filesystem in tree. I'd not use XFS/VFAT etc. I would not want people to migrate towards XFS/VFAT, and yes I believe XFSs/VFATs/... requirements should be documented, too. (But I know too little about those filesystems). If you can suggest better wording, please help me. But... those requirements are non-trivial, commonly not met and the result is data loss. It has to be documented somehow. Make it as innocent-looking as you can... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 19:52 ` Pavel Machek (?) @ 2009-08-24 20:24 ` Ric Wheeler 2009-08-24 20:52 ` Pavel Machek 2009-08-25 18:52 ` Rob Landley -1 siblings, 2 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-24 20:24 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Pavel Machek wrote: > Hi! > > >>> Isn't this by design? In other words, if the metadata doesn't survive >>> non-atomic writes, wouldn't it be an ext3 bug? >>> >> Part of the problem here is that "atomic-writes" is confusing; it >> doesn't mean what many people think it means. The assumption which >> many naive filesystem designers make is that writes succeed or they >> don't. If they don't succeed, they don't change the previously >> existing data in any way. >> >> So in the case of journalling, the assumption which gets made is that >> when the power fails, the disk either writes a particular disk block, >> or it doesn't. The problem here is as with humans and animals, death >> is not an event, it is a process. When the power fails, the system >> just doesn't stop functioning; the power on the +5 and +12 volt rails >> start dropping to zero, and different components fail at different >> times. Specifically, DRAM, being the most voltage sensitve, tends to >> fail before the DMA subsystem, the PCI bus, and the hard drive fails. >> So as a result, garbage can get written out to disk as part of the >> failure. That's just the way hardware works. >> > > Yep, and at that point you lost data. You had "silent data corruption" > from fs point of view, and that's bad. > > It will be probably very bad on XFS, probably okay on Ext3, and > certainly okay on Ext2: you do filesystem check, and you should be > able to repair any damage. So yes, physical journaling is good, but > fsck is better. > I don't see why you think that. In general, fsck (for any fs) only checks metadata. If you have silent data corruption that corrupts things that are fixable by fsck, you most likely have silent corruption hitting things users care about like their data blocks inside of files. Fsck will not fix (or notice) any of that, that is where things like full data checksums can help. Also note (from first hand experience), unless you check and validate your data, you can have data corruptions that will not get flagged as IO errors so data signing or scrubbing is a critical part of data integrity. > >> Is that a file system "bug"? Well, it's better to call that a >> mismatch between the assumptions made of physical devices, and of the >> file system code. On Irix, SGI hardware had a powerfail interrupt, >> > > If those filesystem assumptions were not documented, I'd call it > filesystem bug. So better document them ;-). > > I think that we need to help people understand the full spectrum of data concerns, starting with reasonable best practices that will help most people suffer *less* (not no) data loss. And make very sure that they are not falsely assured that by following any specific script that they can skip backups, remote backups, etc :-) Nothing in our code in any part of the kernel deals well with every disaster or odd event. >> There is another kind of non-atomic write that nearly all file systems >> are subject to, however, and to give an example of this, consider what >> happens if you a laptop is subjected to a sudden shock while it is >> writing a sector, and the hard drive doesn't an accelerometer which >> > ... > >> Depending on how severe the shock happens to be, the head could end up >> impacting the platter, destroying the medium (which used to be >> iron-oxide; hence the term "spinning rust platters") at that spot. >> This will obviously cause a write failure, and the previous contents >> of the sector will be lost. This is also considered a failure of the >> ATOMIC-WRITE property, and no, ext3 doesn't handle this case >> gracefully. Very few file systems do. (It is possible for an OS >> that >> > > Actually, ext2 should be able to survive that, no? Error writing -> > remount ro -> fsck on next boot -> drive relocates the sectors. > I think that the example and the response are both off base. If your head ever touches the platter, you won't be reading from a huge part of your drive ever again (usually, you have 2 heads per platter, 3-4 platters, impact would kill one head and a corresponding percentage of your data). No file system will recover that data although you might be able to scrape out some remaining useful bits and bytes. More common causes of silent corruption would be bad DRAM in things like the drive write cache, hot spots (that cause adjacent track data errors), etc. Note in this last case, your most recently written data is fine, just the data you wrote months/years ago is toast! > >> It's for this reason that I've never been completely sure how useful >> Pavel's proposed treatise about file systems expectations really are >> --- because all storage subsystems *usually* provide these guarantees, >> but it is the very rare storage system that *always* provides these >> guarantees. >> > > Well... there's very big difference between harddrives and flash > memory. Harddrives usually work, and flash memory never does. > It is hard for anyone to see the real data without looking in detail at large numbers of parts. Back at EMC, we looked at failures for lots of parts so we got a clear grasp on trends. I do agree that flash/SSD parts are still very young so we will have interesting and unexpected failure modes to learn to deal with.... > >> We could just as easily have several kilobytes of explanation in >> Documentation/* explaining how we assume that DRAM always returns the >> same value that was stored in it previously --- and yet most PC class >> hardware still does not use ECC memory, and cosmic rays are a reality. >> That means that most Linux systems run on systems that are vulnerable >> to this kind of failure --- and the world hasn't ended. >> > > There's a difference. In case of cosmic rays, hardware is clearly > buggy. I have one machine with bad DRAM (about 1 errors in 2 days), > and I still use it. I will not complain if ext3 trashes that. > > In case of degraded raid-5, even with perfect hardware, and with > ext3 on top of that, you'll get silent data corruption. Nice, eh? > > Clearly, Linux is buggy there. It could be argued it is raid-5's > fault, or maybe it is ext3's fault, but... linux is still buggy. > Nothing is perfect. It is still a trade off between storage utilization (how much storage we give users for say 5 2TB drives), performance and costs (throw away any disks over 2 years old?). > >> As I recall, the main problem which Pavel had was when he was using >> ext3 on a *really* trashy flash drive, on a *really* trashing laptop >> where the flash card stuck out slightly, and any jostling of the >> netbook would cause the flash card to become disconnected from the >> laptop, and cause write errors, very easily and very frequently. In >> those circumstnaces, it's highly unlikely that ***any*** file system >> would have been able to survive such an unreliable storage system. >> > > Well well well. Before I pulled that flash card, I assumed that doing > so is safe, because flashcard is presented as block device and ext3 > should cope with sudden disk disconnects. > > And I was wrong wrong wrong. (Noone told me at the university. I guess > I should want my money back). > > Plus note that it is not only my trashy laptop and one trashy MMC > card; every USB thumb drive I seen is affected. (OTOH USB disks should > be safe AFAICT). > > Ext3 is unsuitable for flash cards and RAID arrays, plain and > simple. It is not documented anywhere :-(. [ext2 should work better -- > at least you'll not get silent data corruption.] > ext3 is used on lots of raid arrays without any issue. > >> One of the problems I have with the break down which Pavel has used is >> that it doesn't break things down according to probability; the chance >> of a storage subsystem scribbling garbage on its last write during a >> > > Can you suggest better patch? I'm not saying we should redesign ext3, > but... someone should have told me that ext3+USB thumb drive=problems. > > >> But these things are never absolute, mainly because people aren't >> willing to pay for either the cost of superior hardware (consider the >> cost of ECC memory, which isn't *that* much more expensive; and yet >> most PC class systems don't use it) or in terms of software overhead >> (historically many file system designers have eschewed the use of >> physical block journalling because it really hurts on meta-data >> intensive benchmarks), talking about absolute requirements for >> ATOMIC-WRITE isn't all that useful --- because nearly all hardware >> doesn't provide these guarantees, and nearly all filesystems require >> them. So to call out ext2 and ext3 for requiring them, without >> making >> > > ext3+raid5 will fail even if you have perfect hardware. > > >> clear that pretty much *all* file systems require them, ends up >> causing people to switch over to some other file system that >> ironically enough, might end up being *more* vulernable, but which >> didn't earn Pavel's displeasure because he didn't try using, say, XFS >> on his flashcard on his trashy laptop. >> > > I hold ext2/ext3 to higher standards than other filesystem in > tree. I'd not use XFS/VFAT etc. > > I would not want people to migrate towards XFS/VFAT, and yes I believe > XFSs/VFATs/... requirements should be documented, too. (But I know too > little about those filesystems). > > If you can suggest better wording, please help me. But... those > requirements are non-trivial, commonly not met and the result is data > loss. It has to be documented somehow. Make it as innocent-looking as > you can... > > Pavel > I think that you really need to step back and look harder at real failures - not just your personal experience - but a larger set of real world failures. Many papers have been published recently about that (the google paper, the Bianca paper from FAST, Netapp, etc). Regards, Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 20:24 ` Ric Wheeler @ 2009-08-24 20:52 ` Pavel Machek 2009-08-24 21:08 ` Ric Wheeler 2009-08-24 21:11 ` Greg Freemyer 2009-08-25 18:52 ` Rob Landley 1 sibling, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 20:52 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Hi! >> Yep, and at that point you lost data. You had "silent data corruption" >> from fs point of view, and that's bad. >> >> It will be probably very bad on XFS, probably okay on Ext3, and >> certainly okay on Ext2: you do filesystem check, and you should be >> able to repair any damage. So yes, physical journaling is good, but >> fsck is better. > > I don't see why you think that. In general, fsck (for any fs) only > checks metadata. If you have silent data corruption that corrupts things > that are fixable by fsck, you most likely have silent corruption hitting > things users care about like their data blocks inside of files. Fsck > will not fix (or notice) any of that, that is where things like full > data checksums can help. Ok, but in case of data corruption, at least your filesystem does not degrade further. >> If those filesystem assumptions were not documented, I'd call it >> filesystem bug. So better document them ;-). >> > I think that we need to help people understand the full spectrum of data > concerns, starting with reasonable best practices that will help most > people suffer *less* (not no) data loss. And make very sure that they > are not falsely assured that by following any specific script that they > can skip backups, remote backups, etc :-) > > Nothing in our code in any part of the kernel deals well with every > disaster or odd event. I can reproduce data loss with ext3 on flashcard in about 40 seconds. I'd not call that "odd event". It would be nice to handle that, but that is hard. So ... can we at least get that documented please? >> Actually, ext2 should be able to survive that, no? Error writing -> >> remount ro -> fsck on next boot -> drive relocates the sectors. >> > > I think that the example and the response are both off base. If your > head ever touches the platter, you won't be reading from a huge part of > your drive ever again (usually, you have 2 heads per platter, 3-4 > platters, impact would kill one head and a corresponding percentage of > your data). Ok, that's obviously game over. >>> It's for this reason that I've never been completely sure how useful >>> Pavel's proposed treatise about file systems expectations really are >>> --- because all storage subsystems *usually* provide these guarantees, >>> but it is the very rare storage system that *always* provides these >>> guarantees. >> >> Well... there's very big difference between harddrives and flash >> memory. Harddrives usually work, and flash memory never does. > > It is hard for anyone to see the real data without looking in detail at > large numbers of parts. Back at EMC, we looked at failures for lots of > parts so we got a clear grasp on trends. I do agree that flash/SSD > parts are still very young so we will have interesting and unexpected > failure modes to learn to deal with.... _Maybe_ SSDs, being HDD replacements are better. I don't know. _All_ flash cards (MMC, USB, SD) had the problems. You don't need to get clear grasp on trends. Those cards just don't meet ext3 expectations, and if you pull them, you get data loss. >>> We could just as easily have several kilobytes of explanation in >>> Documentation/* explaining how we assume that DRAM always returns the >>> same value that was stored in it previously --- and yet most PC class >>> hardware still does not use ECC memory, and cosmic rays are a reality. >>> That means that most Linux systems run on systems that are vulnerable >>> to this kind of failure --- and the world hasn't ended. >> There's a difference. In case of cosmic rays, hardware is clearly >> buggy. I have one machine with bad DRAM (about 1 errors in 2 days), >> and I still use it. I will not complain if ext3 trashes that. >> >> In case of degraded raid-5, even with perfect hardware, and with >> ext3 on top of that, you'll get silent data corruption. Nice, eh? >> >> Clearly, Linux is buggy there. It could be argued it is raid-5's >> fault, or maybe it is ext3's fault, but... linux is still buggy. > > Nothing is perfect. It is still a trade off between storage utilization > (how much storage we give users for say 5 2TB drives), performance and > costs (throw away any disks over 2 years old?). "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I believe that should be at least documented. (And understand why ZFS is interesting thing). >> Ext3 is unsuitable for flash cards and RAID arrays, plain and >> simple. It is not documented anywhere :-(. [ext2 should work better -- >> at least you'll not get silent data corruption.] > > ext3 is used on lots of raid arrays without any issue. And I still use my zaurus with crappy DRAM. I would not trust raid5 array with my data, for multiple reasons. The fact that degraded raid5 breaks ext3 assumptions should really be documented. >> I hold ext2/ext3 to higher standards than other filesystem in >> tree. I'd not use XFS/VFAT etc. >> >> I would not want people to migrate towards XFS/VFAT, and yes I believe >> XFSs/VFATs/... requirements should be documented, too. (But I know too >> little about those filesystems). >> >> If you can suggest better wording, please help me. But... those >> requirements are non-trivial, commonly not met and the result is data >> loss. It has to be documented somehow. Make it as innocent-looking as >> you can... > > I think that you really need to step back and look harder at real > failures - not just your personal experience - but a larger set of real > world failures. Many papers have been published recently about that (the > google paper, the Bianca paper from FAST, Netapp, etc). The papers show failures in "once a year" range. I have "twice a minute" failure scenario with flashdisks. Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, but I bet it would be on "once a day" scale. We should document those. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 20:52 ` Pavel Machek @ 2009-08-24 21:08 ` Ric Wheeler 2009-08-24 21:25 ` Pavel Machek 2009-08-24 21:11 ` Greg Freemyer 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-24 21:08 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Pavel Machek wrote: > Hi! > > >>> Yep, and at that point you lost data. You had "silent data corruption" >>> from fs point of view, and that's bad. >>> >>> It will be probably very bad on XFS, probably okay on Ext3, and >>> certainly okay on Ext2: you do filesystem check, and you should be >>> able to repair any damage. So yes, physical journaling is good, but >>> fsck is better. >>> >> I don't see why you think that. In general, fsck (for any fs) only >> checks metadata. If you have silent data corruption that corrupts things >> that are fixable by fsck, you most likely have silent corruption hitting >> things users care about like their data blocks inside of files. Fsck >> will not fix (or notice) any of that, that is where things like full >> data checksums can help. >> > > Ok, but in case of data corruption, at least your filesystem does not > degrade further. > > Even worse, your data is potentially gone and you have not noticed it... This is why array vendors and archival storage products do periodic scans of all stored data (read all the bytes, compared to a digital signature, etc). >>> If those filesystem assumptions were not documented, I'd call it >>> filesystem bug. So better document them ;-). >>> >>> >> I think that we need to help people understand the full spectrum of data >> concerns, starting with reasonable best practices that will help most >> people suffer *less* (not no) data loss. And make very sure that they >> are not falsely assured that by following any specific script that they >> can skip backups, remote backups, etc :-) >> >> Nothing in our code in any part of the kernel deals well with every >> disaster or odd event. >> > > I can reproduce data loss with ext3 on flashcard in about 40 > seconds. I'd not call that "odd event". It would be nice to handle > that, but that is hard. So ... can we at least get that documented > please? > Part of documenting best practices is to put down very specific things that do/don't work. What I worry about is producing too much detail to be of use for real end users. I have to admit that I have not paid enough attention to this specifics of your ext3 + flash card issue - is it the ftl stuff doing out of order IO's? > > >>> Actually, ext2 should be able to survive that, no? Error writing -> >>> remount ro -> fsck on next boot -> drive relocates the sectors. >>> >>> >> I think that the example and the response are both off base. If your >> head ever touches the platter, you won't be reading from a huge part of >> your drive ever again (usually, you have 2 heads per platter, 3-4 >> platters, impact would kill one head and a corresponding percentage of >> your data). >> > > Ok, that's obviously game over. > This is when you start seeing lots of READ and WRITE errors :-) > >>>> It's for this reason that I've never been completely sure how useful >>>> Pavel's proposed treatise about file systems expectations really are >>>> --- because all storage subsystems *usually* provide these guarantees, >>>> but it is the very rare storage system that *always* provides these >>>> guarantees. >>>> >>> Well... there's very big difference between harddrives and flash >>> memory. Harddrives usually work, and flash memory never does. >>> >> It is hard for anyone to see the real data without looking in detail at >> large numbers of parts. Back at EMC, we looked at failures for lots of >> parts so we got a clear grasp on trends. I do agree that flash/SSD >> parts are still very young so we will have interesting and unexpected >> failure modes to learn to deal with.... >> > > _Maybe_ SSDs, being HDD replacements are better. I don't know. > > _All_ flash cards (MMC, USB, SD) had the problems. You don't need to > get clear grasp on trends. Those cards just don't meet ext3 > expectations, and if you pull them, you get data loss. > > Pull them even after an unmount, or pull them hot? >>>> We could just as easily have several kilobytes of explanation in >>>> Documentation/* explaining how we assume that DRAM always returns the >>>> same value that was stored in it previously --- and yet most PC class >>>> hardware still does not use ECC memory, and cosmic rays are a reality. >>>> That means that most Linux systems run on systems that are vulnerable >>>> to this kind of failure --- and the world hasn't ended. >>>> > > >>> There's a difference. In case of cosmic rays, hardware is clearly >>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days), >>> and I still use it. I will not complain if ext3 trashes that. >>> >>> In case of degraded raid-5, even with perfect hardware, and with >>> ext3 on top of that, you'll get silent data corruption. Nice, eh? >>> >>> Clearly, Linux is buggy there. It could be argued it is raid-5's >>> fault, or maybe it is ext3's fault, but... linux is still buggy. >>> >> Nothing is perfect. It is still a trade off between storage utilization >> (how much storage we give users for say 5 2TB drives), performance and >> costs (throw away any disks over 2 years old?). >> > > "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I > believe that should be at least documented. (And understand why ZFS is > interesting thing). > > Your statement is overly broad - ext3 on a commercial RAID array that does RAID5 or RAID6, etc has no issues that I know of. Do you know first hand that ZFS works on flash cards? >>> Ext3 is unsuitable for flash cards and RAID arrays, plain and >>> simple. It is not documented anywhere :-(. [ext2 should work better -- >>> at least you'll not get silent data corruption.] >>> >> ext3 is used on lots of raid arrays without any issue. >> > > And I still use my zaurus with crappy DRAM. > > I would not trust raid5 array with my data, for multiple > reasons. The fact that degraded raid5 breaks ext3 assumptions should > really be documented. > Again, you say RAID5 without enough specifics. Are you pointing just at MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 vendor? > >>> I hold ext2/ext3 to higher standards than other filesystem in >>> tree. I'd not use XFS/VFAT etc. >>> >>> I would not want people to migrate towards XFS/VFAT, and yes I believe >>> XFSs/VFATs/... requirements should be documented, too. (But I know too >>> little about those filesystems). >>> >>> If you can suggest better wording, please help me. But... those >>> requirements are non-trivial, commonly not met and the result is data >>> loss. It has to be documented somehow. Make it as innocent-looking as >>> you can... >>> > > >> I think that you really need to step back and look harder at real >> failures - not just your personal experience - but a larger set of real >> world failures. Many papers have been published recently about that (the >> google paper, the Bianca paper from FAST, Netapp, etc). >> > > The papers show failures in "once a year" range. I have "twice a > minute" failure scenario with flashdisks. > > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, > but I bet it would be on "once a day" scale. > > We should document those. > Pavel > Documentation is fine with sufficient, hard data.... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 21:08 ` Ric Wheeler @ 2009-08-24 21:25 ` Pavel Machek 2009-08-24 22:05 ` Ric Wheeler 2009-08-24 22:39 ` Theodore Tso 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 21:25 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Hi! >> I can reproduce data loss with ext3 on flashcard in about 40 >> seconds. I'd not call that "odd event". It would be nice to handle >> that, but that is hard. So ... can we at least get that documented >> please? >> > > Part of documenting best practices is to put down very specific things > that do/don't work. What I worry about is producing too much detail to > be of use for real end users. Well, I was trying to write for kernel audience. Someone can turn that into nice end-user manual. > I have to admit that I have not paid enough attention to this specifics > of your ext3 + flash card issue - is it the ftl stuff doing out of order > IO's? The problem is that flash cards destroy whole erase block on unplug, and ext3 can't cope with that. >> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to >> get clear grasp on trends. Those cards just don't meet ext3 >> expectations, and if you pull them, you get data loss. >> > Pull them even after an unmount, or pull them hot? Pull them hot. [Some people try -osync to avoid data loss on flash cards... that will not do the trick. Flashcard will still kill the eraseblock.] >>> Nothing is perfect. It is still a trade off between storage >>> utilization (how much storage we give users for say 5 2TB drives), >>> performance and costs (throw away any disks over 2 years old?). >>> >> >> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I >> believe that should be at least documented. (And understand why ZFS is >> interesting thing). >> > Your statement is overly broad - ext3 on a commercial RAID array that > does RAID5 or RAID6, etc has no issues that I know of. If your commercial RAID array is battery backed, maybe. But I was talking Linux MD here. >> And I still use my zaurus with crappy DRAM. >> >> I would not trust raid5 array with my data, for multiple >> reasons. The fact that degraded raid5 breaks ext3 assumptions should >> really be documented. > > Again, you say RAID5 without enough specifics. Are you pointing just at > MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 > vendor? Degraded MD RAID5 on anything, including SATA, and including hypothetical "perfect disk". >> The papers show failures in "once a year" range. I have "twice a >> minute" failure scenario with flashdisks. >> >> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, >> but I bet it would be on "once a day" scale. >> >> We should document those. > > Documentation is fine with sufficient, hard data.... Degraded MD RAID5 does not work by design; whole stripe will be damaged on powerfail or reset or kernel bug, and ext3 can not cope with that kind of damage. [I don't see why statistics should be neccessary for that; the same way we don't need statistics to see that ext2 needs fsck after powerfail.] Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 21:25 ` Pavel Machek @ 2009-08-24 22:05 ` Ric Wheeler 2009-08-24 22:22 ` Zan Lynx 2009-08-24 22:41 ` Pavel Machek 2009-08-24 22:39 ` Theodore Tso 1 sibling, 2 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-24 22:05 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Pavel Machek wrote: > Hi! > > >>> I can reproduce data loss with ext3 on flashcard in about 40 >>> seconds. I'd not call that "odd event". It would be nice to handle >>> that, but that is hard. So ... can we at least get that documented >>> please? >>> >>> >> Part of documenting best practices is to put down very specific things >> that do/don't work. What I worry about is producing too much detail to >> be of use for real end users. >> > > Well, I was trying to write for kernel audience. Someone can turn that > into nice end-user manual. > Kernel people who don't do storage or file systems will still need a summary - making very specific proposals based on real data and analysis is useful. > >> I have to admit that I have not paid enough attention to this specifics >> of your ext3 + flash card issue - is it the ftl stuff doing out of order >> IO's? >> > > The problem is that flash cards destroy whole erase block on unplug, > and ext3 can't cope with that. > > Even if you unmount the file system? Why isn't this an issue with ext2? Sounds like you want to suggest very specifically that journalled file systems are not appropriate for low end flash cards (which seems quite reasonable). >>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to >>> get clear grasp on trends. Those cards just don't meet ext3 >>> expectations, and if you pull them, you get data loss. >>> >>> >> Pull them even after an unmount, or pull them hot? >> > > Pull them hot. > > [Some people try -osync to avoid data loss on flash cards... that will > not do the trick. Flashcard will still kill the eraseblock.] > Pulling hot any device will cause data loss for recent data loss, even with ext2 you will have data in the page cache, right? > >>>> Nothing is perfect. It is still a trade off between storage >>>> utilization (how much storage we give users for say 5 2TB drives), >>>> performance and costs (throw away any disks over 2 years old?). >>>> >>>> >>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I >>> believe that should be at least documented. (And understand why ZFS is >>> interesting thing). >>> >>> >> Your statement is overly broad - ext3 on a commercial RAID array that >> does RAID5 or RAID6, etc has no issues that I know of. >> > > If your commercial RAID array is battery backed, maybe. But I was > talking Linux MD here. > Many people in the real world who use RAID5 (for better or worse) use external raid cards or raid arrays, so you need to be very specific. > >>> And I still use my zaurus with crappy DRAM. >>> >>> I would not trust raid5 array with my data, for multiple >>> reasons. The fact that degraded raid5 breaks ext3 assumptions should >>> really be documented. >>> >> Again, you say RAID5 without enough specifics. Are you pointing just at >> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 >> vendor? >> > > Degraded MD RAID5 on anything, including SATA, and including > hypothetical "perfect disk". > Degraded is one faulted drive while MD is doing a rebuild? And then you hot unplug it or power cycle? I think that would certainly cause failure for ext2 as well (again, you would lose any data in the page cache). > >>> The papers show failures in "once a year" range. I have "twice a >>> minute" failure scenario with flashdisks. >>> >>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, >>> but I bet it would be on "once a day" scale. >>> >>> We should document those. >>> >> Documentation is fine with sufficient, hard data.... >> > > Degraded MD RAID5 does not work by design; whole stripe will be > damaged on powerfail or reset or kernel bug, and ext3 can not cope > with that kind of damage. [I don't see why statistics should be > neccessary for that; the same way we don't need statistics to see that > ext2 needs fsck after powerfail.] > Pavel > What you are describing is a double failure and RAID5 is not double failure tolerant regardless of the file system type.... I don't want to be overly negative since getting good documentation is certainly very useful. We just need to be document things correctly based on real data. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:05 ` Ric Wheeler @ 2009-08-24 22:22 ` Zan Lynx 2009-08-24 22:44 ` Pavel Machek 2009-08-24 23:42 ` david 2009-08-24 22:41 ` Pavel Machek 1 sibling, 2 replies; 309+ messages in thread From: Zan Lynx @ 2009-08-24 22:22 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Ric Wheeler wrote: > Pavel Machek wrote: >> Degraded MD RAID5 does not work by design; whole stripe will be >> damaged on powerfail or reset or kernel bug, and ext3 can not cope >> with that kind of damage. [I don't see why statistics should be >> neccessary for that; the same way we don't need statistics to see that >> ext2 needs fsck after powerfail.] >> Pavel >> > What you are describing is a double failure and RAID5 is not double > failure tolerant regardless of the file system type.... Are you sure he isn't talking about how RAID must write all the data chunks to make a complete stripe and if there is a power-loss, some of the chunks may be written and some may not? As I read Pavel's point he is saying that the incomplete write can be detected by the incorrect parity chunk, but degraded RAID-5 has no working parity chunk so the incomplete write would go undetected. I know this is a RAID failure mode. However, I actually thought this was a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally read the complete stripe and perform verification unless that is requested, because doing so would hurt performance and lose the entire point of the RAID-5 rotating parity blocks. -- Zan Lynx zlynx@acm.org "Knowledge is Power. Power Corrupts. Study Hard. Be Evil." ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:22 ` Zan Lynx @ 2009-08-24 22:44 ` Pavel Machek 2009-08-25 0:34 ` Ric Wheeler 2009-08-24 23:42 ` david 1 sibling, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-24 22:44 UTC (permalink / raw) To: Zan Lynx Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Mon 2009-08-24 16:22:22, Zan Lynx wrote: > Ric Wheeler wrote: >> Pavel Machek wrote: >>> Degraded MD RAID5 does not work by design; whole stripe will be >>> damaged on powerfail or reset or kernel bug, and ext3 can not cope >>> with that kind of damage. [I don't see why statistics should be >>> neccessary for that; the same way we don't need statistics to see that >>> ext2 needs fsck after powerfail.] >>> Pavel >>> >> What you are describing is a double failure and RAID5 is not double >> failure tolerant regardless of the file system type.... > > Are you sure he isn't talking about how RAID must write all the data > chunks to make a complete stripe and if there is a power-loss, some of > the chunks may be written and some may not? > > As I read Pavel's point he is saying that the incomplete write can be > detected by the incorrect parity chunk, but degraded RAID-5 has no > working parity chunk so the incomplete write would go undetected. Yep. > I know this is a RAID failure mode. However, I actually thought this was > a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally > read the complete stripe and perform verification unless that is > requested, because doing so would hurt performance and lose the entire > point of the RAID-5 rotating parity blocks. Not sure; is not RAID expected to verify the array after unclean shutdown? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:44 ` Pavel Machek @ 2009-08-25 0:34 ` Ric Wheeler 0 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 0:34 UTC (permalink / raw) To: Pavel Machek Cc: Zan Lynx, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Pavel Machek wrote: > On Mon 2009-08-24 16:22:22, Zan Lynx wrote: > >> Ric Wheeler wrote: >> >>> Pavel Machek wrote: >>> >>>> Degraded MD RAID5 does not work by design; whole stripe will be >>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope >>>> with that kind of damage. [I don't see why statistics should be >>>> neccessary for that; the same way we don't need statistics to see that >>>> ext2 needs fsck after powerfail.] >>>> Pavel >>>> >>>> >>> What you are describing is a double failure and RAID5 is not double >>> failure tolerant regardless of the file system type.... >>> >> Are you sure he isn't talking about how RAID must write all the data >> chunks to make a complete stripe and if there is a power-loss, some of >> the chunks may be written and some may not? >> >> As I read Pavel's point he is saying that the incomplete write can be >> detected by the incorrect parity chunk, but degraded RAID-5 has no >> working parity chunk so the incomplete write would go undetected. >> > > Yep. > > >> I know this is a RAID failure mode. However, I actually thought this was >> a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally >> read the complete stripe and perform verification unless that is >> requested, because doing so would hurt performance and lose the entire >> point of the RAID-5 rotating parity blocks. >> > > Not sure; is not RAID expected to verify the array after unclean > shutdown? > > Pavel > Not usually - that would take multiple hours of verification, roughly equivalent to doing a RAID rebuild since you have to read each sector of every drive (although you would do this at full speed if the array was offline, not throttled like we do with rebuilds). That is part of the thing that scrubbing can do. Note that once you find a bad bit of data, it is really useful to be able to map that back into a humanly understandable object/repair action. For example, map the bad data range back to metadata which would translate into a fsck run or a list of impacted files or directories.... Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:22 ` Zan Lynx 2009-08-24 22:44 ` Pavel Machek @ 2009-08-24 23:42 ` david 1 sibling, 0 replies; 309+ messages in thread From: david @ 2009-08-24 23:42 UTC (permalink / raw) To: Zan Lynx Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Mon, 24 Aug 2009, Zan Lynx wrote: > Ric Wheeler wrote: >> Pavel Machek wrote: >>> Degraded MD RAID5 does not work by design; whole stripe will be >>> damaged on powerfail or reset or kernel bug, and ext3 can not cope >>> with that kind of damage. [I don't see why statistics should be >>> neccessary for that; the same way we don't need statistics to see that >>> ext2 needs fsck after powerfail.] >>> Pavel >>> >> What you are describing is a double failure and RAID5 is not double failure >> tolerant regardless of the file system type.... > > Are you sure he isn't talking about how RAID must write all the data chunks > to make a complete stripe and if there is a power-loss, some of the chunks > may be written and some may not? q write to raid 5 doesn't need to write to all drives, but it does need to write to two drives (the drive you are modifying and the parity drive) if you are not degraded and only suceed on one write you will detect the corruption later when you try to verify the data. if you are degraded and only suceed on one write, then the entire stripe gets corrupted. but this is a double failure (one drive + unclean shutdown) if you have battery-backed cache you will finish the writes when you reboot. if you don't have battery-backed cache (or are using software raid and crashed in the middle of sending the writes to the drive) you loose, but unless you disable write buffers and do sync writes (which nobody is going to do because of the performance problems) you will loose data in an unclean shutdown anyway. David Lang > As I read Pavel's point he is saying that the incomplete write can be > detected by the incorrect parity chunk, but degraded RAID-5 has no working > parity chunk so the incomplete write would go undetected. > > I know this is a RAID failure mode. However, I actually thought this was a > problem even for a intact RAID-5. AFAIK, RAID-5 does not generally read the > complete stripe and perform verification unless that is requested, because > doing so would hurt performance and lose the entire point of the RAID-5 > rotating parity blocks. > > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:05 ` Ric Wheeler 2009-08-24 22:22 ` Zan Lynx @ 2009-08-24 22:41 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 22:41 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 >>> I have to admit that I have not paid enough attention to this >>> specifics of your ext3 + flash card issue - is it the ftl stuff >>> doing out of order IO's? >> >> The problem is that flash cards destroy whole erase block on unplug, >> and ext3 can't cope with that. > > Even if you unmount the file system? Why isn't this an issue with > ext2? No, I'm talking hot unplug here. It is the issue with ext2, but ext2 will run fsck on next mount, making it less severe. >>> Pull them even after an unmount, or pull them hot? >>> >> >> Pull them hot. >> >> [Some people try -osync to avoid data loss on flash cards... that will >> not do the trick. Flashcard will still kill the eraseblock.] > > Pulling hot any device will cause data loss for recent data loss, even > with ext2 you will have data in the page cache, right? Right. But in ext3 case you basically loose whole filesystem, because fs is inconsistent and you did not run fsck. >>> Again, you say RAID5 without enough specifics. Are you pointing just >>> at MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial >>> RAID5 vendor? >>> >> >> Degraded MD RAID5 on anything, including SATA, and including >> hypothetical "perfect disk". > > Degraded is one faulted drive while MD is doing a rebuild? And then you > hot unplug it or power cycle? I think that would certainly cause failure > for ext2 as well (again, you would lose any data in the page cache). Losing data in page cache is expected. Losing fs consistency is not. >> Degraded MD RAID5 does not work by design; whole stripe will be >> damaged on powerfail or reset or kernel bug, and ext3 can not cope >> with that kind of damage. [I don't see why statistics should be >> neccessary for that; the same way we don't need statistics to see that >> ext2 needs fsck after powerfail.] > What you are describing is a double failure and RAID5 is not double > failure tolerant regardless of the file system type.... You get single disk failure then powerfail (or reset or kernel panic). I would not call that double failure. I agree that it will mean problems for most filesystems. Anyway, even if that can be called a double failure, this limitation should be clearly documented somewhere. ...and that's exactly what I'm trying to fix. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 21:25 ` Pavel Machek 2009-08-24 22:05 ` Ric Wheeler @ 2009-08-24 22:39 ` Theodore Tso 2009-08-24 23:00 ` Pavel Machek ` (3 more replies) 1 sibling, 4 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-24 22:39 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: > > I have to admit that I have not paid enough attention to this specifics > > of your ext3 + flash card issue - is it the ftl stuff doing out of order > > IO's? > > The problem is that flash cards destroy whole erase block on unplug, > and ext3 can't cope with that. Sure --- but name **any** filesystem that can deal with the fact that 128k or 256k worth of data might disappear when you pull out the flash card while it is writing a single sector? > > Your statement is overly broad - ext3 on a commercial RAID array that > > does RAID5 or RAID6, etc has no issues that I know of. > > If your commercial RAID array is battery backed, maybe. But I was > talking Linux MD here. It's not just high end RAID arrays that have battery backups; I happen to use a mid-range hardware RAID card that comes with a battery backup. It's just a matter of choosing your hardware carefully. If your concern is that with Linux MD, you could potentially lose an entire stripe in RAID 5 mode, then you should say that explicitly; but again, this isn't a filesystem specific cliam; it's true for all filesystems. I don't know of any file system that can survive having a RAID stripe-shaped-hole blown into the middle of it due to a power failure. I'll note, BTW, that AIX uses a journal to protect against these sorts of problems with software raid; this also means that with AIX, you also don't have to rebuild a RAID 1 device after an unclean shutdown, like you have do with Linux MD. This was on the EVMS's team development list to implement for Linux, but it got canned after LVM won out, lo those many years ago. Ce la vie; but it's a problem which is solvable at the RAID layer, and which is traditionally and historically solved in competent RAID implementations. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:39 ` Theodore Tso @ 2009-08-24 23:00 ` Pavel Machek 2009-08-25 0:02 ` david ` (2 more replies) 2009-08-24 23:00 ` Pavel Machek ` (2 subsequent siblings) 3 siblings, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 23:00 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Cc: corbet On Mon 2009-08-24 18:39:15, Theodore Tso wrote: > On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: > > > I have to admit that I have not paid enough attention to this specifics > > > of your ext3 + flash card issue - is it the ftl stuff doing out of order > > > IO's? > > > > The problem is that flash cards destroy whole erase block on unplug, > > and ext3 can't cope with that. > > Sure --- but name **any** filesystem that can deal with the fact that > 128k or 256k worth of data might disappear when you pull out the flash > card while it is writing a single sector? First... I consider myself quite competent in the os level, yet I did not realize what flash does and what that means for data integrity. That means we need some documentation, or maybe we should refuse to mount those devices r/w or something. Then to answer your question... ext2. You expect to run fsck after unclean shutdown, and you expect to have to solve some problems with it. So the way ext2 deals with the flash media actually matches what the user expects. (*) OTOH in ext3 case you expect consistent filesystem after unplug; and you don't get that. > > > Your statement is overly broad - ext3 on a commercial RAID array that > > > does RAID5 or RAID6, etc has no issues that I know of. > > > > If your commercial RAID array is battery backed, maybe. But I was > > talking Linux MD here. ... > If your concern is that with Linux MD, you could potentially lose an > entire stripe in RAID 5 mode, then you should say that explicitly; but > again, this isn't a filesystem specific cliam; it's true for all > filesystems. I don't know of any file system that can survive having > a RAID stripe-shaped-hole blown into the middle of it due to a power > failure. Again, ext2 handles that in a way user expects it. At least I was teached "ext2 needs fsck after powerfail; ext3 can handle powerfails just ok". > I'll note, BTW, that AIX uses a journal to protect against these sorts > of problems with software raid; this also means that with AIX, you > also don't have to rebuild a RAID 1 device after an unclean shutdown, > like you have do with Linux MD. This was on the EVMS's team > development list to implement for Linux, but it got canned after LVM > won out, lo those many years ago. Ce la vie; but it's a problem which > is solvable at the RAID layer, and which is traditionally and > historically solved in competent RAID implementations. Yep, we should add journal to RAID; or at least write "Linux MD *needs* an UPS" in big and bold letters. I'm trying to do the second part. (Attached is current version of the patch). [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are generaly unsafe to use without UPS/reliable connection/no kernel bugs... then I may try to push that. I was not sure... maybe some filesystem _can_ handle this kind of issues?] Pavel (*) Ok, now... user expects to run fsck, but very advanced users may not expect old data to be damaged. Certainly I was not advanced enough user few months ago. diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..d1ef4d0 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,57 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. Not all filesystems require all of these +to be satisfied for safe operation. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On some storage systems, failed write (for example due to power +failure) kills data in adjacent (or maybe unrelated) sectors. + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + MD RAID-4/5/6 in degraded mode has similar problem, stripes + behave similary to eraseblocks. + + +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for MD RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. (But it will only really show up in degraded mode). + UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..ef9ff0f 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 570f9bd..752f4b4 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) + + Ext3 handles trash getting written into sectors during powerfail + surprisingly well. It's not foolproof, but it is resilient. + Incomplete journal entries are ignored, and journal replay of + complete entries will often "repair" garbage written into the inode + table. The data=journal option extends this behavior to file and + directory data blocks as well. + + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. + + References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 23:00 ` Pavel Machek @ 2009-08-25 0:02 ` david 2009-08-25 9:32 ` Pavel Machek 2009-08-25 0:06 ` Ric Wheeler 2009-08-25 0:08 ` Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: david @ 2009-08-25 0:02 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, 25 Aug 2009, Pavel Machek wrote: > On Mon 2009-08-24 18:39:15, Theodore Tso wrote: >> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: >>>> I have to admit that I have not paid enough attention to this specifics >>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order >>>> IO's? >>> >>> The problem is that flash cards destroy whole erase block on unplug, >>> and ext3 can't cope with that. >> >> Sure --- but name **any** filesystem that can deal with the fact that >> 128k or 256k worth of data might disappear when you pull out the flash >> card while it is writing a single sector? > > First... I consider myself quite competent in the os level, yet I did > not realize what flash does and what that means for data > integrity. That means we need some documentation, or maybe we should > refuse to mount those devices r/w or something. > > Then to answer your question... ext2. You expect to run fsck after > unclean shutdown, and you expect to have to solve some problems with > it. So the way ext2 deals with the flash media actually matches what > the user expects. (*) you loose data in ext2 > OTOH in ext3 case you expect consistent filesystem after unplug; and > you don't get that. the problem is that people have been preaching that journaling filesystems eliminate all data loss for no cost (or at worst for minimal cost). they don't, they never did. they address one specific problem (metadata inconsistancy), but they do not address data loss, and never did (and for the most part the filesystem developers never claimed to) depending on how much data gets lost, you may or may not be able to recover enough to continue to use the filesystem, and when your block device takes actions in larger chunks than the filesystem asked it to, it's very possible for seemingly unrelated data to be lost as well. this is true for every single filesystem, nothing special about ext3 people somehow have the expectation that ext3 does the data equivalent of solving world hunger, it doesn't, it never did, and it never claimed to. bashing it because it doesn't isn't fair. bashing XFS because it doesn't also isn't fair. personally I don't consider the two filesystems to be significantly different in terms of the data loss potential. I think people are more aware of the potentials with XFS than with ext3, but I believe that the risk of loss is really about the same (and pretty much for the same reasons) >>>> Your statement is overly broad - ext3 on a commercial RAID array that >>>> does RAID5 or RAID6, etc has no issues that I know of. >>> >>> If your commercial RAID array is battery backed, maybe. But I was >>> talking Linux MD here. > ... >> If your concern is that with Linux MD, you could potentially lose an >> entire stripe in RAID 5 mode, then you should say that explicitly; but >> again, this isn't a filesystem specific cliam; it's true for all >> filesystems. I don't know of any file system that can survive having >> a RAID stripe-shaped-hole blown into the middle of it due to a power >> failure. > > Again, ext2 handles that in a way user expects it. > > At least I was teached "ext2 needs fsck after powerfail; ext3 can > handle powerfails just ok". you were teached wrong. the people making these claims for ext3 didn't understand what ext3 does and doesn't do. David Lang >> I'll note, BTW, that AIX uses a journal to protect against these sorts >> of problems with software raid; this also means that with AIX, you >> also don't have to rebuild a RAID 1 device after an unclean shutdown, >> like you have do with Linux MD. This was on the EVMS's team >> development list to implement for Linux, but it got canned after LVM >> won out, lo those many years ago. Ce la vie; but it's a problem which >> is solvable at the RAID layer, and which is traditionally and >> historically solved in competent RAID implementations. > > Yep, we should add journal to RAID; or at least write "Linux MD > *needs* an UPS" in big and bold letters. I'm trying to do the second > part. > > (Attached is current version of the patch). > > [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are > generaly unsafe to use without UPS/reliable connection/no kernel > bugs... then I may try to push that. I was not sure... maybe some > filesystem _can_ handle this kind of issues?] > > Pavel > > (*) Ok, now... user expects to run fsck, but very advanced users may > not expect old data to be damaged. Certainly I was not advanced enough > user few months ago. > > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..d1ef4d0 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,57 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. Not all filesystems require all of these > +to be satisfied for safe operation. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. > + > +Don't cause collateral damage on a failed write (NO-COLLATERALS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +On some storage systems, failed write (for example due to power > +failure) kills data in adjacent (or maybe unrelated) sectors. > + > +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, > +and are thus unsuitable for all filesystems I know. > + > + An inherent problem with using flash as a normal block device > + is that the flash erase size is bigger than most filesystem > + sector sizes. So when you request a write, it may erase and > + rewrite some 64k, 128k, or even a couple megabytes on the > + really _big_ ones. > + > + If you lose power in the middle of that, filesystem won't > + notice that data in the "sectors" _around_ the one your were > + trying to write to got trashed. > + > + MD RAID-4/5/6 in degraded mode has similar problem, stripes > + behave similary to eraseblocks. > + > + > +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + This may be quite common on generic PC machines. > + > + Note that atomic write is very hard to guarantee for MD RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. (But it will only really show up in degraded mode). > + UPS for RAID array should help. > + > + > + > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > index 67639f9..ef9ff0f 100644 > --- a/Documentation/filesystems/ext2.txt > +++ b/Documentation/filesystems/ext2.txt > @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they > have to be 8 character filenames, even then we are fairly close to > running out of unique filenames. > > +Requirements > +============ > + > +Ext2 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* write errors not allowed (NO-WRITE-ERRORS) > + > +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) > + > +and obviously: > + > +* don't cause collateral damage to adjacent sectors on a failed write > + (NO-COLLATERALS) > + > +(see expectations.txt; note that most/all linux block-based > +filesystems have similar expectations) > + > +* write caching is disabled. ext2 does not know how to issue barriers > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > + > Journaling > ----------- > - > -A journaling extension to the ext2 code has been developed by Stephen > -Tweedie. It avoids the risks of metadata corruption and the need to > -wait for e2fsck to complete after a crash, without requiring a change > -to the on-disk ext2 layout. In a nutshell, the journal is a regular > -file which stores whole metadata (and optionally data) blocks that have > -been modified, prior to writing them into the filesystem. This means > -it is possible to add a journal to an existing ext2 filesystem without > -the need for data conversion. > - > -When changes to the filesystem (e.g. a file is renamed) they are stored in > -a transaction in the journal and can either be complete or incomplete at > -the time of a crash. If a transaction is complete at the time of a crash > -(or in the normal case where the system does not crash), then any blocks > -in that transaction are guaranteed to represent a valid filesystem state, > -and are copied into the filesystem. If a transaction is incomplete at > -the time of the crash, then there is no guarantee of consistency for > -the blocks in that transaction so they are discarded (which means any > -filesystem changes they represent are also lost). > +========== > Check Documentation/filesystems/ext3.txt if you want to read more about > ext3 and journaling. > > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt > index 570f9bd..752f4b4 100644 > --- a/Documentation/filesystems/ext3.txt > +++ b/Documentation/filesystems/ext3.txt > @@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger. > ext2online: online (mounted) ext2 and ext3 filesystem resizer > > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* write errors not allowed (NO-WRITE-ERRORS) > + > +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) > + > + Ext3 handles trash getting written into sectors during powerfail > + surprisingly well. It's not foolproof, but it is resilient. > + Incomplete journal entries are ignored, and journal replay of > + complete entries will often "repair" garbage written into the inode > + table. The data=journal option extends this behavior to file and > + directory data blocks as well. > + > + > +and obviously: > + > +* don't cause collateral damage to adjacent sectors on a failed write > + (NO-COLLATERALS) > + > + > +(see expectations.txt; note that most/all linux block-based > +filesystems have similar expectations) > + > +* either write caching is disabled, or hw can do barriers and they are enabled. > + > + (Note that barriers are disabled by default, use "barrier=1" > + mount option after making sure hw can support them). > + > + hdparm -I reports disk features. If you have "Native > + Command Queueing" is the feature you are looking for. > + > + > References > ========== > > > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 0:02 ` david @ 2009-08-25 9:32 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 9:32 UTC (permalink / raw) To: david Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >>> Sure --- but name **any** filesystem that can deal with the fact that >>> 128k or 256k worth of data might disappear when you pull out the flash >>> card while it is writing a single sector? >> >> First... I consider myself quite competent in the os level, yet I did >> not realize what flash does and what that means for data >> integrity. That means we need some documentation, or maybe we should >> refuse to mount those devices r/w or something. >> >> Then to answer your question... ext2. You expect to run fsck after >> unclean shutdown, and you expect to have to solve some problems with >> it. So the way ext2 deals with the flash media actually matches what >> the user expects. (*) > > you loose data in ext2 Yes. >> OTOH in ext3 case you expect consistent filesystem after unplug; and >> you don't get that. > > the problem is that people have been preaching that journaling > filesystems eliminate all data loss for no cost (or at worst for minimal > cost). > > they don't, they never did. > > they address one specific problem (metadata inconsistancy), but they do > not address data loss, and never did (and for the most part the > filesystem developers never claimed to) Well, in case of flashcard and degraded MD Raid5, ext3 does _not_ address metadata inconsistency problem. And that's why I'm trying to fix the documentation. Current ext3 documentation says: #Journaling Block Device layer #----------------------------- #The Journaling Block Device layer (JBD) isn't ext3 specific. It was #designed #to add journaling capabilities to a block device. The ext3 filesystem #code #will inform the JBD of modifications it is performing (called a #transaction). #The journal supports the transactions start and stop, and in case of a #crash, #the journal can replay the transactions to quickly put the partition #back into #a consistent state. There's no mention that this does not work on flash cards and degraded MD Raid5 arrays. > people somehow have the expectation that ext3 does the data equivalent of > solving world hunger, it doesn't, it never did, and it never claimed > to. It claims so, above. > personally I don't consider the two filesystems to be significantly > different in terms of the data loss potential. I think people are more > aware of the potentials with XFS than with ext3, but I believe that the > risk of loss is really about the same (and pretty much for the same > reasons) Ack here. >> Again, ext2 handles that in a way user expects it. >> >> At least I was teached "ext2 needs fsck after powerfail; ext3 can >> handle powerfails just ok". > > you were teached wrong. the people making these claims for ext3 didn't > understand what ext3 does and doesn't do. Cool. So... can we fix the documentation? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 23:00 ` Pavel Machek 2009-08-25 0:02 ` david @ 2009-08-25 0:06 ` Ric Wheeler 2009-08-25 9:34 ` Pavel Machek 2009-08-25 0:08 ` Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 0:06 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > On Mon 2009-08-24 18:39:15, Theodore Tso wrote: > >> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: >> >>>> I have to admit that I have not paid enough attention to this specifics >>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order >>>> IO's? >>>> >>> The problem is that flash cards destroy whole erase block on unplug, >>> and ext3 can't cope with that. >>> >> Sure --- but name **any** filesystem that can deal with the fact that >> 128k or 256k worth of data might disappear when you pull out the flash >> card while it is writing a single sector? >> > > First... I consider myself quite competent in the os level, yet I did > not realize what flash does and what that means for data > integrity. That means we need some documentation, or maybe we should > refuse to mount those devices r/w or something. > > Then to answer your question... ext2. You expect to run fsck after > unclean shutdown, and you expect to have to solve some problems with > it. So the way ext2 deals with the flash media actually matches what > the user expects. (*) > > OTOH in ext3 case you expect consistent filesystem after unplug; and > you don't get that. > > >>>> Your statement is overly broad - ext3 on a commercial RAID array that >>>> does RAID5 or RAID6, etc has no issues that I know of. >>>> >>> If your commercial RAID array is battery backed, maybe. But I was >>> talking Linux MD here. >>> > ... > >> If your concern is that with Linux MD, you could potentially lose an >> entire stripe in RAID 5 mode, then you should say that explicitly; but >> again, this isn't a filesystem specific cliam; it's true for all >> filesystems. I don't know of any file system that can survive having >> a RAID stripe-shaped-hole blown into the middle of it due to a power >> failure. >> > > Again, ext2 handles that in a way user expects it. > > At least I was teached "ext2 needs fsck after powerfail; ext3 can > handle powerfails just ok". > > So, would you be happy if ext3 fsck was always run on reboot (at least for flash devices)? ric >> I'll note, BTW, that AIX uses a journal to protect against these sorts >> of problems with software raid; this also means that with AIX, you >> also don't have to rebuild a RAID 1 device after an unclean shutdown, >> like you have do with Linux MD. This was on the EVMS's team >> development list to implement for Linux, but it got canned after LVM >> won out, lo those many years ago. Ce la vie; but it's a problem which >> is solvable at the RAID layer, and which is traditionally and >> historically solved in competent RAID implementations. >> > > Yep, we should add journal to RAID; or at least write "Linux MD > *needs* an UPS" in big and bold letters. I'm trying to do the second > part. > > (Attached is current version of the patch). > > [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are > generaly unsafe to use without UPS/reliable connection/no kernel > bugs... then I may try to push that. I was not sure... maybe some > filesystem _can_ handle this kind of issues?] > > Pavel > > (*) Ok, now... user expects to run fsck, but very advanced users may > not expect old data to be damaged. Certainly I was not advanced enough > user few months ago. > > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..d1ef4d0 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,57 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. Not all filesystems require all of these > +to be satisfied for safe operation. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. > + > +Don't cause collateral damage on a failed write (NO-COLLATERALS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +On some storage systems, failed write (for example due to power > +failure) kills data in adjacent (or maybe unrelated) sectors. > + > +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, > +and are thus unsuitable for all filesystems I know. > + > + An inherent problem with using flash as a normal block device > + is that the flash erase size is bigger than most filesystem > + sector sizes. So when you request a write, it may erase and > + rewrite some 64k, 128k, or even a couple megabytes on the > + really _big_ ones. > + > + If you lose power in the middle of that, filesystem won't > + notice that data in the "sectors" _around_ the one your were > + trying to write to got trashed. > + > + MD RAID-4/5/6 in degraded mode has similar problem, stripes > + behave similary to eraseblocks. > + > + > +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + This may be quite common on generic PC machines. > + > + Note that atomic write is very hard to guarantee for MD RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. (But it will only really show up in degraded mode). > + UPS for RAID array should help. > + > + > + > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt > index 67639f9..ef9ff0f 100644 > --- a/Documentation/filesystems/ext2.txt > +++ b/Documentation/filesystems/ext2.txt > @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they > have to be 8 character filenames, even then we are fairly close to > running out of unique filenames. > > +Requirements > +============ > + > +Ext2 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* write errors not allowed (NO-WRITE-ERRORS) > + > +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) > + > +and obviously: > + > +* don't cause collateral damage to adjacent sectors on a failed write > + (NO-COLLATERALS) > + > +(see expectations.txt; note that most/all linux block-based > +filesystems have similar expectations) > + > +* write caching is disabled. ext2 does not know how to issue barriers > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > + > Journaling > ----------- > - > -A journaling extension to the ext2 code has been developed by Stephen > -Tweedie. It avoids the risks of metadata corruption and the need to > -wait for e2fsck to complete after a crash, without requiring a change > -to the on-disk ext2 layout. In a nutshell, the journal is a regular > -file which stores whole metadata (and optionally data) blocks that have > -been modified, prior to writing them into the filesystem. This means > -it is possible to add a journal to an existing ext2 filesystem without > -the need for data conversion. > - > -When changes to the filesystem (e.g. a file is renamed) they are stored in > -a transaction in the journal and can either be complete or incomplete at > -the time of a crash. If a transaction is complete at the time of a crash > -(or in the normal case where the system does not crash), then any blocks > -in that transaction are guaranteed to represent a valid filesystem state, > -and are copied into the filesystem. If a transaction is incomplete at > -the time of the crash, then there is no guarantee of consistency for > -the blocks in that transaction so they are discarded (which means any > -filesystem changes they represent are also lost). > +========== > Check Documentation/filesystems/ext3.txt if you want to read more about > ext3 and journaling. > > diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt > index 570f9bd..752f4b4 100644 > --- a/Documentation/filesystems/ext3.txt > +++ b/Documentation/filesystems/ext3.txt > @@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger. > ext2online: online (mounted) ext2 and ext3 filesystem resizer > > > +Requirements > +============ > + > +Ext3 expects disk/storage subsystem to behave sanely. On sanely > +behaving disk subsystem, data that have been successfully synced will > +stay on the disk. Sane means: > + > +* write errors not allowed (NO-WRITE-ERRORS) > + > +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) > + > + Ext3 handles trash getting written into sectors during powerfail > + surprisingly well. It's not foolproof, but it is resilient. > + Incomplete journal entries are ignored, and journal replay of > + complete entries will often "repair" garbage written into the inode > + table. The data=journal option extends this behavior to file and > + directory data blocks as well. > + > + > +and obviously: > + > +* don't cause collateral damage to adjacent sectors on a failed write > + (NO-COLLATERALS) > + > + > +(see expectations.txt; note that most/all linux block-based > +filesystems have similar expectations) > + > +* either write caching is disabled, or hw can do barriers and they are enabled. > + > + (Note that barriers are disabled by default, use "barrier=1" > + mount option after making sure hw can support them). > + > + hdparm -I reports disk features. If you have "Native > + Command Queueing" is the feature you are looking for. > + > + > References > ========== > > > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 0:06 ` Ric Wheeler @ 2009-08-25 9:34 ` Pavel Machek 2009-08-25 15:34 ` david 2009-08-26 3:32 ` Rik van Riel 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 9:34 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >>> If your concern is that with Linux MD, you could potentially lose an >>> entire stripe in RAID 5 mode, then you should say that explicitly; but >>> again, this isn't a filesystem specific cliam; it's true for all >>> filesystems. I don't know of any file system that can survive having >>> a RAID stripe-shaped-hole blown into the middle of it due to a power >>> failure. >>> >> >> Again, ext2 handles that in a way user expects it. >> >> At least I was teached "ext2 needs fsck after powerfail; ext3 can >> handle powerfails just ok". > > So, would you be happy if ext3 fsck was always run on reboot (at least > for flash devices)? For flash devices, MD Raid 5 and anything else that needs it; yes that would make me happy ;-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 9:34 ` Pavel Machek @ 2009-08-25 15:34 ` david 2009-08-26 3:32 ` Rik van Riel 1 sibling, 0 replies; 309+ messages in thread From: david @ 2009-08-25 15:34 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, 25 Aug 2009, Pavel Machek wrote: > Hi! > >>>> If your concern is that with Linux MD, you could potentially lose an >>>> entire stripe in RAID 5 mode, then you should say that explicitly; but >>>> again, this isn't a filesystem specific cliam; it's true for all >>>> filesystems. I don't know of any file system that can survive having >>>> a RAID stripe-shaped-hole blown into the middle of it due to a power >>>> failure. >>>> >>> >>> Again, ext2 handles that in a way user expects it. >>> >>> At least I was teached "ext2 needs fsck after powerfail; ext3 can >>> handle powerfails just ok". >> >> So, would you be happy if ext3 fsck was always run on reboot (at least >> for flash devices)? > > For flash devices, MD Raid 5 and anything else that needs it; yes that > would make me happy ;-). the thing is that fsck would not fix the problem. it may (if the data lost was metadata) detect the problem and tell you how many files you have lost, but if the data lost was all in a data file you would not detect it with a fsck the only way you would detect the missing data is to read all the files on the filesystem and detect that the data you are reading is wrong. but how can you tell if the data you are reading is wrong? on a flash drive, your read can return garbage, but how do you know that garbage isn't the contents of the file? on a degraded raid5 array you have no way to test data integrity, so when the missing drive is replaced, the rebuild algorithm will calculate the appropriate data to make the parity calculations work out and write garbage to that drive. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 9:34 ` Pavel Machek 2009-08-25 15:34 ` david @ 2009-08-26 3:32 ` Rik van Riel 2009-08-26 11:17 ` Pavel Machek 2009-08-27 5:27 ` Rob Landley 1 sibling, 2 replies; 309+ messages in thread From: Rik van Riel @ 2009-08-26 3:32 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: >> So, would you be happy if ext3 fsck was always run on reboot (at least >> for flash devices)? > > For flash devices, MD Raid 5 and anything else that needs it; yes that > would make me happy ;-). Sorry, but that just shows your naivete. Metadata takes up such a small part of the disk that fscking it and finding it to be OK is absolutely no guarantee that the data on the filesystem has not been horribly mangled. Personally, what I care about is my data. The metadata is just a way to get to my data, while the data is actually important. -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 3:32 ` Rik van Riel @ 2009-08-26 11:17 ` Pavel Machek 2009-08-26 11:29 ` david 2009-08-26 12:28 ` Theodore Tso 2009-08-27 5:27 ` Rob Landley 1 sibling, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-26 11:17 UTC (permalink / raw) To: Rik van Riel Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 23:32:47, Rik van Riel wrote: > Pavel Machek wrote: > >>> So, would you be happy if ext3 fsck was always run on reboot (at >>> least for flash devices)? >> >> For flash devices, MD Raid 5 and anything else that needs it; yes that >> would make me happy ;-). > > Sorry, but that just shows your naivete. > > Metadata takes up such a small part of the disk that fscking > it and finding it to be OK is absolutely no guarantee that > the data on the filesystem has not been horribly mangled. > > Personally, what I care about is my data. > > The metadata is just a way to get to my data, while the data > is actually important. Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there. How do you protect your data is another question, but ext3 documentation does not claim journal to protect them, so that's up to the user I guess. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:17 ` Pavel Machek @ 2009-08-26 11:29 ` david 2009-08-26 13:10 ` Pavel Machek 2009-08-26 12:28 ` Theodore Tso 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-26 11:29 UTC (permalink / raw) To: Pavel Machek Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: > On Tue 2009-08-25 23:32:47, Rik van Riel wrote: >> Pavel Machek wrote: >> >>>> So, would you be happy if ext3 fsck was always run on reboot (at >>>> least for flash devices)? >>> >>> For flash devices, MD Raid 5 and anything else that needs it; yes that >>> would make me happy ;-). >> >> Sorry, but that just shows your naivete. >> >> Metadata takes up such a small part of the disk that fscking >> it and finding it to be OK is absolutely no guarantee that >> the data on the filesystem has not been horribly mangled. >> >> Personally, what I care about is my data. >> >> The metadata is just a way to get to my data, while the data >> is actually important. > > Personally, I care about metadata consistency, and ext3 documentation > suggests that journal protects its integrity. Except that it does not > on broken storage devices, and you still need to run fsck there. as the ext3 authors have stated many times over the years, you still need to run fsck periodicly anyway. what the journal gives you is a reasonable chance of skipping it when the system crashes and you want to get it back up ASAP. David Lang > How do you protect your data is another question, but ext3 > documentation does not claim journal to protect them, so that's up to > the user I guess. > Pavel > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:29 ` david @ 2009-08-26 13:10 ` Pavel Machek 2009-08-26 13:43 ` david 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 13:10 UTC (permalink / raw) To: david Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack >>> The metadata is just a way to get to my data, while the data >>> is actually important. >> >> Personally, I care about metadata consistency, and ext3 documentation >> suggests that journal protects its integrity. Except that it does not >> on broken storage devices, and you still need to run fsck there. > > as the ext3 authors have stated many times over the years, you still need > to run fsck periodicly anyway. Where is that documented? I very much agree with that, but when suse10 switched periodic fsck off, I could not find any docs to show that it is bad idea. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 13:10 ` Pavel Machek @ 2009-08-26 13:43 ` david 2009-08-26 18:02 ` Theodore Tso 2009-08-30 7:03 ` Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: david @ 2009-08-26 13:43 UTC (permalink / raw) To: Pavel Machek Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack On Wed, 26 Aug 2009, Pavel Machek wrote: >>>> The metadata is just a way to get to my data, while the data >>>> is actually important. >>> >>> Personally, I care about metadata consistency, and ext3 documentation >>> suggests that journal protects its integrity. Except that it does not >>> on broken storage devices, and you still need to run fsck there. >> >> as the ext3 authors have stated many times over the years, you still need >> to run fsck periodicly anyway. > > Where is that documented? linux-kernel mailing list archives. David Lang > I very much agree with that, but when suse10 > switched periodic fsck off, I could not find any docs to show that it > is bad idea. > Pavel > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 13:43 ` david @ 2009-08-26 18:02 ` Theodore Tso 2009-08-27 6:28 ` Eric Sandeen ` (2 more replies) 2009-08-30 7:03 ` Pavel Machek 1 sibling, 3 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 18:02 UTC (permalink / raw) To: david Cc: Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote: >>> >>> as the ext3 authors have stated many times over the years, you still need >>> to run fsck periodicly anyway. >> >> Where is that documented? > > linux-kernel mailing list archives. Probably from some 6-8 years ago, in e-mail postings that I made. My argument has always been that PC-class hardware is crap, and it's a Really Good Idea to periodically check the metadata because corruption there can end up causing massive data loss. The main problem is that doing it at reboot time really hurt system availability, and "after 20 reboots (plus or minus)" resulted in fsck checks at wildly varying intervals depending on how often people reboot. What I've been recommending for some time is that people use LVM, and run fsck on a snapshot every week or two, at some convenient time when the system load is at a minimum. There is an e2croncheck script in the e2fsprogs sources, in the contrib directory; it's short enough that I'll attach here here. Is it *necessary*? In a world where hardware is perfect, no. In a world where people don't bother buying ECC memory because it's 10% more expensive, and PC builders use the cheapest possible parts --- I think it's a really good idea. - Ted P.S. Patches so that this shell script takes a config file, and/or parses /etc/fstab to automatically figure out which filesystems should be checked, are greatly appreciated. Getting distro's to start including this in their e2fsprogs packaging scripts would also be greatly appreciated. #!/bin/sh # # e2croncheck -- run e2fsck automatically out of /etc/cron.weekly # # This script is intended to be run by the system administrator # periodically from the command line, or to be run once a week # or so by the cron daemon to check a mounted filesystem (normally # the root filesystem, but it could be used to check other filesystems # that are always mounted when the system is booted). # # Make sure you customize "VG" so it is your LVM volume group name, # "VOLUME" so it is the name of the filesystem's logical volume, # and "EMAIL" to be your e-mail address # # Written by Theodore Ts'o, Copyright 2007, 2008, 2009. # # This file may be redistributed under the terms of the # GNU Public License, version 2. # VG=ssd VOLUME=root SNAPSIZE=100m EMAIL=sysadmin@example.com TMPFILE=`mktemp -t e2fsck.log.XXXXXXXXXX` OPTS="-Fttv -C0" #OPTS="-Fttv -E fragcheck" set -e START="$(date +'%Y%m%d%H%M%S')" lvcreate -s -L ${SNAPSIZE} -n "${VOLUME}-snap" "${VG}/${VOLUME}" if nice logsave -as $TMPFILE e2fsck -p $OPTS "/dev/${VG}/${VOLUME}-snap" && \ nice logsave -as $TMPFILE e2fsck -fy $OPTS "/dev/${VG}/${VOLUME}-snap" ; then echo 'Background scrubbing succeeded!' tune2fs -C 0 -T "${START}" "/dev/${VG}/${VOLUME}" else echo 'Background scrubbing failed! Reboot to fsck soon!' tune2fs -C 16000 -T "19000101" "/dev/${VG}/${VOLUME}" if test -n "$RPT-EMAIL"; then mail -s "E2fsck of /dev/${VG}/${VOLUME} failed!" $EMAIL < $TMPFILE fi fi lvremove -f "${VG}/${VOLUME}-snap" rm $TMPFILE ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 18:02 ` Theodore Tso @ 2009-08-27 6:28 ` Eric Sandeen 2009-11-09 8:53 ` periodic fsck was " Pavel Machek 2009-11-09 8:53 ` Pavel Machek 2 siblings, 0 replies; 309+ messages in thread From: Eric Sandeen @ 2009-08-27 6:28 UTC (permalink / raw) To: Theodore Tso, david, Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack Theodore Tso wrote: > On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote: >>>> as the ext3 authors have stated many times over the years, you still need >>>> to run fsck periodicly anyway. >>> Where is that documented? >> linux-kernel mailing list archives. > > Probably from some 6-8 years ago, in e-mail postings that I made. My > argument has always been that PC-class hardware is crap, and it's a > Really Good Idea to periodically check the metadata because corruption > there can end up causing massive data loss. The main problem is that > doing it at reboot time really hurt system availability, and "after 20 > reboots (plus or minus)" resulted in fsck checks at wildly varying > intervals depending on how often people reboot. Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval then? :) -Eric > What I've been recommending for some time is that people use LVM, and > run fsck on a snapshot every week or two, at some convenient time when > the system load is at a minimum. There is an e2croncheck script in > the e2fsprogs sources, in the contrib directory; it's short enough > that I'll attach here here. > > Is it *necessary*? In a world where hardware is perfect, no. In a > world where people don't bother buying ECC memory because it's 10% > more expensive, and PC builders use the cheapest possible parts --- I > think it's a really good idea. > > - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-08-27 6:28 ` Eric Sandeen 0 siblings, 0 replies; 309+ messages in thread From: Eric Sandeen @ 2009-08-27 6:28 UTC (permalink / raw) To: Theodore Tso, david, Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin Theodore Tso wrote: > On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote: >>>> as the ext3 authors have stated many times over the years, you still need >>>> to run fsck periodicly anyway. >>> Where is that documented? >> linux-kernel mailing list archives. > > Probably from some 6-8 years ago, in e-mail postings that I made. My > argument has always been that PC-class hardware is crap, and it's a > Really Good Idea to periodically check the metadata because corruption > there can end up causing massive data loss. The main problem is that > doing it at reboot time really hurt system availability, and "after 20 > reboots (plus or minus)" resulted in fsck checks at wildly varying > intervals depending on how often people reboot. Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval then? :) -Eric > What I've been recommending for some time is that people use LVM, and > run fsck on a snapshot every week or two, at some convenient time when > the system load is at a minimum. There is an e2croncheck script in > the e2fsprogs sources, in the contrib directory; it's short enough > that I'll attach here here. > > Is it *necessary*? In a world where hardware is perfect, no. In a > world where people don't bother buying ECC memory because it's 10% > more expensive, and PC builders use the cheapest possible parts --- I > think it's a really good idea. > > - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 18:02 ` Theodore Tso 2009-08-27 6:28 ` Eric Sandeen @ 2009-11-09 8:53 ` Pavel Machek 2009-11-09 8:53 ` Pavel Machek 2 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-11-09 8:53 UTC (permalink / raw) To: Theodore Tso, david, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow On Wed 2009-08-26 14:02:48, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote: > >>> > >>> as the ext3 authors have stated many times over the years, you still need > >>> to run fsck periodicly anyway. > >> > >> Where is that documented? > > > > linux-kernel mailing list archives. > > Probably from some 6-8 years ago, in e-mail postings that I made. My > argument has always been that PC-class hardware is crap, and it's a Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I believed that it was really bad idea at that point, but because I could not find piece of documentation recommending them, I lost the argument. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 18:02 ` Theodore Tso 2009-08-27 6:28 ` Eric Sandeen 2009-11-09 8:53 ` periodic fsck was " Pavel Machek @ 2009-11-09 8:53 ` Pavel Machek 2009-11-09 14:05 ` Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-11-09 8:53 UTC (permalink / raw) To: Theodore Tso, david, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack On Wed 2009-08-26 14:02:48, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote: > >>> > >>> as the ext3 authors have stated many times over the years, you still need > >>> to run fsck periodicly anyway. > >> > >> Where is that documented? > > > > linux-kernel mailing list archives. > > Probably from some 6-8 years ago, in e-mail postings that I made. My > argument has always been that PC-class hardware is crap, and it's a Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I believed that it was really bad idea at that point, but because I could not find piece of documentation recommending them, I lost the argument. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-11-09 8:53 ` Pavel Machek @ 2009-11-09 14:05 ` Theodore Tso 2009-11-09 15:58 ` Andreas Dilger 0 siblings, 1 reply; 309+ messages in thread From: Theodore Tso @ 2009-11-09 14:05 UTC (permalink / raw) To: Pavel Machek Cc: david, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack On Mon, Nov 09, 2009 at 09:53:18AM +0100, Pavel Machek wrote: > > Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I > believed that it was really bad idea at that point, but because I > could not find piece of documentation recommending them, I lost the > argument. It's an engineering trade-off. If you have perfect memory that is never has cosmic-ray hiccups, and hard drives that never write data to the wrong place, etc. then you don't need periodic fsck's. If you do have imperfect hardware, the question then is how imperfect your hardware is, and how frequently it introduces errors. If you check too frequently, though, users get upset, especially when it happens at the most inconvenient time (when you're trying to recover from unscheduled downtime by rebooting); if you check too infrequently then it doesn't help you too much since too much data gets damaged before fsck notices. So these days, what I strongly recommend is that people use LVM snapshots, and schedule weekly checks during some low usage period (i.e., 3am on Saturdays), using something like the e2croncheck shell script. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-11-09 14:05 ` Theodore Tso @ 2009-11-09 15:58 ` Andreas Dilger 0 siblings, 0 replies; 309+ messages in thread From: Andreas Dilger @ 2009-11-09 15:58 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, david, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, ext4 development, corbet, Jan Kara, Bryan Kadzban, Karel Zak, LVM Mailing List On 2009-11-09, at 07:05, Theodore Tso wrote: > So these days, what I strongly recommend is that people use LVM > snapshots, and schedule weekly checks during some low usage period > (i.e., 3am on Saturdays), using something like the e2croncheck shell > script. There was another script written to do this that handled the e2fsck, reiserfsck and xfs_check, detecting all volume groups automatically, along with e.g. validating that the snapshot volume doesn't exist before starting the check (which may indicate that the previous e2fsck is still running), and not running while on AC power. The last version was in the thread "forced fsck (again?)" dated 2008-01-28. Would it be better to use that one? In that thread we discussed not clobbering the last checked time as e2croncheck does, so the admin can see how long it was since the filesystem was last checked. Maybe it makes more sense to get the lvcheck script included into util- linux-ng or lvm2 packages, and have it added automatically to the cron.weekly directory? Then the distros could disable the at-boot checking safely, while still being able to detect corruption caused by cables/RAM/drives/software. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 13:43 ` david 2009-08-26 18:02 ` Theodore Tso @ 2009-08-30 7:03 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:03 UTC (permalink / raw) To: david Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack On Wed 2009-08-26 06:43:24, david@lang.hm wrote: > On Wed, 26 Aug 2009, Pavel Machek wrote: > >>>>> The metadata is just a way to get to my data, while the data >>>>> is actually important. >>>> >>>> Personally, I care about metadata consistency, and ext3 documentation >>>> suggests that journal protects its integrity. Except that it does not >>>> on broken storage devices, and you still need to run fsck there. >>> >>> as the ext3 authors have stated many times over the years, you still need >>> to run fsck periodicly anyway. >> >> Where is that documented? > > linux-kernel mailing list archives. That's not where fs documentation belongs :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:17 ` Pavel Machek 2009-08-26 11:29 ` david @ 2009-08-26 12:28 ` Theodore Tso 2009-08-27 6:06 ` Rob Landley 1 sibling, 1 reply; 309+ messages in thread From: Theodore Tso @ 2009-08-26 12:28 UTC (permalink / raw) To: Pavel Machek Cc: Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote: > > Metadata takes up such a small part of the disk that fscking > > it and finding it to be OK is absolutely no guarantee that > > the data on the filesystem has not been horribly mangled. > > > > Personally, what I care about is my data. > > > > The metadata is just a way to get to my data, while the data > > is actually important. > > Personally, I care about metadata consistency, and ext3 documentation > suggests that journal protects its integrity. Except that it does not > on broken storage devices, and you still need to run fsck there. Caring about metadata consistency and not data is just weird, I'm sorry. I can't imagine anyone who actually *cares* about what they have stored, whether it's digital photographs of child taking a first step, or their thesis research, caring about more about the metadata than the data. Giving advice that pretends that most users have that priority is Just Wrong. That's why what we should document is that people should avoid broken storage devices, and advice on how to use RAID properly. At the end of the day, getting people to switch from ext2 to ext3 on some misguided notion that this way, they'll know when their metadata is safe (at least in the power failure case; but not the system hangs and you have to reboot case), and getting them to ignore the question of why are they using a broken storage device in the first place, is Documentation malpractice. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 12:28 ` Theodore Tso @ 2009-08-27 6:06 ` Rob Landley 2009-08-27 6:54 ` david 0 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-08-27 6:06 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote: > On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote: > > > Metadata takes up such a small part of the disk that fscking > > > it and finding it to be OK is absolutely no guarantee that > > > the data on the filesystem has not been horribly mangled. > > > > > > Personally, what I care about is my data. > > > > > > The metadata is just a way to get to my data, while the data > > > is actually important. > > > > Personally, I care about metadata consistency, and ext3 documentation > > suggests that journal protects its integrity. Except that it does not > > on broken storage devices, and you still need to run fsck there. > > Caring about metadata consistency and not data is just weird, I'm > sorry. I can't imagine anyone who actually *cares* about what they > have stored, whether it's digital photographs of child taking a first > step, or their thesis research, caring about more about the metadata > than the data. Giving advice that pretends that most users have that > priority is Just Wrong. I thought the reason for that was that if your metadata is horked, further writes to the disk can trash unrelated existing data because it's lost track of what's allocated and what isn't. So back when the assumption was "what's written stays written", then keeping the metadata sane was still darn important to prevent normal operation from overwriting unrelated existing data. Then Pavel notified us of a situation where interrupted writes to the disk can trash unrelated existing data _anyway_, because the flash block size on the 16 gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks it's 4k or smaller. It seems like what _broke_ was the assumption that the filesystem block size >= the disk block size, and nobody noticed for a while. (Except the people making jffs2 and friends, anyway.) Today we have cheap plentiful USB keys that act like hard drives, except that their write block size isn't remotely the same as hard drives', but they pretend it is, and then the block wear levelling algorithms fuzz things further. (Gee, a drive controller lying about drive geometry, the scsi crowd should feel right at home.) Now Pavel's coming back with a second situation where RAID stripes (under certain circumstances) seem to have similar granularity issues, again breaking what seems to be the same assumption. Big media use big chunks for data, and media is getting bigger. It doesn't seem like this problem is going to diminish in future. I agree that it seems like a good idea to have BIG RED WARNING SIGNS about those kind of media and how _any_ journaling filesystem doesn't really help here. So specifically documenting "These kinds of media lose unrelated random data if writes to them are interrupted, journaling filesystems can't help with this and may actually hide the problem, and even an fsck will only find corrupted metadata not lost file contents" seems kind of useful. That said, ext3's assumption that filesystem block size always >= disk update block size _is_ a fundamental part of this problem, and one that isn't shared by things like jffs2, and which things like btrfs might be able to address if they try, by adding awareness of the real media update granularity to their node layout algorithms. (Heck, ext2 has a stripe size parameter already. Does setting that appropriately for your raid make this suck less? I haven't heard anybody comment on that one yet...) Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 6:06 ` Rob Landley @ 2009-08-27 6:54 ` david 2009-08-27 7:34 ` Rob Landley 2009-08-30 7:19 ` Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: david @ 2009-08-27 6:54 UTC (permalink / raw) To: Rob Landley Cc: Theodore Tso, Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, 27 Aug 2009, Rob Landley wrote: > On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote: >> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote: >>>> Metadata takes up such a small part of the disk that fscking >>>> it and finding it to be OK is absolutely no guarantee that >>>> the data on the filesystem has not been horribly mangled. >>>> >>>> Personally, what I care about is my data. >>>> >>>> The metadata is just a way to get to my data, while the data >>>> is actually important. >>> >>> Personally, I care about metadata consistency, and ext3 documentation >>> suggests that journal protects its integrity. Except that it does not >>> on broken storage devices, and you still need to run fsck there. >> >> Caring about metadata consistency and not data is just weird, I'm >> sorry. I can't imagine anyone who actually *cares* about what they >> have stored, whether it's digital photographs of child taking a first >> step, or their thesis research, caring about more about the metadata >> than the data. Giving advice that pretends that most users have that >> priority is Just Wrong. > > I thought the reason for that was that if your metadata is horked, further > writes to the disk can trash unrelated existing data because it's lost track > of what's allocated and what isn't. So back when the assumption was "what's > written stays written", then keeping the metadata sane was still darn > important to prevent normal operation from overwriting unrelated existing > data. > > Then Pavel notified us of a situation where interrupted writes to the disk can > trash unrelated existing data _anyway_, because the flash block size on the 16 > gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks > it's 4k or smaller. It seems like what _broke_ was the assumption that the > filesystem block size >= the disk block size, and nobody noticed for a while. > (Except the people making jffs2 and friends, anyway.) > > Today we have cheap plentiful USB keys that act like hard drives, except that > their write block size isn't remotely the same as hard drives', but they > pretend it is, and then the block wear levelling algorithms fuzz things > further. (Gee, a drive controller lying about drive geometry, the scsi crowd > should feel right at home.) actually, you don't know if your USB key works that way or not. Pavel has ssome that do, that doesn't mean that all flash drives do when you do a write to a flash drive you have to do the following items 1. allocate an empty eraseblock to put the data on 2. read the old eraseblock 3. merge the incoming write to the eraseblock 4. write the updated data to the flash 5. update the flash trnslation layer to point reads at the new location instead of the old location. now if the flash drive does things in this order you will not loose any previously written data. if the flash drive does step 5 before it does step 4, then you have a window where a crash can loose data (and no btrfs won't survive any better to have a large chunk of data just disappear) it's possible that some super-cheap flash drives skip having a flash translation layer entirely, on those the process would be 1. read the old data into ram 2. merge the new write into the data in ram 3. erase the old data 4. write the new data this obviously has a significant data loss window. but if the device doesn't have a flash translation layer, then repeated writes to any one sector will kill the drive fairly quickly. (updates to the FAT would kill the sectors the FAT, journal, root directory, or superblock lives in due to the fact that every change to the disk requires an update to this file for example) > Now Pavel's coming back with a second situation where RAID stripes (under > certain circumstances) seem to have similar granularity issues, again breaking > what seems to be the same assumption. Big media use big chunks for data, and > media is getting bigger. It doesn't seem like this problem is going to > diminish in future. > > I agree that it seems like a good idea to have BIG RED WARNING SIGNS about > those kind of media and how _any_ journaling filesystem doesn't really help > here. So specifically documenting "These kinds of media lose unrelated random > data if writes to them are interrupted, journaling filesystems can't help with > this and may actually hide the problem, and even an fsck will only find > corrupted metadata not lost file contents" seems kind of useful. I think an update to the documentation is a good thing (especially after learning that a raid 6 array that has lost a single disk can still be corrupted during a powerfail situation), but I also agree that Pavel's wording is not detailed enough > That said, ext3's assumption that filesystem block size always >= disk update > block size _is_ a fundamental part of this problem, and one that isn't shared > by things like jffs2, and which things like btrfs might be able to address if > they try, by adding awareness of the real media update granularity to their > node layout algorithms. (Heck, ext2 has a stripe size parameter already. > Does setting that appropriately for your raid make this suck less? I haven't > heard anybody comment on that one yet...) I thought that that assumption was in the VFS layer, not in any particular filesystem David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 6:54 ` david @ 2009-08-27 7:34 ` Rob Landley 2009-08-28 14:37 ` david 2009-08-30 7:19 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-08-27 7:34 UTC (permalink / raw) To: david Cc: Theodore Tso, Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thursday 27 August 2009 01:54:30 david@lang.hm wrote: > On Thu, 27 Aug 2009, Rob Landley wrote: > > On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote: > >> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote: > >>>> Metadata takes up such a small part of the disk that fscking > >>>> it and finding it to be OK is absolutely no guarantee that > >>>> the data on the filesystem has not been horribly mangled. > >>>> > >>>> Personally, what I care about is my data. > >>>> > >>>> The metadata is just a way to get to my data, while the data > >>>> is actually important. > >>> > >>> Personally, I care about metadata consistency, and ext3 documentation > >>> suggests that journal protects its integrity. Except that it does not > >>> on broken storage devices, and you still need to run fsck there. > >> > >> Caring about metadata consistency and not data is just weird, I'm > >> sorry. I can't imagine anyone who actually *cares* about what they > >> have stored, whether it's digital photographs of child taking a first > >> step, or their thesis research, caring about more about the metadata > >> than the data. Giving advice that pretends that most users have that > >> priority is Just Wrong. > > > > I thought the reason for that was that if your metadata is horked, > > further writes to the disk can trash unrelated existing data because it's > > lost track of what's allocated and what isn't. So back when the > > assumption was "what's written stays written", then keeping the metadata > > sane was still darn important to prevent normal operation from > > overwriting unrelated existing data. > > > > Then Pavel notified us of a situation where interrupted writes to the > > disk can trash unrelated existing data _anyway_, because the flash block > > size on the 16 gig flash key I bought retail at Fry's is 2 megabytes, and > > the filesystem thinks it's 4k or smaller. It seems like what _broke_ was > > the assumption that the filesystem block size >= the disk block size, and > > nobody noticed for a while. (Except the people making jffs2 and friends, > > anyway.) > > > > Today we have cheap plentiful USB keys that act like hard drives, except > > that their write block size isn't remotely the same as hard drives', but > > they pretend it is, and then the block wear levelling algorithms fuzz > > things further. (Gee, a drive controller lying about drive geometry, the > > scsi crowd should feel right at home.) > > actually, you don't know if your USB key works that way or not. Um, yes, I think I do. > Pavel has ssome that do, that doesn't mean that all flash drives do Pretty much all the ones that present a USB disk interface to the outside world and then thus have to do hardware levelling. Here's Valerie Aurora on the topic: http://valhenson.livejournal.com/25228.html >Let's start with hardware wear-leveling. Basically, nearly all practical > implementations of it suck. You'd imagine that it would spread out writes > over all the blocks in the drive, only rewriting any particular block after > every other block has been written. But I've heard from experts several > times that hardware wear-leveling can be as dumb as a ring buffer of 12 > blocks; each time you write a block, it pulls something out of the queue > and sticks the old block in. If you only write one block over and over, > this means that writes will be spread out over a staggering 12 blocks! My > direct experience working with corrupted flash with built-in wear-leveling > is that corruption was centered around frequently written blocks (with > interesting patterns resulting from the interleaving of blocks from > different erase blocks). As a file systems person, I know what it takes to > do high-quality wear-leveling: it's called a log-structured file system and > they are non-trivial pieces of software. Your average consumer SSD is not > going to have sufficient hardware to implement even a half-assed > log-structured file system, so clearly it's going to be a lot stupider than > that. Back to you: > when you do a write to a flash drive you have to do the following items > > 1. allocate an empty eraseblock to put the data on > > 2. read the old eraseblock > > 3. merge the incoming write to the eraseblock > > 4. write the updated data to the flash > > 5. update the flash trnslation layer to point reads at the new location > instead of the old location. > > now if the flash drive does things in this order you will not loose any > previously written data. That's what something like jffs2 will do, sure. (And note that mounting those suckers is slow while it reads the whole disk to figure out what order to put the chunks in.) However, your average consumer level device A) isn't very smart, B) is judged almost entirely by price/capacity ratio and thus usually won't even hide capacity for bad block remapping. You expect them to have significant hidden capacity to do safer updates with when customers aren't demanding it yet? > if the flash drive does step 5 before it does step 4, then you have a > window where a crash can loose data (and no btrfs won't survive any better > to have a large chunk of data just disappear) > > it's possible that some super-cheap flash drives I've never seen one that presented a USB disk interface that _didn't_ do this. (Not that this observation means much.) Neither the windows nor the Macintosh world is calling for this yet. Even the Linux guys barely know about it. And these are the same kinds of manufacturers that NOPed out the flush commands to make their benchmarks look better... > but if the device doesn't have a flash translation layer, then repeated > writes to any one sector will kill the drive fairly quickly. (updates to > the FAT would kill the sectors the FAT, journal, root directory, or > superblock lives in due to the fact that every change to the disk requires > an update to this file for example) Yup. It's got enough of one to get past the warantee, but beyond that they're intended for archiving and sneakernet, not for running compiles on. > > That said, ext3's assumption that filesystem block size always >= disk > > update block size _is_ a fundamental part of this problem, and one that > > isn't shared by things like jffs2, and which things like btrfs might be > > able to address if they try, by adding awareness of the real media update > > granularity to their node layout algorithms. (Heck, ext2 has a stripe > > size parameter already. Does setting that appropriately for your raid > > make this suck less? I haven't heard anybody comment on that one yet...) > > I thought that that assumption was in the VFS layer, not in any particular > filesystem The VFS layer cares about how to talk to the backing store? I thought that was the filesystem driver's job... I wonder how jffs2 gets around it, then? (Or for that matter, squashfs...) > David Lang Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 7:34 ` Rob Landley @ 2009-08-28 14:37 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-28 14:37 UTC (permalink / raw) To: Rob Landley Cc: Theodore Tso, Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, 27 Aug 2009, Rob Landley wrote: > On Thursday 27 August 2009 01:54:30 david@lang.hm wrote: >> On Thu, 27 Aug 2009, Rob Landley wrote: >>> >>> Today we have cheap plentiful USB keys that act like hard drives, except >>> that their write block size isn't remotely the same as hard drives', but >>> they pretend it is, and then the block wear levelling algorithms fuzz >>> things further. (Gee, a drive controller lying about drive geometry, the >>> scsi crowd should feel right at home.) >> >> actually, you don't know if your USB key works that way or not. > > Um, yes, I think I do. > >> Pavel has ssome that do, that doesn't mean that all flash drives do > > Pretty much all the ones that present a USB disk interface to the outside > world and then thus have to do hardware levelling. Here's Valerie Aurora on > the topic: > > http://valhenson.livejournal.com/25228.html > >> Let's start with hardware wear-leveling. Basically, nearly all practical >> implementations of it suck. You'd imagine that it would spread out writes >> over all the blocks in the drive, only rewriting any particular block after >> every other block has been written. But I've heard from experts several >> times that hardware wear-leveling can be as dumb as a ring buffer of 12 >> blocks; each time you write a block, it pulls something out of the queue >> and sticks the old block in. If you only write one block over and over, >> this means that writes will be spread out over a staggering 12 blocks! My >> direct experience working with corrupted flash with built-in wear-leveling >> is that corruption was centered around frequently written blocks (with >> interesting patterns resulting from the interleaving of blocks from >> different erase blocks). As a file systems person, I know what it takes to >> do high-quality wear-leveling: it's called a log-structured file system and >> they are non-trivial pieces of software. Your average consumer SSD is not >> going to have sufficient hardware to implement even a half-assed >> log-structured file system, so clearly it's going to be a lot stupider than >> that. > > Back to you: I am not saying that all devices get this right (not by any means), but I _am_ saying that devices with wear-leveling _can_ avoid this problem entirely you do not need to do a log-structured filesystem. all you need to do is to always write to a new block rather than re-writing a block in place. even if the disk only does a 12-block rotation for it's wear leveling, that is enough for it to not loose other data when you write. to loose data you have to be updating a block in place by erasing the old one first. _anything_ that writes the data to a new location before it erases the old location will prevent you from loosing other data. I'm all for documenting that this problem can and does exist, but I'm not in agreement with documentation that states that _all_ flash drives have this problem because (with wear-leveling in a flash translation layer on the device) it's not inherent to the technology. so even if all existing flash devices had this problem, there could be one released tomorrow that didn't. this is like the problem that flash SSDs had last year that could cause them to stall for up to a second on write-heavy workloads. it went from a problem that almost every drive for sale had (and something that was generally accepted as being a characteristic of SSDs), to being extinct in about one product cycle after the problem was identified. I think this problem will also disappear rapidly once it's publicised. so what's needed is for someone to come up with a way to test this, let people test the various devices, find out how broad the problem is, and publicise the results. personally, I expect that the better disk-replacements will not have a problem with this. I would also be surprised if the larger thumb drives had this problem. if a flash eraseblock can be used 100k times, then if you use FAT on a 16G drive and write 1M files and update the FAT after each file (like you would with a camera), the block the FAT is on will die after filling the device _6_ times. if it does a 12-block rotation it would die after 72 times, but if it can move the blocks around the entire device it would take 50k times of filling the device. for a 2G device the numbers would be 50 times with no wear-leveling and 600 times with 12-block rotation. so I could see them getting away with this sort of thing for the smaller devices, but as the thumb drives get larger, I expect that they will start to gain the wear-leveling capabilities that the SSDs have. >> when you do a write to a flash drive you have to do the following items >> >> 1. allocate an empty eraseblock to put the data on >> >> 2. read the old eraseblock >> >> 3. merge the incoming write to the eraseblock >> >> 4. write the updated data to the flash >> >> 5. update the flash trnslation layer to point reads at the new location >> instead of the old location. >> >> now if the flash drive does things in this order you will not loose any >> previously written data. > > That's what something like jffs2 will do, sure. (And note that mounting those > suckers is slow while it reads the whole disk to figure out what order to put > the chunks in.) > > However, your average consumer level device A) isn't very smart, B) is judged > almost entirely by price/capacity ratio and thus usually won't even hide > capacity for bad block remapping. You expect them to have significant hidden > capacity to do safer updates with when customers aren't demanding it yet? this doesn't require filesystem smarts, but it does require a device with enough smarts to do bad-block remapping (if it does wear leveling all that bad-block remapping would be is not writing to a bad eraseblock, which doesn't even require maintaining a map of such blocks, all it would have to do is to check if what is on the flash is what it intended to write, if it is, use it, if it isn't, try again. >> if the flash drive does step 5 before it does step 4, then you have a >> window where a crash can loose data (and no btrfs won't survive any better >> to have a large chunk of data just disappear) >> >> it's possible that some super-cheap flash drives > > I've never seen one that presented a USB disk interface that _didn't_ do this. > (Not that this observation means much.) Neither the windows nor the Macintosh > world is calling for this yet. Even the Linux guys barely know about it. And > these are the same kinds of manufacturers that NOPed out the flush commands to > make their benchmarks look better... the nature of the FAT filesystem calls for it. I've heard people talk about devices that try to be smart enough to take extra care of the blocks that the FAT is on >> but if the device doesn't have a flash translation layer, then repeated >> writes to any one sector will kill the drive fairly quickly. (updates to >> the FAT would kill the sectors the FAT, journal, root directory, or >> superblock lives in due to the fact that every change to the disk requires >> an update to this file for example) > > Yup. It's got enough of one to get past the warantee, but beyond that they're > intended for archiving and sneakernet, not for running compiles on. it doesn't take them being used for compiles, using them in a camera, media player, phone with a FAT filesystem will excersise the FAT blocks enough to cause problems >>> That said, ext3's assumption that filesystem block size always >= disk >>> update block size _is_ a fundamental part of this problem, and one that >>> isn't shared by things like jffs2, and which things like btrfs might be >>> able to address if they try, by adding awareness of the real media update >>> granularity to their node layout algorithms. (Heck, ext2 has a stripe >>> size parameter already. Does setting that appropriately for your raid >>> make this suck less? I haven't heard anybody comment on that one yet...) >> >> I thought that that assumption was in the VFS layer, not in any particular >> filesystem > > The VFS layer cares about how to talk to the backing store? I thought that > was the filesystem driver's job... I could be mistaken, but I have run into cases with filesystems where the filesystem was designed to be able to use large blocks, but they could only be used on specific architectures because the disk block size had to be smaller than the page size. > I wonder how jffs2 gets around it, then? (Or for that matter, squashfs...) if you know where the eraseblock boundries are, all you need to do is submit your writes in groups of blocks corresponding to those boundries. there is no need to make the blocks themselves the size of the eraseblocks. any filesystem that is doing compressed storage is going to end up dealing with logical changes that span many different disk blocks. I thought that squashfs was read-only (you create a filesystem image, burn it to flash, then use it) as I say I could be completely misunderstanding this interaction. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 6:54 ` david 2009-08-27 7:34 ` Rob Landley @ 2009-08-30 7:19 ` Pavel Machek 2009-08-30 12:48 ` david 1 sibling, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:19 UTC (permalink / raw) To: david Cc: Rob Landley, Theodore Tso, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >> I thought the reason for that was that if your metadata is horked, further >> writes to the disk can trash unrelated existing data because it's lost track >> of what's allocated and what isn't. So back when the assumption was "what's >> written stays written", then keeping the metadata sane was still darn >> important to prevent normal operation from overwriting unrelated existing >> data. >> >> Then Pavel notified us of a situation where interrupted writes to the disk can >> trash unrelated existing data _anyway_, because the flash block size on the 16 >> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks >> it's 4k or smaller. It seems like what _broke_ was the assumption that the >> filesystem block size >= the disk block size, and nobody noticed for a while. >> (Except the people making jffs2 and friends, anyway.) >> >> Today we have cheap plentiful USB keys that act like hard drives, except that >> their write block size isn't remotely the same as hard drives', but they >> pretend it is, and then the block wear levelling algorithms fuzz things >> further. (Gee, a drive controller lying about drive geometry, the scsi crowd >> should feel right at home.) > > actually, you don't know if your USB key works that way or not. Pavel has > ssome that do, that doesn't mean that all flash drives do > > when you do a write to a flash drive you have to do the following items > > 1. allocate an empty eraseblock to put the data on > > 2. read the old eraseblock > > 3. merge the incoming write to the eraseblock > > 4. write the updated data to the flash > > 5. update the flash trnslation layer to point reads at the new location > instead of the old location. That would need two erases per single sector writen, no? Erase is in milisecond range, so the performance would be just way too bad :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-30 7:19 ` Pavel Machek @ 2009-08-30 12:48 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-30 12:48 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, Theodore Tso, Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, 30 Aug 2009, Pavel Machek wrote: >>> I thought the reason for that was that if your metadata is horked, further >>> writes to the disk can trash unrelated existing data because it's lost track >>> of what's allocated and what isn't. So back when the assumption was "what's >>> written stays written", then keeping the metadata sane was still darn >>> important to prevent normal operation from overwriting unrelated existing >>> data. >>> >>> Then Pavel notified us of a situation where interrupted writes to the disk can >>> trash unrelated existing data _anyway_, because the flash block size on the 16 >>> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks >>> it's 4k or smaller. It seems like what _broke_ was the assumption that the >>> filesystem block size >= the disk block size, and nobody noticed for a while. >>> (Except the people making jffs2 and friends, anyway.) >>> >>> Today we have cheap plentiful USB keys that act like hard drives, except that >>> their write block size isn't remotely the same as hard drives', but they >>> pretend it is, and then the block wear levelling algorithms fuzz things >>> further. (Gee, a drive controller lying about drive geometry, the scsi crowd >>> should feel right at home.) >> >> actually, you don't know if your USB key works that way or not. Pavel has >> ssome that do, that doesn't mean that all flash drives do >> >> when you do a write to a flash drive you have to do the following items >> >> 1. allocate an empty eraseblock to put the data on >> >> 2. read the old eraseblock >> >> 3. merge the incoming write to the eraseblock >> >> 4. write the updated data to the flash >> >> 5. update the flash trnslation layer to point reads at the new location >> instead of the old location. > > > That would need two erases per single sector writen, no? Erase is in > milisecond range, so the performance would be just way too bad :-(. no, it only needs one erase if you don't have a pool of pre-erased blocks, then you need to do an erase of the new block you are allocating (before step 4) if you do have a pool of pre-erased blocks, then you don't have to do any erase of the data blocks until after step 5 and you do the erase when you add the old data block to the pool of pre-erased blocks later. in either case the requirements of wear leveling require that the flash translation layer update it's records to show that an additional write took place. what appears to be happening on some cheap devices is that they do the following instead 1. allocate an empty eraseblock to put the data on 2. read the old eraseblock 3. merge the incoming write to the eraseblock 4. erase the old eraseblock 5. write the updated data to the flash I don't know where in (or after) this process theyupdate the wear-levling/flash translation layer info. with this algortihm, if the device looses power between step 4 and step 5 you loose all the data on the eraseblock. with deferred erasing of blocks, the safer algortihm is actually the faster one (up until you run out of your pool of available eraseblocks, at which time it slows down to the same speed as the unreliable one. most flash drives are fairly slow to write to in any case. even the Intel X25M drives are in the same ballpark as rotating media for writes. as far as I know only the X25E SSD drives are faster to write to than rotating media, and most of them are _far_ slower. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 3:32 ` Rik van Riel 2009-08-26 11:17 ` Pavel Machek @ 2009-08-27 5:27 ` Rob Landley 1 sibling, 0 replies; 309+ messages in thread From: Rob Landley @ 2009-08-27 5:27 UTC (permalink / raw) To: Rik van Riel Cc: Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tuesday 25 August 2009 22:32:47 Rik van Riel wrote: > Pavel Machek wrote: > >> So, would you be happy if ext3 fsck was always run on reboot (at least > >> for flash devices)? > > > > For flash devices, MD Raid 5 and anything else that needs it; yes that > > would make me happy ;-). > > Sorry, but that just shows your naivete. Hence wanting documentation properly explaining the situation, yes. Often the people writing the documentation aren't the people who know the most about the situation, but the people who found out they NEED said documentation, and post errors until they get sufficient corrections. In which case "you're wrong, it's actually _this_" is helpful, and "you're wrong, go away and stop bothering us grown-ups" isn't. > Metadata takes up such a small part of the disk that fscking > it and finding it to be OK is absolutely no guarantee that > the data on the filesystem has not been horribly mangled. > > Personally, what I care about is my data. > > The metadata is just a way to get to my data, while the data > is actually important. Are you saying ext3 should default to journal=data then? It seems that the default journaling only handles the metadata, and people seem to think that journaled filesystems exist for a reason. There seems to be a lot of "the guarantees you think a journal provides aren't worth anything, so the fact there are circumstances under which it doesn't provide them isn't worth telling anybody about" in this thread. So we shouldn't bother journaled filesystems? I'm not sure what the intended argument is here... I have no clue what the finished documentation on this issue should look like either. But I want to read it. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 23:00 ` Pavel Machek 2009-08-25 0:02 ` david 2009-08-25 0:06 ` Ric Wheeler @ 2009-08-25 0:08 ` Theodore Tso 2009-08-25 9:42 ` Pavel Machek ` (3 more replies) 2 siblings, 4 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-25 0:08 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote: > Then to answer your question... ext2. You expect to run fsck after > unclean shutdown, and you expect to have to solve some problems with > it. So the way ext2 deals with the flash media actually matches what > the user expects. (*) But if the 256k hole is in data blocks, fsck won't find a problem, even with ext2. And if the 256k hole is the inode table, you will *still* suffer massive data loss. Fsck will tell you how badly screwed you are, but it doesn't "fix" the disk; most users don't consider questions of the form "directory entry <precious-thesis-data> points to trashed inode, may I delete directory entry?" as being terribly helpful. :-/ > OTOH in ext3 case you expect consistent filesystem after unplug; and > you don't get that. You don't get a consistent filesystem with ext2, either. And if your claim is that several hundred lines of fsck output detailing the filesystem's destruction somehow makes things all better, I suspect most users would disagree with you. In any case, depending on where the flash was writing at the time of the unplug, the data corruption could be silent anyway. Maybe this came as a surprise to you, but anyone who has used a compact flash in a digital camera knows that you ***have*** to wait until the led has gone out before trying to eject the flash card. I remember seeing all sorts of horror stories from professional photographers about how they lost an important wedding's day worth of pictures with the attendant commercial loss, on various digital photography forums. It tends to be the sort of mistake that digital photographers only make once. (It's worse with people using Digital SLR's shooting in raw mode, since it can take upwards of 30 seconds or more to write out a 12-30MB raw image, and if you eject at the wrong time, you can trash the contents of the entire CF card; in the worst case, the Flash Translation Layer data can get corrupted, and the card is completely ruined; you can't even reformat it at the filesystem level, but have to get a special Windows program from the CF manufacturer to --maybe-- reset the FTL layer. Early CF cards were especially vulnerable to this; more recent CF cards are better, but it's a known failure mode of CF cards.) - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 0:08 ` Theodore Tso @ 2009-08-25 9:42 ` Pavel Machek 2009-08-25 9:42 ` Pavel Machek ` (2 subsequent siblings) 3 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 9:42 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel On Mon 2009-08-24 20:08:42, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote: > > Then to answer your question... ext2. You expect to run fsck after > > unclean shutdown, and you expect to have to solve some problems with > > it. So the way ext2 deals with the flash media actually matches what > > the user expects. (*) > > But if the 256k hole is in data blocks, fsck won't find a problem, > even with ext2. True. > And if the 256k hole is the inode table, you will *still* suffer > massive data loss. Fsck will tell you how badly screwed you are, but > it doesn't "fix" the disk; most users don't consider questions of the > form "directory entry <precious-thesis-data> points to trashed inode, > may I delete directory entry?" as being terribly helpful. :-/ Well it will fix the disk in the end. And no, "directory entry <precious-thesis-data> points to trashed inode, may I delete directory entry?" is not _terribly_ helpful, but it is slightly helpful and people actually expect that from ext2. > Maybe this came as a surprise to you, but anyone who has used a > compact flash in a digital camera knows that you ***have*** to wait > until the led has gone out before trying to eject the flash card. I > remember seeing all sorts of horror stories from professional > photographers about how they lost an important wedding's day worth of > pictures with the attendant commercial loss, on various digital > photography forums. It tends to be the sort of mistake that digital > photographers only make once. It actually comes as surprise to me. Actually yes and no. I know that digital cameras use VFAT, so pulling CF card out of it may do bad thing, unless I run fsck.vfat afterwards. If digital camera was using ext3, I'd expect it to be safely pullable at any time. Will IBM microdrive do any difference there? Anyway, it was not known to me. Rather than claiming "everyone knows" (when clearly very few people really understand all the details), can we simply document that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 0:08 ` Theodore Tso 2009-08-25 9:42 ` Pavel Machek @ 2009-08-25 9:42 ` Pavel Machek 2009-08-25 13:37 ` Ric Wheeler 2009-08-25 16:11 ` Theodore Tso 2009-08-27 3:34 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley 2009-08-27 8:46 ` David Woodhouse 3 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 9:42 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon 2009-08-24 20:08:42, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote: > > Then to answer your question... ext2. You expect to run fsck after > > unclean shutdown, and you expect to have to solve some problems with > > it. So the way ext2 deals with the flash media actually matches what > > the user expects. (*) > > But if the 256k hole is in data blocks, fsck won't find a problem, > even with ext2. True. > And if the 256k hole is the inode table, you will *still* suffer > massive data loss. Fsck will tell you how badly screwed you are, but > it doesn't "fix" the disk; most users don't consider questions of the > form "directory entry <precious-thesis-data> points to trashed inode, > may I delete directory entry?" as being terribly helpful. :-/ Well it will fix the disk in the end. And no, "directory entry <precious-thesis-data> points to trashed inode, may I delete directory entry?" is not _terribly_ helpful, but it is slightly helpful and people actually expect that from ext2. > Maybe this came as a surprise to you, but anyone who has used a > compact flash in a digital camera knows that you ***have*** to wait > until the led has gone out before trying to eject the flash card. I > remember seeing all sorts of horror stories from professional > photographers about how they lost an important wedding's day worth of > pictures with the attendant commercial loss, on various digital > photography forums. It tends to be the sort of mistake that digital > photographers only make once. It actually comes as surprise to me. Actually yes and no. I know that digital cameras use VFAT, so pulling CF card out of it may do bad thing, unless I run fsck.vfat afterwards. If digital camera was using ext3, I'd expect it to be safely pullable at any time. Will IBM microdrive do any difference there? Anyway, it was not known to me. Rather than claiming "everyone knows" (when clearly very few people really understand all the details), can we simply document that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 9:42 ` Pavel Machek @ 2009-08-25 13:37 ` Ric Wheeler 2009-08-25 13:42 ` Alan Cox 2009-08-25 21:15 ` Pavel Machek 2009-08-25 16:11 ` Theodore Tso 1 sibling, 2 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 13:37 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 05:42 AM, Pavel Machek wrote: > On Mon 2009-08-24 20:08:42, Theodore Tso wrote: >> On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote: >>> Then to answer your question... ext2. You expect to run fsck after >>> unclean shutdown, and you expect to have to solve some problems with >>> it. So the way ext2 deals with the flash media actually matches what >>> the user expects. (*) >> >> But if the 256k hole is in data blocks, fsck won't find a problem, >> even with ext2. > > True. > >> And if the 256k hole is the inode table, you will *still* suffer >> massive data loss. Fsck will tell you how badly screwed you are, but >> it doesn't "fix" the disk; most users don't consider questions of the >> form "directory entry<precious-thesis-data> points to trashed inode, >> may I delete directory entry?" as being terribly helpful. :-/ > > Well it will fix the disk in the end. And no, "directory entry > <precious-thesis-data> points to trashed inode, may I delete directory > entry?" is not _terribly_ helpful, but it is slightly helpful and > people actually expect that from ext2. > >> Maybe this came as a surprise to you, but anyone who has used a >> compact flash in a digital camera knows that you ***have*** to wait >> until the led has gone out before trying to eject the flash card. I >> remember seeing all sorts of horror stories from professional >> photographers about how they lost an important wedding's day worth of >> pictures with the attendant commercial loss, on various digital >> photography forums. It tends to be the sort of mistake that digital >> photographers only make once. > > It actually comes as surprise to me. Actually yes and no. I know that > digital cameras use VFAT, so pulling CF card out of it may do bad > thing, unless I run fsck.vfat afterwards. If digital camera was using > ext3, I'd expect it to be safely pullable at any time. > > Will IBM microdrive do any difference there? > > Anyway, it was not known to me. Rather than claiming "everyone knows" > (when clearly very few people really understand all the details), can > we simply document that? > Pavel I really think that the expectation that all OS's (windows, mac, even your ipod) all teach you not to hot unplug a device with any file system. Users have an "eject" or "safe unload" in windows, your iPod tells you not to power off or disconnect, etc. I don't object to making that general statement - "Don't hot unplug a device with an active file system or actively used raw device" - but would object to the overly general statement about ext3 not working on flash, RAID5 not working, etc... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 13:37 ` Ric Wheeler @ 2009-08-25 13:42 ` Alan Cox 2009-08-27 3:16 ` Rob Landley 2009-08-25 21:15 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Alan Cox @ 2009-08-25 13:42 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, 25 Aug 2009 09:37:12 -0400 Ric Wheeler <rwheeler@redhat.com> wrote: > I really think that the expectation that all OS's (windows, mac, even your ipod) > all teach you not to hot unplug a device with any file system. Users have an > "eject" or "safe unload" in windows, your iPod tells you not to power off or > disconnect, etc. Agreed > I don't object to making that general statement - "Don't hot unplug a device > with an active file system or actively used raw device" - but would object to > the overly general statement about ext3 not working on flash, RAID5 not working, > etc... The overall general statement for all media and all OS's should be "Do you have a backup, have you tested it recently" ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 13:42 ` Alan Cox @ 2009-08-27 3:16 ` Rob Landley 0 siblings, 0 replies; 309+ messages in thread From: Rob Landley @ 2009-08-27 3:16 UTC (permalink / raw) To: Alan Cox Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tuesday 25 August 2009 08:42:10 Alan Cox wrote: > On Tue, 25 Aug 2009 09:37:12 -0400 > > Ric Wheeler <rwheeler@redhat.com> wrote: > > I really think that the expectation that all OS's (windows, mac, even > > your ipod) all teach you not to hot unplug a device with any file system. > > Users have an "eject" or "safe unload" in windows, your iPod tells you > > not to power off or disconnect, etc. > > Agreed Ok, I'll bite: What are journaling filesystems _for_? > > I don't object to making that general statement - "Don't hot unplug a > > device with an active file system or actively used raw device" - but > > would object to the overly general statement about ext3 not working on > > flash, RAID5 not working, etc... > > The overall general statement for all media and all OS's should be > > "Do you have a backup, have you tested it recently" It might be nice to know when you _needed_ said backup, and when you shouldn't re-backup bad data over it, because your data corruption actually got detected before then. And maybe a pony. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 13:37 ` Ric Wheeler 2009-08-25 13:42 ` Alan Cox @ 2009-08-25 21:15 ` Pavel Machek 2009-08-25 22:42 ` Ric Wheeler 2009-08-25 23:08 ` Neil Brown 1 sibling, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 21:15 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>> Maybe this came as a surprise to you, but anyone who has used a >>> compact flash in a digital camera knows that you ***have*** to wait >>> until the led has gone out before trying to eject the flash card. I >>> remember seeing all sorts of horror stories from professional >>> photographers about how they lost an important wedding's day worth of >>> pictures with the attendant commercial loss, on various digital >>> photography forums. It tends to be the sort of mistake that digital >>> photographers only make once. >> >> It actually comes as surprise to me. Actually yes and no. I know that >> digital cameras use VFAT, so pulling CF card out of it may do bad >> thing, unless I run fsck.vfat afterwards. If digital camera was using >> ext3, I'd expect it to be safely pullable at any time. >> >> Will IBM microdrive do any difference there? >> >> Anyway, it was not known to me. Rather than claiming "everyone knows" >> (when clearly very few people really understand all the details), can >> we simply document that? > > I really think that the expectation that all OS's (windows, mac, even > your ipod) all teach you not to hot unplug a device with any file system. > Users have an "eject" or "safe unload" in windows, your iPod tells you > not to power off or disconnect, etc. That was before journaling filesystems... > I don't object to making that general statement - "Don't hot unplug a > device with an active file system or actively used raw device" - but > would object to the overly general statement about ext3 not working on > flash, RAID5 not working, etc... You can object any way you want, but running ext3 on flash or MD RAID5 is stupid: * ext2 would be faster * ext2 would provide better protection against powerfail. "ext3 works on flash and MD RAID5, as long as you do not have powerfail" seems to be the accurate statement, and if you don't need to protect against powerfails, you can just use ext2. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 21:15 ` Pavel Machek @ 2009-08-25 22:42 ` Ric Wheeler 2009-08-25 22:51 ` Pavel Machek 2009-08-25 23:08 ` Neil Brown 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 22:42 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 05:15 PM, Pavel Machek wrote: > >>>> Maybe this came as a surprise to you, but anyone who has used a >>>> compact flash in a digital camera knows that you ***have*** to wait >>>> until the led has gone out before trying to eject the flash card. I >>>> remember seeing all sorts of horror stories from professional >>>> photographers about how they lost an important wedding's day worth of >>>> pictures with the attendant commercial loss, on various digital >>>> photography forums. It tends to be the sort of mistake that digital >>>> photographers only make once. >>> >>> It actually comes as surprise to me. Actually yes and no. I know that >>> digital cameras use VFAT, so pulling CF card out of it may do bad >>> thing, unless I run fsck.vfat afterwards. If digital camera was using >>> ext3, I'd expect it to be safely pullable at any time. >>> >>> Will IBM microdrive do any difference there? >>> >>> Anyway, it was not known to me. Rather than claiming "everyone knows" >>> (when clearly very few people really understand all the details), can >>> we simply document that? >> >> I really think that the expectation that all OS's (windows, mac, even >> your ipod) all teach you not to hot unplug a device with any file system. >> Users have an "eject" or "safe unload" in windows, your iPod tells you >> not to power off or disconnect, etc. > > That was before journaling filesystems... Not true - that is true today with or without journals as we have discussed in great detail. Including specifically ext2. Basically, any file system (Linux, windows, OSX, etc) that writes into the page cache will lose data when you hot unplug its storage. End of story, don't do it! > >> I don't object to making that general statement - "Don't hot unplug a >> device with an active file system or actively used raw device" - but >> would object to the overly general statement about ext3 not working on >> flash, RAID5 not working, etc... > > You can object any way you want, but running ext3 on flash or MD RAID5 > is stupid: > > * ext2 would be faster > > * ext2 would provide better protection against powerfail. Not true in the slightest, you continue to ignore the ext2/3/4 developers telling you that it will lose data. > > "ext3 works on flash and MD RAID5, as long as you do not have > powerfail" seems to be the accurate statement, and if you don't need > to protect against powerfails, you can just use ext2. > Pavel Strange how your personal preference is totally out of sync with the entire enterprise class user base. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 22:42 ` Ric Wheeler @ 2009-08-25 22:51 ` Pavel Machek 2009-08-25 23:03 ` david 2009-08-25 23:03 ` Ric Wheeler 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 22:51 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>> I really think that the expectation that all OS's (windows, mac, even >>> your ipod) all teach you not to hot unplug a device with any file system. >>> Users have an "eject" or "safe unload" in windows, your iPod tells you >>> not to power off or disconnect, etc. >> >> That was before journaling filesystems... > > Not true - that is true today with or without journals as we have > discussed in great detail. Including specifically ext2. > > Basically, any file system (Linux, windows, OSX, etc) that writes into > the page cache will lose data when you hot unplug its storage. End of > story, don't do it! No, not ext3 on SATA disk with barriers on and proper use of fsync(). I actually tested that. Yes, I should be able to hotunplug SATA drives and expect the data that was fsync-ed to be there. >>> I don't object to making that general statement - "Don't hot unplug a >>> device with an active file system or actively used raw device" - but >>> would object to the overly general statement about ext3 not working on >>> flash, RAID5 not working, etc... >> >> You can object any way you want, but running ext3 on flash or MD RAID5 >> is stupid: >> >> * ext2 would be faster >> >> * ext2 would provide better protection against powerfail. > > Not true in the slightest, you continue to ignore the ext2/3/4 developers > telling you that it will lose data. I know I will lose data. Both ext2 and ext3 will lose data on flashdisk. (That's what I'm trying to document). But... what is the benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least protects you against kernel panic. MD RAID5 is in software, so... that additional protection is just not there). >> "ext3 works on flash and MD RAID5, as long as you do not have >> powerfail" seems to be the accurate statement, and if you don't need >> to protect against powerfails, you can just use ext2. > > Strange how your personal preference is totally out of sync with the > entire enterprise class user base. Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly what I'm trying to document here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 22:51 ` Pavel Machek @ 2009-08-25 23:03 ` david 2009-08-25 23:29 ` Pavel Machek 2009-08-25 23:03 ` Ric Wheeler 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-25 23:03 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: >>>> I don't object to making that general statement - "Don't hot unplug a >>>> device with an active file system or actively used raw device" - but >>>> would object to the overly general statement about ext3 not working on >>>> flash, RAID5 not working, etc... >>> >>> You can object any way you want, but running ext3 on flash or MD RAID5 >>> is stupid: >>> >>> * ext2 would be faster >>> >>> * ext2 would provide better protection against powerfail. >> >> Not true in the slightest, you continue to ignore the ext2/3/4 developers >> telling you that it will lose data. > > I know I will lose data. Both ext2 and ext3 will lose data on > flashdisk. (That's what I'm trying to document). But... what is the > benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least > protects you against kernel panic. MD RAID5 is in software, so... that > additional protection is just not there). the block device can loose data, it has absolutly nothing to do with the filesystem >>> "ext3 works on flash and MD RAID5, as long as you do not have >>> powerfail" seems to be the accurate statement, and if you don't need >>> to protect against powerfails, you can just use ext2. >> >> Strange how your personal preference is totally out of sync with the >> entire enterprise class user base. > > Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly > what I'm trying to document here. a MD raid array that's degraded to the point where there is no redundancy is dangerous, but I don't think that any of the enterprise users would be surprised. I think they will be surprised that it's possible that a prior failed write that hasn't been scrubbed can cause data loss when the array later degrades. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:03 ` david @ 2009-08-25 23:29 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 23:29 UTC (permalink / raw) To: david Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>>> "ext3 works on flash and MD RAID5, as long as you do not have >>>> powerfail" seems to be the accurate statement, and if you don't need >>>> to protect against powerfails, you can just use ext2. >>> >>> Strange how your personal preference is totally out of sync with the >>> entire enterprise class user base. >> >> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly >> what I'm trying to document here. > > a MD raid array that's degraded to the point where there is no redundancy > is dangerous, but I don't think that any of the enterprise users would be > surprised. > > I think they will be surprised that it's possible that a prior failed > write that hasn't been scrubbed can cause data loss when the array later > degrades. Cool, so Ted's "raid5 has highly undesirable properties" is actually pretty accurate. Some raid person should write more detailed README, I'd say... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 22:51 ` Pavel Machek 2009-08-25 23:03 ` david @ 2009-08-25 23:03 ` Ric Wheeler 2009-08-25 23:26 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 23:03 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 06:51 PM, Pavel Machek wrote: > > >>>> I really think that the expectation that all OS's (windows, mac, even >>>> your ipod) all teach you not to hot unplug a device with any file system. >>>> Users have an "eject" or "safe unload" in windows, your iPod tells you >>>> not to power off or disconnect, etc. >>> >>> That was before journaling filesystems... >> >> Not true - that is true today with or without journals as we have >> discussed in great detail. Including specifically ext2. >> >> Basically, any file system (Linux, windows, OSX, etc) that writes into >> the page cache will lose data when you hot unplug its storage. End of >> story, don't do it! > > No, not ext3 on SATA disk with barriers on and proper use of > fsync(). I actually tested that. > > Yes, I should be able to hotunplug SATA drives and expect the data > that was fsync-ed to be there. You can and will lose data (even after fsync) with any type of storage at some rate. What you are missing here is that data loss needs to be measured in hard numbers - say percentage of installed boxes that have config X that lose data. Strangely enough, this is what high end storage companies do for a living, configure, deploy and then measure results. A long winded way of saying that just because you can induce data failure by recreating an event that happens almost never (power loss while rebuilding a RAID5 group specifically) does not mean that this makes RAID5 with ext3 unreliable. What does happen all of the time is single bad sector IO's and (less often, but more than your scenario) complete drive failures. In both cases, MD RAID5 will repair that damage before a second failure (including a power failure) happens 99.99% of the time. I can promise you that hot unplugging and replugging a S-ATA drive will also lose you data if you are actively writing to it (ext2, 3, whatever). Your micro datah loss benchmark is not a valid reflection of the wider experience and I fear that you will cause people to lose more data, not less, but moving them away from ext3 and MD RAID5. > >>>> I don't object to making that general statement - "Don't hot unplug a >>>> device with an active file system or actively used raw device" - but >>>> would object to the overly general statement about ext3 not working on >>>> flash, RAID5 not working, etc... >>> >>> You can object any way you want, but running ext3 on flash or MD RAID5 >>> is stupid: >>> >>> * ext2 would be faster >>> >>> * ext2 would provide better protection against powerfail. >> >> Not true in the slightest, you continue to ignore the ext2/3/4 developers >> telling you that it will lose data. > > I know I will lose data. Both ext2 and ext3 will lose data on > flashdisk. (That's what I'm trying to document). But... what is the > benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least > protects you against kernel panic. MD RAID5 is in software, so... that > additional protection is just not there). Faster recovery time on any normal kernel crash or power outage. Data loss would be equivalent with or without the journal. > >>> "ext3 works on flash and MD RAID5, as long as you do not have >>> powerfail" seems to be the accurate statement, and if you don't need >>> to protect against powerfails, you can just use ext2. >> >> Strange how your personal preference is totally out of sync with the >> entire enterprise class user base. > > Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly > what I'm trying to document here. > Pavel Using MD RAID5 will save more people from commonly occurring errors (sector and disk failures) than will lose it because of your rebuild interrupted by a power failure worry. What you are trying to do is to document a belief you have that is not born out by real data across actual user boxes running real work loads. Unfortunately, getting that data is hard work and one of the things that we as a community do especially poorly. All of the data (secret data from my past and published data by NetApp, Google, etc) that I have seen would directly contradict your assertions and you will cause harm to our users with this. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:03 ` Ric Wheeler @ 2009-08-25 23:26 ` Pavel Machek 2009-08-25 23:40 ` Ric Wheeler 2009-08-25 23:46 ` [patch] ext2/3: document conditions when reliable operation is possible david 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 23:26 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>> Basically, any file system (Linux, windows, OSX, etc) that writes into >>> the page cache will lose data when you hot unplug its storage. End of >>> story, don't do it! >> >> No, not ext3 on SATA disk with barriers on and proper use of >> fsync(). I actually tested that. >> >> Yes, I should be able to hotunplug SATA drives and expect the data >> that was fsync-ed to be there. > > You can and will lose data (even after fsync) with any type of storage at > some rate. What you are missing here is that data loss needs to be > measured in hard numbers - say percentage of installed boxes that have > config X that lose data. I'm talking "by design" here. I will lose data even on SATA drive that is properly powered on if I wait 5 years. > I can promise you that hot unplugging and replugging a S-ATA drive will > also lose you data if you are actively writing to it (ext2, 3, whatever). I can promise you that running S-ATA drive will also lose you data, even if you are not actively writing to it. Just wait 10 years; so what is your point? But ext3 is _designed_ to preserve fsynced data on SATA drive, while it is _not_ designed to preserve fsynced data on MD RAID5. Do you really think that's not a difference? >>>>> I don't object to making that general statement - "Don't hot unplug a >>>>> device with an active file system or actively used raw device" - but >>>>> would object to the overly general statement about ext3 not working on >>>>> flash, RAID5 not working, etc... >>>> >>>> You can object any way you want, but running ext3 on flash or MD RAID5 >>>> is stupid: >>>> >>>> * ext2 would be faster >>>> >>>> * ext2 would provide better protection against powerfail. >>> >>> Not true in the slightest, you continue to ignore the ext2/3/4 developers >>> telling you that it will lose data. >> >> I know I will lose data. Both ext2 and ext3 will lose data on >> flashdisk. (That's what I'm trying to document). But... what is the >> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least >> protects you against kernel panic. MD RAID5 is in software, so... that >> additional protection is just not there). > > Faster recovery time on any normal kernel crash or power outage. Data > loss would be equivalent with or without the journal. No, because you'll actually repair the ext2 with fsck after the kernel crash or power outage. Data loss will not be equivalent; in particular you'll not lose data writen _after_ power outage to ext2. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:26 ` Pavel Machek @ 2009-08-25 23:40 ` Ric Wheeler 2009-08-25 23:48 ` david ` (2 more replies) 2009-08-25 23:46 ` [patch] ext2/3: document conditions when reliable operation is possible david 1 sibling, 3 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 23:40 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 07:26 PM, Pavel Machek wrote: > >>>> Basically, any file system (Linux, windows, OSX, etc) that writes into >>>> the page cache will lose data when you hot unplug its storage. End of >>>> story, don't do it! >>> >>> No, not ext3 on SATA disk with barriers on and proper use of >>> fsync(). I actually tested that. >>> >>> Yes, I should be able to hotunplug SATA drives and expect the data >>> that was fsync-ed to be there. >> >> You can and will lose data (even after fsync) with any type of storage at >> some rate. What you are missing here is that data loss needs to be >> measured in hard numbers - say percentage of installed boxes that have >> config X that lose data. > > I'm talking "by design" here. > > I will lose data even on SATA drive that is properly powered on if I > wait 5 years. > You are dead wrong. For RAID5 arrays, you assume that you have a hard failure and a power outage before you can rebuild the RAID (order of hours at full tilt). The failure rate of S-ATA drives is at the rate of a few percentage of the installed base in a year. Some drives will fail faster than that (bad parts, bad environmental conditions, etc). Why don't you hold all of your most precious data on that single S-ATA drive for five year on one box and put a second copy on a small RAID5 with ext3 for the same period? Repeat experiment until you get up to something like google scale or the other papers on failures in national labs in the US and then we can have an informed discussion. >> I can promise you that hot unplugging and replugging a S-ATA drive will >> also lose you data if you are actively writing to it (ext2, 3, whatever). > > I can promise you that running S-ATA drive will also lose you data, > even if you are not actively writing to it. Just wait 10 years; so > what is your point? I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5 RAID5, I would not have lost any. My point is that you fail to take into account the rate of failures of a given configuration and the probability of data loss given those rates. > > But ext3 is _designed_ to preserve fsynced data on SATA drive, while > it is _not_ designed to preserve fsynced data on MD RAID5. Of course it will when you properly configure your MD RAID5. > > Do you really think that's not a difference? I think that you are simply wrong. > >>>>>> I don't object to making that general statement - "Don't hot unplug a >>>>>> device with an active file system or actively used raw device" - but >>>>>> would object to the overly general statement about ext3 not working on >>>>>> flash, RAID5 not working, etc... >>>>> >>>>> You can object any way you want, but running ext3 on flash or MD RAID5 >>>>> is stupid: >>>>> >>>>> * ext2 would be faster >>>>> >>>>> * ext2 would provide better protection against powerfail. >>>> >>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers >>>> telling you that it will lose data. >>> >>> I know I will lose data. Both ext2 and ext3 will lose data on >>> flashdisk. (That's what I'm trying to document). But... what is the >>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least >>> protects you against kernel panic. MD RAID5 is in software, so... that >>> additional protection is just not there). >> >> Faster recovery time on any normal kernel crash or power outage. Data >> loss would be equivalent with or without the journal. > > No, because you'll actually repair the ext2 with fsck after the kernel > crash or power outage. Data loss will not be equivalent; in particular > you'll not lose data writen _after_ power outage to ext2. > Pavel As Ted (who wrote fsck for ext*) said, you will lose data in both. Your argument is not based on fact. You need to actually prove your point, not just state it as fact. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:40 ` Ric Wheeler @ 2009-08-25 23:48 ` david 2009-08-25 23:53 ` Pavel Machek 2009-08-27 3:53 ` Rob Landley 2 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-25 23:48 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, 25 Aug 2009, Ric Wheeler wrote: > On 08/25/2009 07:26 PM, Pavel Machek wrote: >> >>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into >>>>> the page cache will lose data when you hot unplug its storage. End of >>>>> story, don't do it! >>>> >>>> No, not ext3 on SATA disk with barriers on and proper use of >>>> fsync(). I actually tested that. >>>> >>>> Yes, I should be able to hotunplug SATA drives and expect the data >>>> that was fsync-ed to be there. >>> >>> You can and will lose data (even after fsync) with any type of storage at >>> some rate. What you are missing here is that data loss needs to be >>> measured in hard numbers - say percentage of installed boxes that have >>> config X that lose data. >> >> I'm talking "by design" here. >> >> I will lose data even on SATA drive that is properly powered on if I >> wait 5 years. >> > > You are dead wrong. > > For RAID5 arrays, you assume that you have a hard failure and a power outage > before you can rebuild the RAID (order of hours at full tilt). and that the power outage causes a corrupted write. >>> I can promise you that hot unplugging and replugging a S-ATA drive will >>> also lose you data if you are actively writing to it (ext2, 3, whatever). >> >> I can promise you that running S-ATA drive will also lose you data, >> even if you are not actively writing to it. Just wait 10 years; so >> what is your point? > > I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5 > RAID5, I would not have lost any. me to, in fact just after I copied data from a raid array to it so that I could rebuild the raid array differently :-( David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:40 ` Ric Wheeler 2009-08-25 23:48 ` david @ 2009-08-25 23:53 ` Pavel Machek 2009-08-26 0:11 ` Ric Wheeler 2009-08-26 3:50 ` Rik van Riel 2009-08-27 3:53 ` Rob Landley 2 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 23:53 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet > Why don't you hold all of your most precious data on that single S-ATA > drive for five year on one box and put a second copy on a small RAID5 > with ext3 for the same period? > > Repeat experiment until you get up to something like google scale or the > other papers on failures in national labs in the US and then we can have > an informed discussion. I'm not interested in discussing statistics with you. I'd rather discuss fsync() and storage design issues. ext3 is designed to work on single SATA disks, and it is not designed to work on flash cards/degraded MD RAID5s, as Ted acknowledged. Because that fact is non obvious to the users, I'd like to see it documented, and now have nice short writeup from Ted. If you want to argue that ext3/MD RAID5/no UPS combination is still less likely to fail than single SATA disk given part fail probabilities, go ahead and present nice statistics. Its just that I'm not interested in them. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:53 ` Pavel Machek @ 2009-08-26 0:11 ` Ric Wheeler 2009-08-26 0:16 ` Pavel Machek 2009-08-26 3:50 ` Rik van Riel 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:11 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 07:53 PM, Pavel Machek wrote: >> Why don't you hold all of your most precious data on that single S-ATA >> drive for five year on one box and put a second copy on a small RAID5 >> with ext3 for the same period? >> >> Repeat experiment until you get up to something like google scale or the >> other papers on failures in national labs in the US and then we can have >> an informed discussion. > > I'm not interested in discussing statistics with you. I'd rather discuss > fsync() and storage design issues. > > ext3 is designed to work on single SATA disks, and it is not designed > to work on flash cards/degraded MD RAID5s, as Ted acknowledged. You are simply incorrect, Ted did not say that ext3 does not work with MD raid5. > > Because that fact is non obvious to the users, I'd like to see it > documented, and now have nice short writeup from Ted. > > If you want to argue that ext3/MD RAID5/no UPS combination is still > less likely to fail than single SATA disk given part fail > probabilities, go ahead and present nice statistics. Its just that I'm > not interested in them. > Pavel > That is a proven fact and a well published one. If you choose to ignore published work (and common sense) that RAID makes you lose data less than non-RAID, why should anyone care what you write? Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 0:11 ` Ric Wheeler @ 2009-08-26 0:16 ` Pavel Machek 2009-08-26 0:31 ` Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:16 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 20:11:21, Ric Wheeler wrote: > On 08/25/2009 07:53 PM, Pavel Machek wrote: >>> Why don't you hold all of your most precious data on that single S-ATA >>> drive for five year on one box and put a second copy on a small RAID5 >>> with ext3 for the same period? >>> >>> Repeat experiment until you get up to something like google scale or the >>> other papers on failures in national labs in the US and then we can have >>> an informed discussion. >> >> I'm not interested in discussing statistics with you. I'd rather discuss >> fsync() and storage design issues. >> >> ext3 is designed to work on single SATA disks, and it is not designed >> to work on flash cards/degraded MD RAID5s, as Ted acknowledged. > > You are simply incorrect, Ted did not say that ext3 does not work > with MD raid5. http://lkml.org/lkml/2009/8/25/312 Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 0:16 ` Pavel Machek @ 2009-08-26 0:31 ` Ric Wheeler 2009-08-26 1:00 ` Theodore Tso 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:31 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 08:16 PM, Pavel Machek wrote: > On Tue 2009-08-25 20:11:21, Ric Wheeler wrote: >> On 08/25/2009 07:53 PM, Pavel Machek wrote: >>>> Why don't you hold all of your most precious data on that single S-ATA >>>> drive for five year on one box and put a second copy on a small RAID5 >>>> with ext3 for the same period? >>>> >>>> Repeat experiment until you get up to something like google scale or the >>>> other papers on failures in national labs in the US and then we can have >>>> an informed discussion. >>> >>> I'm not interested in discussing statistics with you. I'd rather discuss >>> fsync() and storage design issues. >>> >>> ext3 is designed to work on single SATA disks, and it is not designed >>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged. >> >> You are simply incorrect, Ted did not say that ext3 does not work >> with MD raid5. > > http://lkml.org/lkml/2009/8/25/312 > Pavel I will let Ted clarify his text on his own, but the quoted text says "... have potential...". Why not ask Neil if he designed MD to not work properly with ext3? Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 0:31 ` Ric Wheeler @ 2009-08-26 1:00 ` Theodore Tso 2009-08-26 1:15 ` Ric Wheeler ` (6 more replies) 0 siblings, 7 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 1:00 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote: >>> You are simply incorrect, Ted did not say that ext3 does not work >>> with MD raid5. >> >> http://lkml.org/lkml/2009/8/25/312 >> Pavel > > I will let Ted clarify his text on his own, but the quoted text says "... > have potential...". > > Why not ask Neil if he designed MD to not work properly with ext3? So let me clarify by saying the following things. 1) Filesystems are designed to expect that storage devices have certain properties. These include returning the same data that you wrote, and that an error when writing a sector, or a power failure when writing sector, should not be amplified to cause collateral damage with previously succfessfully written sectors. 2) Degraded RAID 5/6 filesystems do not meet these properties. Neither to cheap flash drives. This increases the chances you can lose, bigtime. 3) Does that mean that you shouldn't use ext3 on RAID drives? Of course not! First of all, Ext3 still saves you against kernel panics and hangs caused by device driver bugs or other kernel hangs. You will lose less data, and avoid needing to run a long and painful fsck after a forced reboot, compared to if you used ext2. You are making an assumption that the only time running the journal takes place is after a power failure. But if the system hangs, and you need to hit the Big Red Switch, or if you using the system in a Linux High Availability setup and the ethernet card fails, so the STONITH ("shoot the other node in the head") system forces a hard reset of the system, or you get a kernel panic which forces a reboot, in all of these cases ext3 will save you from a long fsck, and it will do so safely. Secondly, what's the probability of a failure causes the RAID array to become degraded, followed by a power failure, versus a power failure while the RAID array is not running in degraded mode? Hopefully you are running with the RAID array in full, proper running order a much larger percentage of the time than running with the RAID array in degraded mode. If not, the bug is with the system administrator! If you are someone who tends to run for long periods of time in degraded mode --- then better get a UPS. And certainly if you want to avoid the chances of failure, periodically scrubbing the disks so you detect hard drive failures early, instead of waiting until a disk fails before letting the rebuild find the dreaded "second failure" which causes data loss, is a d*mned good idea. Maybe a random OS engineer doesn't know these things --- but trust me when I say a competent system administrator had better be familiar with these concepts. And someone who wants their data to be reliably stored needs to do some basic storage engineering if they want to have long-term data reliability. (That, or maybe they should outsource their long-term reliable storage some service such as Amazon S3 --- see Jeremy Zawodny's analysis about how it can be cheaper, here: http://jeremy.zawodny.com/blog/archives/007624.html) But we *do* need to be careful that we don't write documentation which is ends up giving users the wrong impression. The bottom line is that you're better off using ext3 over ext2, even on a RAID array, for the reasons listed above. Are you better off using ext3 over ext2 on a crappy flash drive? Maybe --- if you are also using crappy proprietary video drivers, such as Ubuntu ships, where every single time you exit a 3d game the system crashes (and Ubuntu users accept this as normal?!?), then ext3 might be a better choice since you'll reduce the chance of data loss when the system locks up or crashes thanks to the aforemention crappy proprietary video drivers from Nvidia. On the other hand, crappy flash drives *do* have really bad write amplification effects, where a 4K write can cause 128k or more worth of flash to be rewritten, such that using ext3 could seriously degrade the lifetime of said crappy flash drive; furthermore, the crappy flash drives have such terribly write performance that using ext3 can be a performance nightmare. This of course, doesn't apply to well-implemented SSD's, such as the Intel's X25-M and X18-M. So here your mileage may vary. Still, if you are using crappy proprietary drivers which cause system hangs and crashes at a far greater rate than power fail-induced unclean shutdowns, ext3 *still* might be the better choice, even with crappy flash drives. The best thing to do, of course, is to improve your storage stack; use competently implemented SSD's instead of crap flash cards. If your hardware RAID card supports a battery option, *get* the battery. Add a UPS to your system. Provision your RAID array with hot spares, and regularly scrub (read-test) your array so that failed drives can be detected early. Make sure you configure your MD setup so that you get e-mail when a hard drive fails and the array starts running in degraded mode, so you can replace the failed drive ASAP. At the end of the day, filesystems are not magic. They can't compensate for crap hardware, or incompetently administered machines. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso @ 2009-08-26 1:15 ` Ric Wheeler 2009-08-26 2:58 ` Theodore Tso 2009-08-26 1:15 ` Ric Wheeler ` (5 subsequent siblings) 6 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 1:15 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 09:00 PM, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote: > >>>> You are simply incorrect, Ted did not say that ext3 does not work >>>> with MD raid5. >>>> >>> http://lkml.org/lkml/2009/8/25/312 >>> Pavel >>> >> I will let Ted clarify his text on his own, but the quoted text says "... >> have potential...". >> >> Why not ask Neil if he designed MD to not work properly with ext3? >> > So let me clarify by saying the following things. > > 1) Filesystems are designed to expect that storage devices have > certain properties. These include returning the same data that you > wrote, and that an error when writing a sector, or a power failure > when writing sector, should not be amplified to cause collateral > damage with previously succfessfully written sectors. > > 2) Degraded RAID 5/6 filesystems do not meet these properties. > Neither to cheap flash drives. This increases the chances you can > lose, bigtime. > > I agree with the whole write up outside of the above - degraded RAID does meet this requirement unless you have a second (or third, counting the split write) failure during the rebuild. Note that the window of exposure during a RAID rebuild is linear with the size of your disk and how much you detune the rebuild... ric > 3) Does that mean that you shouldn't use ext3 on RAID drives? Of > course not! First of all, Ext3 still saves you against kernel panics > and hangs caused by device driver bugs or other kernel hangs. You > will lose less data, and avoid needing to run a long and painful fsck > after a forced reboot, compared to if you used ext2. You are making > an assumption that the only time running the journal takes place is > after a power failure. But if the system hangs, and you need to hit > the Big Red Switch, or if you using the system in a Linux High > Availability setup and the ethernet card fails, so the STONITH ("shoot > the other node in the head") system forces a hard reset of the system, > or you get a kernel panic which forces a reboot, in all of these cases > ext3 will save you from a long fsck, and it will do so safely. > > Secondly, what's the probability of a failure causes the RAID array to > become degraded, followed by a power failure, versus a power failure > while the RAID array is not running in degraded mode? Hopefully you > are running with the RAID array in full, proper running order a much > larger percentage of the time than running with the RAID array in > degraded mode. If not, the bug is with the system administrator! > > If you are someone who tends to run for long periods of time in > degraded mode --- then better get a UPS. And certainly if you want to > avoid the chances of failure, periodically scrubbing the disks so you > detect hard drive failures early, instead of waiting until a disk > fails before letting the rebuild find the dreaded "second failure" > which causes data loss, is a d*mned good idea. > > Maybe a random OS engineer doesn't know these things --- but trust me > when I say a competent system administrator had better be familiar > with these concepts. And someone who wants their data to be reliably > stored needs to do some basic storage engineering if they want to have > long-term data reliability. (That, or maybe they should outsource > their long-term reliable storage some service such as Amazon S3 --- > see Jeremy Zawodny's analysis about how it can be cheaper, here: > http://jeremy.zawodny.com/blog/archives/007624.html) > > But we *do* need to be careful that we don't write documentation which > is ends up giving users the wrong impression. The bottom line is that > you're better off using ext3 over ext2, even on a RAID array, for the > reasons listed above. > > Are you better off using ext3 over ext2 on a crappy flash drive? > Maybe --- if you are also using crappy proprietary video drivers, such > as Ubuntu ships, where every single time you exit a 3d game the system > crashes (and Ubuntu users accept this as normal?!?), then ext3 might > be a better choice since you'll reduce the chance of data loss when > the system locks up or crashes thanks to the aforemention crappy > proprietary video drivers from Nvidia. On the other hand, crappy > flash drives *do* have really bad write amplification effects, where a > 4K write can cause 128k or more worth of flash to be rewritten, such > that using ext3 could seriously degrade the lifetime of said crappy > flash drive; furthermore, the crappy flash drives have such terribly > write performance that using ext3 can be a performance nightmare. > This of course, doesn't apply to well-implemented SSD's, such as the > Intel's X25-M and X18-M. So here your mileage may vary. Still, if > you are using crappy proprietary drivers which cause system hangs and > crashes at a far greater rate than power fail-induced unclean > shutdowns, ext3 *still* might be the better choice, even with crappy > flash drives. > > The best thing to do, of course, is to improve your storage stack; use > competently implemented SSD's instead of crap flash cards. If your > hardware RAID card supports a battery option, *get* the battery. Add > a UPS to your system. Provision your RAID array with hot spares, and > regularly scrub (read-test) your array so that failed drives can be > detected early. Make sure you configure your MD setup so that you get > e-mail when a hard drive fails and the array starts running in > degraded mode, so you can replace the failed drive ASAP. > > At the end of the day, filesystems are not magic. They can't > compensate for crap hardware, or incompetently administered machines. > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:15 ` Ric Wheeler @ 2009-08-26 2:58 ` Theodore Tso 2009-08-26 10:39 ` Ric Wheeler ` (2 more replies) 0 siblings, 3 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 2:58 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: > > I agree with the whole write up outside of the above - degraded RAID > does meet this requirement unless you have a second (or third, counting > the split write) failure during the rebuild. The argument is that if the degraded RAID array is running in this state for a long time, and the power fails while the software RAID is in the middle of writing out a stripe, such that the stripe isn't completely written out, we could lose all of the data in that stripe. In other words, a power failure in the middle of writing out a stripe in a degraded RAID array counts as a second failure. To me, this isn't a particularly interesting or newsworthy point, since a competent system administrator who cares about his data and/or his hardware will (a) have a UPS, and (b) be running with a hot spare and/or will imediately replace a failed drive in a RAID array. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 2:58 ` Theodore Tso @ 2009-08-26 10:39 ` Ric Wheeler 2009-08-26 10:39 ` Ric Wheeler 2009-08-27 5:19 ` Rob Landley 2 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 10:39 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list On 08/25/2009 10:58 PM, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: > >> I agree with the whole write up outside of the above - degraded RAID >> does meet this requirement unless you have a second (or third, counting >> the split write) failure during the rebuild. >> > The argument is that if the degraded RAID array is running in this > state for a long time, and the power fails while the software RAID is > in the middle of writing out a stripe, such that the stripe isn't > completely written out, we could lose all of the data in that stripe. > > In other words, a power failure in the middle of writing out a stripe > in a degraded RAID array counts as a second failure. > > To me, this isn't a particularly interesting or newsworthy point, > since a competent system administrator who cares about his data and/or > his hardware will (a) have a UPS, and (b) be running with a hot spare > and/or will imediately replace a failed drive in a RAID array. > > - Ted > I agree that this is not an interesting (or likely) scenario, certainly when compared to the much more frequent failures that RAID will protect against which is why I object to the document as Pavel suggested. It will steer people away from using RAID and directly increase their chances of losing their data if they use just a single disk. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 2:58 ` Theodore Tso 2009-08-26 10:39 ` Ric Wheeler @ 2009-08-26 10:39 ` Ric Wheeler 2009-08-26 11:12 ` Pavel Machek 2009-08-27 5:19 ` Rob Landley 2 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 10:39 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 10:58 PM, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: > >> I agree with the whole write up outside of the above - degraded RAID >> does meet this requirement unless you have a second (or third, counting >> the split write) failure during the rebuild. >> > The argument is that if the degraded RAID array is running in this > state for a long time, and the power fails while the software RAID is > in the middle of writing out a stripe, such that the stripe isn't > completely written out, we could lose all of the data in that stripe. > > In other words, a power failure in the middle of writing out a stripe > in a degraded RAID array counts as a second failure. > > To me, this isn't a particularly interesting or newsworthy point, > since a competent system administrator who cares about his data and/or > his hardware will (a) have a UPS, and (b) be running with a hot spare > and/or will imediately replace a failed drive in a RAID array. > > - Ted > I agree that this is not an interesting (or likely) scenario, certainly when compared to the much more frequent failures that RAID will protect against which is why I object to the document as Pavel suggested. It will steer people away from using RAID and directly increase their chances of losing their data if they use just a single disk. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 10:39 ` Ric Wheeler @ 2009-08-26 11:12 ` Pavel Machek 2009-08-26 11:28 ` david ` (2 more replies) 0 siblings, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-26 11:12 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed 2009-08-26 06:39:14, Ric Wheeler wrote: > On 08/25/2009 10:58 PM, Theodore Tso wrote: >> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: >> >>> I agree with the whole write up outside of the above - degraded RAID >>> does meet this requirement unless you have a second (or third, counting >>> the split write) failure during the rebuild. >>> >> The argument is that if the degraded RAID array is running in this >> state for a long time, and the power fails while the software RAID is >> in the middle of writing out a stripe, such that the stripe isn't >> completely written out, we could lose all of the data in that stripe. >> >> In other words, a power failure in the middle of writing out a stripe >> in a degraded RAID array counts as a second failure. >> To me, this isn't a particularly interesting or newsworthy point, >> since a competent system administrator who cares about his data and/or >> his hardware will (a) have a UPS, and (b) be running with a hot spare >> and/or will imediately replace a failed drive in a RAID array. > > I agree that this is not an interesting (or likely) scenario, certainly > when compared to the much more frequent failures that RAID will protect > against which is why I object to the document as Pavel suggested. It > will steer people away from using RAID and directly increase their > chances of losing their data if they use just a single disk. So instead of fixing or at least documenting known software deficiency in Linux MD stack, you'll try to surpress that information so that people use more of raid5 setups? Perhaps the better documentation will push them to RAID1, or maybe make them buy an UPS? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:12 ` Pavel Machek @ 2009-08-26 11:28 ` david 2009-08-29 9:49 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek 2009-08-26 12:01 ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler 2009-08-26 12:23 ` Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: david @ 2009-08-26 11:28 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: > On Wed 2009-08-26 06:39:14, Ric Wheeler wrote: >> On 08/25/2009 10:58 PM, Theodore Tso wrote: >>> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: >>> >>>> I agree with the whole write up outside of the above - degraded RAID >>>> does meet this requirement unless you have a second (or third, counting >>>> the split write) failure during the rebuild. >>>> >>> The argument is that if the degraded RAID array is running in this >>> state for a long time, and the power fails while the software RAID is >>> in the middle of writing out a stripe, such that the stripe isn't >>> completely written out, we could lose all of the data in that stripe. >>> >>> In other words, a power failure in the middle of writing out a stripe >>> in a degraded RAID array counts as a second failure. >>> To me, this isn't a particularly interesting or newsworthy point, >>> since a competent system administrator who cares about his data and/or >>> his hardware will (a) have a UPS, and (b) be running with a hot spare >>> and/or will imediately replace a failed drive in a RAID array. >> >> I agree that this is not an interesting (or likely) scenario, certainly >> when compared to the much more frequent failures that RAID will protect >> against which is why I object to the document as Pavel suggested. It >> will steer people away from using RAID and directly increase their >> chances of losing their data if they use just a single disk. > > So instead of fixing or at least documenting known software deficiency > in Linux MD stack, you'll try to surpress that information so that > people use more of raid5 setups? > > Perhaps the better documentation will push them to RAID1, or maybe > make them buy an UPS? people aren't objecting to better documentation, they are objecting to misleading documentation. for flash drives the danger is very straightforward (although even then you have to note that it depends heavily on the firmware of the device, some will loose lots of data, some won't loose any) a good thing to do here would be for someone to devise a test to show this problem, and then gather the results of lots of people performing this test to see what the commonalities are. you are generalizing that since you have lost data on flash drives, all flash drives are dangerous. what if it turns out that only one manufacturer is doing things wrong? you will have discouraged people from using flash drives for no reason. (potentially causing them to loose data becouse they ae scared away from using flash drives and don't implement anything better) to be safe, all that a flash drive needs to do is to not change the FTL pointers until the data has fully been recorded in it's new location. this is probably a trivial firmware change. for raid arrays, we are still learning the nuances of what actually can happen. the comment that Rik made a few hours ago when he pointed out that with raid 5 you won't trash the entire stripe (which is what I thought happened from prior comments), but instead run the risk of loosing two relativly definable chunks of data 1. the block you are writing (which you can loose anyway) 2. the block that would live on the disk that is missing. that drasticly lessens the impact of the problem I would like to see someone explain what would happen on raid 6, and I think that the possibilities that Neil talked about where he said that it was possible to try the various combinations and see which ones agree with each other would be a good thing to implement if he can do so. but the super simplified statement you keep trying to make is significantly overstating and oversimplifying the problem. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-26 11:28 ` david @ 2009-08-29 9:49 ` Pavel Machek 2009-08-29 11:28 ` Ric Wheeler 2009-08-29 16:35 ` david 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-29 9:49 UTC (permalink / raw) To: david Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet [-- Attachment #1: Type: text/plain, Size: 1488 bytes --] >> So instead of fixing or at least documenting known software deficiency >> in Linux MD stack, you'll try to surpress that information so that >> people use more of raid5 setups? >> >> Perhaps the better documentation will push them to RAID1, or maybe >> make them buy an UPS? > > people aren't objecting to better documentation, they are objecting to > misleading documentation. Actually Ric is. He's trying hard to make RAID5 look better than it really is. > for flash drives the danger is very straightforward (although even then > you have to note that it depends heavily on the firmware of the device, > some will loose lots of data, some won't loose any) I have not seen one that works :-(. > you are generalizing that since you have lost data on flash drives, all > flash drives are dangerous. Do the flash manufacturers claim they do not cause collateral damage during powerfail? If not, they probably are dangerous. Anyway, you wanted a test, and one is attached. It normally takes like 4 unplugs to uncover problems. > but the super simplified statement you keep trying to make is > significantly overstating and oversimplifying the problem. Offer better docs? You are right that it does not lose whole stripe, it merely loses random block on same stripe, but result for journaling filesystem is similar. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: fstest --] [-- Type: text/plain, Size: 923 bytes --] #!/bin/bash # # Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2 # # vfat is broken with filesize=0 # # if [ .$MOUNTOPTS = . ]; then # ext3 is needed, or you need to disable caches using hdparm. # odirsync is needed, else modify fstest.worker to fsync the directory. MOUNTOPTS="-o dirsync" fi if [ .$BDEV = . ]; then # BDEV=/dev/sdb3 BDEV=/dev/nd0 fi export FILESIZE=4000 export NUMFILES=4000 waitforcard() { umount /mnt echo Waiting for card: while ! mount $BDEV $MOUNTOPTS /mnt 2> /dev/null; do echo -n . sleep 1 done # hdparm -W0 $BDEV echo } mkdir delme.fstest cd delme.fstest waitforcard rm tmp.* final.* /mnt/tmp.* /mnt/final.* while true; do ../fstest.work echo waitforcard echo Testing: fsck.... umount /mnt fsck -fy $BDEV echo Testing.... waitforcard for A in final.*; do echo -n $A " " cmp $A /mnt/$A || exit done echo done [-- Attachment #3: fstest.work --] [-- Type: text/plain, Size: 409 bytes --] #!/bin/bash # # Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2 # echo "Writing test files: " for A in `seq $NUMFILES`; do echo -n $A " " rm final.$A cat /dev/urandom | head -c $FILESIZE > tmp.$A dd conv=fsync if=tmp.$A of=/mnt/final.$A 2> /dev/zero || exit # cat /mnt/final.$A > /dev/null || exit # sync should not be needed, as dd asks for fsync # sync mv tmp.$A final.$A done ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-29 9:49 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek @ 2009-08-29 11:28 ` Ric Wheeler 2009-09-02 20:12 ` Pavel Machek 2009-08-29 16:35 ` david 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-29 11:28 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/29/2009 05:49 AM, Pavel Machek wrote: > >>> So instead of fixing or at least documenting known software deficiency >>> in Linux MD stack, you'll try to surpress that information so that >>> people use more of raid5 setups? >>> >>> Perhaps the better documentation will push them to RAID1, or maybe >>> make them buy an UPS? >>> >> people aren't objecting to better documentation, they are objecting to >> misleading documentation. >> > Actually Ric is. He's trying hard to make RAID5 look better than it > really is. > > > I object to misleading and dangerous documentation that you have proposed. I spend a lot of time working in data integrity, talking and writing about it so I care deeply that we don't misinform people. In this thread, I put out a draft that is accurate several times and you have failed to respond to it. The big picture that you don't agree with is: (1) RAID (specifically MD RAID) will dramatically improve data integrity for real users. This is not a statement of opinion, this is a statement of fact that has been shown to be true in large scale deployments with commodity hardware. (2) RAID5 protects you against a single failure and your test case purposely injects a double failure. (3) How to configure MD reliably should be documented in MD documentation, not in each possible FS or raw device application (4) Data loss occurs in non-journalling file systems and journalling file systems when you suffer double failures or hot unplug storage, especially inexpensive FLASH parts. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-29 11:28 ` Ric Wheeler @ 2009-09-02 20:12 ` Pavel Machek 2009-09-02 20:42 ` Ric Wheeler ` (2 more replies) 0 siblings, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-02 20:12 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>> people aren't objecting to better documentation, they are objecting to >>> misleading documentation. >>> >> Actually Ric is. He's trying hard to make RAID5 look better than it >> really is. > > I object to misleading and dangerous documentation that you have > proposed. I spend a lot of time working in data integrity, talking and > writing about it so I care deeply that we don't misinform people. Yes, truth is dangerous. To vendors selling crap products. > In this thread, I put out a draft that is accurate several times and you > have failed to respond to it. Accurate as in 'has 0 information content' :-(. > The big picture that you don't agree with is: > > (1) RAID (specifically MD RAID) will dramatically improve data integrity > for real users. This is not a statement of opinion, this is a statement > of fact that has been shown to be true in large scale deployments with > commodity hardware. It is also completely irrelevant. > (2) RAID5 protects you against a single failure and your test case > purposely injects a double failure. Most people would be surprised that press of reset button is 'failure' in this context. > (4) Data loss occurs in non-journalling file systems and journalling > file systems when you suffer double failures or hot unplug storage, > especially inexpensive FLASH parts. It does not happen on inexpensive DISK parts, so people do not expect that and it is worth pointing out. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-02 20:12 ` Pavel Machek @ 2009-09-02 20:42 ` Ric Wheeler 2009-09-02 23:00 ` Rob Landley 2009-09-02 22:45 ` Rob Landley 2009-09-02 22:49 ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley 2 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-02 20:42 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/02/2009 04:12 PM, Pavel Machek wrote: > >>>> people aren't objecting to better documentation, they are objecting to >>>> misleading documentation. >>>> >>> Actually Ric is. He's trying hard to make RAID5 look better than it >>> really is. >> >> I object to misleading and dangerous documentation that you have >> proposed. I spend a lot of time working in data integrity, talking and >> writing about it so I care deeply that we don't misinform people. > > Yes, truth is dangerous. To vendors selling crap products. Pavel, you have no information and an attitude of not wanting to listen to anyone who has real experience or facts. Not just me, but also Ted and others. Totally pointless to reply to you further. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-02 20:42 ` Ric Wheeler @ 2009-09-02 23:00 ` Rob Landley 2009-09-02 23:09 ` david 2009-09-03 0:36 ` jim owens 0 siblings, 2 replies; 309+ messages in thread From: Rob Landley @ 2009-09-02 23:00 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote: > On 09/02/2009 04:12 PM, Pavel Machek wrote: > >>>> people aren't objecting to better documentation, they are objecting to > >>>> misleading documentation. > >>> > >>> Actually Ric is. He's trying hard to make RAID5 look better than it > >>> really is. > >> > >> I object to misleading and dangerous documentation that you have > >> proposed. I spend a lot of time working in data integrity, talking and > >> writing about it so I care deeply that we don't misinform people. > > > > Yes, truth is dangerous. To vendors selling crap products. > > Pavel, you have no information and an attitude of not wanting to listen to > anyone who has real experience or facts. Not just me, but also Ted and > others. > > Totally pointless to reply to you further. For the record, I've been able to follow Pavel's arguments, and I've been able to follow Ted's arguments. But as far as I can tell, you're arguing about a different topic than the rest of us. There's a difference between: A) This filesystem was corrupted because the underlying hardware is permanently damaged, no longer functioning as it did when it was new, and never will again. B) We had a transient glitch that ate the filesystem. The underlying hardware is as good as new, but our data is gone. You can argue about whether or not "new" was ever any good, but Linux has run on PC-class hardware from day 1. Sure PC-class hardware remains crap in many different ways, but this is not a _new_ problem. Refusing to work around what people actually _have_ and insisting we get a better class of user instead _is_ a new problem, kind of a disturbing one. USB keys are the modern successor to floppy drives, and even now Documentation/blockdev/floppy.txt is still full of some of the torturous workarounds implemented for that over the past 2 decades. The hardware existed, and instead of turning up their nose at it they made it work as best they could. Perhaps what's needed for the flash thing is a userspace package, the way mdutils made floppies a lot more usable than the kernel managed at the time. For the flash problem perhaps some FUSE thing a bit like mtdblock might be nice, a translation layer remapping an arbitrary underlying block device into larger granularity chunks and being sure to do the "write the new one before you erase the old one" trick that so many hardware-only flash devices _don't_, and then maybe even use Pavel's crash tool to figure out the write granularity of various sticks and ship it with a whitelist people can email updates to so we don't have to guess large. (Pressure on the USB vendors to give us a "raw view" extension bypassing the "pretend to be a hard drive, with remapping" hardware in future devices would be nice too, but won't help any of the hardware out in the field. I'm not sure that block remapping wouldn't screw up _this_ approach either, but it's an example of something that culd be _tried_.) However, thinking about how to _fix_ a problem is predicated on acknowledging that there actually _is_ a problem. "The hardware is not physically damaged but your data was lost" sounds to me like a software problem, and thus something software could at least _attempt_ to address. "There's millions of 'em, Linux can't cope" doesn't seem like a useful approach. I already addressed the software raid thing last post. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-02 23:00 ` Rob Landley @ 2009-09-02 23:09 ` david 2009-09-03 8:55 ` Pavel Machek 2009-09-03 0:36 ` jim owens 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-09-02 23:09 UTC (permalink / raw) To: Rob Landley Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 2 Sep 2009, Rob Landley wrote: > USB keys are the modern successor to floppy drives, and even now > Documentation/blockdev/floppy.txt is still full of some of the torturous > workarounds implemented for that over the past 2 decades. The hardware > existed, and instead of turning up their nose at it they made it work as best > they could. > > Perhaps what's needed for the flash thing is a userspace package, the way > mdutils made floppies a lot more usable than the kernel managed at the time. > For the flash problem perhaps some FUSE thing a bit like mtdblock might be > nice, a translation layer remapping an arbitrary underlying block device into > larger granularity chunks and being sure to do the "write the new one before > you erase the old one" trick that so many hardware-only flash devices _don't_, > and then maybe even use Pavel's crash tool to figure out the write granularity > of various sticks and ship it with a whitelist people can email updates to so > we don't have to guess large. (Pressure on the USB vendors to give us a "raw > view" extension bypassing the "pretend to be a hard drive, with remapping" > hardware in future devices would be nice too, but won't help any of the > hardware out in the field. I'm not sure that block remapping wouldn't screw up > _this_ approach either, but it's an example of something that culd be > _tried_.) > > However, thinking about how to _fix_ a problem is predicated on acknowledging > that there actually _is_ a problem. "The hardware is not physically damaged > but your data was lost" sounds to me like a software problem, and thus > something software could at least _attempt_ to address. "There's millions of > 'em, Linux can't cope" doesn't seem like a useful approach. no other OS avoids this problem either. I actually don't see how you can do this from userspace, because when you write to the device you have _no_ idea where on the device your data will actually land. writing in larger chunks may or may not help, (if you do a 128K write, and the device is emulating 512b blocks on top of 128K eraseblocks, depending on the current state of the flash translation layer, you could end up writing to many different eraseblocks, up to the theoretical max of 256) David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-02 23:09 ` david @ 2009-09-03 8:55 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-03 8:55 UTC (permalink / raw) To: david Cc: Rob Landley, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >> However, thinking about how to _fix_ a problem is predicated on acknowledging >> that there actually _is_ a problem. "The hardware is not physically damaged >> but your data was lost" sounds to me like a software problem, and thus >> something software could at least _attempt_ to address. "There's millions of >> 'em, Linux can't cope" doesn't seem like a useful approach. > > no other OS avoids this problem either. > > I actually don't see how you can do this from userspace, because when you > write to the device you have _no_ idea where on the device your data will > actually land. It certainly is not easy. Self-correcting codes could probably be used, but that would be very special, very slow, and very non-standard. (Basically... we could design filesystem so that it would survive damage of arbitrarily 512K on disk -- using self-correcting codes in CD-like manner). I'm not sure if it would be practical. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-02 23:00 ` Rob Landley 2009-09-02 23:09 ` david @ 2009-09-03 0:36 ` jim owens 2009-09-03 2:41 ` Rob Landley 1 sibling, 1 reply; 309+ messages in thread From: jim owens @ 2009-09-03 0:36 UTC (permalink / raw) To: Rob Landley Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Rob Landley wrote: > On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote: >> >> Totally pointless to reply to you further. > > For the record, I've been able to follow Pavel's arguments, and I've been able > to follow Ted's arguments. But as far as I can tell, you're arguing about a > different topic than the rest of us. I had no trouble following what Ric was arguing about. Ric never said "use only the best devices and you won't have problems". Ric was arguing the exact opposite - ALL devices are crap if you define crap as "can loose data". What he is saying is you need to UNDERSTAND your devices and their behavior and you must act accordingly. PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS. We understand he was clueless, but user error is still user error! And Ric said do not stigmatize whole classes of A) devices, B) raid, and C) filesystems with "Pavel says...". > However, thinking about how to _fix_ a problem is predicated on acknowledging > that there actually _is_ a problem. "The hardware is not physically damaged > but your data was lost" sounds to me like a software problem, and thus > something software could at least _attempt_ to address. "There's millions of > 'em, Linux can't cope" doesn't seem like a useful approach. We have been trying forever to deal with device problems and as Ric kept trying to explain we do understand them. The problem is not "can we be better" it is "at what cost". As they keep saying "fast", "cheap", "safe"... pick any 2. Adding software solutions to solve it will always turn "fast" to "slow". Most people will choose some risk they can manage (such as don't pull the flash card you idiot), instead of snail slow. > I already addressed the software raid thing last post. Saw it. I am not an MD guy so I will not say anything bad about it except all the "journal" crud. It really is only pandering to Pavel because ALL filesystems can be screwed and that is what they really need to know. The journal stuff distracts those who are not running a journaling filesystem, even if your description is correct except that as we fs people keep saying, fsck is meaningless and again will only give you a false sense of security that your data is OK. jim ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 0:36 ` jim owens @ 2009-09-03 2:41 ` Rob Landley 2009-09-03 14:14 ` jim owens 0 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-09-03 2:41 UTC (permalink / raw) To: jim owens Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wednesday 02 September 2009 19:36:10 jim owens wrote: > Rob Landley wrote: > > On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote: > >> Totally pointless to reply to you further. > > > > For the record, I've been able to follow Pavel's arguments, and I've been > > able to follow Ted's arguments. But as far as I can tell, you're arguing > > about a different topic than the rest of us. > > I had no trouble following what Ric was arguing about. > > Ric never said "use only the best devices and you won't have problems". > > Ric was arguing the exact opposite - ALL devices are crap if you define > crap as "can loose data". And if you include meteor strike and flooding in your operating criteria you can come up with quite a straw man argument. It still doesn't mean "X is highly likely to cause data loss" can never come as news to people. > What he is saying is you need to UNDERSTAND > your devices and their behavior and you must act accordingly. > > PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS. Where was this limitation documented? (Before he documented it, I mean?) > We understand he was clueless, but user error is still user error! I think he understands he was clueless too, that's why he investigated the failure and wrote it up for posterity. > And Ric said do not stigmatize whole classes of A) devices, B) raid, > and C) filesystems with "Pavel says...". I don't care what "Pavel says", so you can leave the ad hominem at the door, thanks. The kernel presents abstractions, such as block device nodes. Sometimes implementation details bubble through those abstractions. Presumably, we agree on that so far. I was once asked to write what became Documentation/rbtree.txt, which got merged. I've also read maybe half of Documentation/RCU. Neither technique is specific to Linux, but this doesn't seem to have been an objection at the time. The technique, "journaling", is widely perceived as eliminating the need for fsck (and thus the potential for filesystem corruption) in the case of unclean shutdowns. But there are easily reproducible cases where the technique, "journaling", does not do this. Thus journaling, as a concept, has limitations which are _not_ widely understood by the majority of people who purchase and use USB flash keys. The kernel doesn't currently have any documentation on journaling theory where mention of journaling's limitations could go. It does have a section on its internal Journaling API in Documentation/DocBook/filesystems.tmpl which links to two papers (both about ext3, even though reiserfs was merged first and IBM's JFS was implemented before either) from 1998 and 2000 respectively. The 2000 paper brushes against disk granularity answering a question starting at 72m, 21s, and brushes against software raid and write ordering starting at the 72m 32s mark. But it never directly addresses either issue... Sigh, I'm well into tl;dr territory here, aren't I? Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 2:41 ` Rob Landley @ 2009-09-03 14:14 ` jim owens 2009-09-04 7:44 ` Rob Landley 0 siblings, 1 reply; 309+ messages in thread From: jim owens @ 2009-09-03 14:14 UTC (permalink / raw) To: Rob Landley Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Rob Landley wrote: > I think he understands he was clueless too, that's why he investigated the > failure and wrote it up for posterity. > >> And Ric said do not stigmatize whole classes of A) devices, B) raid, >> and C) filesystems with "Pavel says...". > > I don't care what "Pavel says", so you can leave the ad hominem at the door, > thanks. See, this is exactly the problem we have with all the proposed documentation. The reader (you) did not get what the writer (me) was trying to say. That does not say either of us was wrong in what we thought was meant, simply that we did not communicate. What I meant was we did not want to accept Pavel's incorrect documentation and post it in kernel docs. > The kernel presents abstractions, such as block device nodes. Sometimes > implementation details bubble through those abstractions. Presumably, we > agree on that so far. We don't have any problem with documenting abstractions. But they must be written as abstracts and accurate, not as IMO blogs. It is not "he means well, so we will just accept it". The rule for kernel docs should be the same as for code. If it is not correct in all cases or causes problems, we don't accept it. jim ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 14:14 ` jim owens @ 2009-09-04 7:44 ` Rob Landley 2009-09-04 11:49 ` Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-09-04 7:44 UTC (permalink / raw) To: jim owens Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thursday 03 September 2009 09:14:43 jim owens wrote: > Rob Landley wrote: > > I think he understands he was clueless too, that's why he investigated > > the failure and wrote it up for posterity. > > > >> And Ric said do not stigmatize whole classes of A) devices, B) raid, > >> and C) filesystems with "Pavel says...". > > > > I don't care what "Pavel says", so you can leave the ad hominem at the > > door, thanks. > > See, this is exactly the problem we have with all the proposed > documentation. The reader (you) did not get what the writer (me) > was trying to say. That does not say either of us was wrong in > what we thought was meant, simply that we did not communicate. That's why I've mostly stopped bothering with this thread. I could respond to Ric Wheeler's latest (what does write barriers have to do with whether or not a multi-sector stripe is guaranteed to be atomically updated during a panic or power failure?) but there's just no point. The LWN article on the topic is out, and incomplete as it is I expect it's the best documentation anybody will actually _read_. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-04 7:44 ` Rob Landley @ 2009-09-04 11:49 ` Ric Wheeler 2009-09-05 10:28 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-04 11:49 UTC (permalink / raw) To: Rob Landley Cc: jim owens, Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/04/2009 03:44 AM, Rob Landley wrote: > On Thursday 03 September 2009 09:14:43 jim owens wrote: > >> Rob Landley wrote: >> >>> I think he understands he was clueless too, that's why he investigated >>> the failure and wrote it up for posterity. >>> >>> >>>> And Ric said do not stigmatize whole classes of A) devices, B) raid, >>>> and C) filesystems with "Pavel says...". >>>> >>> I don't care what "Pavel says", so you can leave the ad hominem at the >>> door, thanks. >>> >> See, this is exactly the problem we have with all the proposed >> documentation. The reader (you) did not get what the writer (me) >> was trying to say. That does not say either of us was wrong in >> what we thought was meant, simply that we did not communicate. >> > That's why I've mostly stopped bothering with this thread. I could respond to > Ric Wheeler's latest (what does write barriers have to do with whether or not > a multi-sector stripe is guaranteed to be atomically updated during a panic or > power failure?) but there's just no point. > The point of that post was that the failure that you and Pavel both attribute to RAID and journalled fs happens whenever the storage cannot promise to do atomic writes of a logical FS block (prevent torn pages/split writes/etc). I gave a specific example of why this happens even with simple, single disk systems. Further, if you have the write cache enabled on your local S-ATA/SAS drives and do not have working barriers (as is the case with MD RAID5/6), you have a hard promise of data loss on power outage and these split writes are not going to be the cause of your issues. You can verify this by testing. Or, try to find people that do storage and file systems that you would listen to and ask. > The LWN article on the topic is out, and incomplete as it is I expect it's the > best documentation anybody will actually _read_. > > Rob > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-04 11:49 ` Ric Wheeler @ 2009-09-05 10:28 ` Pavel Machek 2009-09-05 12:20 ` Ric Wheeler 2009-09-05 13:54 ` Jonathan Corbet 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-05 10:28 UTC (permalink / raw) To: Ric Wheeler Cc: Rob Landley, jim owens, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri 2009-09-04 07:49:34, Ric Wheeler wrote: > On 09/04/2009 03:44 AM, Rob Landley wrote: >> On Thursday 03 September 2009 09:14:43 jim owens wrote: >> >>> Rob Landley wrote: >>> >>>> I think he understands he was clueless too, that's why he investigated >>>> the failure and wrote it up for posterity. >>>> >>>> >>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid, >>>>> and C) filesystems with "Pavel says...". >>>>> >>>> I don't care what "Pavel says", so you can leave the ad hominem at the >>>> door, thanks. >>>> >>> See, this is exactly the problem we have with all the proposed >>> documentation. The reader (you) did not get what the writer (me) >>> was trying to say. That does not say either of us was wrong in >>> what we thought was meant, simply that we did not communicate. >>> >> That's why I've mostly stopped bothering with this thread. I could respond to >> Ric Wheeler's latest (what does write barriers have to do with whether or not >> a multi-sector stripe is guaranteed to be atomically updated during a panic or >> power failure?) but there's just no point. >> > > The point of that post was that the failure that you and Pavel both > attribute to RAID and journalled fs happens whenever the storage cannot > promise to do atomic writes of a logical FS block (prevent torn > pages/split writes/etc). I gave a specific example of why this happens > even with simple, single disk systems. ext3 does not expect atomic write of 4K block, according to Ted. So no, it is not broken on single disk. >> The LWN article on the topic is out, and incomplete as it is I expect it's the >> best documentation anybody will actually _read_. Would anyone (probably privately?) share the lwn link? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-05 10:28 ` Pavel Machek @ 2009-09-05 12:20 ` Ric Wheeler 2009-09-05 13:54 ` Jonathan Corbet 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-09-05 12:20 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, jim owens, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/05/2009 06:28 AM, Pavel Machek wrote: > On Fri 2009-09-04 07:49:34, Ric Wheeler wrote: > >> On 09/04/2009 03:44 AM, Rob Landley wrote: >> >>> On Thursday 03 September 2009 09:14:43 jim owens wrote: >>> >>> >>>> Rob Landley wrote: >>>> >>>> >>>>> I think he understands he was clueless too, that's why he investigated >>>>> the failure and wrote it up for posterity. >>>>> >>>>> >>>>> >>>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid, >>>>>> and C) filesystems with "Pavel says...". >>>>>> >>>>>> >>>>> I don't care what "Pavel says", so you can leave the ad hominem at the >>>>> door, thanks. >>>>> >>>>> >>>> See, this is exactly the problem we have with all the proposed >>>> documentation. The reader (you) did not get what the writer (me) >>>> was trying to say. That does not say either of us was wrong in >>>> what we thought was meant, simply that we did not communicate. >>>> >>>> >>> That's why I've mostly stopped bothering with this thread. I could respond to >>> Ric Wheeler's latest (what does write barriers have to do with whether or not >>> a multi-sector stripe is guaranteed to be atomically updated during a panic or >>> power failure?) but there's just no point. >>> >>> >> The point of that post was that the failure that you and Pavel both >> attribute to RAID and journalled fs happens whenever the storage cannot >> promise to do atomic writes of a logical FS block (prevent torn >> pages/split writes/etc). I gave a specific example of why this happens >> even with simple, single disk systems. >> > ext3 does not expect atomic write of 4K block, according to Ted. So > no, it is not broken on single disk. > I am not sure what you mean by "expect." ext3 (and other file systems) certainly expect that acknowledged writes will still be there after a crash. With your disk write cache on (and no working barriers or non-volatile write cache), this will always require a repair via fsck or leave you with corrupted data or metadata. ext4, btrfs and zfs all do checksumming of writes, but this is a detection mechanism. Repair of the partial write is done on detection (if you have another copy in btrfs or xfs) or by repair (ext4's fsck). For what it's worth, this is the same story with databases (DB2, Oracle, etc). They spend a lot of energy trying to detect partial writes from the application level's point of view and their granularity is often multiple fs blocks.... > > >>> The LWN article on the topic is out, and incomplete as it is I expect it's the >>> best documentation anybody will actually _read_. >>> > Would anyone (probably privately?) share the lwn link? > Pavel > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-05 10:28 ` Pavel Machek 2009-09-05 12:20 ` Ric Wheeler @ 2009-09-05 13:54 ` Jonathan Corbet 2009-09-05 21:27 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Jonathan Corbet @ 2009-09-05 13:54 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Rob Landley, jim owens, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Sat, 5 Sep 2009 12:28:10 +0200 Pavel Machek <pavel@ucw.cz> wrote: > >> The LWN article on the topic is out, and incomplete as it is I expect it's the > >> best documentation anybody will actually _read_. > > Would anyone (probably privately?) share the lwn link? http://lwn.net/SubscriberLink/349970/9875eff987190551/ assuming you've not already gotten one from elsewhere. jon ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-05 13:54 ` Jonathan Corbet @ 2009-09-05 21:27 ` Pavel Machek 2009-09-05 21:56 ` Theodore Tso 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-09-05 21:27 UTC (permalink / raw) To: Jonathan Corbet Cc: Ric Wheeler, Rob Landley, jim owens, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Sat 2009-09-05 07:54:24, Jonathan Corbet wrote: > On Sat, 5 Sep 2009 12:28:10 +0200 > Pavel Machek <pavel@ucw.cz> wrote: > > > >> The LWN article on the topic is out, and incomplete as it is I expect it's the > > >> best documentation anybody will actually _read_. > > > > Would anyone (probably privately?) share the lwn link? > > http://lwn.net/SubscriberLink/349970/9875eff987190551/ > > assuming you've not already gotten one from elsewhere. Thanks, and thanks for nice article! Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-05 21:27 ` Pavel Machek @ 2009-09-05 21:56 ` Theodore Tso 0 siblings, 0 replies; 309+ messages in thread From: Theodore Tso @ 2009-09-05 21:56 UTC (permalink / raw) To: Pavel Machek Cc: Jonathan Corbet, Ric Wheeler, Rob Landley, jim owens, david, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Sat, Sep 05, 2009 at 11:27:32PM +0200, Pavel Machek wrote: > > Thanks, and thanks for nice article! I agree; it's very nicely written, balanced, and doesn't scare users unduly. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-02 20:12 ` Pavel Machek 2009-09-02 20:42 ` Ric Wheeler @ 2009-09-02 22:45 ` Rob Landley 2009-09-02 22:49 ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley 2 siblings, 0 replies; 309+ messages in thread From: Rob Landley @ 2009-09-02 22:45 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wednesday 02 September 2009 15:12:10 Pavel Machek wrote: > > (2) RAID5 protects you against a single failure and your test case > > purposely injects a double failure. > > Most people would be surprised that press of reset button is 'failure' > in this context. Apparently because most people haven't read Documentation/md.txt: Boot time assembly of degraded/dirty arrays ------------------------------------------- If a raid5 or raid6 array is both dirty and degraded, it could have undetectable data corruption. This is because the fact that it is 'dirty' means that the parity cannot be trusted, and the fact that it is degraded means that some datablocks are missing and cannot reliably be reconstructed (due to no parity). And so on for several more paragraphs. Perhaps the documentation needs to be extended to note that "journaling will not help here, because the lost data blocks render entire stripes unreconstructable"... Hmmm, I'll take a stab at it. (I'm not addressing the raid 0 issues brought up elsewhere in this thread because I don't comfortably understand the current state of play...) Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case. 2009-09-02 20:12 ` Pavel Machek 2009-09-02 20:42 ` Ric Wheeler 2009-09-02 22:45 ` Rob Landley @ 2009-09-02 22:49 ` Rob Landley 2009-09-03 9:08 ` Pavel Machek 2009-09-03 12:05 ` Ric Wheeler 2 siblings, 2 replies; 309+ messages in thread From: Rob Landley @ 2009-09-02 22:49 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet From: Rob Landley <rob@landley.net> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section, explaining that using a journaling filesystem can't overcome this problem. Signed-off-by: Rob Landley <rob@landley.net> --- Documentation/md.txt | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/Documentation/md.txt b/Documentation/md.txt index 4edd39e..52b8450 100644 --- a/Documentation/md.txt +++ b/Documentation/md.txt @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use md-mod.start_dirty_degraded=1 +Note that Journaling filesystems do not effectively protect data in this +case, because the update granularity of the RAID is larger than the journal +was designed to expect. Reconstructing data via partity information involes +matching together corresponding stripes, and updating only some of these +stripes renders the corresponding data in all the unmatched stripes +meaningless. Thus seemingly unrelated data in other parts of the filesystem +(stored in the unmatched stripes) can become unreadable after a partial +update, but the journal is only aware of the parts it modified, not the +"collateral damage" elsewhere in the filesystem which was affected by those +changes. + +Thus successful journal replay proves nothing in this context, and even a +full fsck only shows whether or not the filesystem's metadata was affected. +(A proper solution to this problem would involve adding journaling to the RAID +itself, at least during degraded writes. In the meantime, try not to allow +a system to shut down uncleanly with its RAID both dirty and degraded, it +can handle one but not both.) Superblock formats ------------------ -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case. 2009-09-02 22:49 ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley @ 2009-09-03 9:08 ` Pavel Machek 2009-09-03 12:05 ` Ric Wheeler 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-03 9:08 UTC (permalink / raw) To: Rob Landley Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed 2009-09-02 17:49:46, Rob Landley wrote: > From: Rob Landley <rob@landley.net> > > Add more warnings to the "Boot time assembly of degraded/dirty arrays" section, > explaining that using a journaling filesystem can't overcome this problem. > > Signed-off-by: Rob Landley <rob@landley.net> I like it! Not sure if I know enough about MD to add ack, but... Acked-by: Pavel Machek <pavel@ucw.cz> Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case. 2009-09-02 22:49 ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley 2009-09-03 9:08 ` Pavel Machek @ 2009-09-03 12:05 ` Ric Wheeler 2009-09-03 12:31 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-03 12:05 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/02/2009 06:49 PM, Rob Landley wrote: > From: Rob Landley<rob@landley.net> > > Add more warnings to the "Boot time assembly of degraded/dirty arrays" section, > explaining that using a journaling filesystem can't overcome this problem. > > Signed-off-by: Rob Landley<rob@landley.net> > --- > > Documentation/md.txt | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/Documentation/md.txt b/Documentation/md.txt > index 4edd39e..52b8450 100644 > --- a/Documentation/md.txt > +++ b/Documentation/md.txt > @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use > > md-mod.start_dirty_degraded=1 > > +Note that Journaling filesystems do not effectively protect data in this > +case, because the update granularity of the RAID is larger than the journal > +was designed to expect. Reconstructing data via partity information involes > +matching together corresponding stripes, and updating only some of these > +stripes renders the corresponding data in all the unmatched stripes > +meaningless. Thus seemingly unrelated data in other parts of the filesystem > +(stored in the unmatched stripes) can become unreadable after a partial > +update, but the journal is only aware of the parts it modified, not the > +"collateral damage" elsewhere in the filesystem which was affected by those > +changes. > + > +Thus successful journal replay proves nothing in this context, and even a > +full fsck only shows whether or not the filesystem's metadata was affected. > +(A proper solution to this problem would involve adding journaling to the RAID > +itself, at least during degraded writes. In the meantime, try not to allow > +a system to shut down uncleanly with its RAID both dirty and degraded, it > +can handle one but not both.) > > Superblock formats > ------------------ > > NACK. Now you have moved the inaccurate documentation about journalling file systems into the MD documentation. Repeat after me: (1) partial writes to a RAID stripe (with or without file systems, with or without journals) create an invalid stripe (2) partial writes can be prevented in most cases by running with write cache disabled or working barriers (3) fsck can (for journalling fs or non journalling fs) detect and fix your file system. It won't give you back the data in that stripe, but you will get the rest of your metadata and data back and usable. You don't need MD in the picture to test this - take fsfuzzer or just dd and zero out a RAID stripe width of data from a file system. If you hit data blocks, your fsck (for ext2) or mount (for any journalling fs) will not see an error. If metadata, fsck in both cases when run will try to fix it as best as it can. Also note that partial writes (similar to torn writes) can happen for multiple reasons on non-RAID systems and leave the same kind of damage. Side note, proposing a half sketched out "fix" for partial stripe writes in documentation is not productive. Much better to submit a fully thought out proposal or actual patches to demonstrate the issue. Rob, you should really try to take a few disks, build a working MD RAID5 group and test your ideas. Try it with and without the write cache enabled. Measure and report, say after 20 power losses, how files integrity and fsck repairs were impacted. Try the same with ext2 and ext3. Regards, Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case. 2009-09-03 12:05 ` Ric Wheeler @ 2009-09-03 12:31 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-03 12:31 UTC (permalink / raw) To: Ric Wheeler Cc: Rob Landley, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu 2009-09-03 08:05:31, Ric Wheeler wrote: > On 09/02/2009 06:49 PM, Rob Landley wrote: >> From: Rob Landley<rob@landley.net> >> >> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section, >> explaining that using a journaling filesystem can't overcome this problem. >> >> Signed-off-by: Rob Landley<rob@landley.net> >> --- >> >> Documentation/md.txt | 17 +++++++++++++++++ >> 1 file changed, 17 insertions(+) >> >> diff --git a/Documentation/md.txt b/Documentation/md.txt >> index 4edd39e..52b8450 100644 >> --- a/Documentation/md.txt >> +++ b/Documentation/md.txt >> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use >> >> md-mod.start_dirty_degraded=1 >> >> +Note that Journaling filesystems do not effectively protect data in this >> +case, because the update granularity of the RAID is larger than the journal >> +was designed to expect. Reconstructing data via partity information involes >> +matching together corresponding stripes, and updating only some of these >> +stripes renders the corresponding data in all the unmatched stripes >> +meaningless. Thus seemingly unrelated data in other parts of the filesystem >> +(stored in the unmatched stripes) can become unreadable after a partial >> +update, but the journal is only aware of the parts it modified, not the >> +"collateral damage" elsewhere in the filesystem which was affected by those >> +changes. >> + >> +Thus successful journal replay proves nothing in this context, and even a >> +full fsck only shows whether or not the filesystem's metadata was affected. >> +(A proper solution to this problem would involve adding journaling to the RAID >> +itself, at least during degraded writes. In the meantime, try not to allow >> +a system to shut down uncleanly with its RAID both dirty and degraded, it >> +can handle one but not both.) >> >> Superblock formats >> ------------------ >> >> > > NACK. > > Now you have moved the inaccurate documentation about journalling file > systems into the MD documentation. What is inaccurate about it? > Repeat after me: > (1) partial writes to a RAID stripe (with or without file systems, with > or without journals) create an invalid stripe That's what he's documenting. > (2) partial writes can be prevented in most cases by running with write > cache disabled or working barriers Given how long experience with storage you claim, you should know that MD RAID5 does not support barriers by now... > Rob, you should really try to take a few disks, build a working MD RAID5 > group and test your ideas. Try it with and without the write cache > enabled. ....and understand by now that statistics are irrelevant for design problems. Ouch and trying to silence people by telling them to fix the problem instead of documenting it is not nice either. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-29 9:49 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek @ 2009-08-29 16:35 ` david 2009-08-29 16:35 ` david 1 sibling, 0 replies; 309+ messages in thread From: david @ 2009-08-29 16:35 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet [-- Attachment #1: Type: TEXT/PLAIN, Size: 1331 bytes --] On Sat, 29 Aug 2009, Pavel Machek wrote: >> for flash drives the danger is very straightforward (although even then >> you have to note that it depends heavily on the firmware of the device, >> some will loose lots of data, some won't loose any) > > I have not seen one that works :-(. so let's get broader testing (including testing the SSDs as well as the thumb drives) >> you are generalizing that since you have lost data on flash drives, all >> flash drives are dangerous. > > Do the flash manufacturers claim they do not cause collateral damage > during powerfail? If not, they probably are dangerous. I think that every single one of them will tell you to not unplug the drive while writing to it. in fact, I'll bet they all tell you to not unplug the drive without unmounting ('ejecting') it at the OS level. > Anyway, you wanted a test, and one is attached. It normally takes like > 4 unplugs to uncover problems. Ok, help me understand this. I copy these two files to a system, change them to point at the correct device, run them and unplug the drive while it's running. when I plug the device back in, how do I tell if it lost something unexpected? since you are writing from urandom I have no idea what data _should_ be on the drive, so how can I detect that a data block has been corrupted? David Lang [-- Attachment #2: Type: TEXT/PLAIN, Size: 923 bytes --] #!/bin/bash # # Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2 # # vfat is broken with filesize=0 # # if [ .$MOUNTOPTS = . ]; then # ext3 is needed, or you need to disable caches using hdparm. # odirsync is needed, else modify fstest.worker to fsync the directory. MOUNTOPTS="-o dirsync" fi if [ .$BDEV = . ]; then # BDEV=/dev/sdb3 BDEV=/dev/nd0 fi export FILESIZE=4000 export NUMFILES=4000 waitforcard() { umount /mnt echo Waiting for card: while ! mount $BDEV $MOUNTOPTS /mnt 2> /dev/null; do echo -n . sleep 1 done # hdparm -W0 $BDEV echo } mkdir delme.fstest cd delme.fstest waitforcard rm tmp.* final.* /mnt/tmp.* /mnt/final.* while true; do ../fstest.work echo waitforcard echo Testing: fsck.... umount /mnt fsck -fy $BDEV echo Testing.... waitforcard for A in final.*; do echo -n $A " " cmp $A /mnt/$A || exit done echo done [-- Attachment #3: Type: TEXT/PLAIN, Size: 409 bytes --] #!/bin/bash # # Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2 # echo "Writing test files: " for A in `seq $NUMFILES`; do echo -n $A " " rm final.$A cat /dev/urandom | head -c $FILESIZE > tmp.$A dd conv=fsync if=tmp.$A of=/mnt/final.$A 2> /dev/zero || exit # cat /mnt/final.$A > /dev/null || exit # sync should not be needed, as dd asks for fsync # sync mv tmp.$A final.$A done ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) @ 2009-08-29 16:35 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-29 16:35 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet [-- Attachment #1: Type: TEXT/PLAIN, Size: 1331 bytes --] On Sat, 29 Aug 2009, Pavel Machek wrote: >> for flash drives the danger is very straightforward (although even then >> you have to note that it depends heavily on the firmware of the device, >> some will loose lots of data, some won't loose any) > > I have not seen one that works :-(. so let's get broader testing (including testing the SSDs as well as the thumb drives) >> you are generalizing that since you have lost data on flash drives, all >> flash drives are dangerous. > > Do the flash manufacturers claim they do not cause collateral damage > during powerfail? If not, they probably are dangerous. I think that every single one of them will tell you to not unplug the drive while writing to it. in fact, I'll bet they all tell you to not unplug the drive without unmounting ('ejecting') it at the OS level. > Anyway, you wanted a test, and one is attached. It normally takes like > 4 unplugs to uncover problems. Ok, help me understand this. I copy these two files to a system, change them to point at the correct device, run them and unplug the drive while it's running. when I plug the device back in, how do I tell if it lost something unexpected? since you are writing from urandom I have no idea what data _should_ be on the drive, so how can I detect that a data block has been corrupted? David Lang [-- Attachment #2: Type: TEXT/PLAIN, Size: 976 bytes --] #!/bin/bash # # Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2 # # vfat is broken with filesize=0 # # if [ .$MOUNTOPTS = . ]; then # ext3 is needed, or you need to disable caches using hdparm. # odirsync is needed, else modify fstest.worker to fsync the directory. MOUNTOPTS="-o dirsync" fi if [ .$BDEV = . ]; then # BDEV=/dev/sdb3 BDEV=/dev/nd0 fi export FILESIZE=4000 export NUMFILES=4000 waitforcard() { umount /mnt echo Waiting for card: while ! mount $BDEV $MOUNTOPTS /mnt 2> /dev/null; do echo -n . sleep 1 done # hdparm -W0 $BDEV echo } mkdir delme.fstest cd delme.fstest waitforcard rm tmp.* final.* /mnt/tmp.* /mnt/final.* while true; do ../fstest.work echo waitforcard echo Testing: fsck.... umount /mnt fsck -fy $BDEV echo Testing.... waitforcard for A in final.*; do echo -n $A " " cmp $A /mnt/$A || exit done echo done [-- Attachment #3: Type: TEXT/PLAIN, Size: 425 bytes --] #!/bin/bash # # Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2 # echo "Writing test files: " for A in `seq $NUMFILES`; do echo -n $A " " rm final.$A cat /dev/urandom | head -c $FILESIZE > tmp.$A dd conv=fsync if=tmp.$A of=/mnt/final.$A 2> /dev/zero || exit # cat /mnt/final.$A > /dev/null || exit # sync should not be needed, as dd asks for fsync # sync mv tmp.$A final.$A done ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-29 16:35 ` david (?) @ 2009-08-30 7:07 ` Pavel Machek -1 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:07 UTC (permalink / raw) To: david Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >>> for flash drives the danger is very straightforward (although even then >>> you have to note that it depends heavily on the firmware of the device, >>> some will loose lots of data, some won't loose any) >> >> I have not seen one that works :-(. > > so let's get broader testing (including testing the SSDs as well as the > thumb drives) If someone can do ssd test -- yes that would be interesting. >> Anyway, you wanted a test, and one is attached. It normally takes like >> 4 unplugs to uncover problems. > > Ok, help me understand this. > > I copy these two files to a system, change them to point at the correct > device, run them and unplug the drive while it's running. Yep. > when I plug the device back in, how do I tell if it lost something > unexpected? since you are writing from urandom I have no idea what data > _should_ be on the drive, so how can I detect that a data block has been > corrupted? I have mirror on disk you are not unplugging. See cmp || exit lines. The test continues until it detects corruption. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:12 ` Pavel Machek 2009-08-26 11:28 ` david @ 2009-08-26 12:01 ` Ric Wheeler 2009-08-26 12:23 ` Theodore Tso 2 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 12:01 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/26/2009 07:12 AM, Pavel Machek wrote: > On Wed 2009-08-26 06:39:14, Ric Wheeler wrote: > >> On 08/25/2009 10:58 PM, Theodore Tso wrote: >> >>> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: >>> >>> >>>> I agree with the whole write up outside of the above - degraded RAID >>>> does meet this requirement unless you have a second (or third, counting >>>> the split write) failure during the rebuild. >>>> >>>> >>> The argument is that if the degraded RAID array is running in this >>> state for a long time, and the power fails while the software RAID is >>> in the middle of writing out a stripe, such that the stripe isn't >>> completely written out, we could lose all of the data in that stripe. >>> >>> In other words, a power failure in the middle of writing out a stripe >>> in a degraded RAID array counts as a second failure. >>> To me, this isn't a particularly interesting or newsworthy point, >>> since a competent system administrator who cares about his data and/or >>> his hardware will (a) have a UPS, and (b) be running with a hot spare >>> and/or will imediately replace a failed drive in a RAID array. >>> >> I agree that this is not an interesting (or likely) scenario, certainly >> when compared to the much more frequent failures that RAID will protect >> against which is why I object to the document as Pavel suggested. It >> will steer people away from using RAID and directly increase their >> chances of losing their data if they use just a single disk. >> > So instead of fixing or at least documenting known software deficiency > in Linux MD stack, you'll try to surpress that information so that > people use more of raid5 setups? > > Perhaps the better documentation will push them to RAID1, or maybe > make them buy an UPS? > Pavel > I am against documenting unlikely scenarios out of context that will lead people to do the wrong thing. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:12 ` Pavel Machek 2009-08-26 11:28 ` david 2009-08-26 12:01 ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler @ 2009-08-26 12:23 ` Theodore Tso 2009-08-30 7:01 ` Pavel Machek 2009-08-30 7:01 ` Pavel Machek 2 siblings, 2 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 12:23 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote: > > I agree that this is not an interesting (or likely) scenario, certainly > > when compared to the much more frequent failures that RAID will protect > > against which is why I object to the document as Pavel suggested. It > > will steer people away from using RAID and directly increase their > > chances of losing their data if they use just a single disk. > > So instead of fixing or at least documenting known software deficiency > in Linux MD stack, you'll try to surpress that information so that > people use more of raid5 setups? First of all, it's not a "known software deficiency"; you can't do anything about a degraded RAID array, other than to replace the failed disk. Secondly, what we should document is things like "don't use crappy flash devices", "don't let the RAID array run in degraded mode for a long time" and "if you must (which is a bad idea), better have a UPS or a battery-backed hardware RAID". What we should *not* document is "ext3 is worthless for RAID 5 arrays" (simply wrong) and "ext2 is better than ext3 because it forces you to run a long, slow fsck after each boot, and that helps you to catch filesystem corruptions when the storage devices goes bad" (Second part of the statement is true, but it's still bad general advice, and it's horribly misleading) and "ext2 and ext3 have this surprising dependency that disks act like disks". (alarmist) - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 12:23 ` Theodore Tso @ 2009-08-30 7:01 ` Pavel Machek 2009-08-30 7:01 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:01 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel On Wed 2009-08-26 08:23:11, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote: > > > I agree that this is not an interesting (or likely) scenario, certainly > > > when compared to the much more frequent failures that RAID will protect > > > against which is why I object to the document as Pavel suggested. It > > > will steer people away from using RAID and directly increase their > > > chances of losing their data if they use just a single disk. > > > > So instead of fixing or at least documenting known software deficiency > > in Linux MD stack, you'll try to surpress that information so that > > people use more of raid5 setups? > > First of all, it's not a "known software deficiency"; you can't do > anything about a degraded RAID array, other than to replace the failed > disk. You could add journal to raid5. > "ext2 and ext3 have this surprising dependency that disks act like > disks". (alarmist) AFAICT, you mount block device, not disk. Many block devices fail the test. And since users (and block device developers) do not know in detail how disks behave, it is hard to blame them... ("you may corrupt sector you are writing to and ext3 handles that ok" was surprise for me, for example). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 12:23 ` Theodore Tso 2009-08-30 7:01 ` Pavel Machek @ 2009-08-30 7:01 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:01 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed 2009-08-26 08:23:11, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote: > > > I agree that this is not an interesting (or likely) scenario, certainly > > > when compared to the much more frequent failures that RAID will protect > > > against which is why I object to the document as Pavel suggested. It > > > will steer people away from using RAID and directly increase their > > > chances of losing their data if they use just a single disk. > > > > So instead of fixing or at least documenting known software deficiency > > in Linux MD stack, you'll try to surpress that information so that > > people use more of raid5 setups? > > First of all, it's not a "known software deficiency"; you can't do > anything about a degraded RAID array, other than to replace the failed > disk. You could add journal to raid5. > "ext2 and ext3 have this surprising dependency that disks act like > disks". (alarmist) AFAICT, you mount block device, not disk. Many block devices fail the test. And since users (and block device developers) do not know in detail how disks behave, it is hard to blame them... ("you may corrupt sector you are writing to and ext3 handles that ok" was surprise for me, for example). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 2:58 ` Theodore Tso 2009-08-26 10:39 ` Ric Wheeler 2009-08-26 10:39 ` Ric Wheeler @ 2009-08-27 5:19 ` Rob Landley 2009-08-27 12:24 ` Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-08-27 5:19 UTC (permalink / raw) To: Theodore Tso Cc: Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tuesday 25 August 2009 21:58:49 Theodore Tso wrote: > On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote: > > I agree with the whole write up outside of the above - degraded RAID > > does meet this requirement unless you have a second (or third, counting > > the split write) failure during the rebuild. > > The argument is that if the degraded RAID array is running in this > state for a long time, and the power fails while the software RAID is > in the middle of writing out a stripe, such that the stripe isn't > completely written out, we could lose all of the data in that stripe. > > In other words, a power failure in the middle of writing out a stripe > in a degraded RAID array counts as a second failure. Or panic, hang, the drive failed because the system is overheating because the air conditioner suddenly died and the server room is now an oven. (Yup, worked at that company too.) > To me, this isn't a particularly interesting or newsworthy point, > since a competent system administrator I'm a bit concerned by the argument that we don't need to document serious pitfalls because every Linux system has a sufficiently competent administrator they already know stuff that didn't even come up until the second or third day it was discussed on lkml. "You're documenting it wrong" != "you shouldn't document it". > who cares about his data and/or > his hardware will (a) have a UPS, I worked at a company that retested their UPSes a year after installing them and found that _none_ of them supplied more than 15 seconds charge, and when they dismantled them the batteries had physically bloated inside their little plastic cases. (Same company as the dead air conditioner, possibly overheating was involved but the little _lights_ said everything was ok.) That was by no means the first UPS I'd seen die, the suckers have a higher failure rate than hard drives in my experience. This is a device where the batteries get constantly charged and almost never tested because if it _does_ fail you just rebooted your production server, so a lot of smaller companies think they have one but actually don't. > , and (b) be running with a hot spare > and/or will imediately replace a failed drive in a RAID array. Here's hoping they shut the system down properly to install the new drive in the raid then, eh? Not accidentally pull the plug before it's finished running the ~7 minutes of shutdown scripts in the last Red Hat Enterprise I messed with... Does this situation apply during the rebuild? I.E. once a hot spare has been supplied, is the copy to the new drive linear, or will it write dirty pages to the new drive out of order, even before the reconstruction's gotten that far, _and_ do so in an order that doesn't open this race window of the data being unable to be reconstructed? If "degraded array" just means "don't have a replacement disk yet", then it sounds like what Pavel wants to document is "don't write to a degraded array at all, because power failures can cost you data due to write granularity being larger than filesystem block size". (Which still comes as news to some of us, and you need a way to remount mount the degraded array read only until the sysadmin can fix it.) But if "degraded array" means "hasn't finished rebuilding the new disk yet", that could easily be several hours' window and not writing to it is less of an option. (I realize a competent system administrator would obviously already know this, but I don't.) > - Ted Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 5:19 ` Rob Landley @ 2009-08-27 12:24 ` Theodore Tso 2009-08-27 13:10 ` Ric Wheeler ` (3 more replies) 0 siblings, 4 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-27 12:24 UTC (permalink / raw) To: Rob Landley Cc: Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote: > > To me, this isn't a particularly interesting or newsworthy point, > > since a competent system administrator > > I'm a bit concerned by the argument that we don't need to document > serious pitfalls because every Linux system has a sufficiently > competent administrator they already know stuff that didn't even > come up until the second or third day it was discussed on lkml. I'm not convinced that information which needs to be known by System Administrators is best documented in the kernel Documentation directory. Should there be a HOWTO document on stuff like that? Sure, if someone wants to put something like that together, having free documentation about ways to set up your storage stack in a sane way is not a bad thing. It should be noted that these sorts of issues are discussed in various books targetted at System Administrators, and in Usenix's System Administration tutorials. The computer industry is highly specialized, and so just because an OS kernel hacker might not be familiar with these issues, doesn't mean that professionals whose job it is to run data centers don't know about these things! Similarly, you could be a whiz at Linux's networking stack, but you might not know about certain pitfalls in configuring a Cisco router using IOS; does that mean we should have an IOS tutorial in the kernel documentation directory? I'm not so sure about that! > "You're documenting it wrong" != "you shouldn't document it". Sure, but the fact that we don't currently say much about storage stacks doesn't mean we should accept a patch that might actively mislead people. I'm NACK'ing the patch on that basis. > > who cares about his data and/or > > his hardware will (a) have a UPS, > > I worked at a company that retested their UPSes a year after > installing them and found that _none_ of them supplied more than 15 > seconds charge, and when they dismantled them the batteries had > physically bloated inside their little plastic cases. (Same company > as the dead air conditioner, possibly overheating was involved but > the little _lights_ said everything was ok.) > > That was by no means the first UPS I'd seen die, the suckers have a > higher failure rate than hard drives in my experience. This is a > device where the batteries get constantly charged and almost never > tested because if it _does_ fail you just rebooted your production > server, so a lot of smaller companies think they have one but > actually don't. Sounds like they were using really cheap UPS's; certainly not the kind I would expect to find in a data center. And if company's system administrator is using the cheapest possible consumer-grade UPS's, then yes, they might have a problem. Even an educational institution like MIT, where I was an network administrator some 15 years ago, had proper UPS's, *and* we had a diesel generator which kicked in after 15 seconds --- and we tested the diesel generator every Friday morning, to make sure it worked properly. > > , and (b) be running with a hot spare > > and/or will imediately replace a failed drive in a RAID array. > > Here's hoping they shut the system down properly to install the new > drive in the raid then, eh? Not accidentally pull the plug before > it's finished running the ~7 minutes of shutdown scripts in the last > Red Hat Enterprise I messed with... Even my home RAID array uses hot-plug SATA disks, so I can replace a failed disk without shutting down my system. (And yes, I have a backup battery for the hardware RAID, and the firmware runs periodic tests on it; the hardware RAID card also will send me e-mail if a RAID array drive fails and it needs to use my hot-spare. At that point, I order a new hard drive, secure in the knowledge that the system can still suffer another drive failure before falling into degraded mode. And no, this isn't some expensive enterprise RAID setup; this is just a mid-range Areca RAID card.) > If "degraded array" just means "don't have a replacement disk yet", > then it sounds like what Pavel wants to document is "don't write to > a degraded array at all, because power failures can cost you data > due to write granularity being larger than filesystem block size". > (Which still comes as news to some of us, and you need a way to > remount mount the degraded array read only until the sysadmin can > fix it.) If you want to document that as a property of RAID arrays, sure. But it's not something that should live in Documentation/filesystems/ext2.txt and Documentation/filesystems/ext3.txt. The MD RAID howto might be a better place, since it's far more likely more users will read it. How many system administrators read what's in the kernel's Documentation directory, after all, and this is basic information about how RAID works; it's not necessarily something that someone would *expect* to be in kernel documentation, nor would necessarily go looking for it there. And the reality is that it's not like most people go reading Documentation/* for pleasure. :-) BTW, the RAID write atomicity issue and the possibility of failures cause data loss *is* documented in the Wikipedia article on RAID. It's not as written as direct practical advice to a system administrator (you'd have to go to a book that is really targetted at system administrators to find that sort of thing). - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 12:24 ` Theodore Tso @ 2009-08-27 13:10 ` Ric Wheeler 2009-08-27 13:10 ` Ric Wheeler ` (2 subsequent siblings) 3 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-27 13:10 UTC (permalink / raw) To: Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer, Goswin von Brederlow, kernel list On 08/27/2009 08:24 AM, Theodore Tso wrote: > On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote: >>> To me, this isn't a particularly interesting or newsworthy point, >>> since a competent system administrator >> >> I'm a bit concerned by the argument that we don't need to document >> serious pitfalls because every Linux system has a sufficiently >> competent administrator they already know stuff that didn't even >> come up until the second or third day it was discussed on lkml. > > I'm not convinced that information which needs to be known by System > Administrators is best documented in the kernel Documentation > directory. Should there be a HOWTO document on stuff like that? > Sure, if someone wants to put something like that together, having > free documentation about ways to set up your storage stack in a sane > way is not a bad thing. > > It should be noted that these sorts of issues are discussed in various > books targetted at System Administrators, and in Usenix's System > Administration tutorials. The computer industry is highly > specialized, and so just because an OS kernel hacker might not be > familiar with these issues, doesn't mean that professionals whose job > it is to run data centers don't know about these things! Similarly, > you could be a whiz at Linux's networking stack, but you might not > know about certain pitfalls in configuring a Cisco router using IOS; > does that mean we should have an IOS tutorial in the kernel > documentation directory? I'm not so sure about that! > >> "You're documenting it wrong" != "you shouldn't document it". > > Sure, but the fact that we don't currently say much about storage > stacks doesn't mean we should accept a patch that might actively > mislead people. I'm NACK'ing the patch on that basis. > >>> who cares about his data and/or >>> his hardware will (a) have a UPS, >> >> I worked at a company that retested their UPSes a year after >> installing them and found that _none_ of them supplied more than 15 >> seconds charge, and when they dismantled them the batteries had >> physically bloated inside their little plastic cases. (Same company >> as the dead air conditioner, possibly overheating was involved but >> the little _lights_ said everything was ok.) >> >> That was by no means the first UPS I'd seen die, the suckers have a >> higher failure rate than hard drives in my experience. This is a >> device where the batteries get constantly charged and almost never >> tested because if it _does_ fail you just rebooted your production >> server, so a lot of smaller companies think they have one but >> actually don't. > > Sounds like they were using really cheap UPS's; certainly not the kind > I would expect to find in a data center. And if company's system > administrator is using the cheapest possible consumer-grade UPS's, > then yes, they might have a problem. Even an educational institution > like MIT, where I was an network administrator some 15 years ago, had > proper UPS's, *and* we had a diesel generator which kicked in after 15 > seconds --- and we tested the diesel generator every Friday morning, > to make sure it worked properly. > >>> , and (b) be running with a hot spare >>> and/or will imediately replace a failed drive in a RAID array. >> >> Here's hoping they shut the system down properly to install the new >> drive in the raid then, eh? Not accidentally pull the plug before >> it's finished running the ~7 minutes of shutdown scripts in the last >> Red Hat Enterprise I messed with... > > Even my home RAID array uses hot-plug SATA disks, so I can replace a > failed disk without shutting down my system. (And yes, I have a > backup battery for the hardware RAID, and the firmware runs periodic > tests on it; the hardware RAID card also will send me e-mail if a RAID > array drive fails and it needs to use my hot-spare. At that point, I > order a new hard drive, secure in the knowledge that the system can > still suffer another drive failure before falling into degraded mode. > And no, this isn't some expensive enterprise RAID setup; this is just > a mid-range Areca RAID card.) > >> If "degraded array" just means "don't have a replacement disk yet", >> then it sounds like what Pavel wants to document is "don't write to >> a degraded array at all, because power failures can cost you data >> due to write granularity being larger than filesystem block size". >> (Which still comes as news to some of us, and you need a way to >> remount mount the degraded array read only until the sysadmin can >> fix it.) > > If you want to document that as a property of RAID arrays, sure. But > it's not something that should live in Documentation/filesystems/ext2.txt > and Documentation/filesystems/ext3.txt. The MD RAID howto might be a > better place, since it's far more likely more users will read it. How > many system administrators read what's in the kernel's Documentation > directory, after all, and this is basic information about how RAID > works; it's not necessarily something that someone would *expect* to > be in kernel documentation, nor would necessarily go looking for it > there. And the reality is that it's not like most people go reading > Documentation/* for pleasure. :-) > > BTW, the RAID write atomicity issue and the possibility of failures > cause data loss *is* documented in the Wikipedia article on RAID. > It's not as written as direct practical advice to a system > administrator (you'd have to go to a book that is really targetted at > system administrators to find that sort of thing). > > - Ted One thing that does need fixing for some MD configurations is to stress again that we need to make sure that barrier operations are properly supported or users will need to disable the write cache on devices with volatile write caches. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 12:24 ` Theodore Tso 2009-08-27 13:10 ` Ric Wheeler @ 2009-08-27 13:10 ` Ric Wheeler 2009-08-27 16:54 ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik 2009-08-29 10:02 ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek 2009-08-29 10:02 ` Pavel Machek 3 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-27 13:10 UTC (permalink / raw) To: Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/27/2009 08:24 AM, Theodore Tso wrote: > On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote: >>> To me, this isn't a particularly interesting or newsworthy point, >>> since a competent system administrator >> >> I'm a bit concerned by the argument that we don't need to document >> serious pitfalls because every Linux system has a sufficiently >> competent administrator they already know stuff that didn't even >> come up until the second or third day it was discussed on lkml. > > I'm not convinced that information which needs to be known by System > Administrators is best documented in the kernel Documentation > directory. Should there be a HOWTO document on stuff like that? > Sure, if someone wants to put something like that together, having > free documentation about ways to set up your storage stack in a sane > way is not a bad thing. > > It should be noted that these sorts of issues are discussed in various > books targetted at System Administrators, and in Usenix's System > Administration tutorials. The computer industry is highly > specialized, and so just because an OS kernel hacker might not be > familiar with these issues, doesn't mean that professionals whose job > it is to run data centers don't know about these things! Similarly, > you could be a whiz at Linux's networking stack, but you might not > know about certain pitfalls in configuring a Cisco router using IOS; > does that mean we should have an IOS tutorial in the kernel > documentation directory? I'm not so sure about that! > >> "You're documenting it wrong" != "you shouldn't document it". > > Sure, but the fact that we don't currently say much about storage > stacks doesn't mean we should accept a patch that might actively > mislead people. I'm NACK'ing the patch on that basis. > >>> who cares about his data and/or >>> his hardware will (a) have a UPS, >> >> I worked at a company that retested their UPSes a year after >> installing them and found that _none_ of them supplied more than 15 >> seconds charge, and when they dismantled them the batteries had >> physically bloated inside their little plastic cases. (Same company >> as the dead air conditioner, possibly overheating was involved but >> the little _lights_ said everything was ok.) >> >> That was by no means the first UPS I'd seen die, the suckers have a >> higher failure rate than hard drives in my experience. This is a >> device where the batteries get constantly charged and almost never >> tested because if it _does_ fail you just rebooted your production >> server, so a lot of smaller companies think they have one but >> actually don't. > > Sounds like they were using really cheap UPS's; certainly not the kind > I would expect to find in a data center. And if company's system > administrator is using the cheapest possible consumer-grade UPS's, > then yes, they might have a problem. Even an educational institution > like MIT, where I was an network administrator some 15 years ago, had > proper UPS's, *and* we had a diesel generator which kicked in after 15 > seconds --- and we tested the diesel generator every Friday morning, > to make sure it worked properly. > >>> , and (b) be running with a hot spare >>> and/or will imediately replace a failed drive in a RAID array. >> >> Here's hoping they shut the system down properly to install the new >> drive in the raid then, eh? Not accidentally pull the plug before >> it's finished running the ~7 minutes of shutdown scripts in the last >> Red Hat Enterprise I messed with... > > Even my home RAID array uses hot-plug SATA disks, so I can replace a > failed disk without shutting down my system. (And yes, I have a > backup battery for the hardware RAID, and the firmware runs periodic > tests on it; the hardware RAID card also will send me e-mail if a RAID > array drive fails and it needs to use my hot-spare. At that point, I > order a new hard drive, secure in the knowledge that the system can > still suffer another drive failure before falling into degraded mode. > And no, this isn't some expensive enterprise RAID setup; this is just > a mid-range Areca RAID card.) > >> If "degraded array" just means "don't have a replacement disk yet", >> then it sounds like what Pavel wants to document is "don't write to >> a degraded array at all, because power failures can cost you data >> due to write granularity being larger than filesystem block size". >> (Which still comes as news to some of us, and you need a way to >> remount mount the degraded array read only until the sysadmin can >> fix it.) > > If you want to document that as a property of RAID arrays, sure. But > it's not something that should live in Documentation/filesystems/ext2.txt > and Documentation/filesystems/ext3.txt. The MD RAID howto might be a > better place, since it's far more likely more users will read it. How > many system administrators read what's in the kernel's Documentation > directory, after all, and this is basic information about how RAID > works; it's not necessarily something that someone would *expect* to > be in kernel documentation, nor would necessarily go looking for it > there. And the reality is that it's not like most people go reading > Documentation/* for pleasure. :-) > > BTW, the RAID write atomicity issue and the possibility of failures > cause data loss *is* documented in the Wikipedia article on RAID. > It's not as written as direct practical advice to a system > administrator (you'd have to go to a book that is really targetted at > system administrators to find that sort of thing). > > - Ted One thing that does need fixing for some MD configurations is to stress again that we need to make sure that barrier operations are properly supported or users will need to disable the write cache on devices with volatile write caches. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-27 13:10 ` Ric Wheeler @ 2009-08-27 16:54 ` Jeff Garzik 2009-08-27 18:09 ` Alasdair G Kergon 2009-09-01 14:01 ` Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: Jeff Garzik @ 2009-08-27 16:54 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/27/2009 09:10 AM, Ric Wheeler wrote: > One thing that does need fixing for some MD configurations is to stress > again that we need to make sure that barrier operations are properly > supported or users will need to disable the write cache on devices with > volatile write caches. Agreed; chime in on Christoph's linux-vfs thread if people have input. I quickly glanced at MD and DM. Currently, upstream, we see a lot of if (unlikely(bio_barrier(bio))) { bio_endio(bio, -EOPNOTSUPP); return 0; } in DM and MD make_request functions. Only md/raid1 supports barriers at present, it seems. None of the other MD drivers support barriers. DM has some barrier code... but the above code was pasted from DM's make_request function, so I am guessing that DM's barrier stuff is incomplete and disabled at present. I've been mentioning this issue for years... glad some people finally noticed :) Jeff ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-27 16:54 ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik @ 2009-08-27 18:09 ` Alasdair G Kergon 2009-09-01 14:01 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Alasdair G Kergon @ 2009-08-27 18:09 UTC (permalink / raw) To: Jeff Garzik Cc: Ric Wheeler, Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, Mikulas Patocka On Thu, Aug 27, 2009 at 12:54:05PM -0400, Jeff Garzik wrote: > DM has some barrier code... but the above code was pasted from DM's > make_request function, so I am guessing that DM's barrier stuff is > incomplete and disabled at present. That code is from the new request-based multipath implementation in 2.6.31 which doesn't yet. But bio-based dm does support barriers now. (Just missing some patches to complete the dm-raid1 support that are still under review IIRC.) Alasdair ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-27 16:54 ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik 2009-08-27 18:09 ` Alasdair G Kergon @ 2009-09-01 14:01 ` Pavel Machek 2009-09-02 16:17 ` Michael Tokarev 1 sibling, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-09-01 14:01 UTC (permalink / raw) To: Jeff Garzik Cc: Ric Wheeler, Theodore Tso, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu 2009-08-27 12:54:05, Jeff Garzik wrote: > On 08/27/2009 09:10 AM, Ric Wheeler wrote: >> One thing that does need fixing for some MD configurations is to stress >> again that we need to make sure that barrier operations are properly >> supported or users will need to disable the write cache on devices with >> volatile write caches. > > Agreed; chime in on Christoph's linux-vfs thread if people have input. > > I quickly glanced at MD and DM. Currently, upstream, we see a lot of > > if (unlikely(bio_barrier(bio))) { > bio_endio(bio, -EOPNOTSUPP); > return 0; > } > > in DM and MD make_request functions. > > Only md/raid1 supports barriers at present, it seems. None of the other > MD drivers support barriers. Not even md/raid0? Ouch :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-01 14:01 ` Pavel Machek @ 2009-09-02 16:17 ` Michael Tokarev 0 siblings, 0 replies; 309+ messages in thread From: Michael Tokarev @ 2009-09-02 16:17 UTC (permalink / raw) To: Pavel Machek Cc: Jeff Garzik, Ric Wheeler, Theodore Tso, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > On Thu 2009-08-27 12:54:05, Jeff Garzik wrote: [] >> Only md/raid1 supports barriers at present, it seems. None of the other >> MD drivers support barriers. > > Not even md/raid0? Ouch :-(. Only for raid1 there's no requiriment for inter-drive ordering. Hence only raid1 supports barriers (and gained that support very recently, in 1 or 2 kernel releases). For the rest, including raid0 and linear, inter-drive ordering is necessary to implement barriers. Or md should have its own queue (flushing) mechanisms. /mjt ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 12:24 ` Theodore Tso 2009-08-27 13:10 ` Ric Wheeler 2009-08-27 13:10 ` Ric Wheeler @ 2009-08-29 10:02 ` Pavel Machek 2009-08-29 10:02 ` Pavel Machek 3 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-29 10:02 UTC (permalink / raw) To: Theodore Tso, Rob Landley, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu 2009-08-27 08:24:23, Theodore Tso wrote: > On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote: > > > To me, this isn't a particularly interesting or newsworthy point, > > > since a competent system administrator > > > > I'm a bit concerned by the argument that we don't need to document > > serious pitfalls because every Linux system has a sufficiently > > competent administrator they already know stuff that didn't even > > come up until the second or third day it was discussed on lkml. > > I'm not convinced that information which needs to be known by System > Administrators is best documented in the kernel Documentation > directory. Should there be a HOWTO document on stuff like that? It is not only for system administrators; I was trying to find out if kernel is buggy, and that should be in kernel tree. > > If "degraded array" just means "don't have a replacement disk yet", > > then it sounds like what Pavel wants to document is "don't write to > > a degraded array at all, because power failures can cost you data > > due to write granularity being larger than filesystem block size". > > (Which still comes as news to some of us, and you need a way to > > remount mount the degraded array read only until the sysadmin can > > fix it.) > > If you want to document that as a property of RAID arrays, sure. But > it's not something that should live in Documentation/filesystems/ext2.txt > and Documentation/filesystems/ext3.txt. The MD RAID howto might be a ext3 documentation states that journal protects fs integrity on powerfail. If you don't want to talk about storage stacks, perhaps that should be removed? Now... You mocked me up for 'ext3 expects disks to behave like disks (alarmist)'. I actually believe that should be written somewhere. ext3 depends on fairly subtle storage disk characteristics, and many common configs just do not meet the expectations (missing barriers is most common, followed by collateral damage). Maybe not documenting that was okay 10 years ago, but with all the USB sticks and raid arrays around, its just sloppy. Because those characteristics are not documented, storage stack authors do not know what they have to guarantee, and the result is bad. See for example nbd -- it does not propagate barriers and is therefore unsafe. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 12:24 ` Theodore Tso ` (2 preceding siblings ...) 2009-08-29 10:02 ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek @ 2009-08-29 10:02 ` Pavel Machek 3 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-29 10:02 UTC (permalink / raw) To: Theodore Tso, Rob Landley, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel On Thu 2009-08-27 08:24:23, Theodore Tso wrote: > On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote: > > > To me, this isn't a particularly interesting or newsworthy point, > > > since a competent system administrator > > > > I'm a bit concerned by the argument that we don't need to document > > serious pitfalls because every Linux system has a sufficiently > > competent administrator they already know stuff that didn't even > > come up until the second or third day it was discussed on lkml. > > I'm not convinced that information which needs to be known by System > Administrators is best documented in the kernel Documentation > directory. Should there be a HOWTO document on stuff like that? It is not only for system administrators; I was trying to find out if kernel is buggy, and that should be in kernel tree. > > If "degraded array" just means "don't have a replacement disk yet", > > then it sounds like what Pavel wants to document is "don't write to > > a degraded array at all, because power failures can cost you data > > due to write granularity being larger than filesystem block size". > > (Which still comes as news to some of us, and you need a way to > > remount mount the degraded array read only until the sysadmin can > > fix it.) > > If you want to document that as a property of RAID arrays, sure. But > it's not something that should live in Documentation/filesystems/ext2.txt > and Documentation/filesystems/ext3.txt. The MD RAID howto might be a ext3 documentation states that journal protects fs integrity on powerfail. If you don't want to talk about storage stacks, perhaps that should be removed? Now... You mocked me up for 'ext3 expects disks to behave like disks (alarmist)'. I actually believe that should be written somewhere. ext3 depends on fairly subtle storage disk characteristics, and many common configs just do not meet the expectations (missing barriers is most common, followed by collateral damage). Maybe not documenting that was okay 10 years ago, but with all the USB sticks and raid arrays around, its just sloppy. Because those characteristics are not documented, storage stack authors do not know what they have to guarantee, and the result is bad. See for example nbd -- it does not propagate barriers and is therefore unsafe. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso 2009-08-26 1:15 ` Ric Wheeler @ 2009-08-26 1:15 ` Ric Wheeler 2009-08-26 1:16 ` Pavel Machek ` (4 subsequent siblings) 6 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 1:15 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley On 08/25/2009 09:00 PM, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote: > >>>> You are simply incorrect, Ted did not say that ext3 does not work >>>> with MD raid5. >>>> >>> http://lkml.org/lkml/2009/8/25/312 >>> Pavel >>> >> I will let Ted clarify his text on his own, but the quoted text says "... >> have potential...". >> >> Why not ask Neil if he designed MD to not work properly with ext3? >> > So let me clarify by saying the following things. > > 1) Filesystems are designed to expect that storage devices have > certain properties. These include returning the same data that you > wrote, and that an error when writing a sector, or a power failure > when writing sector, should not be amplified to cause collateral > damage with previously succfessfully written sectors. > > 2) Degraded RAID 5/6 filesystems do not meet these properties. > Neither to cheap flash drives. This increases the chances you can > lose, bigtime. > > I agree with the whole write up outside of the above - degraded RAID does meet this requirement unless you have a second (or third, counting the split write) failure during the rebuild. Note that the window of exposure during a RAID rebuild is linear with the size of your disk and how much you detune the rebuild... ric > 3) Does that mean that you shouldn't use ext3 on RAID drives? Of > course not! First of all, Ext3 still saves you against kernel panics > and hangs caused by device driver bugs or other kernel hangs. You > will lose less data, and avoid needing to run a long and painful fsck > after a forced reboot, compared to if you used ext2. You are making > an assumption that the only time running the journal takes place is > after a power failure. But if the system hangs, and you need to hit > the Big Red Switch, or if you using the system in a Linux High > Availability setup and the ethernet card fails, so the STONITH ("shoot > the other node in the head") system forces a hard reset of the system, > or you get a kernel panic which forces a reboot, in all of these cases > ext3 will save you from a long fsck, and it will do so safely. > > Secondly, what's the probability of a failure causes the RAID array to > become degraded, followed by a power failure, versus a power failure > while the RAID array is not running in degraded mode? Hopefully you > are running with the RAID array in full, proper running order a much > larger percentage of the time than running with the RAID array in > degraded mode. If not, the bug is with the system administrator! > > If you are someone who tends to run for long periods of time in > degraded mode --- then better get a UPS. And certainly if you want to > avoid the chances of failure, periodically scrubbing the disks so you > detect hard drive failures early, instead of waiting until a disk > fails before letting the rebuild find the dreaded "second failure" > which causes data loss, is a d*mned good idea. > > Maybe a random OS engineer doesn't know these things --- but trust me > when I say a competent system administrator had better be familiar > with these concepts. And someone who wants their data to be reliably > stored needs to do some basic storage engineering if they want to have > long-term data reliability. (That, or maybe they should outsource > their long-term reliable storage some service such as Amazon S3 --- > see Jeremy Zawodny's analysis about how it can be cheaper, here: > http://jeremy.zawodny.com/blog/archives/007624.html) > > But we *do* need to be careful that we don't write documentation which > is ends up giving users the wrong impression. The bottom line is that > you're better off using ext3 over ext2, even on a RAID array, for the > reasons listed above. > > Are you better off using ext3 over ext2 on a crappy flash drive? > Maybe --- if you are also using crappy proprietary video drivers, such > as Ubuntu ships, where every single time you exit a 3d game the system > crashes (and Ubuntu users accept this as normal?!?), then ext3 might > be a better choice since you'll reduce the chance of data loss when > the system locks up or crashes thanks to the aforemention crappy > proprietary video drivers from Nvidia. On the other hand, crappy > flash drives *do* have really bad write amplification effects, where a > 4K write can cause 128k or more worth of flash to be rewritten, such > that using ext3 could seriously degrade the lifetime of said crappy > flash drive; furthermore, the crappy flash drives have such terribly > write performance that using ext3 can be a performance nightmare. > This of course, doesn't apply to well-implemented SSD's, such as the > Intel's X25-M and X18-M. So here your mileage may vary. Still, if > you are using crappy proprietary drivers which cause system hangs and > crashes at a far greater rate than power fail-induced unclean > shutdowns, ext3 *still* might be the better choice, even with crappy > flash drives. > > The best thing to do, of course, is to improve your storage stack; use > competently implemented SSD's instead of crap flash cards. If your > hardware RAID card supports a battery option, *get* the battery. Add > a UPS to your system. Provision your RAID array with hot spares, and > regularly scrub (read-test) your array so that failed drives can be > detected early. Make sure you configure your MD setup so that you get > e-mail when a hard drive fails and the array starts running in > degraded mode, so you can replace the failed drive ASAP. > > At the end of the day, filesystems are not magic. They can't > compensate for crap hardware, or incompetently administered machines. > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso 2009-08-26 1:15 ` Ric Wheeler 2009-08-26 1:15 ` Ric Wheeler @ 2009-08-26 1:16 ` Pavel Machek 2009-08-26 1:16 ` Pavel Machek ` (3 subsequent siblings) 6 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-26 1:16 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel Hi! > 3) Does that mean that you shouldn't use ext3 on RAID drives? Of > course not! First of all, Ext3 still saves you against kernel panics > and hangs caused by device driver bugs or other kernel hangs. You > will lose less data, and avoid needing to run a long and painful fsck > after a forced reboot, compared to if you used ext2. You are making Actually... ext3 + MD RAID5 will still have a problem on kernel panic. MD RAID5 is implemented in software, so if kernel panics, you can still get inconsistent data in your array. I mostly agree with the rest. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso ` (2 preceding siblings ...) 2009-08-26 1:16 ` Pavel Machek @ 2009-08-26 1:16 ` Pavel Machek 2009-08-26 2:55 ` Theodore Tso 2009-08-26 2:53 ` Henrique de Moraes Holschuh ` (2 subsequent siblings) 6 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 1:16 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > 3) Does that mean that you shouldn't use ext3 on RAID drives? Of > course not! First of all, Ext3 still saves you against kernel panics > and hangs caused by device driver bugs or other kernel hangs. You > will lose less data, and avoid needing to run a long and painful fsck > after a forced reboot, compared to if you used ext2. You are making Actually... ext3 + MD RAID5 will still have a problem on kernel panic. MD RAID5 is implemented in software, so if kernel panics, you can still get inconsistent data in your array. I mostly agree with the rest. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:16 ` Pavel Machek @ 2009-08-26 2:55 ` Theodore Tso 2009-08-26 13:37 ` Ric Wheeler 2009-08-26 13:37 ` Ric Wheeler 0 siblings, 2 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 2:55 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, Aug 26, 2009 at 03:16:06AM +0200, Pavel Machek wrote: > Hi! > > > 3) Does that mean that you shouldn't use ext3 on RAID drives? Of > > course not! First of all, Ext3 still saves you against kernel panics > > and hangs caused by device driver bugs or other kernel hangs. You > > will lose less data, and avoid needing to run a long and painful fsck > > after a forced reboot, compared to if you used ext2. You are making > > Actually... ext3 + MD RAID5 will still have a problem on kernel > panic. MD RAID5 is implemented in software, so if kernel panics, you > can still get inconsistent data in your array. Only if the MD RAID array is running in degraded mode (and again, if the system is in this state for a long time, the bug is in the system administrator). And even then, it depends on how the kernel dies. If the system hangs due to some deadlock, or we get an OOPS that kills a process while still holding some locks, and that leads to a deadlock, it's likely the low-level MD driver can still complete the stripe write, and no data will be lost. If the kernel ties itself in knots due to running out of memory, and the OOM handler is invoked, someone hitting the reset button to force a reboot will also be fine. If the RAID array is degraded, and we get an oops in interrupt handler, such that the system is immediately halted --- then yes, data could get lost. But there are many system crashes where the software RAID's ability to complete a stripe write would not be compromised. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 2:55 ` Theodore Tso @ 2009-08-26 13:37 ` Ric Wheeler 2009-08-26 13:37 ` Ric Wheeler 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 13:37 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list On 08/25/2009 10:55 PM, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 03:16:06AM +0200, Pavel Machek wrote: >> Hi! >> >>> 3) Does that mean that you shouldn't use ext3 on RAID drives? Of >>> course not! First of all, Ext3 still saves you against kernel panics >>> and hangs caused by device driver bugs or other kernel hangs. You >>> will lose less data, and avoid needing to run a long and painful fsck >>> after a forced reboot, compared to if you used ext2. You are making >> >> Actually... ext3 + MD RAID5 will still have a problem on kernel >> panic. MD RAID5 is implemented in software, so if kernel panics, you >> can still get inconsistent data in your array. > > Only if the MD RAID array is running in degraded mode (and again, if > the system is in this state for a long time, the bug is in the system > administrator). And even then, it depends on how the kernel dies. If > the system hangs due to some deadlock, or we get an OOPS that kills a > process while still holding some locks, and that leads to a deadlock, > it's likely the low-level MD driver can still complete the stripe > write, and no data will be lost. If the kernel ties itself in knots > due to running out of memory, and the OOM handler is invoked, someone > hitting the reset button to force a reboot will also be fine. > > If the RAID array is degraded, and we get an oops in interrupt > handler, such that the system is immediately halted --- then yes, data > could get lost. But there are many system crashes where the software > RAID's ability to complete a stripe write would not be compromised. > > - Ted Just to add some real world data, Bianca Schroeder published a really good paper that looks at failures in national labs which has actual measured disk failures: http://www.cs.cmu.edu/~bianca/fast07.pdf Her numbers showed various rates of failures, but depending on the box, drive type, etc, they lost between 1-6% of the install drives each year. There is also a good paper from Google: http://labs.google.com/papers/disk_failures.html Both of the above are largely linux boxes. And several other FAST papers on failures in commercial RAID boxes, most notably by NetApp. If reading papers is not at the top of your list of things to do, just skim through and look for the tables on disk failures, etc. which have great measurements of what really failed in these systems... Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 2:55 ` Theodore Tso 2009-08-26 13:37 ` Ric Wheeler @ 2009-08-26 13:37 ` Ric Wheeler 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 13:37 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 10:55 PM, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 03:16:06AM +0200, Pavel Machek wrote: >> Hi! >> >>> 3) Does that mean that you shouldn't use ext3 on RAID drives? Of >>> course not! First of all, Ext3 still saves you against kernel panics >>> and hangs caused by device driver bugs or other kernel hangs. You >>> will lose less data, and avoid needing to run a long and painful fsck >>> after a forced reboot, compared to if you used ext2. You are making >> >> Actually... ext3 + MD RAID5 will still have a problem on kernel >> panic. MD RAID5 is implemented in software, so if kernel panics, you >> can still get inconsistent data in your array. > > Only if the MD RAID array is running in degraded mode (and again, if > the system is in this state for a long time, the bug is in the system > administrator). And even then, it depends on how the kernel dies. If > the system hangs due to some deadlock, or we get an OOPS that kills a > process while still holding some locks, and that leads to a deadlock, > it's likely the low-level MD driver can still complete the stripe > write, and no data will be lost. If the kernel ties itself in knots > due to running out of memory, and the OOM handler is invoked, someone > hitting the reset button to force a reboot will also be fine. > > If the RAID array is degraded, and we get an oops in interrupt > handler, such that the system is immediately halted --- then yes, data > could get lost. But there are many system crashes where the software > RAID's ability to complete a stripe write would not be compromised. > > - Ted Just to add some real world data, Bianca Schroeder published a really good paper that looks at failures in national labs which has actual measured disk failures: http://www.cs.cmu.edu/~bianca/fast07.pdf Her numbers showed various rates of failures, but depending on the box, drive type, etc, they lost between 1-6% of the install drives each year. There is also a good paper from Google: http://labs.google.com/papers/disk_failures.html Both of the above are largely linux boxes. And several other FAST papers on failures in commercial RAID boxes, most notably by NetApp. If reading papers is not at the top of your list of things to do, just skim through and look for the tables on disk failures, etc. which have great measurements of what really failed in these systems... Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso ` (3 preceding siblings ...) 2009-08-26 1:16 ` Pavel Machek @ 2009-08-26 2:53 ` Henrique de Moraes Holschuh 2009-08-26 2:53 ` Henrique de Moraes Holschuh 2009-09-03 9:47 ` Pavel Machek 6 siblings, 0 replies; 309+ messages in thread From: Henrique de Moraes Holschuh @ 2009-08-26 2:53 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley On Tue, 25 Aug 2009, Theodore Tso wrote: > a UPS to your system. Provision your RAID array with hot spares, and > regularly scrub (read-test) your array so that failed drives can be Can we get a proper scrub function (full rewrite of all component disks), please? Not every disk out there will stop a streaming read to rewrite weak sectors it happens to come across. > detected early. Make sure you configure your MD setup so that you get > e-mail when a hard drive fails and the array starts running in > degraded mode, so you can replace the failed drive ASAP. Debian got this right :-) -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso ` (4 preceding siblings ...) 2009-08-26 2:53 ` Henrique de Moraes Holschuh @ 2009-08-26 2:53 ` Henrique de Moraes Holschuh 2009-09-03 9:47 ` Pavel Machek 6 siblings, 0 replies; 309+ messages in thread From: Henrique de Moraes Holschuh @ 2009-08-26 2:53 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue, 25 Aug 2009, Theodore Tso wrote: > a UPS to your system. Provision your RAID array with hot spares, and > regularly scrub (read-test) your array so that failed drives can be Can we get a proper scrub function (full rewrite of all component disks), please? Not every disk out there will stop a streaming read to rewrite weak sectors it happens to come across. > detected early. Make sure you configure your MD setup so that you get > e-mail when a hard drive fails and the array starts running in > degraded mode, so you can replace the failed drive ASAP. Debian got this right :-) -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 1:00 ` Theodore Tso @ 2009-09-03 9:47 ` Pavel Machek 2009-08-26 1:15 ` Ric Wheeler ` (5 subsequent siblings) 6 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-03 9:47 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 21:00:18, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote: > >>> You are simply incorrect, Ted did not say that ext3 does not work > >>> with MD raid5. > >> > >> http://lkml.org/lkml/2009/8/25/312 > > > > I will let Ted clarify his text on his own, but the quoted text says "... > > have potential...". > > > > Why not ask Neil if he designed MD to not work properly with ext3? > > So let me clarify by saying the following things. > > 1) Filesystems are designed to expect that storage devices have > certain properties. These include returning the same data that you > wrote, and that an error when writing a sector, or a power failure > when writing sector, should not be amplified to cause collateral > damage with previously succfessfully written sectors. Yes. Unfortunately, different filesystems expect different properties from block devices. ext3 will work with write cache enabled/barriers enabled, while ext2 needs write cache disabled. The requirements are also quite surprising; AFAICT ext3 can handle disk writing garbage to single sector during powerfail, while xfs can not handle that. Now, how do you expect users to know these subtle details when it is not documented anywhere? And why are you fighting against documenting these subtleties? > Secondly, what's the probability of a failure causes the RAID array to > become degraded, followed by a power failure, versus a power failure > while the RAID array is not running in degraded mode? Hopefully you > are running with the RAID array in full, proper running order a much > larger percentage of the time than running with the RAID array in > degraded mode. If not, the bug is with the system administrator! As was uncovered, MD RAID does not properly support barriers, so... you don't actually need drive failure. > Maybe a random OS engineer doesn't know these things --- but trust me > when I say a competent system administrator had better be familiar > with these concepts. And someone who wants their data to be > reliably Trust me, 99% of sysadmins are not compentent by your definition. So this should be documented. > At the end of the day, filesystems are not magic. They can't > compensate for crap hardware, or incompetently administered machines. ext3 greatly contributes to administrator incomentency: # The journal supports the transactions start and stop, and in case of a # crash, the journal can replay the transactions to quickly put the # partition back into a consistent state. ...it does not mention that (non-default!) barrier=1 is needed to make this reliable, nor it mentions that there are certain requirements for this to work. It just says that journal will magically help you. And you wonder while people expect magic from your filesystem? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-09-03 9:47 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-03 9:47 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel On Tue 2009-08-25 21:00:18, Theodore Tso wrote: > On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote: > >>> You are simply incorrect, Ted did not say that ext3 does not work > >>> with MD raid5. > >> > >> http://lkml.org/lkml/2009/8/25/312 > > > > I will let Ted clarify his text on his own, but the quoted text says "... > > have potential...". > > > > Why not ask Neil if he designed MD to not work properly with ext3? > > So let me clarify by saying the following things. > > 1) Filesystems are designed to expect that storage devices have > certain properties. These include returning the same data that you > wrote, and that an error when writing a sector, or a power failure > when writing sector, should not be amplified to cause collateral > damage with previously succfessfully written sectors. Yes. Unfortunately, different filesystems expect different properties from block devices. ext3 will work with write cache enabled/barriers enabled, while ext2 needs write cache disabled. The requirements are also quite surprising; AFAICT ext3 can handle disk writing garbage to single sector during powerfail, while xfs can not handle that. Now, how do you expect users to know these subtle details when it is not documented anywhere? And why are you fighting against documenting these subtleties? > Secondly, what's the probability of a failure causes the RAID array to > become degraded, followed by a power failure, versus a power failure > while the RAID array is not running in degraded mode? Hopefully you > are running with the RAID array in full, proper running order a much > larger percentage of the time than running with the RAID array in > degraded mode. If not, the bug is with the system administrator! As was uncovered, MD RAID does not properly support barriers, so... you don't actually need drive failure. > Maybe a random OS engineer doesn't know these things --- but trust me > when I say a competent system administrator had better be familiar > with these concepts. And someone who wants their data to be > reliably Trust me, 99% of sysadmins are not compentent by your definition. So this should be documented. > At the end of the day, filesystems are not magic. They can't > compensate for crap hardware, or incompetently administered machines. ext3 greatly contributes to administrator incomentency: # The journal supports the transactions start and stop, and in case of a # crash, the journal can replay the transactions to quickly put the # partition back into a consistent state. ...it does not mention that (non-default!) barrier=1 is needed to make this reliable, nor it mentions that there are certain requirements for this to work. It just says that journal will magically help you. And you wonder while people expect magic from your filesystem? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:53 ` Pavel Machek 2009-08-26 0:11 ` Ric Wheeler @ 2009-08-26 3:50 ` Rik van Riel 1 sibling, 0 replies; 309+ messages in thread From: Rik van Riel @ 2009-08-26 3:50 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > If you want to argue that ext3/MD RAID5/no UPS combination is still > less likely to fail than single SATA disk given part fail > probabilities, go ahead and present nice statistics. Its just that I'm > not interested in them. The reality in your document does not match up with the reality out there in the world. That sounds like a good reason not to have your (incorrect) document out there, confusing people. -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:40 ` Ric Wheeler 2009-08-25 23:48 ` david 2009-08-25 23:53 ` Pavel Machek @ 2009-08-27 3:53 ` Rob Landley 2009-08-27 11:43 ` Ric Wheeler 2 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-08-27 3:53 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote: > Repeat experiment until you get up to something like google scale or the > other papers on failures in national labs in the US and then we can have an > informed discussion. On google scale anvil lightning can fry your machine out of a clear sky. However, there are still a few non-enterprise users out there, and knowing that specific usage patterns don't behave like they expect might be useful to them. > >> I can promise you that hot unplugging and replugging a S-ATA drive will > >> also lose you data if you are actively writing to it (ext2, 3, > >> whatever). > > > > I can promise you that running S-ATA drive will also lose you data, > > even if you are not actively writing to it. Just wait 10 years; so > > what is your point? > > I lost a s-ata drive 24 hours after installing it in a new box. If I had > MD5 RAID5, I would not have lost any. > > My point is that you fail to take into account the rate of failures of a > given configuration and the probability of data loss given those rates. Actually, that's _exactly_ what he's talking about. When writing to a degraded raid or a flash disk, journaling is essentially useless. If you get a power failure, kernel panic, somebody tripping over a USB cable, and so on, your filesystem will not be protected by journaling. Your data won't be trashed _every_ time, but the likelihood is much greater than experience with journaling in other contexts would suggest. Worse, the journaling may be counterproductive by _hiding_ many errors that fsck would promptly detect, so when the error is detected it may not be associated with the event that caused it. It also may not be noticed until good backups of the data have been overwritten or otherwise cycled out. You seem to be arguing that Linux is no longer used anywhere but the enterprise, so issues affecting USB flash keys or cheap software-only RAID aren't worth documenting? Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 3:53 ` Rob Landley @ 2009-08-27 11:43 ` Ric Wheeler 2009-08-27 20:51 ` Rob Landley 2009-08-27 22:13 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-27 11:43 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/26/2009 11:53 PM, Rob Landley wrote: > On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote: > >> Repeat experiment until you get up to something like google scale or the >> other papers on failures in national labs in the US and then we can have an >> informed discussion. >> > On google scale anvil lightning can fry your machine out of a clear sky. > > However, there are still a few non-enterprise users out there, and knowing > that specific usage patterns don't behave like they expect might be useful to > them. > > You are missing the broader point of both papers. They (and people like me when back at EMC) look at large numbers of machines and try to fix what actually breaks when run in the real world and causes data loss. The motherboards, S-ATA controllers, disk types are the same class of parts that I have in my desktop box today. The advantage of google, national labs, etc is that they have large numbers of systems and can draw conclusions that are meaningful to our broad user base. Specifically, in using S-ATA drives (just like ours, maybe slightly more reliable) they see up to 7% of those drives fail each year. All users have "soft" drive failures like single remapped sectors. These errors happen extremely commonly and are what RAID deals with well. What does not happen commonly is that during the RAID rebuild (kicked off only after a drive is kicked out), you push the power button or have a second failure (power outage). We will have more users loose data if they decide to use ext2 instead of ext3 and use only single disk storage. We have real numbers that show that is true. Injecting double faults into a system that handles single faults is frankly not that interesting. You can get better protection from these double faults if you move to "cloud" like storage configs where each box is fault tolerant, but you also spread your data over multiple boxes in multiple locations. Regards, Ric >>>> I can promise you that hot unplugging and replugging a S-ATA drive will >>>> also lose you data if you are actively writing to it (ext2, 3, >>>> whatever). >>>> >>> I can promise you that running S-ATA drive will also lose you data, >>> even if you are not actively writing to it. Just wait 10 years; so >>> what is your point? >>> >> I lost a s-ata drive 24 hours after installing it in a new box. If I had >> MD5 RAID5, I would not have lost any. >> >> My point is that you fail to take into account the rate of failures of a >> given configuration and the probability of data loss given those rates. >> > Actually, that's _exactly_ what he's talking about. > > When writing to a degraded raid or a flash disk, journaling is essentially > useless. If you get a power failure, kernel panic, somebody tripping over a > USB cable, and so on, your filesystem will not be protected by journaling. > Your data won't be trashed _every_ time, but the likelihood is much greater > than experience with journaling in other contexts would suggest. > > Worse, the journaling may be counterproductive by _hiding_ many errors that > fsck would promptly detect, so when the error is detected it may not be > associated with the event that caused it. It also may not be noticed until > good backups of the data have been overwritten or otherwise cycled out. > > You seem to be arguing that Linux is no longer used anywhere but the > enterprise, so issues affecting USB flash keys or cheap software-only RAID > aren't worth documenting? > > Rob > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 11:43 ` Ric Wheeler @ 2009-08-27 20:51 ` Rob Landley 2009-08-27 22:00 ` Ric Wheeler 2009-08-28 14:49 ` david 2009-08-27 22:13 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek 1 sibling, 2 replies; 309+ messages in thread From: Rob Landley @ 2009-08-27 20:51 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote: > On 08/26/2009 11:53 PM, Rob Landley wrote: > > On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote: > >> Repeat experiment until you get up to something like google scale or the > >> other papers on failures in national labs in the US and then we can have > >> an informed discussion. > > > > On google scale anvil lightning can fry your machine out of a clear sky. > > > > However, there are still a few non-enterprise users out there, and > > knowing that specific usage patterns don't behave like they expect might > > be useful to them. > > You are missing the broader point of both papers. No, I'm dismissing the papers (some of which I read when they first came out and got slashdotted) as irrelevant to the topic at hand. Pavel has two failure modes which he can trivially reproduce. The USB stick one is reproducible on a laptop by jostling said stick. I myself used to have a literal USB keychain, and the weight of keys dangling from it pulled it out of the USB socket fairly easily if I wasn't careful. At the time nobody had told me a journaling filesystem was not a reasonable safeguard here. Presumably the degraded raid one can be reproduced under an emulator, with no hardware directly involved at all, so talking about hardware failure rates ignores the fact that he's actually discussing a _software_ problem. It may happen in _response_ to hardware failures, but the damage he's attempting to document happens entirely in software. These failure modes can cause data loss which journaling can't help, but which journaling might (or might not) conceivably hide so you don't immediately notice it. They share a common underlying assumption that the storage device's update granularity is less than or equal to the filesystem's block size, which is not actually true of all modern storage devices. The fact he's only _found_ two instances where this assumption bites doesn't mean there aren't more waiting to be found, especially as more new storage media types get introduced. Pavel's response was to attempt to document this. Not that journaling is _bad_, but that it doesn't protect against this class of problem. Your response is to talk about google clusters, cloud storage, and cite academic papers of statistical hardware failure rates. As I understand the discussion, that's not actually the issue Pavel's talking about, merely one potential trigger for it. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 20:51 ` Rob Landley @ 2009-08-27 22:00 ` Ric Wheeler 2009-08-28 14:49 ` david 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-27 22:00 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/27/2009 04:51 PM, Rob Landley wrote: > On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote: > >> On 08/26/2009 11:53 PM, Rob Landley wrote: >> >>> On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote: >>> >>>> Repeat experiment until you get up to something like google scale or the >>>> other papers on failures in national labs in the US and then we can have >>>> an informed discussion. >>>> >>> On google scale anvil lightning can fry your machine out of a clear sky. >>> >>> However, there are still a few non-enterprise users out there, and >>> knowing that specific usage patterns don't behave like they expect might >>> be useful to them. >>> >> You are missing the broader point of both papers. >> > No, I'm dismissing the papers (some of which I read when they first came out > and got slashdotted) as irrelevant to the topic at hand. > I guess I have to dismiss your dismissing then. > Pavel has two failure modes which he can trivially reproduce. The USB stick > one is reproducible on a laptop by jostling said stick. I myself used to have > a literal USB keychain, and the weight of keys dangling from it pulled it out > of the USB socket fairly easily if I wasn't careful. At the time nobody had > told me a journaling filesystem was not a reasonable safeguard here. > > Presumably the degraded raid one can be reproduced under an emulator, with no > hardware directly involved at all, so talking about hardware failure rates > ignores the fact that he's actually discussing a _software_ problem. It may > happen in _response_ to hardware failures, but the damage he's attempting to > document happens entirely in software. > > These failure modes can cause data loss which journaling can't help, but which > journaling might (or might not) conceivably hide so you don't immediately > notice it. They share a common underlying assumption that the storage > device's update granularity is less than or equal to the filesystem's block > size, which is not actually true of all modern storage devices. The fact he's > only _found_ two instances where this assumption bites doesn't mean there > aren't more waiting to be found, especially as more new storage media types > get introduced. > > Pavel's response was to attempt to document this. Not that journaling is > _bad_, but that it doesn't protect against this class of problem. > > Your response is to talk about google clusters, cloud storage, and cite > academic papers of statistical hardware failure rates. As I understand the > discussion, that's not actually the issue Pavel's talking about, merely one > potential trigger for it. > > Rob > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 20:51 ` Rob Landley 2009-08-27 22:00 ` Ric Wheeler @ 2009-08-28 14:49 ` david 2009-08-29 10:05 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-28 14:49 UTC (permalink / raw) To: Rob Landley Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, 27 Aug 2009, Rob Landley wrote: > Pavel's response was to attempt to document this. Not that journaling is > _bad_, but that it doesn't protect against this class of problem. I don't think anyone is disagreeing with the statement that journaling doesn't protect against this class of problems, but Pavel's statements didn't say that. he stated that ext3 is more dangerous than ext2. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-28 14:49 ` david @ 2009-08-29 10:05 ` Pavel Machek 2009-08-29 20:22 ` Rob Landley 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-29 10:05 UTC (permalink / raw) To: david Cc: Rob Landley, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri 2009-08-28 07:49:38, david@lang.hm wrote: > On Thu, 27 Aug 2009, Rob Landley wrote: > >> Pavel's response was to attempt to document this. Not that journaling is >> _bad_, but that it doesn't protect against this class of problem. > > I don't think anyone is disagreeing with the statement that journaling > doesn't protect against this class of problems, but Pavel's statements > didn't say that. he stated that ext3 is more dangerous than ext2. Well, if you use 'common' fsck policy, ext3 _is_ more dangerous. But I'm not pushing that to documentation, I'm trying to push info everyone agrees with. (check the patches). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-29 10:05 ` Pavel Machek @ 2009-08-29 20:22 ` Rob Landley 2009-08-29 21:34 ` Pavel Machek 2009-09-03 16:56 ` what fsck can (and can't) do was " david 0 siblings, 2 replies; 309+ messages in thread From: Rob Landley @ 2009-08-29 20:22 UTC (permalink / raw) To: Pavel Machek Cc: david, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Saturday 29 August 2009 05:05:58 Pavel Machek wrote: > On Fri 2009-08-28 07:49:38, david@lang.hm wrote: > > On Thu, 27 Aug 2009, Rob Landley wrote: > >> Pavel's response was to attempt to document this. Not that journaling > >> is _bad_, but that it doesn't protect against this class of problem. > > > > I don't think anyone is disagreeing with the statement that journaling > > doesn't protect against this class of problems, but Pavel's statements > > didn't say that. he stated that ext3 is more dangerous than ext2. > > Well, if you use 'common' fsck policy, ext3 _is_ more dangerous. The filesystem itself isn't more dangerous, but it may provide a false sense of security when used on storage devices it wasn't designed for. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-29 20:22 ` Rob Landley @ 2009-08-29 21:34 ` Pavel Machek 2009-09-03 16:56 ` what fsck can (and can't) do was " david 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-29 21:34 UTC (permalink / raw) To: Rob Landley Cc: david, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sat 2009-08-29 15:22:06, Rob Landley wrote: > On Saturday 29 August 2009 05:05:58 Pavel Machek wrote: > > On Fri 2009-08-28 07:49:38, david@lang.hm wrote: > > > On Thu, 27 Aug 2009, Rob Landley wrote: > > >> Pavel's response was to attempt to document this. Not that journaling > > >> is _bad_, but that it doesn't protect against this class of problem. > > > > > > I don't think anyone is disagreeing with the statement that journaling > > > doesn't protect against this class of problems, but Pavel's statements > > > didn't say that. he stated that ext3 is more dangerous than ext2. > > > > Well, if you use 'common' fsck policy, ext3 _is_ more dangerous. > > The filesystem itself isn't more dangerous, but it may provide a false sense of > security when used on storage devices it wasn't designed for. Agreed. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* what fsck can (and can't) do was Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-29 20:22 ` Rob Landley 2009-08-29 21:34 ` Pavel Machek @ 2009-09-03 16:56 ` david 2009-09-03 19:27 ` Theodore Tso 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-09-03 16:56 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sat, 29 Aug 2009, Rob Landley wrote: > On Saturday 29 August 2009 05:05:58 Pavel Machek wrote: >> On Fri 2009-08-28 07:49:38, david@lang.hm wrote: >>> On Thu, 27 Aug 2009, Rob Landley wrote: >>>> Pavel's response was to attempt to document this. Not that journaling >>>> is _bad_, but that it doesn't protect against this class of problem. >>> >>> I don't think anyone is disagreeing with the statement that journaling >>> doesn't protect against this class of problems, but Pavel's statements >>> didn't say that. he stated that ext3 is more dangerous than ext2. >> >> Well, if you use 'common' fsck policy, ext3 _is_ more dangerous. > > The filesystem itself isn't more dangerous, but it may provide a false sense of > security when used on storage devices it wasn't designed for. from this discussin (and the similar discussion on lwn.net) there appears to be confusion/disagreement over what fsck does and what the results of not running it are. it has been stated here that fsck cannot fix broken data, all it tries to do is to clean up metadata, but it would probably help to get a clear statement of what exactly that means. I know that it: finds entries that don't actually have data and deletes them finds entries where multiple files share data blocks and duplicates the (bad for one file) data to seperate them finds blocks that have been orphaned (allocated, but no directory pointer to them) and creates entries in lost+found but if a fsck does not get run on a filesystem that has been damaged, what additional damage can be done? can it overwrite data that could have been saved? can it cause new files that are created (or new data written to existing, but uncorrupted files) to be lost? or is it just a matter of not knowing about existing corruption? David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: what fsck can (and can't) do was Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-09-03 16:56 ` what fsck can (and can't) do was " david @ 2009-09-03 19:27 ` Theodore Tso 0 siblings, 0 replies; 309+ messages in thread From: Theodore Tso @ 2009-09-03 19:27 UTC (permalink / raw) To: david Cc: Rob Landley, Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, Sep 03, 2009 at 09:56:48AM -0700, david@lang.hm wrote: > from this discussin (and the similar discussion on lwn.net) there appears > to be confusion/disagreement over what fsck does and what the results of > not running it are. > > it has been stated here that fsck cannot fix broken data, all it tries to > do is to clean up metadata, but it would probably help to get a clear > statement of what exactly that means. Let me give you my formulation of fsck which may be helpful. Fsck can not fix broken data; and (particularly in fsck -y mode) may not even recover the maximal amount of lost data caused by metadata corruption. (This is why sometimes an expert using debugfs can recover more data than fsck -y, and if you have some really precious data, like ten years' worth of Ph.D. research that you've never bothered to back up[1], the first thing you should do is buy a new hard drive and make a sector-by-sector copy of the disk and *then* run fsck. A new terrabyte hard drive costs $100; how much is your data worth to you?) [1] This isn't hypothetical; while I was at MIT this sort of thing actually happened more than once --- which brings up the philosophical question of whether someone who is that stupid about not doing backups on critical data *deserves* to get a Ph.D. degree. :-) Fsck's primary job is to make sure that further writes to the filesystem, whether you are creating new files or removing directory hierarchies, etc., will not cause *additional* data loss due to meta data corruption in the file system. Its secondary goals are to preserve as much data as possible, and to make sure that file system metadata is valid (i.e., so that a block pointer contains a valid block address, so that an attempt to read a file won't cause an I/O error when the filesystems attempts to seek to a non-existent sector on disk). For some filesystems, invalid, corrupt metadata can actually cause a system panic or oops message, so it's not necessarily safe to mount a filesystem with corrupt metadata read-only without risking the need to reboot the machine in question. More recently, there are folks who have been filing security bugs when they detect such cases, so there are fewer examples of such cases, but historically it was a good idea to run fsck because otherwise it's possible the kernel might oops or panic when it tripped over some particularly nasty metadata corruption. > but if a fsck does not get run on a filesystem that has been damaged, > what additional damage can be done? Consider the case where there are data blocks in use by inodes, containing precious data, but which are marked free in a filesystem allocation data structures (e.g., ext3's block bitmaps, but this applies to pretty much any filesystem, whether it's xfs, reiserfs, btrfs, etc.). When you create a new file on that filesystem, there's a chance that blocks that really contain data belonging to other inodes (perhaps the aforementioned ten years' of unbacked-up Ph.D. thesis research) will get overwritten by the newly created file. Another example is an inode which has multiple hard links, but the hard link count is wrong by being too low. Now when you delete one of the hard links, the inode will be released, and the inode and its data blocks returned to the free pool, despite the fact that it is still accessible via another directory entry in the filesystem, and despite the fact that the file contents should be saved. In the case where you have a block which is claimed by more than one file, if that file is rewritten in place, it's possible that the newly written file could have its data corrupted, so it's not just a matter of potential corruption to existing files; the newly created files are at risk as well. > can it overwrite data that could have been saved? > > can it cause new files that are created (or new data written to existing, > but uncorrupted files) to be lost? > > or is it just a matter of not knowing about existing corruption? So it's yes to all of the above; yes, you can overwrite existing data files; yes it can cause data blocks belonging to newly created files to be list; and no you won't know about data loss caused by metadata corruption. (Again, you won't know about data loss caused by corruption to the data blocks.) - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-27 11:43 ` Ric Wheeler 2009-08-27 20:51 ` Rob Landley @ 2009-08-27 22:13 ` Pavel Machek 2009-08-28 1:32 ` Ric Wheeler 1 sibling, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-27 22:13 UTC (permalink / raw) To: Ric Wheeler Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>> Repeat experiment until you get up to something like google scale or the >>> other papers on failures in national labs in the US and then we can have an >>> informed discussion. >>> >> On google scale anvil lightning can fry your machine out of a clear sky. >> >> However, there are still a few non-enterprise users out there, and knowing >> that specific usage patterns don't behave like they expect might be useful to >> them. > > You are missing the broader point of both papers. They (and people like > me when back at EMC) look at large numbers of machines and try to fix > what actually breaks when run in the real world and causes data loss. > The motherboards, S-ATA controllers, disk types are the same class of > parts that I have in my desktop box today. ... > These errors happen extremely commonly and are what RAID deals with well. > > What does not happen commonly is that during the RAID rebuild (kicked > off only after a drive is kicked out), you push the power button or have > a second failure (power outage). > > We will have more users loose data if they decide to use ext2 instead of > ext3 and use only single disk storage. So your argument basically is 'our abs brakes are broken, but lets not tell anyone; our car is still safer than a horse'. and 'while we know our abs brakes are broken, they are not major factor in accidents, so lets not tell anyone'. Sorry, but I'd expect slightly higher moral standards. If we can document it in a way that's non-scary, and does not push people to single disks (horses), please go ahead; but you have to mention that md raid breaks journalling assumptions (our abs brakes really are broken). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-27 22:13 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek @ 2009-08-28 1:32 ` Ric Wheeler 2009-08-28 6:44 ` Pavel Machek ` (2 more replies) 0 siblings, 3 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-28 1:32 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/27/2009 06:13 PM, Pavel Machek wrote: > >>>> Repeat experiment until you get up to something like google scale or the >>>> other papers on failures in national labs in the US and then we can have an >>>> informed discussion. >>>> >>> On google scale anvil lightning can fry your machine out of a clear sky. >>> >>> However, there are still a few non-enterprise users out there, and knowing >>> that specific usage patterns don't behave like they expect might be useful to >>> them. >> >> You are missing the broader point of both papers. They (and people like >> me when back at EMC) look at large numbers of machines and try to fix >> what actually breaks when run in the real world and causes data loss. >> The motherboards, S-ATA controllers, disk types are the same class of >> parts that I have in my desktop box today. > ... >> These errors happen extremely commonly and are what RAID deals with well. >> >> What does not happen commonly is that during the RAID rebuild (kicked >> off only after a drive is kicked out), you push the power button or have >> a second failure (power outage). >> >> We will have more users loose data if they decide to use ext2 instead of >> ext3 and use only single disk storage. > > So your argument basically is > > 'our abs brakes are broken, but lets not tell anyone; our car is still > safer than a horse'. > > and > > 'while we know our abs brakes are broken, they are not major factor in > accidents, so lets not tell anyone'. > > Sorry, but I'd expect slightly higher moral standards. If we can > document it in a way that's non-scary, and does not push people to > single disks (horses), please go ahead; but you have to mention that > md raid breaks journalling assumptions (our abs brakes really are > broken). > Pavel > You continue to ignore the technical facts that everyone (both MD and ext3) people put in front of you. If you have a specific bug in MD code, please propose a patch. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 1:32 ` Ric Wheeler @ 2009-08-28 6:44 ` Pavel Machek 2009-08-28 7:31 ` NeilBrown 2009-08-28 11:16 ` Ric Wheeler 2009-08-28 7:11 ` raid is dangerous but that's secret Florian Weimer 2009-08-28 12:08 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso 2 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-28 6:44 UTC (permalink / raw) To: Ric Wheeler Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu 2009-08-27 21:32:49, Ric Wheeler wrote: > On 08/27/2009 06:13 PM, Pavel Machek wrote: >> >>>>> Repeat experiment until you get up to something like google scale or the >>>>> other papers on failures in national labs in the US and then we can have an >>>>> informed discussion. >>>>> >>>> On google scale anvil lightning can fry your machine out of a clear sky. >>>> >>>> However, there are still a few non-enterprise users out there, and knowing >>>> that specific usage patterns don't behave like they expect might be useful to >>>> them. >>> >>> You are missing the broader point of both papers. They (and people like >>> me when back at EMC) look at large numbers of machines and try to fix >>> what actually breaks when run in the real world and causes data loss. >>> The motherboards, S-ATA controllers, disk types are the same class of >>> parts that I have in my desktop box today. >> ... >>> These errors happen extremely commonly and are what RAID deals with well. >>> >>> What does not happen commonly is that during the RAID rebuild (kicked >>> off only after a drive is kicked out), you push the power button or have >>> a second failure (power outage). >>> >>> We will have more users loose data if they decide to use ext2 instead of >>> ext3 and use only single disk storage. >> >> So your argument basically is >> >> 'our abs brakes are broken, but lets not tell anyone; our car is still >> safer than a horse'. >> >> and >> >> 'while we know our abs brakes are broken, they are not major factor in >> accidents, so lets not tell anyone'. >> >> Sorry, but I'd expect slightly higher moral standards. If we can >> document it in a way that's non-scary, and does not push people to >> single disks (horses), please go ahead; but you have to mention that >> md raid breaks journalling assumptions (our abs brakes really are >> broken). > > You continue to ignore the technical facts that everyone (both MD and > ext3) people put in front of you. > > If you have a specific bug in MD code, please propose a patch. Interesting. So, what's technically wrong with the patch below? Pavel --- From: Theodore Tso <tytso@mit.edu> Document that many devices are too broken for filesystems to protect data in case of powerfail. Signed-of-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..2f3eec1 --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,21 @@ +There are storage devices that high highly undesirable properties when +they are disconnected or suffer power failures while writes are in +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*) +arrays. These devices have the property of potentially corrupting +blocks being written at the time of the power failure, and worse yet, +amplifying the region where blocks are corrupted such that additional +sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using these devices is also a +Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. + +(*) Degraded array or single disk failure "near" the powerfail is +neccessary for this property of RAID arrays to bite. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 6:44 ` Pavel Machek @ 2009-08-28 7:31 ` NeilBrown 2009-08-28 11:16 ` Ric Wheeler 1 sibling, 0 replies; 309+ messages in thread From: NeilBrown @ 2009-08-28 7:31 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri, August 28, 2009 4:44 pm, Pavel Machek wrote: > On Thu 2009-08-27 21:32:49, Ric Wheeler wrote: >>> >> If you have a specific bug in MD code, please propose a patch. > > Interesting. So, what's technically wrong with the patch below? > You mean apart from ".... that high highly undesirable ...." ?? ^^^^^^^^^^^ And the phrase "Regular backups when using these devices ...." should be "Regular backups when using any devices .....". ^^^ If you have a device failure near a power fail on a raid5 you might lose some blocks of data. If you have a device failure near (or not near) a power failure on raid0 or jbod etc you will certainly lose lots of blocks of data. I think it would be better to say: ".... and degraded DM/MD RAID 4/5/6(*) arrays..." ^^^^^^^^ with (*) If device failure causes the array to become degraded during or immediately after the power failure, the same problem can result. And "necessary" only have the one 'c' :-) NeilBrown > Pavel > --- > > From: Theodore Tso <tytso@mit.edu> > > Document that many devices are too broken for filesystems to protect > data in case of powerfail. > > Signed-of-by: Pavel Machek <pavel@ucw.cz> > > diff --git a/Documentation/filesystems/dangers.txt > b/Documentation/filesystems/dangers.txt > new file mode 100644 > index 0000000..2f3eec1 > --- /dev/null > +++ b/Documentation/filesystems/dangers.txt > @@ -0,0 +1,21 @@ > +There are storage devices that high highly undesirable properties when > +they are disconnected or suffer power failures while writes are in > +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*) > +arrays. These devices have the property of potentially corrupting > +blocks being written at the time of the power failure, and worse yet, > +amplifying the region where blocks are corrupted such that additional > +sectors are also damaged during the power failure. > + > +Users who use such storage devices are well advised take > +countermeasures, such as the use of Uninterruptible Power Supplies, > +and making sure the flash device is not hot-unplugged while the device > +is being used. Regular backups when using these devices is also a > +Very Good Idea. > + > +Otherwise, file systems placed on these devices can suffer silent data > +and file system corruption. An forced use of fsck may detect metadata > +corruption resulting in file system corruption, but will not suffice > +to detect data corruption. > + > +(*) Degraded array or single disk failure "near" the powerfail is > +neccessary for this property of RAID arrays to bite. > > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) @ 2009-08-28 7:31 ` NeilBrown 0 siblings, 0 replies; 309+ messages in thread From: NeilBrown @ 2009-08-28 7:31 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri, August 28, 2009 4:44 pm, Pavel Machek wrote: > On Thu 2009-08-27 21:32:49, Ric Wheeler wrote: >>> >> If you have a specific bug in MD code, please propose a patch. > > Interesting. So, what's technically wrong with the patch below? > You mean apart from ".... that high highly undesirable ...." ?? ^^^^^^^^^^^ And the phrase "Regular backups when using these devices ...." should be "Regular backups when using any devices .....". ^^^ If you have a device failure near a power fail on a raid5 you might lose some blocks of data. If you have a device failure near (or not near) a power failure on raid0 or jbod etc you will certainly lose lots of blocks of data. I think it would be better to say: ".... and degraded DM/MD RAID 4/5/6(*) arrays..." ^^^^^^^^ with (*) If device failure causes the array to become degraded during or immediately after the power failure, the same problem can result. And "necessary" only have the one 'c' :-) NeilBrown > Pavel > --- > > From: Theodore Tso <tytso@mit.edu> > > Document that many devices are too broken for filesystems to protect > data in case of powerfail. > > Signed-of-by: Pavel Machek <pavel@ucw.cz> > > diff --git a/Documentation/filesystems/dangers.txt > b/Documentation/filesystems/dangers.txt > new file mode 100644 > index 0000000..2f3eec1 > --- /dev/null > +++ b/Documentation/filesystems/dangers.txt > @@ -0,0 +1,21 @@ > +There are storage devices that high highly undesirable properties when > +they are disconnected or suffer power failures while writes are in > +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*) > +arrays. These devices have the property of potentially corrupting > +blocks being written at the time of the power failure, and worse yet, > +amplifying the region where blocks are corrupted such that additional > +sectors are also damaged during the power failure. > + > +Users who use such storage devices are well advised take > +countermeasures, such as the use of Uninterruptible Power Supplies, > +and making sure the flash device is not hot-unplugged while the device > +is being used. Regular backups when using these devices is also a > +Very Good Idea. > + > +Otherwise, file systems placed on these devices can suffer silent data > +and file system corruption. An forced use of fsck may detect metadata > +corruption resulting in file system corruption, but will not suffice > +to detect data corruption. > + > +(*) Degraded array or single disk failure "near" the powerfail is > +neccessary for this property of RAID arrays to bite. > > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 7:31 ` NeilBrown (?) @ 2009-11-09 10:50 ` Pavel Machek -1 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-11-09 10:50 UTC (permalink / raw) To: NeilBrown Cc: Ric Wheeler, Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > >> If you have a specific bug in MD code, please propose a patch. > > > > Interesting. So, what's technically wrong with the patch below? > > > > You mean apart from ".... that high highly undesirable ...." ?? > ^^^^^^^^^^^ > Ok, I still believe kernel documentation should be ... well... in kernel, not in LWN article, so I fixed the patch according to your comments. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..14d0324 --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,21 @@ +There are storage devices that have highly undesirable properties when +they are disconnected or suffer power failures while writes are in +progress; such devices include flash devices and degraded DM/MD RAID +4/5/6 (*) arrays. These devices have the property of potentially +corrupting blocks being written at the time of the power failure, and +worse yet, amplifying the region where blocks are corrupted such that +additional sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using any devices, and these +devices in particular is also a Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. + +(*) If device failure causes the array to become degraded during or +immediately after the power failure, the same problem can result. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 6:44 ` Pavel Machek 2009-08-28 7:31 ` NeilBrown @ 2009-08-28 11:16 ` Ric Wheeler 2009-09-01 13:58 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-28 11:16 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/28/2009 02:44 AM, Pavel Machek wrote: > On Thu 2009-08-27 21:32:49, Ric Wheeler wrote: >> On 08/27/2009 06:13 PM, Pavel Machek wrote: >>> >>>>>> Repeat experiment until you get up to something like google scale or the >>>>>> other papers on failures in national labs in the US and then we can have an >>>>>> informed discussion. >>>>>> >>>>> On google scale anvil lightning can fry your machine out of a clear sky. >>>>> >>>>> However, there are still a few non-enterprise users out there, and knowing >>>>> that specific usage patterns don't behave like they expect might be useful to >>>>> them. >>>> >>>> You are missing the broader point of both papers. They (and people like >>>> me when back at EMC) look at large numbers of machines and try to fix >>>> what actually breaks when run in the real world and causes data loss. >>>> The motherboards, S-ATA controllers, disk types are the same class of >>>> parts that I have in my desktop box today. >>> ... >>>> These errors happen extremely commonly and are what RAID deals with well. >>>> >>>> What does not happen commonly is that during the RAID rebuild (kicked >>>> off only after a drive is kicked out), you push the power button or have >>>> a second failure (power outage). >>>> >>>> We will have more users loose data if they decide to use ext2 instead of >>>> ext3 and use only single disk storage. >>> >>> So your argument basically is >>> >>> 'our abs brakes are broken, but lets not tell anyone; our car is still >>> safer than a horse'. >>> >>> and >>> >>> 'while we know our abs brakes are broken, they are not major factor in >>> accidents, so lets not tell anyone'. >>> >>> Sorry, but I'd expect slightly higher moral standards. If we can >>> document it in a way that's non-scary, and does not push people to >>> single disks (horses), please go ahead; but you have to mention that >>> md raid breaks journalling assumptions (our abs brakes really are >>> broken). >> >> You continue to ignore the technical facts that everyone (both MD and >> ext3) people put in front of you. >> >> If you have a specific bug in MD code, please propose a patch. > > Interesting. So, what's technically wrong with the patch below? > > Pavel My suggestion was that you stop trying to document your assertion of an issue and actually suggest fixes in code or implementation. I really don't think that you have properly diagnosed your specific failure or done sufficient. However, if you put a full analysis and suggested code out to the MD devel lists, we can debate technical implementation as we normally do. As Ted quite clearly stated, documentation on how RAID works, how to configure it, etc, is best put in RAID documentation. What you claim as a key issue is an issue for all file systems (including ext2). The only note that I would put in ext3/4 etc documentation would be: "Reliable storage is important for any file system. Single disks (or FLASH or SSD) do fail on a regular basis. To reduce your risk of data loss, it is advisable to use RAID which can overcome these common issues. If using MD software RAID, see the RAID documentation on how best to configure your storage. With or without RAID, it is always important to back up your data to an external device and keep copies of that backup off site." ric > --- > > From: Theodore Tso<tytso@mit.edu> > > Document that many devices are too broken for filesystems to protect > data in case of powerfail. > > Signed-of-by: Pavel Machek<pavel@ucw.cz> > > diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt > new file mode 100644 > index 0000000..2f3eec1 > --- /dev/null > +++ b/Documentation/filesystems/dangers.txt > @@ -0,0 +1,21 @@ > +There are storage devices that high highly undesirable properties when > +they are disconnected or suffer power failures while writes are in > +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*) > +arrays. These devices have the property of potentially corrupting > +blocks being written at the time of the power failure, and worse yet, > +amplifying the region where blocks are corrupted such that additional > +sectors are also damaged during the power failure. > + > +Users who use such storage devices are well advised take > +countermeasures, such as the use of Uninterruptible Power Supplies, > +and making sure the flash device is not hot-unplugged while the device > +is being used. Regular backups when using these devices is also a > +Very Good Idea. > + > +Otherwise, file systems placed on these devices can suffer silent data > +and file system corruption. An forced use of fsck may detect metadata > +corruption resulting in file system corruption, but will not suffice > +to detect data corruption. > + > +(*) Degraded array or single disk failure "near" the powerfail is > +neccessary for this property of RAID arrays to bite. > > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 11:16 ` Ric Wheeler @ 2009-09-01 13:58 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-01 13:58 UTC (permalink / raw) To: Ric Wheeler Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >> Interesting. So, what's technically wrong with the patch below? > > My suggestion was that you stop trying to document your assertion of an > issue and actually suggest fixes in code or implementation. I really > don't think that you have properly diagnosed your specific failure or > done sufficient. However, if you put a full analysis and suggested code > out to the MD devel lists, we can debate technical implementation as we > normally do. I don't think I should be required to rewrite linux md layer in order to fix documentation. > The only note that I would put in ext3/4 etc documentation would be: > > "Reliable storage is important for any file system. Single disks (or > FLASH or SSD) do fail on a regular basis. Uh, how clever, instead of documenting that our md raid code does not always work as expected, you document that components fail. Newspeak 101? You even failed to mention little design problem with flash and eraseblock size... and the fact that you don't need flash to fail to get data loss. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret 2009-08-28 1:32 ` Ric Wheeler 2009-08-28 6:44 ` Pavel Machek @ 2009-08-28 7:11 ` Florian Weimer 2009-08-28 7:23 ` NeilBrown 2009-08-28 12:08 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: Florian Weimer @ 2009-08-28 7:11 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Rob Landley, Theodore Tso, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet * Ric Wheeler: > You continue to ignore the technical facts that everyone (both MD and > ext3) people put in front of you. > > If you have a specific bug in MD code, please propose a patch. In RAID 1 mode, it should read both copies and error out on mismatch. 8-) -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret 2009-08-28 7:11 ` raid is dangerous but that's secret Florian Weimer @ 2009-08-28 7:23 ` NeilBrown 0 siblings, 0 replies; 309+ messages in thread From: NeilBrown @ 2009-08-28 7:23 UTC (permalink / raw) To: Florian Weimer Cc: Ric Wheeler, Pavel Machek, Rob Landley, Theodore Tso, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri, August 28, 2009 5:11 pm, Florian Weimer wrote: > * Ric Wheeler: > >> You continue to ignore the technical facts that everyone (both MD and >> ext3) people put in front of you. >> >> If you have a specific bug in MD code, please propose a patch. > > In RAID 1 mode, it should read both copies and error out on > mismatch. 8-) Despite your smiley: no it shouldn't, and no one is making any claims about raid1 being unsafe, only raid4/5/6. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 1:32 ` Ric Wheeler 2009-08-28 6:44 ` Pavel Machek 2009-08-28 7:11 ` raid is dangerous but that's secret Florian Weimer @ 2009-08-28 12:08 ` Theodore Tso 2009-08-30 7:51 ` Pavel Machek 2009-08-30 7:51 ` Pavel Machek 2 siblings, 2 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-28 12:08 UTC (permalink / raw) To: Pavel Machek, NeilBrown Cc: Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri, Aug 28, 2009 at 08:44:49AM +0200, Pavel Machek wrote: > From: Theodore Tso <tytso@mit.edu> > > Document that many devices are too broken for filesystems to protect > data in case of powerfail. > > Signed-of-by: Pavel Machek <pavel@ucw.cz> NACK. I didn't write this patch, and it's disingenuous for you to try to claim that I authored it. You took text I wrote from the *middle* of an e-mail discussion and you ignored multiple corrections to typo's that I made --- typo's that I would have corrected if I had ultimately decided to post this as a patch, which I did NOT. While Neil Brown's corrections are minimally necessary so the text is at least technically *correct*, it's still not the right advice to give system administrators. It's better than the fear-mongering patches you had proposed earlier, but what would be better *still* is telling people why running with degraded RAID arrays is bad, and to give them further tips about how to use RAID arrays safely. To use your ABS brakes analogy, just becase it's not safe to rely on ABS brakes if the "check brakes" light is on, that doesn't justify writing something alarmist which claims that ABS brakes don't work 100% of the time, don't use ABS brakes, they're broken!!!! The first part of it is true, since ABS brakes can suffer mechnical failure. But what we should be telling drivers is, "if the 'check brakes' light comes on, don't keep driving with it, go to a garage and get it fixed!!!". Similarly, if you get a notice that your RAID is running in degraded mode, you've already suffered one failure; you won't survive another failure, so fix that issue ASAP! If you're really paranoid, you could decide to "pull over to the side of the road"; that is, you could stop writing to the RAID array as soon as possible, and then get the the RAID array rebuilt before proceeding. That can reduce the chances of a second failure. But in the real world, there are costs associated with taking a production server off-line, and the prudent system administrator has to do a risk-reward tradeoff. A better approach might to have the array configured with a hot spare, and to regularly scrub the array, and configure the RAID array with either a battery backup or a UPS. And hot-swap drives might not be a bad idea, too. But in any case, just because ABS brakes and RAID arrays can suffer failures, that doesn't mean you should run around telling people not to use RAID arrays or RAID arrays are broken. People are better off using RAID than not using single disk storage solutions, just as people are better off using ABS brakes than not. Your argument basically boils down to, "if you drive like a maniac when the roads are wet and slippery, ABS brakes might not save your life. Since ABS brake might cause you to have a false sense of security, it's better to tell users that ABS brakes are broken." That's just silly. What we should be telling people instead is (a) pay attention to the check brakes light (just as you should pay attention to the RAID array is degraded warning), and (b) while ABS brakes will get you out of some situations with life and limb intact, they do not repeal that laws of physics (do regular full and incremental backups; practice disk scrubbing; use UPS's or battery backups). - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 12:08 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso @ 2009-08-30 7:51 ` Pavel Machek 2009-08-30 9:01 ` Christian Kujau ` (2 more replies) 2009-08-30 7:51 ` Pavel Machek 1 sibling, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:51 UTC (permalink / raw) To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > > From: Theodore Tso <tytso@mit.edu> > > > > Document that many devices are too broken for filesystems to protect > > data in case of powerfail. > > > > Signed-of-by: Pavel Machek <pavel@ucw.cz> > > NACK. I didn't write this patch, and it's disingenuous for you to try > to claim that I authored it. Well, you did write original text, so I wanted to give you credit. Sorry. > While Neil Brown's corrections are minimally necessary so the text is > at least technically *correct*, it's still not the right advice to > give system administrators. It's better than the fear-mongering > patches you had proposed earlier, but what would be better *still* is > telling people why running with degraded RAID arrays is bad, and to > give them further tips about how to use RAID arrays safely. Maybe this belongs to Doc*/filesystems, and more detailed RAID description should go to md description? > To use your ABS brakes analogy, just becase it's not safe to rely on > ABS brakes if the "check brakes" light is on, that doesn't justify > writing something alarmist which claims that ABS brakes don't work > 100% of the time, don't use ABS brakes, they're broken!!!! If it only was this simple. We don't have 'check brakes' (aka 'journalling ineffective') warning light. If we had that, I would not have problem. It is rather that your ABS brakes are ineffective if 'check engine' (RAID degraded) is lit. And yes, running with 'check engine' for extended periods may be bad idea, but I know people that do that... and I still hope their brakes work (and believe they should have won suit for damages should their ABS brakes fail). > That's just silly. What we should be telling people instead is (a) > pay attention to the check brakes light (just as you should pay > attention to the RAID array is degraded warning), and (b) while ABS 'your RAID array is degraded' is very counter intuitive way to say '...and btw your journalling is no longer effective, either'. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 7:51 ` Pavel Machek @ 2009-08-30 9:01 ` Christian Kujau 2009-09-02 20:55 ` Pavel Machek 2009-08-30 12:55 ` david 2009-08-30 15:20 ` Theodore Tso 2 siblings, 1 reply; 309+ messages in thread From: Christian Kujau @ 2009-08-30 9:01 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, 30 Aug 2009 at 09:51, Pavel Machek wrote: > > give system administrators. It's better than the fear-mongering > > patches you had proposed earlier, but what would be better *still* is > > telling people why running with degraded RAID arrays is bad, and to > > give them further tips about how to use RAID arrays safely. > > Maybe this belongs to Doc*/filesystems, and more detailed RAID > description should go to md description? Why should this be placed in *kernel* documentation anyway? The "dangers of RAID", the hints that "backups are a good idea" - isn't that something for howtos for sysadmins? No end-user will ever look into Documentation/ anyway. The sysadmins should know what they're doing and see the upsides and downsides of RAID and journalling filesystems. And they'll turn to howtos and tutorials to find out. And maybe seek *reference* documentation in Documentation/ - but I don't think Storage-101 should be covered in a mostly hidden place like Documentation/. Christian. -- BOFH excuse #212: Of course it doesn't work. We've performed a software upgrade. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 9:01 ` Christian Kujau @ 2009-09-02 20:55 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-02 20:55 UTC (permalink / raw) To: Christian Kujau Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun 2009-08-30 02:01:10, Christian Kujau wrote: > On Sun, 30 Aug 2009 at 09:51, Pavel Machek wrote: > > > give system administrators. It's better than the fear-mongering > > > patches you had proposed earlier, but what would be better *still* is > > > telling people why running with degraded RAID arrays is bad, and to > > > give them further tips about how to use RAID arrays safely. > > > > Maybe this belongs to Doc*/filesystems, and more detailed RAID > > description should go to md description? > > Why should this be placed in *kernel* documentation anyway? The "dangers > of RAID", the hints that "backups are a good idea" - isn't that something > for howtos for sysadmins? No end-user will ever look into The fact that two kernel subsystems (MD RAID, journaling filesystems) do not work well together is surprising and should be documented near the source. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 7:51 ` Pavel Machek 2009-08-30 9:01 ` Christian Kujau @ 2009-08-30 12:55 ` david 2009-08-30 14:12 ` Ric Wheeler 2009-08-30 15:05 ` Pavel Machek 2009-08-30 15:20 ` Theodore Tso 2 siblings, 2 replies; 309+ messages in thread From: david @ 2009-08-30 12:55 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, 30 Aug 2009, Pavel Machek wrote: >>> From: Theodore Tso <tytso@mit.edu> >>> >> To use your ABS brakes analogy, just becase it's not safe to rely on >> ABS brakes if the "check brakes" light is on, that doesn't justify >> writing something alarmist which claims that ABS brakes don't work >> 100% of the time, don't use ABS brakes, they're broken!!!! > > If it only was this simple. We don't have 'check brakes' (aka > 'journalling ineffective') warning light. If we had that, I would not > have problem. > > It is rather that your ABS brakes are ineffective if 'check engine' > (RAID degraded) is lit. And yes, running with 'check engine' for > extended periods may be bad idea, but I know people that do > that... and I still hope their brakes work (and believe they should > have won suit for damages should their ABS brakes fail). the 'RAID degraded' warning says that _anything_ you put on that block device is at risk. it doesn't matter if you are using a filesystem with a journal, one without, or using the raw device directly. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 12:55 ` david @ 2009-08-30 14:12 ` Ric Wheeler 2009-08-30 14:44 ` Michael Tokarev 2009-08-30 15:05 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-30 14:12 UTC (permalink / raw) To: david Cc: Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/30/2009 08:55 AM, david@lang.hm wrote: > On Sun, 30 Aug 2009, Pavel Machek wrote: > >>>> From: Theodore Tso <tytso@mit.edu> >>>> >>> To use your ABS brakes analogy, just becase it's not safe to rely on >>> ABS brakes if the "check brakes" light is on, that doesn't justify >>> writing something alarmist which claims that ABS brakes don't work >>> 100% of the time, don't use ABS brakes, they're broken!!!! >> >> If it only was this simple. We don't have 'check brakes' (aka >> 'journalling ineffective') warning light. If we had that, I would not >> have problem. >> >> It is rather that your ABS brakes are ineffective if 'check engine' >> (RAID degraded) is lit. And yes, running with 'check engine' for >> extended periods may be bad idea, but I know people that do >> that... and I still hope their brakes work (and believe they should >> have won suit for damages should their ABS brakes fail). > > the 'RAID degraded' warning says that _anything_ you put on that block > device is at risk. it doesn't matter if you are using a filesystem > with a journal, one without, or using the raw device directly. > > David Lang The easiest way to lose your data in Linux - with RAID, without RAID, S-ATA or SAS - is to run with the write cache enabled. If you compare the size of even a large RAID stripe it will be measured in KB and as this thread has mentioned already, you stand to have damage to just one stripe (or even just a disk sector or two). If you lose power with the write caches enabled on that same 5 drive RAID set, you could lose as much as 5 * 32MB of freshly written data on a power loss (16-32MB write caches are common on s-ata disks these days). For MD5 (and MD6), you really must run with the write cache disabled until we get barriers to work for those configurations. It would be interesting for Pavel to retest with the write cache enabled/disabled on his power loss scenarios with multi-drive RAID. Regards, Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 14:12 ` Ric Wheeler @ 2009-08-30 14:44 ` Michael Tokarev 2009-08-30 16:10 ` Ric Wheeler 2009-08-30 16:35 ` Christoph Hellwig 0 siblings, 2 replies; 309+ messages in thread From: Michael Tokarev @ 2009-08-30 14:44 UTC (permalink / raw) To: Ric Wheeler Cc: david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler wrote: [] > The easiest way to lose your data in Linux - with RAID, without RAID, > S-ATA or SAS - is to run with the write cache enabled. > > If you compare the size of even a large RAID stripe it will be measured > in KB and as this thread has mentioned already, you stand to have damage > to just one stripe (or even just a disk sector or two). > > If you lose power with the write caches enabled on that same 5 drive > RAID set, you could lose as much as 5 * 32MB of freshly written data on > a power loss (16-32MB write caches are common on s-ata disks these days). This is fundamentally wrong. Many filesystems today use either barriers or flushes (if barriers are not supported), and the times when disk drives were lying to the OS that the cache got flushed are long gone. > For MD5 (and MD6), you really must run with the write cache disabled > until we get barriers to work for those configurations. I highly doubt barriers will ever be supported on anything but simple raid1, because it's impossible to guarantee ordering across multiple drives. Well, it *is* possible to have write barriers with journalled (and/or with battery-backed-cache) raid[456]. Note that even if raid[456] does not support barriers, write cache flushes still works. /mjt ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 14:44 ` Michael Tokarev @ 2009-08-30 16:10 ` Ric Wheeler 2009-08-30 16:35 ` Christoph Hellwig 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-30 16:10 UTC (permalink / raw) To: Michael Tokarev Cc: david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/30/2009 10:44 AM, Michael Tokarev wrote: > Ric Wheeler wrote: > [] >> The easiest way to lose your data in Linux - with RAID, without RAID, >> S-ATA or SAS - is to run with the write cache enabled. >> >> If you compare the size of even a large RAID stripe it will be >> measured in KB and as this thread has mentioned already, you stand to >> have damage to just one stripe (or even just a disk sector or two). >> >> If you lose power with the write caches enabled on that same 5 drive >> RAID set, you could lose as much as 5 * 32MB of freshly written data >> on a power loss (16-32MB write caches are common on s-ata disks >> these days). > > This is fundamentally wrong. Many filesystems today use either barriers > or flushes (if barriers are not supported), and the times when disk > drives > were lying to the OS that the cache got flushed are long gone. Unfortunately not - if you mount a file system with write cache enabled and see "barriers disabled" messages in /var/log/messages, this is exactly what happens. File systems issue write barrier operations that in turn do cache flushes (ATA_FLUSH_EXT) commands or its SCSI equivalent. MD5 and MD6 do not pass these operations on currently and there is no other file system level mechanism that somehow bypasses the IO stack to invalidate or flush the cache. Note that some devices have non-volatile write caches (specifically arrays or battery backed RAID cards) where this is not an issue. > >> For MD5 (and MD6), you really must run with the write cache disabled >> until we get barriers to work for those configurations. > > I highly doubt barriers will ever be supported on anything but simple > raid1, because it's impossible to guarantee ordering across multiple > drives. Well, it *is* possible to have write barriers with journalled > (and/or with battery-backed-cache) raid[456]. > > Note that even if raid[456] does not support barriers, write cache > flushes still works. > > /mjt I think that you are confused - barriers are implemented using cache flushes. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 14:44 ` Michael Tokarev 2009-08-30 16:10 ` Ric Wheeler @ 2009-08-30 16:35 ` Christoph Hellwig 2009-08-31 13:15 ` Ric Wheeler 1 sibling, 1 reply; 309+ messages in thread From: Christoph Hellwig @ 2009-08-30 16:35 UTC (permalink / raw) To: Michael Tokarev Cc: Ric Wheeler, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, Aug 30, 2009 at 06:44:04PM +0400, Michael Tokarev wrote: >> If you lose power with the write caches enabled on that same 5 drive >> RAID set, you could lose as much as 5 * 32MB of freshly written data on >> a power loss (16-32MB write caches are common on s-ata disks these >> days). > > This is fundamentally wrong. Many filesystems today use either barriers > or flushes (if barriers are not supported), and the times when disk drives > were lying to the OS that the cache got flushed are long gone. While most common filesystem do have barrier support it is: - not actually enabled for the two most common filesystems - the support for write barriers an cache flushing tends to be buggy all over our software stack, >> For MD5 (and MD6), you really must run with the write cache disabled >> until we get barriers to work for those configurations. > > I highly doubt barriers will ever be supported on anything but simple > raid1, because it's impossible to guarantee ordering across multiple > drives. Well, it *is* possible to have write barriers with journalled > (and/or with battery-backed-cache) raid[456]. > > Note that even if raid[456] does not support barriers, write cache > flushes still works. All currently working barrier implementations on Linux are built upon queue drains and cache flushes, plus sometimes setting the FUA bit. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 16:35 ` Christoph Hellwig @ 2009-08-31 13:15 ` Ric Wheeler 2009-08-31 13:16 ` Christoph Hellwig 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-31 13:15 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/30/2009 12:35 PM, Christoph Hellwig wrote: > On Sun, Aug 30, 2009 at 06:44:04PM +0400, Michael Tokarev wrote: >>> If you lose power with the write caches enabled on that same 5 drive >>> RAID set, you could lose as much as 5 * 32MB of freshly written data on >>> a power loss (16-32MB write caches are common on s-ata disks these >>> days). >> >> This is fundamentally wrong. Many filesystems today use either barriers >> or flushes (if barriers are not supported), and the times when disk drives >> were lying to the OS that the cache got flushed are long gone. > > While most common filesystem do have barrier support it is: > > - not actually enabled for the two most common filesystems > - the support for write barriers an cache flushing tends to be buggy > all over our software stack, > Or just missing - I think that MD5/6 simply drop the requests at present. I wonder if it would be worth having MD probe for write cache enabled & warn if barriers are not supported? >>> For MD5 (and MD6), you really must run with the write cache disabled >>> until we get barriers to work for those configurations. >> >> I highly doubt barriers will ever be supported on anything but simple >> raid1, because it's impossible to guarantee ordering across multiple >> drives. Well, it *is* possible to have write barriers with journalled >> (and/or with battery-backed-cache) raid[456]. >> >> Note that even if raid[456] does not support barriers, write cache >> flushes still works. > > All currently working barrier implementations on Linux are built upon > queue drains and cache flushes, plus sometimes setting the FUA bit. > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:15 ` Ric Wheeler @ 2009-08-31 13:16 ` Christoph Hellwig 2009-08-31 13:19 ` Mark Lord 2009-08-31 13:22 ` Ric Wheeler 0 siblings, 2 replies; 309+ messages in thread From: Christoph Hellwig @ 2009-08-31 13:16 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote: >> While most common filesystem do have barrier support it is: >> >> - not actually enabled for the two most common filesystems >> - the support for write barriers an cache flushing tends to be buggy >> all over our software stack, >> > > Or just missing - I think that MD5/6 simply drop the requests at present. > > I wonder if it would be worth having MD probe for write cache enabled & > warn if barriers are not supported? In my opinion even that is too weak. We know how to control the cache settings on all common disks (that is scsi and ata), so we should always disable the write cache unless we know that the whole stack (filesystem, raid, volume managers) supports barriers. And even then we should make sure the filesystems does actually use barriers everywhere that's needed which failed at for years. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:16 ` Christoph Hellwig @ 2009-08-31 13:19 ` Mark Lord 2009-08-31 13:21 ` Christoph Hellwig 2009-08-31 13:22 ` Ric Wheeler 1 sibling, 1 reply; 309+ messages in thread From: Mark Lord @ 2009-08-31 13:19 UTC (permalink / raw) To: Christoph Hellwig Cc: Ric Wheeler, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote: >>> While most common filesystem do have barrier support it is: >>> >>> - not actually enabled for the two most common filesystems >>> - the support for write barriers an cache flushing tends to be buggy >>> all over our software stack, >>> >> Or just missing - I think that MD5/6 simply drop the requests at present. >> >> I wonder if it would be worth having MD probe for write cache enabled & >> warn if barriers are not supported? > > In my opinion even that is too weak. We know how to control the cache > settings on all common disks (that is scsi and ata), so we should always > disable the write cache unless we know that the whole stack (filesystem, > raid, volume managers) supports barriers. And even then we should make > sure the filesystems does actually use barriers everywhere that's needed > which failed at for years. .. That stack does not know that my MD device has full battery backup, so it bloody well better NOT prevent me from enabling the write caches. In fact, MD should have nothing to do with that. I do like/prefer the way that XFS currently does it: disables barriers and logs the event, but otherwise doesn't try to enforce policy upon me from kernel space. Cheers ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:19 ` Mark Lord @ 2009-08-31 13:21 ` Christoph Hellwig 2009-08-31 15:14 ` jim owens 2009-09-03 1:59 ` Ric Wheeler 0 siblings, 2 replies; 309+ messages in thread From: Christoph Hellwig @ 2009-08-31 13:21 UTC (permalink / raw) To: Mark Lord Cc: Christoph Hellwig, Ric Wheeler, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote: >> In my opinion even that is too weak. We know how to control the cache >> settings on all common disks (that is scsi and ata), so we should always >> disable the write cache unless we know that the whole stack (filesystem, >> raid, volume managers) supports barriers. And even then we should make >> sure the filesystems does actually use barriers everywhere that's needed >> which failed at for years. > .. > > That stack does not know that my MD device has full battery backup, > so it bloody well better NOT prevent me from enabling the write caches. No one is going to prevent you from doing it. That question is one of sane defaults. And always safe, but slower if you have advanced equipment is a much better default than usafe by default on most of the install base. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:21 ` Christoph Hellwig @ 2009-08-31 15:14 ` jim owens 2009-09-03 1:59 ` Ric Wheeler 1 sibling, 0 replies; 309+ messages in thread From: jim owens @ 2009-08-31 15:14 UTC (permalink / raw) To: Christoph Hellwig Cc: Mark Lord, Ric Wheeler, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote: >>> In my opinion even that is too weak. We know how to control the cache >>> settings on all common disks (that is scsi and ata), so we should always >>> disable the write cache unless we know that the whole stack (filesystem, >>> raid, volume managers) supports barriers. And even then we should make >>> sure the filesystems does actually use barriers everywhere that's needed >>> which failed at for years. >> .. >> >> That stack does not know that my MD device has full battery backup, >> so it bloody well better NOT prevent me from enabling the write caches. > > No one is going to prevent you from doing it. That question is one of > sane defaults. And always safe, but slower if you have advanced > equipment is a much better default than usafe by default on most of > the install base. I've always agreed with "be safe first" and have worked where we always shut write cache off unless we knew it had battery. But before we make disabling cache the default, this is the impact: - users will see it as a performance regression - trashy OS vendors who never disable cache will benchmark better than "out of the box" linux. Because as we all know, users don't read release notes. Been there, done that, felt the pain. jim ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:21 ` Christoph Hellwig 2009-08-31 15:14 ` jim owens @ 2009-09-03 1:59 ` Ric Wheeler 2009-09-03 11:12 ` Krzysztof Halasa 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-03 1:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/31/2009 09:21 AM, Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote: >>> In my opinion even that is too weak. We know how to control the cache >>> settings on all common disks (that is scsi and ata), so we should always >>> disable the write cache unless we know that the whole stack (filesystem, >>> raid, volume managers) supports barriers. And even then we should make >>> sure the filesystems does actually use barriers everywhere that's needed >>> which failed at for years. >> .. >> >> That stack does not know that my MD device has full battery backup, >> so it bloody well better NOT prevent me from enabling the write caches. > > No one is going to prevent you from doing it. That question is one of > sane defaults. And always safe, but slower if you have advanced > equipment is a much better default than usafe by default on most of > the install base. > Just to add some support to this, all of the external RAID arrays that I know of normally run with write cache disabled on the component drives. In addition, many of them will disable their internal write cache if/when they detect that they have lost their UPS. I think that if we had done this kind of sane default earlier for MD levels that do not handle barriers, we would not have left some people worried about our software RAID. To be clear, if a sophisticated user wants to override this default, that should be supported. It is not (in my opinion) a safe default behaviour. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 1:59 ` Ric Wheeler @ 2009-09-03 11:12 ` Krzysztof Halasa 2009-09-03 11:18 ` Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Krzysztof Halasa @ 2009-09-03 11:12 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler <rwheeler@redhat.com> writes: > Just to add some support to this, all of the external RAID arrays that > I know of normally run with write cache disabled on the component > drives. Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones? -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 11:12 ` Krzysztof Halasa @ 2009-09-03 11:18 ` Ric Wheeler 2009-09-03 13:34 ` Krzysztof Halasa 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-03 11:18 UTC (permalink / raw) To: Krzysztof Halasa Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/03/2009 07:12 AM, Krzysztof Halasa wrote: > Ric Wheeler<rwheeler@redhat.com> writes: > >> Just to add some support to this, all of the external RAID arrays that >> I know of normally run with write cache disabled on the component >> drives. > > Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones? Which drives various vendors ships changes with specific products. Usually, they ship drives that have carefully vetted firmware, etc. but they are close to the same drives you buy on the open market. Seagate has a huge slice of the market, ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 11:18 ` Ric Wheeler @ 2009-09-03 13:34 ` Krzysztof Halasa 2009-09-03 13:50 ` Ric Wheeler 2009-09-03 14:35 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david 0 siblings, 2 replies; 309+ messages in thread From: Krzysztof Halasa @ 2009-09-03 13:34 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler <rwheeler@redhat.com> writes: >>> Just to add some support to this, all of the external RAID arrays that >>> I know of normally run with write cache disabled on the component >>> drives. >> >> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones? > > Which drives various vendors ships changes with specific products. > Usually, they ship drives that have carefully vetted firmware, etc. > but they are close to the same drives you buy on the open market. But they aren't the same, are they? If they are not, the fact they can run well with the write-through cache doesn't mean the off-the-shelf ones can do as well. Are they SATA (or PATA) at all? SCSI etc. are usually different animals, though there are SCSI and SATA models which differ only in electronics. Do you have battery-backed write-back RAID cache (which acknowledges flushes before the data is written out to disks)? PC can't do that. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 13:34 ` Krzysztof Halasa @ 2009-09-03 13:50 ` Ric Wheeler 2009-09-03 13:59 ` Krzysztof Halasa 2009-09-03 14:35 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-03 13:50 UTC (permalink / raw) To: Krzysztof Halasa Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/03/2009 09:34 AM, Krzysztof Halasa wrote: > Ric Wheeler<rwheeler@redhat.com> writes: > >>>> Just to add some support to this, all of the external RAID arrays that >>>> I know of normally run with write cache disabled on the component >>>> drives. >>> >>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones? >> >> Which drives various vendors ships changes with specific products. >> Usually, they ship drives that have carefully vetted firmware, etc. >> but they are close to the same drives you buy on the open market. > > But they aren't the same, are they? If they are not, the fact they can > run well with the write-through cache doesn't mean the off-the-shelf > ones can do as well. Storage vendors have a wide range of options, but what you get today is a collection of s-ata (not much any more), sas or fc. Some times they will have different firmware, other times it is the same. > > Are they SATA (or PATA) at all? SCSI etc. are usually different > animals, though there are SCSI and SATA models which differ only in > electronics. > > Do you have battery-backed write-back RAID cache (which acknowledges > flushes before the data is written out to disks)? PC can't do that. We (red hat) have all kinds of different raid boxes... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 13:50 ` Ric Wheeler @ 2009-09-03 13:59 ` Krzysztof Halasa 2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Krzysztof Halasa @ 2009-09-03 13:59 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler <rwheeler@redhat.com> writes: > We (red hat) have all kinds of different raid boxes... A have no doubt about it, but are those you know equipped with battery-backed write-back cache? Are they using SATA disks? We can _at_best_ compare non-battery-backed RAID using SATA disks with what we typically have in a PC. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 309+ messages in thread
* wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 13:59 ` Krzysztof Halasa @ 2009-09-03 14:15 ` Ric Wheeler 2009-09-03 14:26 ` Florian Weimer ` (3 more replies) 0 siblings, 4 replies; 309+ messages in thread From: Ric Wheeler @ 2009-09-03 14:15 UTC (permalink / raw) To: Krzysztof Halasa Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/03/2009 09:59 AM, Krzysztof Halasa wrote: > Ric Wheeler<rwheeler@redhat.com> writes: > >> We (red hat) have all kinds of different raid boxes... > > A have no doubt about it, but are those you know equipped with > battery-backed write-back cache? Are they using SATA disks? > > We can _at_best_ compare non-battery-backed RAID using SATA disks with > what we typically have in a PC. The whole thread above is about software MD using commodity drives (S-ATA or SAS) without battery backed write cache. We have that (and I have it personally) and do test it. You must disable the write cache on these commodity drives *if* the MD RAID level does not support barriers properly. This will greatly reduce errors after a power loss (both in degraded state and non-degraded state), but it will not eliminate data loss entirely. You simply cannot do that with any storage device! Note that even without MD raid, the file system issues IO's in file system block size (4096 bytes normally) and most commodity storage devices use a 512 byte sector size which means that we have to update 8 512b sectors. Drives can (and do) have multiple platters and surfaces and it is perfectly normal to have contiguous logical ranges of sectors map to non-contiguous sectors physically. Imagine a 4KB write stripe that straddles two adjacent tracks on one platter (requiring a seek) or mapped across two surfaces (requiring a head switch). Also, a remapped sector can require more or less a full surface seek from where ever you are to the remapped sector area of the drive. These are all examples that can after a power loss, even a local (non-MD) device, do a partial update of that 4KB write range of sectors. Note that unlike unlike RAID/MD, local storage has no parity on the server to detect this partial write. This is why new file systems like btrfs and zfs do checksumming of data and metadata. This won't prevent partial updates during a write, but can at least detect them and try to do some kind of recovery. In other words, this is not just an MD issue, it is entirely possible even with non-MD devices. Also, when you enable the write cache (MD or not) you are buffering multiple MB's of data that can go away on power loss. Far greater (10x) the exposure that the partial RAID rewrite case worries about. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler @ 2009-09-03 14:26 ` Florian Weimer 2009-09-03 15:09 ` Ric Wheeler 2009-09-03 23:50 ` Krzysztof Halasa ` (2 subsequent siblings) 3 siblings, 1 reply; 309+ messages in thread From: Florian Weimer @ 2009-09-03 14:26 UTC (permalink / raw) To: Ric Wheeler Cc: Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet * Ric Wheeler: > Note that even without MD raid, the file system issues IO's in file > system block size (4096 bytes normally) and most commodity storage > devices use a 512 byte sector size which means that we have to update > 8 512b sectors. Database software often attempts to deal with this phenomenon (sometimes called "torn page writes"). For example, you can make sure that the first time you write to a database page, you keep a full copy in your transaction log. If the machine crashes, the log is replayed, first completely overwriting the partially-written page. Only after that, you can perform logical/incremental logging. The log itself has to be protected with a different mechanism, so that you don't try to replay bad data. But you haven't comitted to this data yet, so it is fine to skip bad records. Therefore, sub-page corruption is a fundamentally different issue from super-page corruption. BTW, older textbooks will tell you that mirroring requires that you read from two copies of the data and compare it (and have some sort of tie breaker if you need availability). And you also have to re-read data you've just written to disk, to make sure it's actually there and hit the expected sectors. We can't even do this anymore, thanks to disk caches. And it doesn't seem to be necessary in most cases. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 14:26 ` Florian Weimer @ 2009-09-03 15:09 ` Ric Wheeler 0 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-09-03 15:09 UTC (permalink / raw) To: Florian Weimer Cc: Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/03/2009 10:26 AM, Florian Weimer wrote: > * Ric Wheeler: > >> Note that even without MD raid, the file system issues IO's in file >> system block size (4096 bytes normally) and most commodity storage >> devices use a 512 byte sector size which means that we have to update >> 8 512b sectors. > > Database software often attempts to deal with this phenomenon > (sometimes called "torn page writes"). For example, you can make sure > that the first time you write to a database page, you keep a full copy > in your transaction log. If the machine crashes, the log is replayed, > first completely overwriting the partially-written page. Only after > that, you can perform logical/incremental logging. > > The log itself has to be protected with a different mechanism, so that > you don't try to replay bad data. But you haven't comitted to this > data yet, so it is fine to skip bad records. Yes - databases worry a lot about this. Another technique that they tend to use is to have state bits at the beginning and end of their logical pages. For example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as you update. If the bits don't match, that is a quick level indication of a torn write. Even with the above scheme, you can still have data loss of course - you just need an IO error in the log and in your db table that was recently updated. Not entirely unlikely, especially if you use write cache enabled storage and don't flush that cache :-) > > Therefore, sub-page corruption is a fundamentally different issue from > super-page corruption. We have to be careful to keep our terms clear since the DB pages are (usually) larger than the FS block size which in turn is larger than non-RAID storage sector size. At the FS level, we send down multiples of fs blocks (not blocked/aligned at RAID stripe levels, etc). In any case, we can get sub-FS block level "torn writes" even with a local S-ATA drive in edge conditions. > > BTW, older textbooks will tell you that mirroring requires that you > read from two copies of the data and compare it (and have some sort of > tie breaker if you need availability). And you also have to re-read > data you've just written to disk, to make sure it's actually there and > hit the expected sectors. We can't even do this anymore, thanks to > disk caches. And it doesn't seem to be necessary in most cases. > We can do something like this with the built in RAID in btrfs. If you detect an IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy. Also note that the SCSI T10 DIF/DIX has baked in support for applications to layer on extra data integrity (look for MKP's slide decks). This is really neat since you can intercept bad IO's on the way down and prevent overwriting good data. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler 2009-09-03 14:26 ` Florian Weimer @ 2009-09-03 23:50 ` Krzysztof Halasa 2009-09-04 0:39 ` Ric Wheeler 2009-09-04 21:21 ` Mark Lord 2009-09-07 11:45 ` Pavel Machek 3 siblings, 1 reply; 309+ messages in thread From: Krzysztof Halasa @ 2009-09-03 23:50 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler <rwheeler@redhat.com> writes: > The whole thread above is about software MD using commodity drives > (S-ATA or SAS) without battery backed write cache. Yes. However, you mentioned external RAID arrays disable disk caches. That's why I asked if they are using SATA or SCSI/etc. disks, and if they have battery-backed cache. > Also, when you enable the write cache (MD or not) you are buffering > multiple MB's of data that can go away on power loss. Far greater > (10x) the exposure that the partial RAID rewrite case worries about. The cache is flushed with working barriers. I guess it should be superior to disabled WB cache, in both performance and expected disk lifetime. -- Krzysztof Halasa ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 23:50 ` Krzysztof Halasa @ 2009-09-04 0:39 ` Ric Wheeler 0 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-09-04 0:39 UTC (permalink / raw) To: Krzysztof Halasa Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/03/2009 07:50 PM, Krzysztof Halasa wrote: > Ric Wheeler<rwheeler@redhat.com> writes: > > >> The whole thread above is about software MD using commodity drives >> (S-ATA or SAS) without battery backed write cache. >> > Yes. However, you mentioned external RAID arrays disable disk caches. > That's why I asked if they are using SATA or SCSI/etc. disks, and if > they have battery-backed cache. > > Sorry for the confusion - they disable the write caches on the component drives normally, but have their own write cache which is not disabled in most cases. >> Also, when you enable the write cache (MD or not) you are buffering >> multiple MB's of data that can go away on power loss. Far greater >> (10x) the exposure that the partial RAID rewrite case worries about. >> > The cache is flushed with working barriers. I guess it should be > superior to disabled WB cache, in both performance and expected disk > lifetime. > True - barriers (especially on big, slow s-ata drives) usually give you an overall win. SAS drives it seems to make less of an impact, but then you always need to benchmark your workload on anything to get the only numbers that really matter :-) ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler 2009-09-03 14:26 ` Florian Weimer 2009-09-03 23:50 ` Krzysztof Halasa @ 2009-09-04 21:21 ` Mark Lord 2009-09-04 21:29 ` Ric Wheeler 2009-09-07 11:45 ` Pavel Machek 3 siblings, 1 reply; 309+ messages in thread From: Mark Lord @ 2009-09-04 21:21 UTC (permalink / raw) To: Ric Wheeler Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler wrote: .. > You must disable the write cache on these commodity drives *if* the MD > RAID level does not support barriers properly. .. Rather than further trying to cripple Linux on the notebook, (it's bad enough already).. How about instead, *fixing* the MD layer to properly support barriers? That would be far more useful, productive, and better for end-users. Cheers ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-04 21:21 ` Mark Lord @ 2009-09-04 21:29 ` Ric Wheeler 2009-09-05 12:57 ` Mark Lord 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-09-04 21:29 UTC (permalink / raw) To: Mark Lord Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/04/2009 05:21 PM, Mark Lord wrote: > Ric Wheeler wrote: > .. >> You must disable the write cache on these commodity drives *if* the >> MD RAID level does not support barriers properly. > .. > > Rather than further trying to cripple Linux on the notebook, > (it's bad enough already).. People using MD on notebooks (not sure there are that many using RAID5 MD) could leave their write cache enabled. > > How about instead, *fixing* the MD layer to properly support barriers? > That would be far more useful, productive, and better for end-users. > > Cheers Fixing MD would be great - not sure that it would end up still faster (look at md1 devices with working barriers with compared to md1 with write cache disabled). In the mean time, if you are using MD to make your data more reliable, I would still strongly urge you to disable the write cache when you see "barriers disabled" messages spit out in /var/log/messages :-) ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-04 21:29 ` Ric Wheeler @ 2009-09-05 12:57 ` Mark Lord 2009-09-05 13:40 ` Ric Wheeler 2009-09-05 21:43 ` NeilBrown 0 siblings, 2 replies; 309+ messages in thread From: Mark Lord @ 2009-09-05 12:57 UTC (permalink / raw) To: Ric Wheeler Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Ric Wheeler wrote: > On 09/04/2009 05:21 PM, Mark Lord wrote: .. >> How about instead, *fixing* the MD layer to properly support barriers? >> That would be far more useful, productive, and better for end-users. .. > Fixing MD would be great - not sure that it would end up still faster > (look at md1 devices with working barriers with compared to md1 with > write cache disabled). .. There's no inherent reason for it to be slower, except possibly drives with b0rked FUA support. So the first step is to fix MD to pass barriers to the LLDs for most/all RAID types. Then, if it has performance issues, those can be addressed by more application of little grey cells. :) Cheers ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-05 12:57 ` Mark Lord @ 2009-09-05 13:40 ` Ric Wheeler 2009-09-05 21:43 ` NeilBrown 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-09-05 13:40 UTC (permalink / raw) To: Mark Lord Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 09/05/2009 08:57 AM, Mark Lord wrote: > Ric Wheeler wrote: >> On 09/04/2009 05:21 PM, Mark Lord wrote: > .. >>> How about instead, *fixing* the MD layer to properly support barriers? >>> That would be far more useful, productive, and better for end-users. > .. >> Fixing MD would be great - not sure that it would end up still faster >> (look at md1 devices with working barriers with compared to md1 with >> write cache disabled). > .. > > There's no inherent reason for it to be slower, except possibly > drives with b0rked FUA support. > > So the first step is to fix MD to pass barriers to the LLDs > for most/all RAID types. > Then, if it has performance issues, those can be addressed > by more application of little grey cells. :) > > Cheers The performance issue with MD is that the "simple" answer is to not only pass on those downstream barrier ops, but also to block and wait until all of those dependent barrier ops complete before ack'ing the IO. When you do that implementation at least, you will see a very large performance impact and I am not sure that you would see any degradation vs just turning off the write caches. Sounds like we should actually do some testing and actually measure, I do think that it will vary with the class of device quite a lot just like we see with single disk barriers vs write cache disabled on SAS vs S-ATA, etc... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-05 12:57 ` Mark Lord 2009-09-05 13:40 ` Ric Wheeler @ 2009-09-05 21:43 ` NeilBrown 1 sibling, 0 replies; 309+ messages in thread From: NeilBrown @ 2009-09-05 21:43 UTC (permalink / raw) To: Mark Lord Cc: Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david, Pavel Machek, Theodore Tso, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sat, September 5, 2009 10:57 pm, Mark Lord wrote: > Ric Wheeler wrote: >> On 09/04/2009 05:21 PM, Mark Lord wrote: > .. >>> How about instead, *fixing* the MD layer to properly support barriers? >>> That would be far more useful, productive, and better for end-users. > .. >> Fixing MD would be great - not sure that it would end up still faster >> (look at md1 devices with working barriers with compared to md1 with >> write cache disabled). > .. > > There's no inherent reason for it to be slower, except possibly > drives with b0rked FUA support. > > So the first step is to fix MD to pass barriers to the LLDs > for most/all RAID types. Having MD "pass barriers" to LLDs isn't really very useful. The barrier need to act with respect to all addresses of the device, and once you pass it down, it can only act with respect to addresses on that device. What any striping RAID level needs to do when it sees a barrier is: suspend all future writes drain and flush all queues submit the barrier write drain and flush all queues unsuspend writes I guess "drain can flush all queues" can be done with an empty barrier so maybe that is exactly what you meant. The double flush which (I think) is required by the barrier semantic is unfortunate. I wonder if it would actually make things slower than necessary. NeilBrown > > Then, if it has performance issues, those can be addressed > by more application of little grey cells. :) > > Cheers > ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler ` (2 preceding siblings ...) 2009-09-04 21:21 ` Mark Lord @ 2009-09-07 11:45 ` Pavel Machek 2009-09-07 13:10 ` Theodore Tso 3 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-09-07 11:45 UTC (permalink / raw) To: Ric Wheeler Cc: Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > Note that even without MD raid, the file system issues IO's in file > system block size (4096 bytes normally) and most commodity storage > devices use a 512 byte sector size which means that we have to update 8 > 512b sectors. > > Drives can (and do) have multiple platters and surfaces and it is > perfectly normal to have contiguous logical ranges of sectors map to > non-contiguous sectors physically. Imagine a 4KB write stripe that > straddles two adjacent tracks on one platter (requiring a seek) or mapped > across two surfaces (requiring a head switch). Also, a remapped sector > can require more or less a full surface seek from where ever you are to > the remapped sector area of the drive. Yes, but ext3 was designed to handle the partial write (according to tytso). > These are all examples that can after a power loss, even a local > (non-MD) device, do a partial update of that 4KB write range of > sectors. Yes, but ext3 journal protects metadata integrity in that case. > In other words, this is not just an MD issue, it is entirely possible > even with non-MD devices. > > Also, when you enable the write cache (MD or not) you are buffering > multiple MB's of data that can go away on power loss. Far greater (10x) > the exposure that the partial RAID rewrite case worries about. Yes, that's what barriers are for. Except that they are not there on MD0/MD5/MD6. They actually work on local sata drives... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage 2009-09-07 11:45 ` Pavel Machek @ 2009-09-07 13:10 ` Theodore Tso 2010-04-04 13:47 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Theodore Tso @ 2009-09-07 13:10 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, Sep 07, 2009 at 01:45:34PM +0200, Pavel Machek wrote: > > Yes, but ext3 was designed to handle the partial write (according to > tytso). I'm not sure what made you think that I said that. In practice things usually work out, as a conseuqence of the fact that ext3 uses physical block journaling, but it's not perfect, becase... > > Also, when you enable the write cache (MD or not) you are buffering > > multiple MB's of data that can go away on power loss. Far greater (10x) > > the exposure that the partial RAID rewrite case worries about. > > Yes, that's what barriers are for. Except that they are not there on > MD0/MD5/MD6. They actually work on local sata drives... Yes, but ext3 does not enable barriers by default (the patch has been submitted but akpm has balked because he doesn't like the performance degredation and doesn't believe that Chris Mason's "workload of doom" is a common case). Note though that it is possible for dirty blocks to remain in the track buffer for *minutes* without being written to spinning rust platters without a barrier. See Chris Mason's report of this phenonmenon here: http://lkml.org/lkml/2009/3/30/297 Here's Chris Mason "barrier test" which will corrupt ext3 filesystems 50% of the time after a power drop if the filesystem is mounted with barriers disabled (which is the default; use the mount option barrier=1 to enable barriers): http://lkml.indiana.edu/hypermail/linux/kernel/0805.2/1518.html (Yes, ext4 has barriers enabled by default.) - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2009-09-07 13:10 ` Theodore Tso @ 2010-04-04 13:47 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2010-04-04 13:47 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > > Yes, but ext3 was designed to handle the partial write (according to > > tytso). > > I'm not sure what made you think that I said that. In practice things > usually work out, as a conseuqence of the fact that ext3 uses physical > block journaling, but it's not perfect, becase... Ok; so the journalling actually is not reliable on many machines -- not even disk drive manufacturers guarantee full block writes AFAICT. Maybe there's time to reviwe the patch to increase mount count by >1 when journal is replayed, to do fsck more often when powerfails are present? > > > Also, when you enable the write cache (MD or not) you are buffering > > > multiple MB's of data that can go away on power loss. Far greater (10x) > > > the exposure that the partial RAID rewrite case worries about. > > > > Yes, that's what barriers are for. Except that they are not there on > > MD0/MD5/MD6. They actually work on local sata drives... > > Yes, but ext3 does not enable barriers by default (the patch has been > submitted but akpm has balked because he doesn't like the performance > degredation and doesn't believe that Chris Mason's "workload of doom" > is a common case). Note though that it is possible for dirty blocks > to remain in the track buffer for *minutes* without being written to > spinning rust platters without a barrier. So we do wrong thing by default. Another reason to do fsck more often when powerfails are present? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) @ 2010-04-04 13:47 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2010-04-04 13:47 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev Hi! > > Yes, but ext3 was designed to handle the partial write (according to > > tytso). > > I'm not sure what made you think that I said that. In practice things > usually work out, as a conseuqence of the fact that ext3 uses physical > block journaling, but it's not perfect, becase... Ok; so the journalling actually is not reliable on many machines -- not even disk drive manufacturers guarantee full block writes AFAICT. Maybe there's time to reviwe the patch to increase mount count by >1 when journal is replayed, to do fsck more often when powerfails are present? > > > Also, when you enable the write cache (MD or not) you are buffering > > > multiple MB's of data that can go away on power loss. Far greater (10x) > > > the exposure that the partial RAID rewrite case worries about. > > > > Yes, that's what barriers are for. Except that they are not there on > > MD0/MD5/MD6. They actually work on local sata drives... > > Yes, but ext3 does not enable barriers by default (the patch has been > submitted but akpm has balked because he doesn't like the performance > degredation and doesn't believe that Chris Mason's "workload of doom" > is a common case). Note though that it is possible for dirty blocks > to remain in the track buffer for *minutes* without being written to > spinning rust platters without a barrier. So we do wrong thing by default. Another reason to do fsck more often when powerfails are present? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2010-04-04 13:47 ` Pavel Machek (?) @ 2010-04-04 17:39 ` tytso -1 siblings, 0 replies; 309+ messages in thread From: tytso @ 2010-04-04 17:39 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, Apr 04, 2010 at 03:47:29PM +0200, Pavel Machek wrote: > > Yes, but ext3 does not enable barriers by default (the patch has been > > submitted but akpm has balked because he doesn't like the performance > > degredation and doesn't believe that Chris Mason's "workload of doom" > > is a common case). Note though that it is possible for dirty blocks > > to remain in the track buffer for *minutes* without being written to > > spinning rust platters without a barrier. > > So we do wrong thing by default. Another reason to do fsck more often > when powerfails are present? Or migrate to ext4, which does use barriers by defaults, as well as journal-level checksumming. :-) As far as changing the default to enable barriers for ext3, you'll need to talk to akpm about that; he's the one who has been against it in the past. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2010-04-04 13:47 ` Pavel Machek (?) (?) @ 2010-04-04 17:59 ` Rob Landley 2010-04-04 18:45 ` Pavel Machek 2010-04-04 19:29 ` tytso -1 siblings, 2 replies; 309+ messages in thread From: Rob Landley @ 2010-04-04 17:59 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sunday 04 April 2010 08:47:29 Pavel Machek wrote: > Maybe there's time to reviwe the patch to increase mount count by >1 > when journal is replayed, to do fsck more often when powerfails are > present? Wow, you mean there are Linux users left who _don't_ rip that out? The auto-fsck stuff is an instance of "we the developers know what you the users need far more than you ever could, so let me ram this down your throat". I don't know of a server anywhere that can afford an unscheduled extra four hours of downtime due to the system deciding to fsck itself, and I don't know a Linux laptop user anywhere who would be happy to fire up their laptop and suddenly be told "oh, you can't do anything with it for two hours, and you can't power it down either". I keep my laptop backed up to an external terabyte USB drive and the volatile subset of it to a network drive (rsync is great for both), and when it dies, it dies. But I've never lost data due to an issue fsck would have fixed. I've lost data to disks overheating, disks wearing out, disks being run undervolt because the cat chewed on the power supply cord... I've copied floppy images to /dev/hda instead of /dev/fd0... I even ran over my laptop with my car once. (Amazingly enough, that hard drive survived.) But fsck has never once protected any data of mine, that I am aware of, since journaling was introduced. I'm all for btrfs coming along and being able to fsck itself behind my back where I don't have to care about it. (Although I want to tell it _not_ to do that when on battery power.) But the "fsck lottery" at powerup is just stupid. > > > > Also, when you enable the write cache (MD or not) you are buffering > > > > multiple MB's of data that can go away on power loss. Far greater > > > > (10x) the exposure that the partial RAID rewrite case worries about. > > > > > > Yes, that's what barriers are for. Except that they are not there on > > > MD0/MD5/MD6. They actually work on local sata drives... > > > > Yes, but ext3 does not enable barriers by default (the patch has been > > submitted but akpm has balked because he doesn't like the performance > > degredation and doesn't believe that Chris Mason's "workload of doom" > > is a common case). Note though that it is possible for dirty blocks > > to remain in the track buffer for *minutes* without being written to > > spinning rust platters without a barrier. > > So we do wrong thing by default. Another reason to do fsck more often > when powerfails are present? My laptop power fails all the time, due to battery exhaustion. Back under KDE it was decent about suspending when it was ran low on power, but ever since KDE 4 came out and I had to switch to XFCE, it's using the gnome infrastructure, which collects funky statistics and heuristics but can never quite save them to disk because suddenly running out of power when it thinks it's got 20 minutes left doesn't give it the opportunity to save its database. So it'll never auto-suspend, just suddenly die if I don't hit the button. As a result of one of these, two large media files in my "anime" subdirectory are not only crosslinked, but the common sector they share is bad. (It ran out of power in the act of writing that sector. I left it copying large files to the drive and forgot to plug it in, and it did the loud click emergency park and power down thing when the hardware voltage regulator tripped.) This corruption has been there for a year now. Presumably if it overwrote that sector it might recover (perhaps by allocating one of the spares), but the drive firmware has proven unwilling to do so in response to _reading_ the bad sector, and I'm largely ignoring it because it's by no means the worst thing wrong with this laptop's hardware, and some glorious day I'll probably break down and buy a macintosh. The stuff I have on it's backed up, and in the year since it hasn't developed a second bad sector and I haven't deleted those files. (Yes, I could replace the hard drive _again_ but this laptop's on its third hard drive already and it's just not worth the effort.) I'm much more comfortable living with this until I can get a new laptop than with the idea of running fsck on the system and letting it do who knows what it response to something that is not actually a problem. > Pavel Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2010-04-04 17:59 ` Rob Landley @ 2010-04-04 18:45 ` Pavel Machek 2010-04-04 19:35 ` tytso 2010-04-04 19:29 ` tytso 1 sibling, 1 reply; 309+ messages in thread From: Pavel Machek @ 2010-04-04 18:45 UTC (permalink / raw) To: Rob Landley Cc: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun 2010-04-04 12:59:16, Rob Landley wrote: > On Sunday 04 April 2010 08:47:29 Pavel Machek wrote: > > Maybe there's time to reviwe the patch to increase mount count by >1 > > when journal is replayed, to do fsck more often when powerfails are > > present? > > Wow, you mean there are Linux users left who _don't_ rip that out? Yes, there are. It actually helped pinpoint corruption here, 4 time it was major corruption. And yes, I'd like fsck more often, when they are power failures and less often when the shutdowns are orderly... I'm not sure of what right intervals between check are for you, but I'd say that fsck once a year or every 100 mounts or every 10 power failures is probably good idea for everybody... > The auto-fsck stuff is an instance of "we the developers know what you the > users need far more than you ever could, so let me ram this down your throat". > I don't know of a server anywhere that can afford an unscheduled extra four > hours of downtime due to the system deciding to fsck itself, and I don't know > a Linux laptop user anywhere who would be happy to fire up their laptop and > suddenly be told "oh, you can't do anything with it for two hours, and you > can't power it down either". On laptop situation is easy. Pull the plug, hit reset, wait for fsck, plug AC back in. Done that, too :-). Yep, it would be nice if fsck had "escape" button. > I'm all for btrfs coming along and being able to fsck itself behind my back > where I don't have to care about it. (Although I want to tell it _not_ to do > that when on battery power.) But the "fsck lottery" at powerup is just > stupid. fsck lottery. :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2010-04-04 18:45 ` Pavel Machek @ 2010-04-04 19:35 ` tytso 0 siblings, 0 replies; 309+ messages in thread From: tytso @ 2010-04-04 19:35 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, Apr 04, 2010 at 08:45:46PM +0200, Pavel Machek wrote: > > I'm not sure of what right intervals between check are for you, but > I'd say that fsck once a year or every 100 mounts or every 10 power > failures is probably good idea for everybody... For people using e2croncheck, where you can check it when the system is idle and without needing to do a power cycle, I'd recommend once a week, actually. > > hours of downtime due to the system deciding to fsck itself, and I > > don't know a Linux laptop user anywhere who would be happy to fire > > up their laptop and suddenly be told "oh, you can't do anything > > with it for two hours, and you can't power it down either". > > On laptop situation is easy. Pull the plug, hit reset, wait for fsck, > plug AC back in. Done that, too :-). Some distributions will allow you to cancel an fsck; either by using ^C, or hitting escape. That's a matter for the boot scripts, which are distribution specific. Ubuntu has a way of doing this, for example, if I recall correctly --- although since I've started using e2croncheck, I've never had an issue with an e2fsck taking place on bootup. Also, ext4, fscks are so much much faster that even before I upgraded to using an SSD, it's never been an issue for me. It's certainly not hours any more.... > Yep, it would be nice if fsck had "escape" button. Complain to your distribution. :-) Or this is Linux and open source; fix it yourself, and submit the patches back to your distribution. If all you want to do is whine, then maybe Rob's choice is the best way, go switch to the velvet-lined closed system/jail which is the Macintosh. :-) (I created e2croncheck to solve my problem; if that isn't good enough for you, I encourage you to find/create your own fixes.) - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2010-04-04 17:59 ` Rob Landley 2010-04-04 18:45 ` Pavel Machek @ 2010-04-04 19:29 ` tytso 2010-04-04 23:58 ` Rob Landley 1 sibling, 1 reply; 309+ messages in thread From: tytso @ 2010-04-04 19:29 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote: > I don't know of a server anywhere that can afford an unscheduled > extra four hours of downtime due to the system deciding to fsck > itself, and I don't know a Linux laptop user anywhere who would be > happy to fire up their laptop and suddenly be told "oh, you can't do > anything with it for two hours, and you can't power it down either". So what I recommend for server class machines is to either turn off the automatic fsck's (it's the default, but it's documented and there are supported ways of turning it off --- that's hardly developers "ramming" it down user's throats), or more preferably, to use LVM, and then use a snapshot and running fsck on the snapshot. > I'm all for btrfs coming along and being able to fsck itself behind > my back where I don't have to care about it. (Although I want to > tell it _not_ to do that when on battery power.) You can do this with ext3/ext4 today, now. Just take a look at e2croncheck in the contrib directory of e2fsprogs. Changing it to not do this when on battery power is a trivial exercise. > My laptop power fails all the time, due to battery exhaustion. Back > under KDE it was decent about suspending when it was ran low on > power, but ever since KDE 4 came out and I had to switch to XFCE, > it's using the gnome infrastructure, which collects funky statistics > and heuristics but can never quite save them to disk because > suddenly running out of power when it thinks it's got 20 minutes > left doesn't give it the opportunity to save its database. So it'll > never auto-suspend, just suddenly die if I don't hit the button. Hmm, why are you running on battery so often? I make a point of running connected to the AC mains whenever possible, because a LiOn battery only has about 200 full-cycle charge/discharges in it, and given the cost of LiOn batteries, basically each charge/discharge cycle costs a dollar each. So I only run on batteries when I absolutely have to, and in practice it's rare that I dip below 30% or so. > As a result of one of these, two large media files in my "anime" > subdirectory are not only crosslinked, but the common sector they > share is bad. (It ran out of power in the act of writing that > sector. I left it copying large files to the drive and forgot to > plug it in, and it did the loud click emergency park and power down > thing when the hardware voltage regulator tripped.) So e2fsck would fix the cross-linking. We do need to have some better tools to do forced rewrite of sectors that have gone bad in a HDD. It can be done by using badblocks -n, but translating the sector number emitted by the device driver (which for some drivers is relative to the beginning of the partition, and for others is relative to the beginning of the disk). It is possible to run badblocks -w on the whole disk, of course, but it's better to just run it on the specific block in question. > I'm much more comfortable living with this until I can get a new laptop than > with the idea of running fsck on the system and letting it do who knows what > it response to something that is not actually a problem. Well, it actually is a problem. And there may be other problems hiding that you're not aware of. Running "badblocks -b 4096 -n" may discover other blocks that have failed, and you can then decide whether you want to let fsck fix things up. If you don't, though, it's probably not fair to blame ext3 or e2fsck for any future failures (not that it's likely to stop you :-). - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) 2010-04-04 19:29 ` tytso @ 2010-04-04 23:58 ` Rob Landley 0 siblings, 0 replies; 309+ messages in thread From: Rob Landley @ 2010-04-04 23:58 UTC (permalink / raw) To: tytso Cc: Pavel Machek, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sunday 04 April 2010 14:29:12 tytso@mit.edu wrote: > On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote: > > I don't know of a server anywhere that can afford an unscheduled > > extra four hours of downtime due to the system deciding to fsck > > itself, and I don't know a Linux laptop user anywhere who would be > > happy to fire up their laptop and suddenly be told "oh, you can't do > > anything with it for two hours, and you can't power it down either". > > So what I recommend for server class machines is to either turn off > the automatic fsck's (it's the default, but it's documented and there > are supported ways of turning it off --- that's hardly developers > "ramming" it down user's throats), or more preferably, to use LVM, and > then use a snapshot and running fsck on the snapshot. Turning off the automatic fsck is what I see people do, yes. My point is that if you don't force the thing to run memtest86 overnight every 20 boots, forcing it to run fsck seems a bit silly. > > I'm all for btrfs coming along and being able to fsck itself behind > > my back where I don't have to care about it. (Although I want to > > tell it _not_ to do that when on battery power.) > > You can do this with ext3/ext4 today, now. Just take a look at > e2croncheck in the contrib directory of e2fsprogs. Changing it to not > do this when on battery power is a trivial exercise. > > > My laptop power fails all the time, due to battery exhaustion. Back > > under KDE it was decent about suspending when it was ran low on > > power, but ever since KDE 4 came out and I had to switch to XFCE, > > it's using the gnome infrastructure, which collects funky statistics > > and heuristics but can never quite save them to disk because > > suddenly running out of power when it thinks it's got 20 minutes > > left doesn't give it the opportunity to save its database. So it'll > > never auto-suspend, just suddenly die if I don't hit the button. > > Hmm, why are you running on battery so often? Personal working style? When I was in Pittsburgh, I used the laptop on the bus to and from work every day. Here in Austin, my laundromat has free wifi. It also gets usable free wifi from the coffee shop to the right, the japanese restaurant to the left, and the ice cream shop across the street. (And when I'm not in a wifi area, my cell phone can bluetooth associate to give me net access too.) I like coffee shops. (Of course the fact that if I try to work from home I have to fight off the affections of four cats might have something to do with it too...) > I make a point of > running connected to the AC mains whenever possible, because a LiOn > battery only has about 200 full-cycle charge/discharges in it, and > given the cost of LiOn batteries, basically each charge/discharge > cycle costs a dollar each. Actually the battery's about $50, so that would be 25 cents each. My laptop is on its third battery. It's also on its third hard drive. > So I only run on batteries when I > absolutely have to, and in practice it's rare that I dip below 30% or > so. Actually I find the suckers die just as quickly from simply being plugged in and kept hot by the electronics, and never used so they're pegged at 100% with slight trickle current beyond that constantly overcharging them. > > As a result of one of these, two large media files in my "anime" > > subdirectory are not only crosslinked, but the common sector they > > share is bad. (It ran out of power in the act of writing that > > sector. I left it copying large files to the drive and forgot to > > plug it in, and it did the loud click emergency park and power down > > thing when the hardware voltage regulator tripped.) > > So e2fsck would fix the cross-linking. We do need to have some better > tools to do forced rewrite of sectors that have gone bad in a HDD. It > can be done by using badblocks -n, but translating the sector number > emitted by the device driver (which for some drivers is relative to > the beginning of the partition, and for others is relative to the > beginning of the disk). It is possible to run badblocks -w on the > whole disk, of course, but it's better to just run it on the specific > block in question. The point I was trying to make is that running "preemptive" fsck is imposing a significant burden on users in an attempt to find purely theoretical problems, with the expectation that a given run will _not_ find them. I've had systems taken out by actual hardware issues often enough that keeping good backups and being prepared to lose the entire laptop at any time is just common sense. I knocked my laptop into the bathtub last month. Luckily there wasn't any water in the thing at the time, but it made a very loud bang when it hit, and it was on at the time. (Checked dmesg several times over the next few days and it didn't start spitting errors at me, so that's something...) > > I'm much more comfortable living with this until I can get a new laptop > > than with the idea of running fsck on the system and letting it do who > > knows what it response to something that is not actually a problem. > > Well, it actually is a problem. And there may be other problems > hiding that you're not aware of. Running "badblocks -b 4096 -n" may > discover other blocks that have failed, and you can then decide > whether you want to let fsck fix things up. If you don't, though, > it's probably not fair to blame ext3 or e2fsck for any future > failures (not that it's likely to stop you :-). I'm not blaming ext2. I'm saying I've spilled sodas into my working machines on so many occasions over the years I've lost _track_. (The vast majority of 'em survived, actually.) Random example of current cascading badness: The latch sensor on my laptop is no longer debounced. That happened when I upgraded to Ubuntu 9.04 but I'm not sure how that _can_ screw that up, you'd think the bios would be in charge of that. So anyway, it now has a nasty habit of waking itself up in the nice insulated pocket in my backpack and then shutting itself down hard five minutes later when the thermal sensors trip (at the bios level I think, not in the OS). So I now regularly suspend to disk instead of to ram because that way it can't spuriously wake itself back up just because it got jostled slightly. Except that when it resumes from disk, the console it suspended in is totally misprogrammed (vertical lines on what it _thinks_ is text mode), and sometimes the chip is so horked I can hear the sucker making a screeching noise. The easy workarond is to ctrl-alt-F1 and suspend from a text console, then Ctrl- alt-f7 gets me back to the desktop. But going back to that text console remembers the misprogramming, and I get vertical lines and an adible whine coming from something that isn't a speaker. (Luckly cursor-up and enter works to re-suspend, so I can just sacrifice one console to the suspend bug.) The _fun_ part is that the last system I had where X11 regularly misprogramed it so badly I could _hear_ the video chip, said video chip eventually overheated and melted bits of the motherboard. (That was a toshiba laptop. It took out the keyboard controller first, and I used it for a few months with an external keyboard until the whole thing just went one day. The display you get when your video chip finally goes can be pretty impressive. Way prettier than the time I was caught in a thunderstorm and my laptop got soaked and two vertical sections of the display were flickering white while the rest was displaying normally -- that system actally started working again when it dried out...) It just wouldn't be a Linux box to me if I didn't have workarounds for the side effects of my workarounds. Anyway, this is the perspective from which I say that the fsck to look for purely theoretical badness on my otherwise perfect system is not worth 2 hours to never find anything wrong. If Ubuntu's little upgrade icon had a "recommend fsck" thing that lights up every 3 months which I could hit some weekend when I was going out anyway, that would be one thing. But "Ah, Ubuntu 9.04 moved DRM from X11 into the kernel and the Intel 945 3D driver is now psychotic and it froze your machine for the second time this week. Since you're rebooting anyway, you won't mind if I add an extra 3 hours to the process"...? That stopped really being a viable assumption some time before hard drives were regularly measured in terabytes. > - Ted Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-09-03 13:34 ` Krzysztof Halasa 2009-09-03 13:50 ` Ric Wheeler @ 2009-09-03 14:35 ` david 1 sibling, 0 replies; 309+ messages in thread From: david @ 2009-09-03 14:35 UTC (permalink / raw) To: Krzysztof Halasa Cc: Ric Wheeler, Christoph Hellwig, Mark Lord, Michael Tokarev, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, 3 Sep 2009, Krzysztof Halasa wrote: > Ric Wheeler <rwheeler@redhat.com> writes: > >>>> Just to add some support to this, all of the external RAID arrays that >>>> I know of normally run with write cache disabled on the component >>>> drives. >>> >>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones? >> >> Which drives various vendors ships changes with specific products. >> Usually, they ship drives that have carefully vetted firmware, etc. >> but they are close to the same drives you buy on the open market. > > But they aren't the same, are they? If they are not, the fact they can > run well with the write-through cache doesn't mean the off-the-shelf > ones can do as well. frequently they are exactly the same drives, with exactly the same firmware. you disable the write caches on the drives themselves, but you add a large write cache (with battery backup) in the raid card/chassis > Are they SATA (or PATA) at all? SCSI etc. are usually different > animals, though there are SCSI and SATA models which differ only in > electronics. it depends on what raid array you use, some use SATA, some use SAS/SCSI David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:16 ` Christoph Hellwig 2009-08-31 13:19 ` Mark Lord @ 2009-08-31 13:22 ` Ric Wheeler 2009-08-31 15:50 ` david 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-31 13:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/31/2009 09:16 AM, Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote: >>> While most common filesystem do have barrier support it is: >>> >>> - not actually enabled for the two most common filesystems >>> - the support for write barriers an cache flushing tends to be buggy >>> all over our software stack, >>> >> >> Or just missing - I think that MD5/6 simply drop the requests at present. >> >> I wonder if it would be worth having MD probe for write cache enabled& >> warn if barriers are not supported? > > In my opinion even that is too weak. We know how to control the cache > settings on all common disks (that is scsi and ata), so we should always > disable the write cache unless we know that the whole stack (filesystem, > raid, volume managers) supports barriers. And even then we should make > sure the filesystems does actually use barriers everywhere that's needed > which failed at for years. > I was thinking about that as well. Having us disable the write cache when we know it is not supported (like in the MD5 case) would certainly be *much* safer for almost everyone. We would need to have a way to override the write cache disabling for people who either know that they have a non-volatile write cache (unlikely as it would probably be to put MD5 on top of a hardware RAID/external array, but some of the new SSD's claim to have non-volatile write cache). It would also be very useful to have all of our top tier file systems enable barriers by default, provide consistent barrier on/off mount options and log a nice warning when not enabled.... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 13:22 ` Ric Wheeler @ 2009-08-31 15:50 ` david 2009-08-31 16:21 ` Ric Wheeler 2009-08-31 18:31 ` Christoph Hellwig 0 siblings, 2 replies; 309+ messages in thread From: david @ 2009-08-31 15:50 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Michael Tokarev, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, 31 Aug 2009, Ric Wheeler wrote: > On 08/31/2009 09:16 AM, Christoph Hellwig wrote: >> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote: >>>> While most common filesystem do have barrier support it is: >>>> >>>> - not actually enabled for the two most common filesystems >>>> - the support for write barriers an cache flushing tends to be buggy >>>> all over our software stack, >>>> >>> >>> Or just missing - I think that MD5/6 simply drop the requests at present. >>> >>> I wonder if it would be worth having MD probe for write cache enabled& >>> warn if barriers are not supported? >> >> In my opinion even that is too weak. We know how to control the cache >> settings on all common disks (that is scsi and ata), so we should always >> disable the write cache unless we know that the whole stack (filesystem, >> raid, volume managers) supports barriers. And even then we should make >> sure the filesystems does actually use barriers everywhere that's needed >> which failed at for years. >> > > I was thinking about that as well. Having us disable the write cache when we > know it is not supported (like in the MD5 case) would certainly be *much* > safer for almost everyone. > > We would need to have a way to override the write cache disabling for people > who either know that they have a non-volatile write cache (unlikely as it > would probably be to put MD5 on top of a hardware RAID/external array, but > some of the new SSD's claim to have non-volatile write cache). I've done this when the hardware raid only suppored raid 5 but I wanted raid 6. I've also done it when I had enough disks to need more than one hardware raid card to talk to them all, but wanted one logical drive for the system. > It would also be very useful to have all of our top tier file systems enable > barriers by default, provide consistent barrier on/off mount options and log > a nice warning when not enabled.... most people are not willing to live with unbuffered write performance. they care about their data, but they also care about performance, and since performance is what they see on an ongong basis, they tend to care more about performance. given that we don't even have barriers enabled by default on ext3 due to the performance hit, what makes you think that disabling buffers entirely is going to be acceptable to people? David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 15:50 ` david @ 2009-08-31 16:21 ` Ric Wheeler 2009-08-31 18:31 ` Christoph Hellwig 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-31 16:21 UTC (permalink / raw) To: david Cc: Christoph Hellwig, Michael Tokarev, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/31/2009 11:50 AM, david@lang.hm wrote: > On Mon, 31 Aug 2009, Ric Wheeler wrote: > >> On 08/31/2009 09:16 AM, Christoph Hellwig wrote: >>> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote: >>>>> While most common filesystem do have barrier support it is: >>>>> >>>>> - not actually enabled for the two most common filesystems >>>>> - the support for write barriers an cache flushing tends to be buggy >>>>> all over our software stack, >>>>> >>>> >>>> Or just missing - I think that MD5/6 simply drop the requests at >>>> present. >>>> >>>> I wonder if it would be worth having MD probe for write cache enabled& >>>> warn if barriers are not supported? >>> >>> In my opinion even that is too weak. We know how to control the cache >>> settings on all common disks (that is scsi and ata), so we should always >>> disable the write cache unless we know that the whole stack (filesystem, >>> raid, volume managers) supports barriers. And even then we should make >>> sure the filesystems does actually use barriers everywhere that's needed >>> which failed at for years. >>> >> >> I was thinking about that as well. Having us disable the write cache >> when we know it is not supported (like in the MD5 case) would >> certainly be *much* safer for almost everyone. >> >> We would need to have a way to override the write cache disabling for >> people who either know that they have a non-volatile write cache >> (unlikely as it would probably be to put MD5 on top of a hardware >> RAID/external array, but some of the new SSD's claim to have >> non-volatile write cache). > > I've done this when the hardware raid only suppored raid 5 but I wanted > raid 6. I've also done it when I had enough disks to need more than one > hardware raid card to talk to them all, but wanted one logical drive for > the system. > >> It would also be very useful to have all of our top tier file systems >> enable barriers by default, provide consistent barrier on/off mount >> options and log a nice warning when not enabled.... > > most people are not willing to live with unbuffered write performance. > they care about their data, but they also care about performance, and > since performance is what they see on an ongong basis, they tend to care > more about performance. > > given that we don't even have barriers enabled by default on ext3 due to > the performance hit, what makes you think that disabling buffers > entirely is going to be acceptable to people? > > David Lang We do (and have for a number of years) enable barriers by default for XFS and reiserfs. In SLES, ext3 has default barriers as well. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 15:50 ` david 2009-08-31 16:21 ` Ric Wheeler @ 2009-08-31 18:31 ` Christoph Hellwig 2009-08-31 19:11 ` david 1 sibling, 1 reply; 309+ messages in thread From: Christoph Hellwig @ 2009-08-31 18:31 UTC (permalink / raw) To: david Cc: Ric Wheeler, Christoph Hellwig, Michael Tokarev, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, Aug 31, 2009 at 08:50:53AM -0700, david@lang.hm wrote: >> It would also be very useful to have all of our top tier file systems >> enable barriers by default, provide consistent barrier on/off mount >> options and log a nice warning when not enabled.... > > most people are not willing to live with unbuffered write performance. I'm not sure what you mean with unbuffered write support, the only common use of that term is for userspace I/O using the read/write sysctem calls directly in comparism to buffered I/O which uses the stdio library. But be ensure that the use of barriers and cache flushes in fsync does not completely disable caching (or "buffering"), it just does flush flushes the disk write cache in case we either commit a log buffer than need to be on disk, or performan an fsync where we really do want to have data on disk instead of lying to the application about the status of the I/O completion. Which btw could be interpreted as a violation of the Posix rules. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 18:31 ` Christoph Hellwig @ 2009-08-31 19:11 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-31 19:11 UTC (permalink / raw) To: Christoph Hellwig Cc: Ric Wheeler, Michael Tokarev, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, 31 Aug 2009, Christoph Hellwig wrote: > On Mon, Aug 31, 2009 at 08:50:53AM -0700, david@lang.hm wrote: >>> It would also be very useful to have all of our top tier file systems >>> enable barriers by default, provide consistent barrier on/off mount >>> options and log a nice warning when not enabled.... >> >> most people are not willing to live with unbuffered write performance. > > I'm not sure what you mean with unbuffered write support, the only > common use of that term is for userspace I/O using the read/write > sysctem calls directly in comparism to buffered I/O which uses > the stdio library. > > But be ensure that the use of barriers and cache flushes in fsync does not > completely disable caching (or "buffering"), it just does flush flushes > the disk write cache in case we either commit a log buffer than need to > be on disk, or performan an fsync where we really do want to have data > on disk instead of lying to the application about the status of the > I/O completion. Which btw could be interpreted as a violation of the > Posix rules. as I understood it, the proposal that I responded to was to change the kernel to detect if barriers are enabled for the entire stack or not, and if not disable the write caches on the drives. there are definantly times when that is the correct thing to do, but I am not sure that it is the correct thing to do by default. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 12:55 ` david 2009-08-30 14:12 ` Ric Wheeler @ 2009-08-30 15:05 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 15:05 UTC (permalink / raw) To: david Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun 2009-08-30 05:55:01, david@lang.hm wrote: > On Sun, 30 Aug 2009, Pavel Machek wrote: > >>>> From: Theodore Tso <tytso@mit.edu> >>>> >>> To use your ABS brakes analogy, just becase it's not safe to rely on >>> ABS brakes if the "check brakes" light is on, that doesn't justify >>> writing something alarmist which claims that ABS brakes don't work >>> 100% of the time, don't use ABS brakes, they're broken!!!! >> >> If it only was this simple. We don't have 'check brakes' (aka >> 'journalling ineffective') warning light. If we had that, I would not >> have problem. >> >> It is rather that your ABS brakes are ineffective if 'check engine' >> (RAID degraded) is lit. And yes, running with 'check engine' for >> extended periods may be bad idea, but I know people that do >> that... and I still hope their brakes work (and believe they should >> have won suit for damages should their ABS brakes fail). > > the 'RAID degraded' warning says that _anything_ you put on that block > device is at risk. it doesn't matter if you are using a filesystem with a > journal, one without, or using the raw device directly. If you are using one with journal, you'll still need to run fsck at boot time, to make sure metadata is still consistent... Protection provided by journaling is not effective in this configuration. (You have the point that pretty much all users of the blockdevice will be affected by powerfail degraded mode.) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 7:51 ` Pavel Machek 2009-08-30 9:01 ` Christian Kujau 2009-08-30 12:55 ` david @ 2009-08-30 15:20 ` Theodore Tso 2009-08-31 17:49 ` Jesse Brandeburg ` (3 more replies) 2 siblings, 4 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-30 15:20 UTC (permalink / raw) To: Pavel Machek Cc: NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, Aug 30, 2009 at 09:51:35AM +0200, Pavel Machek wrote: > > If it only was this simple. We don't have 'check brakes' (aka > 'journalling ineffective') warning light. If we had that, I would not > have problem. But we do; comptently designed (and in the cast of software RAID, competently packaged) RAID subsystems send notifications to the system administrator when there is a hard drive failure. Some hardware RAID systems will send a page to the system administrator. A mid-range Areca card has a separate ethernet port so it can send e-mail to the administrator, even if the OS is hosed for some reason. And it's not a matter of journalling ineffective; the much bigger deal is, "your data is at risk"; perhaps because the file system metadata may become subject to corruption, but more critically, because the file data may become subject to corruption. Metadata becoming subject to corruption is important primarily because it leads to data becoming corruption; metadata is the tail; the user's data is the dog. So we *do* have the warning light; the problem is that just as some people may not realize that "check brakes" means, "YOU COULD DIE", some people may not realize that "hard drive failure; RAID array degraded" could mean, "YOU COULD LOSE DATA". Fortunately, for software RAID, this is easily solved; if you are so concerned, why don't you submit a patch to mdadm adjusting the e-mail sent to the system administrator when the array is in a degraded state, such that it states, "YOU COULD LOSE DATA". I would gently suggest to you this would be ***far*** more effective that a patch to kernel documentation. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 15:20 ` Theodore Tso @ 2009-08-31 17:49 ` Jesse Brandeburg 2009-08-31 18:01 ` Ric Wheeler 2009-08-31 18:07 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft 2009-08-31 17:49 ` Jesse Brandeburg ` (2 subsequent siblings) 3 siblings, 2 replies; 309+ messages in thread From: Jesse Brandeburg @ 2009-08-31 17:49 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<tytso@mit.edu> wrote: > So we *do* have the warning light; the problem is that just as some > people may not realize that "check brakes" means, "YOU COULD DIE", > some people may not realize that "hard drive failure; RAID array > degraded" could mean, "YOU COULD LOSE DATA". > > Fortunately, for software RAID, this is easily solved; if you are so > concerned, why don't you submit a patch to mdadm adjusting the e-mail > sent to the system administrator when the array is in a degraded > state, such that it states, "YOU COULD LOSE DATA". I would gently > suggest to you this would be ***far*** more effective that a patch to > kernel documentation. In the case of a degraded array, could the kernel be more proactive (or maybe even mdadm) and have the filesystem remount itself withOUT journalling enabled? This seems on the surface to be possible, but I don't know the internal particulars that might prevent/allow it. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 17:49 ` Jesse Brandeburg @ 2009-08-31 18:01 ` Ric Wheeler 2009-08-31 21:01 ` MD5/6? (was Re: raid is dangerous but that's secret ...) Ron Johnson 2009-08-31 18:07 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-31 18:01 UTC (permalink / raw) To: Jesse Brandeburg Cc: Theodore Tso, Pavel Machek, NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/31/2009 01:49 PM, Jesse Brandeburg wrote: > On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<tytso@mit.edu> wrote: >> So we *do* have the warning light; the problem is that just as some >> people may not realize that "check brakes" means, "YOU COULD DIE", >> some people may not realize that "hard drive failure; RAID array >> degraded" could mean, "YOU COULD LOSE DATA". >> >> Fortunately, for software RAID, this is easily solved; if you are so >> concerned, why don't you submit a patch to mdadm adjusting the e-mail >> sent to the system administrator when the array is in a degraded >> state, such that it states, "YOU COULD LOSE DATA". I would gently >> suggest to you this would be ***far*** more effective that a patch to >> kernel documentation. > > In the case of a degraded array, could the kernel be more proactive > (or maybe even mdadm) and have the filesystem remount itself withOUT > journalling enabled? This seems on the surface to be possible, but I > don't know the internal particulars that might prevent/allow it. This a misconception - with or without journalling, you are open to a second failure during a RAID rebuild. Also note that by default, ext3 does not mount with barriers turned on. Even if you mount with barriers, MD5 does not handle barriers, so you stand to lose a lot of data if you have a power outage. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* MD5/6? (was Re: raid is dangerous but that's secret ...) 2009-08-31 18:01 ` Ric Wheeler @ 2009-08-31 21:01 ` Ron Johnson 0 siblings, 0 replies; 309+ messages in thread From: Ron Johnson @ 2009-08-31 21:01 UTC (permalink / raw) To: Linux-Ext4 On 2009-08-31 13:01, Ric Wheeler wrote: [snip] > > Even if you mount with barriers, MD5 does not handle barriers, so you > stand to lose a lot of data if you have a power outage. Pardon me for asking for such a seemingly obvious question, but what (besides "Message-Digest algorithm 5") is MD5? (I've always seen "multiple drive" written in the lower case "md".) -- Brawndo's got what plants crave. It's got electrolytes! ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 17:49 ` Jesse Brandeburg 2009-08-31 18:01 ` Ric Wheeler @ 2009-08-31 18:07 ` martin f krafft 2009-08-31 22:26 ` Jesse Brandeburg 1 sibling, 1 reply; 309+ messages in thread From: martin f krafft @ 2009-08-31 18:07 UTC (permalink / raw) To: Jesse Brandeburg Cc: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet [-- Attachment #1: Type: text/plain, Size: 939 bytes --] also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]: > In the case of a degraded array, could the kernel be more > proactive (or maybe even mdadm) and have the filesystem remount > itself withOUT journalling enabled? This seems on the surface to > be possible, but I don't know the internal particulars that might > prevent/allow it. Why would I want to disable the filesystem journal in that case? -- .''`. martin f. krafft <madduck@d.o> Related projects: : :' : proud Debian developer http://debiansystem.info `. `'` http://people.debian.org/~madduck http://vcs-pkg.org `- Debian - when you have better things to do than fixing systems "i can stand brute force, but brute reason is quite unbearable. there is something unfair about its use. it is hitting below the intellect." -- oscar wilde [-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 18:07 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft @ 2009-08-31 22:26 ` Jesse Brandeburg 0 siblings, 0 replies; 309+ messages in thread From: Jesse Brandeburg @ 2009-08-31 22:26 UTC (permalink / raw) To: Jesse Brandeburg, Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck@debian.org> wrote: > also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]: >> In the case of a degraded array, could the kernel be more >> proactive (or maybe even mdadm) and have the filesystem remount >> itself withOUT journalling enabled? This seems on the surface to >> be possible, but I don't know the internal particulars that might >> prevent/allow it. > > Why would I want to disable the filesystem journal in that case? I misspoke w.r.t journalling, the idea I was trying to get across was to remount with -o sync while running on a degraded array, but given some of the other comments in this thread I'm not even sure that would help. the idea was to make writes as safe as possible (at the cost of speed) when running on a degraded array, and to have the transition be as hands-free as possible, just have the kernel (or mdadm) by default remount. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) @ 2009-08-31 22:26 ` Jesse Brandeburg 0 siblings, 0 replies; 309+ messages in thread From: Jesse Brandeburg @ 2009-08-31 22:26 UTC (permalink / raw) To: Jesse Brandeburg, Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck@debian.org> wrote: > also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]: >> In the case of a degraded array, could the kernel be more >> proactive (or maybe even mdadm) and have the filesystem remount >> itself withOUT journalling enabled? This seems on the surface to >> be possible, but I don't know the internal particulars that might >> prevent/allow it. > > Why would I want to disable the filesystem journal in that case? I misspoke w.r.t journalling, the idea I was trying to get across was to remount with -o sync while running on a degraded array, but given some of the other comments in this thread I'm not even sure that would help. the idea was to make writes as safe as possible (at the cost of speed) when running on a degraded array, and to have the transition be as hands-free as possible, just have the kernel (or mdadm) by default remount. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 22:26 ` Jesse Brandeburg (?) @ 2009-08-31 23:19 ` Ron Johnson -1 siblings, 0 replies; 309+ messages in thread From: Ron Johnson @ 2009-08-31 23:19 UTC (permalink / raw) To: Jesse Brandeburg; +Cc: Theodore Tso, Ric Wheeler, Linux-Ext4 On 2009-08-31 17:26, Jesse Brandeburg wrote: > On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck@debian.org> wrote: >> also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]: >>> In the case of a degraded array, could the kernel be more >>> proactive (or maybe even mdadm) and have the filesystem remount >>> itself withOUT journalling enabled? This seems on the surface to >>> be possible, but I don't know the internal particulars that might >>> prevent/allow it. >> Why would I want to disable the filesystem journal in that case? > > I misspoke w.r.t journalling, the idea I was trying to get across was > to remount with -o sync while running on a degraded array, but given > some of the other comments in this thread I'm not even sure that would > help. the idea was to make writes as safe as possible (at the cost of > speed) when running on a degraded array, and to have the transition be > as hands-free as possible, just have the kernel (or mdadm) by default > remount. Much better, I'd think, to "just" have it scream out DANGER!! WILL ROBINSON!! DANGER!! to syslog and to an email hook. -- Brawndo's got what plants crave. It's got electrolytes! ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-31 22:26 ` Jesse Brandeburg (?) (?) @ 2009-09-01 5:45 ` martin f krafft -1 siblings, 0 replies; 309+ messages in thread From: martin f krafft @ 2009-09-01 5:45 UTC (permalink / raw) To: Jesse Brandeburg Cc: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet [-- Attachment #1: Type: text/plain, Size: 1242 bytes --] also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.09.01.0026 +0200]: > I misspoke w.r.t journalling, the idea I was trying to get across > was to remount with -o sync while running on a degraded array, but > given some of the other comments in this thread I'm not even sure > that would help. the idea was to make writes as safe as possible > (at the cost of speed) when running on a degraded array, and to > have the transition be as hands-free as possible, just have the > kernel (or mdadm) by default remount. I don't see how that is any more necessary with a degraded array than it is when you have a fully working array. Sync just ensures that the data are written and not cached, but that has absolutely nothing to do with the underlying storage. Or am I failing to see the link? -- .''`. martin f. krafft <madduck@d.o> Related projects: : :' : proud Debian developer http://debiansystem.info `. `'` http://people.debian.org/~madduck http://vcs-pkg.org `- Debian - when you have better things to do than fixing systems "how do you feel about women's rights?" "i like either side of them." -- groucho marx [-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 15:20 ` Theodore Tso 2009-08-31 17:49 ` Jesse Brandeburg @ 2009-08-31 17:49 ` Jesse Brandeburg 2009-09-05 10:34 ` Pavel Machek 2009-09-05 10:34 ` Pavel Machek 3 siblings, 0 replies; 309+ messages in thread From: Jesse Brandeburg @ 2009-08-31 17:49 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<tytso@mit.edu> wrote: > So we *do* have the warning light; the problem is that just as some > people may not realize that "check brakes" means, "YOU COULD DIE", > some people may not realize that "hard drive failure; RAID array > degraded" could mean, "YOU COULD LOSE DATA". > > Fortunately, for software RAID, this is easily solved; if you are so > concerned, why don't you submit a patch to mdadm adjusting the e-mail > sent to the system administrator when the array is in a degraded > state, such that it states, "YOU COULD LOSE DATA". I would gently > suggest to you this would be ***far*** more effective that a patch to > kernel documentation. In the case of a degraded array, could the kernel be more proactive (or maybe even mdadm) and have the filesystem remount itself withOUT journalling enabled? This seems on the surface to be possible, but I don't know the internal particulars that might prevent/allow it. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 15:20 ` Theodore Tso 2009-08-31 17:49 ` Jesse Brandeburg 2009-08-31 17:49 ` Jesse Brandeburg @ 2009-09-05 10:34 ` Pavel Machek 2009-09-05 10:34 ` Pavel Machek 3 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-05 10:34 UTC (permalink / raw) To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow Hi! > > If it only was this simple. We don't have 'check brakes' (aka > > 'journalling ineffective') warning light. If we had that, I would not > > have problem. > > But we do; comptently designed (and in the cast of software RAID, > competently packaged) RAID subsystems send notifications to the system > administrator when there is a hard drive failure. Some hardware RAID > systems will send a page to the system administrator. A mid-range > Areca card has a separate ethernet port so it can send e-mail to the > administrator, even if the OS is hosed for some reason. Well, my MMC/uSD cards do not have ethernet ports to remind me that they are unreliable :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-30 15:20 ` Theodore Tso ` (2 preceding siblings ...) 2009-09-05 10:34 ` Pavel Machek @ 2009-09-05 10:34 ` Pavel Machek 3 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-09-05 10:34 UTC (permalink / raw) To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > > If it only was this simple. We don't have 'check brakes' (aka > > 'journalling ineffective') warning light. If we had that, I would not > > have problem. > > But we do; comptently designed (and in the cast of software RAID, > competently packaged) RAID subsystems send notifications to the system > administrator when there is a hard drive failure. Some hardware RAID > systems will send a page to the system administrator. A mid-range > Areca card has a separate ethernet port so it can send e-mail to the > administrator, even if the OS is hosed for some reason. Well, my MMC/uSD cards do not have ethernet ports to remind me that they are unreliable :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) 2009-08-28 12:08 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso 2009-08-30 7:51 ` Pavel Machek @ 2009-08-30 7:51 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 7:51 UTC (permalink / raw) To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow Hi! > > From: Theodore Tso <tytso@mit.edu> > > > > Document that many devices are too broken for filesystems to protect > > data in case of powerfail. > > > > Signed-of-by: Pavel Machek <pavel@ucw.cz> > > NACK. I didn't write this patch, and it's disingenuous for you to try > to claim that I authored it. Well, you did write original text, so I wanted to give you credit. Sorry. > While Neil Brown's corrections are minimally necessary so the text is > at least technically *correct*, it's still not the right advice to > give system administrators. It's better than the fear-mongering > patches you had proposed earlier, but what would be better *still* is > telling people why running with degraded RAID arrays is bad, and to > give them further tips about how to use RAID arrays safely. Maybe this belongs to Doc*/filesystems, and more detailed RAID description should go to md description? > To use your ABS brakes analogy, just becase it's not safe to rely on > ABS brakes if the "check brakes" light is on, that doesn't justify > writing something alarmist which claims that ABS brakes don't work > 100% of the time, don't use ABS brakes, they're broken!!!! If it only was this simple. We don't have 'check brakes' (aka 'journalling ineffective') warning light. If we had that, I would not have problem. It is rather that your ABS brakes are ineffective if 'check engine' (RAID degraded) is lit. And yes, running with 'check engine' for extended periods may be bad idea, but I know people that do that... and I still hope their brakes work (and believe they should have won suit for damages should their ABS brakes fail). > That's just silly. What we should be telling people instead is (a) > pay attention to the check brakes light (just as you should pay > attention to the RAID array is degraded warning), and (b) while ABS 'your RAID array is degraded' is very counter intuitive way to say '...and btw your journalling is no longer effective, either'. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:26 ` Pavel Machek 2009-08-25 23:40 ` Ric Wheeler @ 2009-08-25 23:46 ` david 1 sibling, 0 replies; 309+ messages in thread From: david @ 2009-08-25 23:46 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: >>>> Basically, any file system (Linux, windows, OSX, etc) that writes into >>>> the page cache will lose data when you hot unplug its storage. End of >>>> story, don't do it! >>> >>> No, not ext3 on SATA disk with barriers on and proper use of >>> fsync(). I actually tested that. >>> >>> Yes, I should be able to hotunplug SATA drives and expect the data >>> that was fsync-ed to be there. >> >> You can and will lose data (even after fsync) with any type of storage at >> some rate. What you are missing here is that data loss needs to be >> measured in hard numbers - say percentage of installed boxes that have >> config X that lose data. > > I'm talking "by design" here. > > I will lose data even on SATA drive that is properly powered on if I > wait 5 years. > >> I can promise you that hot unplugging and replugging a S-ATA drive will >> also lose you data if you are actively writing to it (ext2, 3, whatever). > > I can promise you that running S-ATA drive will also lose you data, > even if you are not actively writing to it. Just wait 10 years; so > what is your point? > > But ext3 is _designed_ to preserve fsynced data on SATA drive, while > it is _not_ designed to preserve fsynced data on MD RAID5. substatute 'degraded MD RAID 5' for 'MD RAID 5' and you have a point here. although the language you are using is pretty harsh. you make it sound like this is a problem with ext3 when the filesystem has nothing to do with it. the problem is that a degraded raid 5 array can be corrupted by an additional failure. > Do you really think that's not a difference? > >>>>>> I don't object to making that general statement - "Don't hot unplug a >>>>>> device with an active file system or actively used raw device" - but >>>>>> would object to the overly general statement about ext3 not working on >>>>>> flash, RAID5 not working, etc... >>>>> >>>>> You can object any way you want, but running ext3 on flash or MD RAID5 >>>>> is stupid: >>>>> >>>>> * ext2 would be faster >>>>> >>>>> * ext2 would provide better protection against powerfail. >>>> >>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers >>>> telling you that it will lose data. >>> >>> I know I will lose data. Both ext2 and ext3 will lose data on >>> flashdisk. (That's what I'm trying to document). But... what is the >>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least >>> protects you against kernel panic. MD RAID5 is in software, so... that >>> additional protection is just not there). >> >> Faster recovery time on any normal kernel crash or power outage. Data >> loss would be equivalent with or without the journal. > > No, because you'll actually repair the ext2 with fsck after the kernel > crash or power outage. Data loss will not be equivalent; in particular > you'll not lose data writen _after_ power outage to ext2. by the way, while you are thinking about failures that can happen from a failed write corrupting additional blocks, think about the nightmare that can happen if those blocks are in the journal. the 'repair' of ext2 by a fsck is actually much less than you are thinking that it is. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 21:15 ` Pavel Machek 2009-08-25 22:42 ` Ric Wheeler @ 2009-08-25 23:08 ` Neil Brown 2009-08-25 23:44 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Neil Brown @ 2009-08-25 23:08 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tuesday August 25, pavel@ucw.cz wrote: > > You can object any way you want, but running ext3 on flash or MD RAID5 > is stupid: > > * ext2 would be faster > > * ext2 would provide better protection against powerfail. > > "ext3 works on flash and MD RAID5, as long as you do not have > powerfail" seems to be the accurate statement, and if you don't need > to protect against powerfails, you can just use ext2. > Pavel You are over generalising. MD/RAID5 is only less than perfect if it is degraded. If all devices are present before the power failure and after the power failure, then there is no risk. RAID5 only promises to protect against a single failure. Power loss plus device loss equals multiple failure. And then there is the comment Ted made about probabilities. While you can get data corruption if a RAID5 comes back degraded after a power fail, I believe it is a lot less likely than the metadata being inconsistent on an ext2 after a power fail. So ext3 is still a good choice (especially if you put your journal on a separate device). While I think it is, in principle, worth documenting this sort of thing, there are an awful lot of fine details and distinctions that would need to be considered. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:08 ` Neil Brown @ 2009-08-25 23:44 ` Pavel Machek 2009-08-26 4:08 ` Rik van Riel 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-25 23:44 UTC (permalink / raw) To: Neil Brown Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet > While I think it is, in principle, worth documenting this sort of > thing, there are an awful lot of fine details and distinctions that > would need to be considered. Ok, can you help? Having a piece of MD documentation explaining the "powerfail nukes entire stripe" and how current filesystems do not deal with that would be nice, along with description when exactly that happens. It seems to need two events -- one failed disk and one powerfail. I knew that raid5 only protects against one failure, but I never realized that simple powerfail (or kernel crash) counts as a failure here, too. I guess it should go at the end of md.txt.... aha, it actually already talks about the issue a bit, in: #Boot time assembly of degraded/dirty arrays #------------------------------------------- # #If a raid5 or raid6 array is both dirty and degraded, it could have #undetectable data corruption. This is because the fact that it is #'dirty' means that the parity cannot be trusted, and the fact that it #is degraded means that some datablocks are missing and cannot reliably #be reconstructed (due to no parity). (Actually... that's possibly what happened to friend of mine. One of disks in raid5 stopped responding and whole system just hanged up. Oops, two failures in one...) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:44 ` Pavel Machek @ 2009-08-26 4:08 ` Rik van Riel 2009-08-26 11:15 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Rik van Riel @ 2009-08-26 4:08 UTC (permalink / raw) To: Pavel Machek Cc: Neil Brown, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > Ok, can you help? Having a piece of MD documentation explaining the > "powerfail nukes entire stripe" and how current filesystems do not > deal with that would be nice, along with description when exactly that > happens. Except of course for the inconvenient detail that a power failure on a degraded RAID 5 array does *NOT* nuke the entire stripe. A 5-disk RAID 5 array will have 4 data blocks and 1 parity block in each stripe. A degraded array will have either 4 data blocks or 3 data blocks and 1 parity block in the stripe. If we are dealing with a parity-less stripe, we cannot lose any data due to RAID 5, because each of the 4 data blocks has a disk block available. We could still lose a data write due to a power failure, but this could also happen with the RAID 5 array still intact. If we are dealing with a 3-data, 1-parity stripe, then 3 of the 4 data blocks have an available disk block and will not be lost (if they make it to disk). The only block that maintains on all 3 data blocks and the parity block being correct is the block that does not currently have a disk to be written to. In short, if a stripe is not written completely on a degraded RAID 5 array, you can lose: 1) the blocks that were not written (duh) 2) the block that doesn't have a disk The first part of this loss is also true in a non-degraded RAID 5 array. The fact that the array is degraded really does not add much additional data loss here and you certainly will not lose the entire stripe like you suggest. -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 4:08 ` Rik van Riel @ 2009-08-26 11:15 ` Pavel Machek 2009-08-27 3:29 ` Rik van Riel 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 11:15 UTC (permalink / raw) To: Rik van Riel Cc: Neil Brown, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >> Ok, can you help? Having a piece of MD documentation explaining the >> "powerfail nukes entire stripe" and how current filesystems do not >> deal with that would be nice, along with description when exactly that >> happens. > > Except of course for the inconvenient detail that a power > failure on a degraded RAID 5 array does *NOT* nuke the > entire stripe. Ok, you are right. It will nuke unrelated sector somewhere on the stripe (one that is "old" and was not recently written) -- which is still something ext3 can not reliably handle. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-26 11:15 ` Pavel Machek @ 2009-08-27 3:29 ` Rik van Riel 0 siblings, 0 replies; 309+ messages in thread From: Rik van Riel @ 2009-08-27 3:29 UTC (permalink / raw) To: Pavel Machek Cc: Neil Brown, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > Hi! > >>> Ok, can you help? Having a piece of MD documentation explaining the >>> "powerfail nukes entire stripe" and how current filesystems do not >>> deal with that would be nice, along with description when exactly that >>> happens. >> Except of course for the inconvenient detail that a power >> failure on a degraded RAID 5 array does *NOT* nuke the >> entire stripe. > > Ok, you are right. It will nuke unrelated sector somewhere on the > stripe (one that is "old" and was not recently written) -- which is > still something ext3 can not reliably handle. Not quite unrelated. The "nuked" sector will be the one that used to live on the disk that is broken and no longer a part of the RAID 5 array. I wouldn't qualify a missing hard disk as a software issue... -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 9:42 ` Pavel Machek 2009-08-25 13:37 ` Ric Wheeler @ 2009-08-25 16:11 ` Theodore Tso 2009-08-25 22:21 ` Pavel Machek ` (2 more replies) 1 sibling, 3 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-25 16:11 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet It seems that you are really hung up on whether or not the filesystem metadata is consistent after a power failure, when I'd argue that the problem with using storage devices that don't have good powerfail properties have much bigger problems (such as the potential for silent data corruption, or even if fsck will fix a trashed inode table with ext2, massive data loss). So instead of your suggested patch, it might be better simply to have a file in Documentation/filesystems that states something along the lines of: "There are storage devices that high highly undesirable properties when they are disconnected or suffer power failures while writes are in progress; such devices include flash devices and software RAID 5/6 arrays without journals, as well as hardware RAID 5/6 devices without battery backups. These devices have the property of potentially corrupting blocks being written at the time of the power failure, and worse yet, amplifying the region where blocks are corrupted such that adjacent sectors are also damaged during the power failure. Users who use such storage devices are well advised take countermeasures, such as the use of Uninterruptible Power Supplies, and making sure the flash device is not hot-unplugged while the device is being used. Regular backups when using these devices is also a Very Good Idea. Otherwise, file systems placed on these devices can suffer silent data and file system corruption. An forced use of fsck may detect metadata corruption resulting in file system corruption, but will not suffice to detect data corruption." My big complaint is that you seem to think that ext3 some how let you down, but I'd argue that the real issue is that the storage device let you down. Any journaling filesystem will have the properties that you seem to be complaining about, so the fact that your patch only documents this as assumptions made by ext2 and ext3 is unfair; it also applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most users are even more concerned about possibility of massive data loss and/or silent data corruption. So if your complaint that we don't have documentation warning users about the potential pitfalls of using storage devices with undesirable power fail properties, let's document that as a shortcoming in those storage devices. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* [patch] document flash/RAID dangers 2009-08-25 16:11 ` Theodore Tso @ 2009-08-25 22:21 ` Pavel Machek 2009-08-25 22:27 ` [patch] document that ext2 can't handle barriers Pavel Machek 2009-08-25 22:27 ` Pavel Machek 2 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 22:21 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! > It seems that you are really hung up on whether or not the filesystem > metadata is consistent after a power failure, when I'd argue that the > problem with using storage devices that don't have good powerfail > properties have much bigger problems (such as the potential for silent > data corruption, or even if fsck will fix a trashed inode table with > ext2, massive data loss). So instead of your suggested patch, it > might be better simply to have a file in Documentation/filesystems > that states something along the lines of: > > "There are storage devices that high highly undesirable properties > when they are disconnected or suffer power failures while writes are > in progress; such devices include flash devices and software RAID 5/6 > arrays without journals, as well as hardware RAID 5/6 devices without > battery backups. These devices have the property of potentially > corrupting blocks being written at the time of the power failure, and > worse yet, amplifying the region where blocks are corrupted such that > adjacent sectors are also damaged during the power failure. In FTL case, damaged sectors are not neccessarily adjacent. Otherwise this looks okay and fair to me. > Users who use such storage devices are well advised take > countermeasures, such as the use of Uninterruptible Power Supplies, > and making sure the flash device is not hot-unplugged while the device > is being used. Regular backups when using these devices is also a > Very Good Idea. > > Otherwise, file systems placed on these devices can suffer silent data > and file system corruption. An forced use of fsck may detect metadata > corruption resulting in file system corruption, but will not suffice > to detect data corruption." Ok, would you be against adding: "Running non-journalled filesystem on these may be desirable, as journalling can not provide meaningful protection, anyway." > My big complaint is that you seem to think that ext3 some how let you > down, but I'd argue that the real issue is that the storage device let > you down. Any journaling filesystem will have the properties that you > seem to be complaining about, so the fact that your patch only > documents this as assumptions made by ext2 and ext3 is unfair; it also > applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most > users Yes, it applies to all journalling filesystems; it is just that I was clever/paranoid enough to avoid anything non-ext3. ext3 docs still says: # The journal supports the transactions start and stop, and in case of a # crash, the journal can replay the transactions to quickly put the # partition back into a consistent state. > are even more concerned about possibility of massive data loss and/or > silent data corruption. So if your complaint that we don't have > documentation warning users about the potential pitfalls of using > storage devices with undesirable power fail properties, let's document > that as a shortcoming in those storage devices. Ok, works for me. --- From: Theodore Tso <tytso@mit.edu> Document that many devices are too broken for filesystems to protect data in case of powerfail. Signed-of-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..e1a46dd --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,19 @@ +There are storage devices that high highly undesirable properties +when they are disconnected or suffer power failures while writes are +in progress; such devices include flash devices and software RAID 5/6 +arrays without journals, as well as hardware RAID 5/6 devices without +battery backups. These devices have the property of potentially +corrupting blocks being written at the time of the power failure, and +worse yet, amplifying the region where blocks are corrupted such that +additional sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using these devices is also a +Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. \ No newline at end of file -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* [patch] document flash/RAID dangers @ 2009-08-25 22:21 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 22:21 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel Hi! > It seems that you are really hung up on whether or not the filesystem > metadata is consistent after a power failure, when I'd argue that the > problem with using storage devices that don't have good powerfail > properties have much bigger problems (such as the potential for silent > data corruption, or even if fsck will fix a trashed inode table with > ext2, massive data loss). So instead of your suggested patch, it > might be better simply to have a file in Documentation/filesystems > that states something along the lines of: > > "There are storage devices that high highly undesirable properties > when they are disconnected or suffer power failures while writes are > in progress; such devices include flash devices and software RAID 5/6 > arrays without journals, as well as hardware RAID 5/6 devices without > battery backups. These devices have the property of potentially > corrupting blocks being written at the time of the power failure, and > worse yet, amplifying the region where blocks are corrupted such that > adjacent sectors are also damaged during the power failure. In FTL case, damaged sectors are not neccessarily adjacent. Otherwise this looks okay and fair to me. > Users who use such storage devices are well advised take > countermeasures, such as the use of Uninterruptible Power Supplies, > and making sure the flash device is not hot-unplugged while the device > is being used. Regular backups when using these devices is also a > Very Good Idea. > > Otherwise, file systems placed on these devices can suffer silent data > and file system corruption. An forced use of fsck may detect metadata > corruption resulting in file system corruption, but will not suffice > to detect data corruption." Ok, would you be against adding: "Running non-journalled filesystem on these may be desirable, as journalling can not provide meaningful protection, anyway." > My big complaint is that you seem to think that ext3 some how let you > down, but I'd argue that the real issue is that the storage device let > you down. Any journaling filesystem will have the properties that you > seem to be complaining about, so the fact that your patch only > documents this as assumptions made by ext2 and ext3 is unfair; it also > applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most > users Yes, it applies to all journalling filesystems; it is just that I was clever/paranoid enough to avoid anything non-ext3. ext3 docs still says: # The journal supports the transactions start and stop, and in case of a # crash, the journal can replay the transactions to quickly put the # partition back into a consistent state. > are even more concerned about possibility of massive data loss and/or > silent data corruption. So if your complaint that we don't have > documentation warning users about the potential pitfalls of using > storage devices with undesirable power fail properties, let's document > that as a shortcoming in those storage devices. Ok, works for me. --- From: Theodore Tso <tytso@mit.edu> Document that many devices are too broken for filesystems to protect data in case of powerfail. Signed-of-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..e1a46dd --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,19 @@ +There are storage devices that high highly undesirable properties +when they are disconnected or suffer power failures while writes are +in progress; such devices include flash devices and software RAID 5/6 +arrays without journals, as well as hardware RAID 5/6 devices without +battery backups. These devices have the property of potentially +corrupting blocks being written at the time of the power failure, and +worse yet, amplifying the region where blocks are corrupted such that +additional sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using these devices is also a +Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. \ No newline at end of file -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 22:21 ` Pavel Machek (?) @ 2009-08-25 22:33 ` david 2009-08-25 22:40 ` Pavel Machek -1 siblings, 1 reply; 309+ messages in thread From: david @ 2009-08-25 22:33 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: >> It seems that you are really hung up on whether or not the filesystem >> metadata is consistent after a power failure, when I'd argue that the >> problem with using storage devices that don't have good powerfail >> properties have much bigger problems (such as the potential for silent >> data corruption, or even if fsck will fix a trashed inode table with >> ext2, massive data loss). So instead of your suggested patch, it >> might be better simply to have a file in Documentation/filesystems >> that states something along the lines of: >> >> "There are storage devices that high highly undesirable properties >> when they are disconnected or suffer power failures while writes are >> in progress; such devices include flash devices and software RAID 5/6 >> arrays without journals, is it under all conditions, or only when you have already lost redundancy? prior discussions make me think this was only if the redundancy is already lost. also, the talk about software RAID 5/6 arrays without journals will be confusing (after all, if you are using ext3/XFS/etc you are using a journal, aren't you?) you then go on to talk about hardware raid 5/6 without battery backup. I'm think that you are being too specific here. any array without battery backup can lead to 'interesting' situations when you loose power. in addition, even with a single drive you will loose some data on power loss (unless you do sync mounts with disabled write caches), full data journaling can help protect you from this, but the default journaling just protects the metadata. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 22:33 ` david @ 2009-08-25 22:40 ` Pavel Machek 2009-08-25 22:59 ` david 2009-08-26 4:20 ` Rik van Riel 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 22:40 UTC (permalink / raw) To: david Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 15:33:08, david@lang.hm wrote: > On Wed, 26 Aug 2009, Pavel Machek wrote: > >>> It seems that you are really hung up on whether or not the filesystem >>> metadata is consistent after a power failure, when I'd argue that the >>> problem with using storage devices that don't have good powerfail >>> properties have much bigger problems (such as the potential for silent >>> data corruption, or even if fsck will fix a trashed inode table with >>> ext2, massive data loss). So instead of your suggested patch, it >>> might be better simply to have a file in Documentation/filesystems >>> that states something along the lines of: >>> >>> "There are storage devices that high highly undesirable properties >>> when they are disconnected or suffer power failures while writes are >>> in progress; such devices include flash devices and software RAID 5/6 >>> arrays without journals, > > is it under all conditions, or only when you have already lost redundancy? I'd prefer not to specify. > prior discussions make me think this was only if the redundancy is > already lost. I'm not so sure now. Lets say you are writing to the (healthy) RAID5 and have a powerfail. So now data blocks do not correspond to the parity block. You don't yet have the corruption, but you already have a problem. If you get a disk failing at this point, you'll get corruption. > also, the talk about software RAID 5/6 arrays without journals will be > confusing (after all, if you are using ext3/XFS/etc you are using a > journal, aren't you?) Slightly confusing, yes. Should I just say "MD RAID 5" and avoid talking about hardware RAID arrays, where that's really manufacturer-specific? > in addition, even with a single drive you will loose some data on power > loss (unless you do sync mounts with disabled write caches), full data > journaling can help protect you from this, but the default journaling > just protects the metadata. "Data loss" here means "damaging data that were already fsynced". That will not happen on single disk (with barriers on etc), but will happen on RAID5 and flash. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 22:40 ` Pavel Machek @ 2009-08-25 22:59 ` david 2009-08-25 23:37 ` Pavel Machek 2009-08-26 4:20 ` Rik van Riel 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-25 22:59 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: > On Tue 2009-08-25 15:33:08, david@lang.hm wrote: >> On Wed, 26 Aug 2009, Pavel Machek wrote: >> >>>> It seems that you are really hung up on whether or not the filesystem >>>> metadata is consistent after a power failure, when I'd argue that the >>>> problem with using storage devices that don't have good powerfail >>>> properties have much bigger problems (such as the potential for silent >>>> data corruption, or even if fsck will fix a trashed inode table with >>>> ext2, massive data loss). So instead of your suggested patch, it >>>> might be better simply to have a file in Documentation/filesystems >>>> that states something along the lines of: >>>> >>>> "There are storage devices that high highly undesirable properties >>>> when they are disconnected or suffer power failures while writes are >>>> in progress; such devices include flash devices and software RAID 5/6 >>>> arrays without journals, >> >> is it under all conditions, or only when you have already lost redundancy? > > I'd prefer not to specify. you need to, otherwise you are claiming that all linux software raid implementations will loose data on powerfail, which I don't think is the case. >> prior discussions make me think this was only if the redundancy is >> already lost. > > I'm not so sure now. > > Lets say you are writing to the (healthy) RAID5 and have a powerfail. > > So now data blocks do not correspond to the parity block. You don't > yet have the corruption, but you already have a problem. > > If you get a disk failing at this point, you'll get corruption. it's the same combination of problems (non-redundant array and write lost to powerfail/reboot), just in a different order. reccomending a scrub of the raid after an unclean shutdown would make sense, along with a warning that if you loose all redundancy before the scrub is completed and there was a write failure in the unscrubbed portion it could corrupt things. >> also, the talk about software RAID 5/6 arrays without journals will be >> confusing (after all, if you are using ext3/XFS/etc you are using a >> journal, aren't you?) > > Slightly confusing, yes. Should I just say "MD RAID 5" and avoid > talking about hardware RAID arrays, where that's really > manufacturer-specific? what about dm raid? I don't think you should talk about hardware raid cards. >> in addition, even with a single drive you will loose some data on power >> loss (unless you do sync mounts with disabled write caches), full data >> journaling can help protect you from this, but the default journaling >> just protects the metadata. > > "Data loss" here means "damaging data that were already fsynced". That > will not happen on single disk (with barriers on etc), but will happen > on RAID5 and flash. this definition of data loss wasn't clear prior to this. you need to define this, and state that the reason that flash and raid arrays can suffer from this is that both of them deal with blocks of storage larger than the data block (eraseblock or raid stripe) and there are conditions that can cause the loss of the entire eraseblock or raid stripe which can affect data that was previously safe on disk (and if power had been lost before the latest write, the prior data would still be safe) note that this doesn't nessasarily affect all flash disks. if the disk doesn't replace the old block in the FTL until the data has all been sucessfuly copies to the new eraseblock you don't have this problem. some (possibly all) cheap thumb drives don't do this, but I would expect that the expensive SATA SSDs to do things in the right order. do this right and you are properly documenting a failure mode that most people don't understand, but go too far and you are crying wolf. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 22:59 ` david @ 2009-08-25 23:37 ` Pavel Machek 2009-08-25 23:48 ` Ric Wheeler 2009-08-25 23:56 ` david 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 23:37 UTC (permalink / raw) To: david Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >>> is it under all conditions, or only when you have already lost redundancy? >> >> I'd prefer not to specify. > > you need to, otherwise you are claiming that all linux software raid > implementations will loose data on powerfail, which I don't think is the > case. Well, I'm not saying it loses data on _every_ powerfail ;-). >>> also, the talk about software RAID 5/6 arrays without journals will be >>> confusing (after all, if you are using ext3/XFS/etc you are using a >>> journal, aren't you?) >> >> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid >> talking about hardware RAID arrays, where that's really >> manufacturer-specific? > > what about dm raid? > > I don't think you should talk about hardware raid cards. Ok, fixed. >>> in addition, even with a single drive you will loose some data on power >>> loss (unless you do sync mounts with disabled write caches), full data >>> journaling can help protect you from this, but the default journaling >>> just protects the metadata. >> >> "Data loss" here means "damaging data that were already fsynced". That >> will not happen on single disk (with barriers on etc), but will happen >> on RAID5 and flash. > > this definition of data loss wasn't clear prior to this. you need to I actually think it was. write() syscall does not guarantee anything, fsync() does. > define this, and state that the reason that flash and raid arrays can > suffer from this is that both of them deal with blocks of storage larger > than the data block (eraseblock or raid stripe) and there are conditions > that can cause the loss of the entire eraseblock or raid stripe which can > affect data that was previously safe on disk (and if power had been lost > before the latest write, the prior data would still be safe) I actually believe Ted's writeup is good. > note that this doesn't nessasarily affect all flash disks. if the disk > doesn't replace the old block in the FTL until the data has all been > sucessfuly copies to the new eraseblock you don't have this problem. > > some (possibly all) cheap thumb drives don't do this, but I would expect > that the expensive SATA SSDs to do things in the right order. I'd expect SATA SSDs to have that solved, yes. Again, Ted does not say it affects _all_ such devices, and it certianly did affect all that I seen. > do this right and you are properly documenting a failure mode that most > people don't understand, but go too far and you are crying wolf. Ok, latest version is below, can you suggest improvements? (And yes, details when exactly RAID-5 misbehaves should be noted somewhere. I don't know enough about RAID arrays, can someone help?) Pavel --- There are storage devices that high highly undesirable properties when they are disconnected or suffer power failures while writes are in progress; such devices include flash devices and MD RAID 4/5/6 arrays. These devices have the property of potentially corrupting blocks being written at the time of the power failure, and worse yet, amplifying the region where blocks are corrupted such that additional sectors are also damaged during the power failure. Users who use such storage devices are well advised take countermeasures, such as the use of Uninterruptible Power Supplies, and making sure the flash device is not hot-unplugged while the device is being used. Regular backups when using these devices is also a Very Good Idea. Otherwise, file systems placed on these devices can suffer silent data and file system corruption. An forced use of fsck may detect metadata corruption resulting in file system corruption, but will not suffice to detect data corruption. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 23:37 ` Pavel Machek @ 2009-08-25 23:48 ` Ric Wheeler 2009-08-26 0:06 ` Pavel Machek 2009-08-25 23:56 ` david 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 23:48 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet > --- > There are storage devices that high highly undesirable properties > when they are disconnected or suffer power failures while writes are > in progress; such devices include flash devices and MD RAID 4/5/6 > arrays. These devices have the property of potentially > corrupting blocks being written at the time of the power failure, and > worse yet, amplifying the region where blocks are corrupted such that > additional sectors are also damaged during the power failure. I would strike the entire mention of MD devices since it is your assertion, not a proven fact. You will cause more data loss from common events (single sector errors, complete drive failure) by steering people away from more reliable storage configurations because of a really rare edge case (power failure during split write to two raid members while doing a RAID rebuild). > > Users who use such storage devices are well advised take > countermeasures, such as the use of Uninterruptible Power Supplies, > and making sure the flash device is not hot-unplugged while the device > is being used. Regular backups when using these devices is also a > Very Good Idea. All users who care about data integrity - including those who do not use MD5 but just regular single S-ATA disks - will get better reliability from a UPS. > > Otherwise, file systems placed on these devices can suffer silent data > and file system corruption. An forced use of fsck may detect metadata > corruption resulting in file system corruption, but will not suffice > to detect data corruption. > This is very misleading. All storage "can" have silent data loss, you are making a statement without specifics about frequency. FSCK can repair the file system metadata, but will not detect any data loss or corruption in the data blocks allocated to user files. To detect data loss properly, you need to checksum (or digitally sign) all objects stored in a file system and verify them on a regular basis. Also helps to keep a separate list of those objects on another device so that when the metadata does take a hit, you can enumerate your objects and verify that you have not lost anything. ric ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 23:48 ` Ric Wheeler @ 2009-08-26 0:06 ` Pavel Machek 2009-08-26 0:12 ` Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:06 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 19:48:09, Ric Wheeler wrote: > >> --- >> There are storage devices that high highly undesirable properties >> when they are disconnected or suffer power failures while writes are >> in progress; such devices include flash devices and MD RAID 4/5/6 >> arrays. These devices have the property of potentially >> corrupting blocks being written at the time of the power failure, and >> worse yet, amplifying the region where blocks are corrupted such that >> additional sectors are also damaged during the power failure. > > I would strike the entire mention of MD devices since it is your > assertion, not a proven fact. You will cause more data loss from common That actually is a fact. That's how MD RAID 5 is designed. And btw those are originaly Ted's words. > events (single sector errors, complete drive failure) by steering people > away from more reliable storage configurations because of a really rare > edge case (power failure during split write to two raid members while > doing a RAID rebuild). I'm not sure what's rare about power failures. Unlike single sector errors, my machine actually has a button that produces exactly that event. Running degraded raid5 arrays for extended periods may be slightly unusual configuration, but I suspect people should just do that for testing. (And from the discussion, people seem to think that degraded raid5 is equivalent to raid0). >> Otherwise, file systems placed on these devices can suffer silent data >> and file system corruption. An forced use of fsck may detect metadata >> corruption resulting in file system corruption, but will not suffice >> to detect data corruption. >> > > This is very misleading. All storage "can" have silent data loss, you are > making a statement without specifics about frequency. substitute with "can (by design)"? Now, if you can suggest useful version of that document meeting your criteria? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:06 ` Pavel Machek @ 2009-08-26 0:12 ` Ric Wheeler 2009-08-26 0:20 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:12 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 08:06 PM, Pavel Machek wrote: > On Tue 2009-08-25 19:48:09, Ric Wheeler wrote: >> >>> --- >>> There are storage devices that high highly undesirable properties >>> when they are disconnected or suffer power failures while writes are >>> in progress; such devices include flash devices and MD RAID 4/5/6 >>> arrays. These devices have the property of potentially >>> corrupting blocks being written at the time of the power failure, and >>> worse yet, amplifying the region where blocks are corrupted such that >>> additional sectors are also damaged during the power failure. >> >> I would strike the entire mention of MD devices since it is your >> assertion, not a proven fact. You will cause more data loss from common > > That actually is a fact. That's how MD RAID 5 is designed. And btw > those are originaly Ted's words. > Ted did not design MD RAID5. >> events (single sector errors, complete drive failure) by steering people >> away from more reliable storage configurations because of a really rare >> edge case (power failure during split write to two raid members while >> doing a RAID rebuild). > > I'm not sure what's rare about power failures. Unlike single sector > errors, my machine actually has a button that produces exactly that > event. Running degraded raid5 arrays for extended periods may be > slightly unusual configuration, but I suspect people should just do > that for testing. (And from the discussion, people seem to think that > degraded raid5 is equivalent to raid0). Power failures after a full drive failure with a split write during a rebuild? > >>> Otherwise, file systems placed on these devices can suffer silent data >>> and file system corruption. An forced use of fsck may detect metadata >>> corruption resulting in file system corruption, but will not suffice >>> to detect data corruption. >>> >> >> This is very misleading. All storage "can" have silent data loss, you are >> making a statement without specifics about frequency. > > substitute with "can (by design)"? By Pavel's unproven casual observation? > > Now, if you can suggest useful version of that document meeting your > criteria? > > Pavel ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:12 ` Ric Wheeler @ 2009-08-26 0:20 ` Pavel Machek 2009-08-26 0:26 ` david ` (2 more replies) 0 siblings, 3 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:20 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>>> --- >>>> There are storage devices that high highly undesirable properties >>>> when they are disconnected or suffer power failures while writes are >>>> in progress; such devices include flash devices and MD RAID 4/5/6 >>>> arrays. These devices have the property of potentially >>>> corrupting blocks being written at the time of the power failure, and >>>> worse yet, amplifying the region where blocks are corrupted such that >>>> additional sectors are also damaged during the power failure. >>> >>> I would strike the entire mention of MD devices since it is your >>> assertion, not a proven fact. You will cause more data loss from common >> >> That actually is a fact. That's how MD RAID 5 is designed. And btw >> those are originaly Ted's words. > > Ted did not design MD RAID5. So what? He clearly knows how it works. Instead of arguing he's wrong, will you simply label everything as unproven? >>> events (single sector errors, complete drive failure) by steering people >>> away from more reliable storage configurations because of a really rare >>> edge case (power failure during split write to two raid members while >>> doing a RAID rebuild). >> >> I'm not sure what's rare about power failures. Unlike single sector >> errors, my machine actually has a button that produces exactly that >> event. Running degraded raid5 arrays for extended periods may be >> slightly unusual configuration, but I suspect people should just do >> that for testing. (And from the discussion, people seem to think that >> degraded raid5 is equivalent to raid0). > > Power failures after a full drive failure with a split write during a rebuild? Look, I don't need full drive failure for this to happen. I can just remove one disk from array. I don't need power failure, I can just press the power button. I don't even need to rebuild anything, I can just write to degraded array. Given that all events are under my control, statistics make little sense here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:20 ` Pavel Machek @ 2009-08-26 0:26 ` david 2009-08-26 0:28 ` Ric Wheeler 2009-08-26 4:24 ` Rik van Riel 2 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-26 0:26 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: >>>>> --- >>>>> There are storage devices that high highly undesirable properties >>>>> when they are disconnected or suffer power failures while writes are >>>>> in progress; such devices include flash devices and MD RAID 4/5/6 >>>>> arrays. These devices have the property of potentially >>>>> corrupting blocks being written at the time of the power failure, and >>>>> worse yet, amplifying the region where blocks are corrupted such that >>>>> additional sectors are also damaged during the power failure. >>>> >>>> I would strike the entire mention of MD devices since it is your >>>> assertion, not a proven fact. You will cause more data loss from common >>> >>> That actually is a fact. That's how MD RAID 5 is designed. And btw >>> those are originaly Ted's words. >> >> Ted did not design MD RAID5. > > So what? He clearly knows how it works. > > Instead of arguing he's wrong, will you simply label everything as > unproven? > >>>> events (single sector errors, complete drive failure) by steering people >>>> away from more reliable storage configurations because of a really rare >>>> edge case (power failure during split write to two raid members while >>>> doing a RAID rebuild). >>> >>> I'm not sure what's rare about power failures. Unlike single sector >>> errors, my machine actually has a button that produces exactly that >>> event. Running degraded raid5 arrays for extended periods may be >>> slightly unusual configuration, but I suspect people should just do >>> that for testing. (And from the discussion, people seem to think that >>> degraded raid5 is equivalent to raid0). >> >> Power failures after a full drive failure with a split write during a rebuild? > > Look, I don't need full drive failure for this to happen. I can just > remove one disk from array. I don't need power failure, I can just > press the power button. I don't even need to rebuild anything, I can > just write to degraded array. > > Given that all events are under my control, statistics make little > sense here. if you are intentionally causing several low-probability things to happen at once you increase the risk of corruption note that you also need a write to take place, and be interrupted in just the right way. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:20 ` Pavel Machek 2009-08-26 0:26 ` david @ 2009-08-26 0:28 ` Ric Wheeler 2009-08-26 0:38 ` Pavel Machek 2009-08-26 4:24 ` Rik van Riel 2 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:28 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 08:20 PM, Pavel Machek wrote: >>>>> --- >>>>> There are storage devices that high highly undesirable properties >>>>> when they are disconnected or suffer power failures while writes are >>>>> in progress; such devices include flash devices and MD RAID 4/5/6 >>>>> arrays. These devices have the property of potentially >>>>> corrupting blocks being written at the time of the power failure, and >>>>> worse yet, amplifying the region where blocks are corrupted such that >>>>> additional sectors are also damaged during the power failure. >>>> >>>> I would strike the entire mention of MD devices since it is your >>>> assertion, not a proven fact. You will cause more data loss from common >>> >>> That actually is a fact. That's how MD RAID 5 is designed. And btw >>> those are originaly Ted's words. >> >> Ted did not design MD RAID5. > > So what? He clearly knows how it works. > > Instead of arguing he's wrong, will you simply label everything as > unproven? > >>>> events (single sector errors, complete drive failure) by steering people >>>> away from more reliable storage configurations because of a really rare >>>> edge case (power failure during split write to two raid members while >>>> doing a RAID rebuild). >>> >>> I'm not sure what's rare about power failures. Unlike single sector >>> errors, my machine actually has a button that produces exactly that >>> event. Running degraded raid5 arrays for extended periods may be >>> slightly unusual configuration, but I suspect people should just do >>> that for testing. (And from the discussion, people seem to think that >>> degraded raid5 is equivalent to raid0). >> >> Power failures after a full drive failure with a split write during a rebuild? > > Look, I don't need full drive failure for this to happen. I can just > remove one disk from array. I don't need power failure, I can just > press the power button. I don't even need to rebuild anything, I can > just write to degraded array. > > Given that all events are under my control, statistics make little > sense here. > Pavel > You are deliberately causing a double failure - pressing the power button after pulling a drive is exactly that scenario. Pull your single (non-MD5) disk out while writing (hot unplug from the S-ATA side, leaving power on) and run some tests to verify your assertions... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:28 ` Ric Wheeler @ 2009-08-26 0:38 ` Pavel Machek 2009-08-26 0:45 ` Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:38 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>>> I'm not sure what's rare about power failures. Unlike single sector >>>> errors, my machine actually has a button that produces exactly that >>>> event. Running degraded raid5 arrays for extended periods may be >>>> slightly unusual configuration, but I suspect people should just do >>>> that for testing. (And from the discussion, people seem to think that >>>> degraded raid5 is equivalent to raid0). >>> >>> Power failures after a full drive failure with a split write during a rebuild? >> >> Look, I don't need full drive failure for this to happen. I can just >> remove one disk from array. I don't need power failure, I can just >> press the power button. I don't even need to rebuild anything, I can >> just write to degraded array. >> >> Given that all events are under my control, statistics make little >> sense here. > > You are deliberately causing a double failure - pressing the power button > after pulling a drive is exactly that scenario. Exactly. And now I'm trying to get that documented, so that people don't do it and still expect their fs to be consistent. > Pull your single (non-MD5) disk out while writing (hot unplug from the > S-ATA side, leaving power on) and run some tests to verify your > assertions... I actually did that some time ago with pulling SATA disk (I actually pulled both SATA *and* power -- that was the way hotplug envelope worked; that's more harsh test than what you suggest, so that should be ok). Write test was fsync heavy, with logging to separate drive, checking that all the data where fsync succeeded are indeed accessible. I uncovered few bugs in ext* that jack fixed, I uncovered some libata weirdness that is not yet fixed AFAIK, but with all the patches applied I could not break that single SATA disk. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:38 ` Pavel Machek @ 2009-08-26 0:45 ` Ric Wheeler 2009-08-26 11:21 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:45 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 08:38 PM, Pavel Machek wrote: >>>>> I'm not sure what's rare about power failures. Unlike single sector >>>>> errors, my machine actually has a button that produces exactly that >>>>> event. Running degraded raid5 arrays for extended periods may be >>>>> slightly unusual configuration, but I suspect people should just do >>>>> that for testing. (And from the discussion, people seem to think that >>>>> degraded raid5 is equivalent to raid0). >>>> >>>> Power failures after a full drive failure with a split write during a rebuild? >>> >>> Look, I don't need full drive failure for this to happen. I can just >>> remove one disk from array. I don't need power failure, I can just >>> press the power button. I don't even need to rebuild anything, I can >>> just write to degraded array. >>> >>> Given that all events are under my control, statistics make little >>> sense here. >> >> You are deliberately causing a double failure - pressing the power button >> after pulling a drive is exactly that scenario. > > Exactly. And now I'm trying to get that documented, so that people > don't do it and still expect their fs to be consistent. The problem I have is that the way you word it steers people away from RAID5 and better data integrity. Your intentions are good, but your text is going to do considerable harm. Most people don't intentionally drop power (or have a power failure) during RAID rebuilds.... > >> Pull your single (non-MD5) disk out while writing (hot unplug from the >> S-ATA side, leaving power on) and run some tests to verify your >> assertions... > > I actually did that some time ago with pulling SATA disk (I actually > pulled both SATA *and* power -- that was the way hotplug envelope > worked; that's more harsh test than what you suggest, so that should > be ok). Write test was fsync heavy, with logging to separate drive, > checking that all the data where fsync succeeded are indeed > accessible. I uncovered few bugs in ext* that jack fixed, I uncovered > some libata weirdness that is not yet fixed AFAIK, but with all the > patches applied I could not break that single SATA disk. > Pavel Fsync heavy workloads with working barriers will tend to keep the write cache pretty empty (two barrier flushes per fsync) so this is not too surprising. Drive behaviour depends on a lot of things though - how the firmware prioritizes writes over reads, etc. ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:45 ` Ric Wheeler @ 2009-08-26 11:21 ` Pavel Machek 2009-08-26 11:58 ` Ric Wheeler 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 11:21 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 20:45:26, Ric Wheeler wrote: > On 08/25/2009 08:38 PM, Pavel Machek wrote: >>>>>> I'm not sure what's rare about power failures. Unlike single sector >>>>>> errors, my machine actually has a button that produces exactly that >>>>>> event. Running degraded raid5 arrays for extended periods may be >>>>>> slightly unusual configuration, but I suspect people should just do >>>>>> that for testing. (And from the discussion, people seem to think that >>>>>> degraded raid5 is equivalent to raid0). >>>>> >>>>> Power failures after a full drive failure with a split write during a rebuild? >>>> >>>> Look, I don't need full drive failure for this to happen. I can just >>>> remove one disk from array. I don't need power failure, I can just >>>> press the power button. I don't even need to rebuild anything, I can >>>> just write to degraded array. >>>> >>>> Given that all events are under my control, statistics make little >>>> sense here. >>> >>> You are deliberately causing a double failure - pressing the power button >>> after pulling a drive is exactly that scenario. >> >> Exactly. And now I'm trying to get that documented, so that people >> don't do it and still expect their fs to be consistent. > > The problem I have is that the way you word it steers people away from > RAID5 and better data integrity. Your intentions are good, but your text > is going to do considerable harm. > > Most people don't intentionally drop power (or have a power failure) > during RAID rebuilds.... Example I seen went like this: Drive in raid 5 failed; hot spare was available (no idea about UPS). System apparently locked up trying to talk to the failed drive, or maybe admin just was not patient enough, so he just powercycled the array. He lost the array. So while most people will not agressively powercycle the RAID array, drive failure still provokes little tested error paths, and getting unclean shutdown is quite easy in such case. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 11:21 ` Pavel Machek @ 2009-08-26 11:58 ` Ric Wheeler 2009-08-26 12:40 ` Theodore Tso 2009-08-29 9:38 ` Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 11:58 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/26/2009 07:21 AM, Pavel Machek wrote: > On Tue 2009-08-25 20:45:26, Ric Wheeler wrote: > >> On 08/25/2009 08:38 PM, Pavel Machek wrote: >> >>>>>>> I'm not sure what's rare about power failures. Unlike single sector >>>>>>> errors, my machine actually has a button that produces exactly that >>>>>>> event. Running degraded raid5 arrays for extended periods may be >>>>>>> slightly unusual configuration, but I suspect people should just do >>>>>>> that for testing. (And from the discussion, people seem to think that >>>>>>> degraded raid5 is equivalent to raid0). >>>>>>> >>>>>> Power failures after a full drive failure with a split write during a rebuild? >>>>>> >>>>> Look, I don't need full drive failure for this to happen. I can just >>>>> remove one disk from array. I don't need power failure, I can just >>>>> press the power button. I don't even need to rebuild anything, I can >>>>> just write to degraded array. >>>>> >>>>> Given that all events are under my control, statistics make little >>>>> sense here. >>>>> >>>> You are deliberately causing a double failure - pressing the power button >>>> after pulling a drive is exactly that scenario. >>>> >>> Exactly. And now I'm trying to get that documented, so that people >>> don't do it and still expect their fs to be consistent. >>> >> The problem I have is that the way you word it steers people away from >> RAID5 and better data integrity. Your intentions are good, but your text >> is going to do considerable harm. >> >> Most people don't intentionally drop power (or have a power failure) >> during RAID rebuilds.... >> > Example I seen went like this: > > Drive in raid 5 failed; hot spare was available (no idea about > UPS). System apparently locked up trying to talk to the failed drive, > or maybe admin just was not patient enough, so he just powercycled the > array. He lost the array. > > So while most people will not agressively powercycle the RAID array, > drive failure still provokes little tested error paths, and getting > unclean shutdown is quite easy in such case. > Pavel > Then what we need to document is do not power cycle an array during a rebuild, right? If it wasn't the admin that timed out and the box really was hung (no drive activity lights, etc), you will need to power cycle/reboot but then you should not have this active rebuild issuing writes either... In the end, there are cascading failures that will defeat any data protection scheme, but that does not mean that the value of that scheme is zero. We need to be get more people to use RAID (including MD5) and try to enhance it as we go. Just using a single disk is not a good thing... ric Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 11:58 ` Ric Wheeler @ 2009-08-26 12:40 ` Theodore Tso 2009-08-26 13:11 ` Ric Wheeler ` (2 more replies) 2009-08-29 9:38 ` Pavel Machek 1 sibling, 3 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 12:40 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, david, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote: >> Drive in raid 5 failed; hot spare was available (no idea about >> UPS). System apparently locked up trying to talk to the failed drive, >> or maybe admin just was not patient enough, so he just powercycled the >> array. He lost the array. >> >> So while most people will not agressively powercycle the RAID array, >> drive failure still provokes little tested error paths, and getting >> unclean shutdown is quite easy in such case. > > Then what we need to document is do not power cycle an array during a > rebuild, right? Well, the softwar raid layer could be improved so that it implements scrubbing by default (i.e., have the md package install a cron job to implement a periodict scrub pass automatically). The MD code could also regularly check to make sure the hot spare is OK; the other possibility is that hot spare, which hadn't been used in a long time, had silently failed. > In the end, there are cascading failures that will defeat any data > protection scheme, but that does not mean that the value of that scheme > is zero. We need to be get more people to use RAID (including MD5) and > try to enhance it as we go. Just using a single disk is not a good > thing... Yep; the solution is to improve the storage devices. It is *not* to encourage people to think RAID is not worth it, or that somehow ext2 is better than ext3 because it runs fsck's all the time at boot up. That's just crazy talk. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 12:40 ` Theodore Tso @ 2009-08-26 13:11 ` Ric Wheeler 2009-08-26 13:11 ` Ric Wheeler 2009-08-26 13:40 ` Chris Adams 2 siblings, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 13:11 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, david, Florian Weimer, Goswin von Brederlow, Rob Landley, ke On 08/26/2009 08:40 AM, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote: >>> Drive in raid 5 failed; hot spare was available (no idea about >>> UPS). System apparently locked up trying to talk to the failed drive, >>> or maybe admin just was not patient enough, so he just powercycled the >>> array. He lost the array. >>> >>> So while most people will not agressively powercycle the RAID array, >>> drive failure still provokes little tested error paths, and getting >>> unclean shutdown is quite easy in such case. >> >> Then what we need to document is do not power cycle an array during a >> rebuild, right? > > Well, the softwar raid layer could be improved so that it implements > scrubbing by default (i.e., have the md package install a cron job to > implement a periodict scrub pass automatically). The MD code could > also regularly check to make sure the hot spare is OK; the other > possibility is that hot spare, which hadn't been used in a long time, > had silently failed. Actually, MD does this scan already (not automatically, but you can set up a simple cron job to kick off a periodic "check"). It is a delicate balance to get the frequency of the scrubbing correct. On one hand, you want to make sure that you detect errors in a timely fashion, certainly detection of single sector errors before you might develop a second sector level error on another drive. On the other hand, running scans/scrubs continually impacts the performance of your real workload and can potentially impact your components' life span by subjecting them to a heavy workload. Rule of thumb seems from my experience is that most people settle in with a scan once a week or two (done at a throttled rate). > >> In the end, there are cascading failures that will defeat any data >> protection scheme, but that does not mean that the value of that scheme >> is zero. We need to be get more people to use RAID (including MD5) and >> try to enhance it as we go. Just using a single disk is not a good >> thing... > > Yep; the solution is to improve the storage devices. It is *not* to > encourage people to think RAID is not worth it, or that somehow ext2 > is better than ext3 because it runs fsck's all the time at boot up. > That's just crazy talk. > > - Ted Agreed.... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 12:40 ` Theodore Tso 2009-08-26 13:11 ` Ric Wheeler @ 2009-08-26 13:11 ` Ric Wheeler 2009-08-26 13:44 ` david 2009-08-26 13:40 ` Chris Adams 2 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 13:11 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, david, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/26/2009 08:40 AM, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote: >>> Drive in raid 5 failed; hot spare was available (no idea about >>> UPS). System apparently locked up trying to talk to the failed drive, >>> or maybe admin just was not patient enough, so he just powercycled the >>> array. He lost the array. >>> >>> So while most people will not agressively powercycle the RAID array, >>> drive failure still provokes little tested error paths, and getting >>> unclean shutdown is quite easy in such case. >> >> Then what we need to document is do not power cycle an array during a >> rebuild, right? > > Well, the softwar raid layer could be improved so that it implements > scrubbing by default (i.e., have the md package install a cron job to > implement a periodict scrub pass automatically). The MD code could > also regularly check to make sure the hot spare is OK; the other > possibility is that hot spare, which hadn't been used in a long time, > had silently failed. Actually, MD does this scan already (not automatically, but you can set up a simple cron job to kick off a periodic "check"). It is a delicate balance to get the frequency of the scrubbing correct. On one hand, you want to make sure that you detect errors in a timely fashion, certainly detection of single sector errors before you might develop a second sector level error on another drive. On the other hand, running scans/scrubs continually impacts the performance of your real workload and can potentially impact your components' life span by subjecting them to a heavy workload. Rule of thumb seems from my experience is that most people settle in with a scan once a week or two (done at a throttled rate). > >> In the end, there are cascading failures that will defeat any data >> protection scheme, but that does not mean that the value of that scheme >> is zero. We need to be get more people to use RAID (including MD5) and >> try to enhance it as we go. Just using a single disk is not a good >> thing... > > Yep; the solution is to improve the storage devices. It is *not* to > encourage people to think RAID is not worth it, or that somehow ext2 > is better than ext3 because it runs fsck's all the time at boot up. > That's just crazy talk. > > - Ted Agreed.... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 13:11 ` Ric Wheeler @ 2009-08-26 13:44 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-26 13:44 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Ric Wheeler wrote: > On 08/26/2009 08:40 AM, Theodore Tso wrote: >> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote: >>>> Drive in raid 5 failed; hot spare was available (no idea about >>>> UPS). System apparently locked up trying to talk to the failed drive, >>>> or maybe admin just was not patient enough, so he just powercycled the >>>> array. He lost the array. >>>> >>>> So while most people will not agressively powercycle the RAID array, >>>> drive failure still provokes little tested error paths, and getting >>>> unclean shutdown is quite easy in such case. >>> >>> Then what we need to document is do not power cycle an array during a >>> rebuild, right? >> >> Well, the softwar raid layer could be improved so that it implements >> scrubbing by default (i.e., have the md package install a cron job to >> implement a periodict scrub pass automatically). The MD code could >> also regularly check to make sure the hot spare is OK; the other >> possibility is that hot spare, which hadn't been used in a long time, >> had silently failed. > > Actually, MD does this scan already (not automatically, but you can set up a > simple cron job to kick off a periodic "check"). It is a delicate balance to > get the frequency of the scrubbing correct. debian defaults to doing this once a month (first sunday of each month), on some of my systems this scrub takes almost a week to complete. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 12:40 ` Theodore Tso 2009-08-26 13:11 ` Ric Wheeler 2009-08-26 13:11 ` Ric Wheeler @ 2009-08-26 13:40 ` Chris Adams 2009-08-26 13:47 ` Alan Cox 2009-08-27 21:50 ` Pavel Machek 2 siblings, 2 replies; 309+ messages in thread From: Chris Adams @ 2009-08-26 13:40 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-kernel Once upon a time, Theodore Tso <tytso@mit.edu> said: >Well, the softwar raid layer could be improved so that it implements >scrubbing by default (i.e., have the md package install a cron job to >implement a periodict scrub pass automatically). Fedora 11 added a cron job to kick off a RAID check for each Linux MD RAID array every week. Combined with running mdmonitor, root will get an email on any failure. The other thing about this thread is that the only RAID implementation that is being discussed here is the MD RAID stack. There are a lot of RAID implementations that have the same issues: - motherboard (aka "fake") RAID - In Linux this is typically mapped with device mapper via dmraid; AFAIK there is not a tool to scrub (or even monitor the status of and notify on failure) a Linux DM RAID setup. - hardware RAID cards without battery backup - these have the exact same issues because they cannot guarantee all writes complete, nor can they keep track of incomplete writes across power failures - hardware RAID cards _with_ battery backup but that don't periodically test the battery and have a way to notify you of battery failure while Linux is running The issues being raised here are not specific to extX, MD RAID, or Linux at all; they are problems with non-"enterprise-class" RAID setups. There's a reason enterprise-class RAID costs a lot more money than the card you can pick up at Fry's. There's no reason to document the design issues of general RAID implementations in the Linux kernel. -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 13:40 ` Chris Adams @ 2009-08-26 13:47 ` Alan Cox 2009-08-26 14:11 ` Chris Adams 2009-08-27 21:50 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Alan Cox @ 2009-08-26 13:47 UTC (permalink / raw) To: Chris Adams; +Cc: Theodore Tso, linux-kernel > The issues being raised here are not specific to extX, MD RAID, or Linux > at all; they are problems with non-"enterprise-class" RAID setups. > There's a reason enterprise-class RAID costs a lot more money than the > card you can pick up at Fry's. And you will still need backups ;) A long time ago I worked on a fault tolerant news server with dual alphaserver boxes and a shared disk array. A power system failure took out both the alpha boxes and the disk controllers and all the disks. Fortunately it was a news server so you just had to wait a week .. Alan ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 13:47 ` Alan Cox @ 2009-08-26 14:11 ` Chris Adams 0 siblings, 0 replies; 309+ messages in thread From: Chris Adams @ 2009-08-26 14:11 UTC (permalink / raw) To: Alan Cox; +Cc: Theodore Tso, linux-kernel Once upon a time, Alan Cox <alan@lxorguk.ukuu.org.uk> said: > > The issues being raised here are not specific to extX, MD RAID, or Linux > > at all; they are problems with non-"enterprise-class" RAID setups. > > There's a reason enterprise-class RAID costs a lot more money than the > > card you can pick up at Fry's. > > And you will still need backups ;) Yep. RAID (of any class) != fail safe. > A long time ago I worked on a fault tolerant news server with dual > alphaserver boxes and a shared disk array. A power system failure took > out both the alpha boxes and the disk controllers and all the disks. Hey, that's not funny! I'm typing this on a dual AlphaServer cluster with a shared disk array (with dual battery backup even), and we had a power failure at the NOC yesterday (that then tripped a breaker, although it was between the generator and the UPS, so nothing went down). No matter how redundant you make things, a "no single point of failure" setup still can fail, often in "interesting" ways that nobody anticipated. -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 13:40 ` Chris Adams 2009-08-26 13:47 ` Alan Cox @ 2009-08-27 21:50 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-27 21:50 UTC (permalink / raw) To: Chris Adams; +Cc: Theodore Tso, linux-kernel Hi! > The other thing about this thread is that the only RAID implementation > that is being discussed here is the MD RAID stack. There are a lot of > RAID implementations that have the same issues: > > - motherboard (aka "fake") RAID - In Linux this is typically mapped with > device mapper via dmraid; AFAIK there is not a tool to scrub (or even > monitor the status of and notify on failure) a Linux DM RAID setup. > > - hardware RAID cards without battery backup - these have the exact same > issues because they cannot guarantee all writes complete, nor can they > keep track of incomplete writes across power failures > > - hardware RAID cards _with_ battery backup but that don't periodically > test the battery and have a way to notify you of battery failure while > Linux is running > > The issues being raised here are not specific to extX, MD RAID, or Linux > at all; they are problems with non-"enterprise-class" RAID setups. > There's a reason enterprise-class RAID costs a lot more money than the > card you can pick up at Fry's. > > There's no reason to document the design issues of general RAID > implementations in the Linux kernel. Even when we carry one of those misdesigned implementations in-tree? (Note that fixed implementations do exist -- AIX? -- just add journal). 'I wont't tell you that this pony bites, because many ponies do bite'? WTF? I thought we had higher moral standard than this. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 11:58 ` Ric Wheeler 2009-08-26 12:40 ` Theodore Tso @ 2009-08-29 9:38 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-29 9:38 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >> Example I seen went like this: >> >> Drive in raid 5 failed; hot spare was available (no idea about >> UPS). System apparently locked up trying to talk to the failed drive, >> or maybe admin just was not patient enough, so he just powercycled the >> array. He lost the array. >> >> So while most people will not agressively powercycle the RAID array, >> drive failure still provokes little tested error paths, and getting >> unclean shutdown is quite easy in such case. > > Then what we need to document is do not power cycle an array during a > rebuild, right? Yep, that and the fact that you should fsck if you do. > If it wasn't the admin that timed out and the box really was hung (no > drive activity lights, etc), you will need to power cycle/reboot but > then you should not have this active rebuild issuing writes either... Ok, I guess you are right here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:20 ` Pavel Machek 2009-08-26 0:26 ` david 2009-08-26 0:28 ` Ric Wheeler @ 2009-08-26 4:24 ` Rik van Riel 2009-08-26 11:22 ` Pavel Machek 2 siblings, 1 reply; 309+ messages in thread From: Rik van Riel @ 2009-08-26 4:24 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > Look, I don't need full drive failure for this to happen. I can just > remove one disk from array. I don't need power failure, I can just > press the power button. I don't even need to rebuild anything, I can > just write to degraded array. > > Given that all events are under my control, statistics make little > sense here. I recommend a sledgehammer. If you want to lose your data, you might as well have some fun. No need to bore yourself to tears by simulating events that are unlikely to happen simultaneously to careful system administrators. -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 4:24 ` Rik van Riel @ 2009-08-26 11:22 ` Pavel Machek 2009-08-26 14:45 ` Rik van Riel 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 11:22 UTC (permalink / raw) To: Rik van Riel Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed 2009-08-26 00:24:30, Rik van Riel wrote: > Pavel Machek wrote: > >> Look, I don't need full drive failure for this to happen. I can just >> remove one disk from array. I don't need power failure, I can just >> press the power button. I don't even need to rebuild anything, I can >> just write to degraded array. >> >> Given that all events are under my control, statistics make little >> sense here. > > I recommend a sledgehammer. > > If you want to lose your data, you might as well have some fun. > > No need to bore yourself to tears by simulating events that are > unlikely to happen simultaneously to careful system administrators. Sledgehammer is hardware problem, and I'm demonstrating software/documentation problem we have here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 11:22 ` Pavel Machek @ 2009-08-26 14:45 ` Rik van Riel 2009-08-29 9:39 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: Rik van Riel @ 2009-08-26 14:45 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > Sledgehammer is hardware problem, and I'm demonstrating > software/documentation problem we have here. So your argument is that a sledgehammer is a hardware problem, while a broken hard disk and a power failure are software/documentation issues? I'd argue that the broken hard disk and power failure are hardware issues, too. -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 14:45 ` Rik van Riel @ 2009-08-29 9:39 ` Pavel Machek 2009-08-29 11:47 ` Ron Johnson 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-29 9:39 UTC (permalink / raw) To: Rik van Riel Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed 2009-08-26 10:45:44, Rik van Riel wrote: > Pavel Machek wrote: > >> Sledgehammer is hardware problem, and I'm demonstrating >> software/documentation problem we have here. > > So your argument is that a sledgehammer is a hardware > problem, while a broken hard disk and a power failure > are software/documentation issues? > > I'd argue that the broken hard disk and power failure > are hardware issues, too. Noone told me that degraded md raid5 is dangerous. Thats documentation issue #1. Maybe I just pulled the disk for fun. ext3 docs told me that journal protects me against fs corruption during power fails. It does not in this particular case. Seems like docs issue #2. Maybe I just hit the reset button because it was there. Randomly hitting power button may be stupid, but should not result in filesystem corruption on reasonably working filesystem/storage stack. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-29 9:39 ` Pavel Machek @ 2009-08-29 11:47 ` Ron Johnson 2009-08-29 16:12 ` jim owens 0 siblings, 1 reply; 309+ messages in thread From: Ron Johnson @ 2009-08-29 11:47 UTC (permalink / raw) To: linux-ext4; +Cc: Rik van Riel, Ric Wheeler, Theodore Tso, corbet On 2009-08-29 04:39, Pavel Machek wrote: > On Wed 2009-08-26 10:45:44, Rik van Riel wrote: >> Pavel Machek wrote: >> >>> Sledgehammer is hardware problem, and I'm demonstrating >>> software/documentation problem we have here. >> So your argument is that a sledgehammer is a hardware >> problem, while a broken hard disk and a power failure >> are software/documentation issues? >> >> I'd argue that the broken hard disk and power failure >> are hardware issues, too. > > Noone told me that degraded md raid5 is dangerous. Thats documentation > issue #1. Maybe I just pulled the disk for fun. You're kidding, right? Or are you being too effectively sarcastic? -- Obsession with "preserving cultural heritage" is a racist impediment to moral, physical and intellectual progress. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-29 11:47 ` Ron Johnson @ 2009-08-29 16:12 ` jim owens 0 siblings, 0 replies; 309+ messages in thread From: jim owens @ 2009-08-29 16:12 UTC (permalink / raw) To: Ron Johnson; +Cc: linux-ext4, Rik van Riel, Ric Wheeler, Theodore Tso, corbet Ron Johnson wrote: > On 2009-08-29 04:39, Pavel Machek wrote: >> Noone told me that degraded md raid5 is dangerous. Thats documentation >> issue #1. Maybe I just pulled the disk for fun. > > You're kidding, right? No he is not... and that is exactly why Ted and Ric have been fighting so hard against his scare the children documentation. In 20 years, I have not found a way to educate those who think "I know computers so it must work the way I want and expect." Tremendous amounts of information and recommendations are out there on the web, in books, classes, etc. But people don't research before using or understand before they have a problem. Pavel Machek wrote: > It is not only for system administrators; I was trying to find > out if kernel is buggy, and that should be in kernel tree. Pavel, *THE KERNEL IS NOT BUGGY* end of story! Everyone experienced in storage understands the "in the edge case that Pavel hit, you will loose your data", and we take our responsibility to tell people what works and does not work very seriously. And we try very hard to reduce the amount of edge case data losses. But as Ric and Ted and many others keep trying to explain: - There is no such thing as "never fails" data storage. - The goal of journal file systems is not what you thing. - The goal of raid is not what you think. - We do not want the vast majority of computer users who are not kernel engineers to stop using the technology that in 99.99 percent of the use cases keeps their data as safe as we can reasonably make it, just because they read Pavel's 0.01 percent scary and inaccurate case. And the worst part is this 0.01 percent case problem is really "I did not know what I was doing". jim ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 23:37 ` Pavel Machek 2009-08-25 23:48 ` Ric Wheeler @ 2009-08-25 23:56 ` david 2009-08-26 0:12 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-25 23:56 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: > There are storage devices that high highly undesirable properties > when they are disconnected or suffer power failures while writes are > in progress; such devices include flash devices and MD RAID 4/5/6 > arrays. change this to say 'degraded MD RAID 4/5/6 arrays' also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly suspect that they do) then you need to add a note that if the array becomes degraded before a scrub cycle happens previously hidden damage (that would have been repaired by the scrub) can surface. > These devices have the property of potentially corrupting blocks being > written at the time of the power failure, this is true of all devices > and worse yet, amplifying the region where blocks are corrupted such > that additional sectors are also damaged during the power failure. re-word this something like In addition to the standard risk of corrupting the blocks being written at the time of the power failure, additonal blocks (in the same flash eraseblock or raid stripe) may also be corrupted. > Users who use such storage devices are well advised take > countermeasures, such as the use of Uninterruptible Power Supplies, > and making sure the flash device is not hot-unplugged while the device > is being used. Regular backups when using these devices is also a > Very Good Idea. > > Otherwise, file systems placed on these devices can suffer silent data > and file system corruption. An forced use of fsck may detect metadata > corruption resulting in file system corruption, but will not suffice > to detect data corruption. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 23:56 ` david @ 2009-08-26 0:12 ` Pavel Machek 2009-08-26 0:20 ` david 2009-08-26 0:26 ` Ric Wheeler 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:12 UTC (permalink / raw) To: david Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 16:56:40, david@lang.hm wrote: > On Wed, 26 Aug 2009, Pavel Machek wrote: > >> There are storage devices that high highly undesirable properties >> when they are disconnected or suffer power failures while writes are >> in progress; such devices include flash devices and MD RAID 4/5/6 >> arrays. > > change this to say 'degraded MD RAID 4/5/6 arrays' > > also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly > suspect that they do) I changed it to say MD/DM. > then you need to add a note that if the array becomes degraded before a > scrub cycle happens previously hidden damage (that would have been > repaired by the scrub) can surface. I'd prefer not to talk about scrubing and such details here. Better leave warning here and point to MD documentation. >> THESE devices have the property of potentially corrupting blocks being >> written at the time of the power failure, > > this is true of all devices Actually I don't think so. I believe SATA disks do not corrupt even the sector they are writing to -- they just have big enough capacitors. And yes I believe ext3 depends on that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:12 ` Pavel Machek @ 2009-08-26 0:20 ` david 2009-08-26 0:39 ` Pavel Machek 2009-08-26 0:26 ` Ric Wheeler 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-26 0:20 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: > On Tue 2009-08-25 16:56:40, david@lang.hm wrote: >> On Wed, 26 Aug 2009, Pavel Machek wrote: >> >>> There are storage devices that high highly undesirable properties >>> when they are disconnected or suffer power failures while writes are >>> in progress; such devices include flash devices and MD RAID 4/5/6 >>> arrays. >> >> change this to say 'degraded MD RAID 4/5/6 arrays' >> >> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly >> suspect that they do) > > I changed it to say MD/DM. > >> then you need to add a note that if the array becomes degraded before a >> scrub cycle happens previously hidden damage (that would have been >> repaired by the scrub) can surface. > > I'd prefer not to talk about scrubing and such details here. Better > leave warning here and point to MD documentation. I disagree with that, the way you are wording this makes it sound as if raid isn't worth it. if you are going to say that raid is risky you need to properly specify when it is risky >>> THESE devices have the property of potentially corrupting blocks being >>> written at the time of the power failure, >> >> this is true of all devices > > Actually I don't think so. I believe SATA disks do not corrupt even > the sector they are writing to -- they just have big enough > capacitors. And yes I believe ext3 depends on that. you are incorrect on this. ext3 (like every other filesystem) just accepts the risk (zfs makes some attempt to detect such corruption) David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:20 ` david @ 2009-08-26 0:39 ` Pavel Machek 2009-08-26 1:17 ` david 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:39 UTC (permalink / raw) To: david Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 17:20:13, david@lang.hm wrote: > On Wed, 26 Aug 2009, Pavel Machek wrote: > >> On Tue 2009-08-25 16:56:40, david@lang.hm wrote: >>> On Wed, 26 Aug 2009, Pavel Machek wrote: >>> >>>> There are storage devices that high highly undesirable properties >>>> when they are disconnected or suffer power failures while writes are >>>> in progress; such devices include flash devices and MD RAID 4/5/6 >>>> arrays. >>> >>> change this to say 'degraded MD RAID 4/5/6 arrays' >>> >>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly >>> suspect that they do) >> >> I changed it to say MD/DM. >> >>> then you need to add a note that if the array becomes degraded before a >>> scrub cycle happens previously hidden damage (that would have been >>> repaired by the scrub) can surface. >> >> I'd prefer not to talk about scrubing and such details here. Better >> leave warning here and point to MD documentation. > > I disagree with that, the way you are wording this makes it sound as if > raid isn't worth it. if you are going to say that raid is risky you need > to properly specify when it is risky Ok, would this help? I don't really want to go to scrubbing details. (*) Degraded array or single disk failure "near" the powerfail is neccessary for this property of RAID arrays to bite. >>>> THESE devices have the property of potentially corrupting blocks being >>>> written at the time of the power failure, >>> >>> this is true of all devices >> >> Actually I don't think so. I believe SATA disks do not corrupt even >> the sector they are writing to -- they just have big enough >> capacitors. And yes I believe ext3 depends on that. > > you are incorrect on this. > > ext3 (like every other filesystem) just accepts the risk (zfs makes some > attempt to detect such corruption) I'd like Ted to comment on this. He wrote the original document, and I'd prefer not to introduce mistakes. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:39 ` Pavel Machek @ 2009-08-26 1:17 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-26 1:17 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: > On Tue 2009-08-25 17:20:13, david@lang.hm wrote: >> On Wed, 26 Aug 2009, Pavel Machek wrote: >> >>> On Tue 2009-08-25 16:56:40, david@lang.hm wrote: >>>> On Wed, 26 Aug 2009, Pavel Machek wrote: >>>> >>>>> There are storage devices that high highly undesirable properties >>>>> when they are disconnected or suffer power failures while writes are >>>>> in progress; such devices include flash devices and MD RAID 4/5/6 >>>>> arrays. >>>> >>>> change this to say 'degraded MD RAID 4/5/6 arrays' >>>> >>>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly >>>> suspect that they do) >>> >>> I changed it to say MD/DM. >>> >>>> then you need to add a note that if the array becomes degraded before a >>>> scrub cycle happens previously hidden damage (that would have been >>>> repaired by the scrub) can surface. >>> >>> I'd prefer not to talk about scrubing and such details here. Better >>> leave warning here and point to MD documentation. >> >> I disagree with that, the way you are wording this makes it sound as if >> raid isn't worth it. if you are going to say that raid is risky you need >> to properly specify when it is risky > > Ok, would this help? I don't really want to go to scrubbing details. > > (*) Degraded array or single disk failure "near" the powerfail is > neccessary for this property of RAID arrays to bite. that sounds reasonable David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:12 ` Pavel Machek 2009-08-26 0:20 ` david @ 2009-08-26 0:26 ` Ric Wheeler 2009-08-26 0:44 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:26 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 08:12 PM, Pavel Machek wrote: > On Tue 2009-08-25 16:56:40, david@lang.hm wrote: >> On Wed, 26 Aug 2009, Pavel Machek wrote: >> >>> There are storage devices that high highly undesirable properties >>> when they are disconnected or suffer power failures while writes are >>> in progress; such devices include flash devices and MD RAID 4/5/6 >>> arrays. >> >> change this to say 'degraded MD RAID 4/5/6 arrays' >> >> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly >> suspect that they do) > > I changed it to say MD/DM. > >> then you need to add a note that if the array becomes degraded before a >> scrub cycle happens previously hidden damage (that would have been >> repaired by the scrub) can surface. > > I'd prefer not to talk about scrubing and such details here. Better > leave warning here and point to MD documentation. Than you should punt the MD discussion to the MD documentation entirely. I would suggest: "Users of any file system that have a single media (SSD, flash or normal disk) can suffer from catastrophic and complete data loss if that single media fails. To reduce your exposure to data loss after a single point of failure, consider using either hardware or properly configured software RAID. See the documentation on MD RAID for how to configure it. To insure proper fsync() semantics, you will need to have a storage device that supports write barriers or have a non-volatile write cache. If not, best practices dictate disabling the write cache on the storage device." > >>> THESE devices have the property of potentially corrupting blocks being >>> written at the time of the power failure, >> >> this is true of all devices > > Actually I don't think so. I believe SATA disks do not corrupt even > the sector they are writing to -- they just have big enough > capacitors. And yes I believe ext3 depends on that. > Pavel Pavel, no S-ATA drive has capacitors to hold up during a power failure (or even enough power to destage their write cache). I know this from direct, personal knowledge having built RAID boxes at EMC for years. In fact, almost all RAID boxes require that the write cache be hardwired to off when used in their arrays. Drives fail partially on a very common basis - look at your remapped sector count with smartctl. RAID (including MD RAID5) will protect you from this most common error as it will protect you from complete drive failure which is also an extremely common event. Your scenario is really, really rare - doing a full rebuild after a complete drive failure (takes a matter of hours, depends on the size of the disk) and having a power failure during that rebuild. Of course adding a UPS to any storage system (including MD RAID system) helps make it more reliable, specifically in your scenario. The more important point is that having any RAID (MD1, MD5 or MD6) will greatly reduce your chance of data loss if configured correctly. With ext3, ext2 or zfs. Ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:26 ` Ric Wheeler @ 2009-08-26 0:44 ` Pavel Machek 2009-08-26 0:50 ` Ric Wheeler 2009-08-26 1:19 ` david 0 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-26 0:44 UTC (permalink / raw) To: Ric Wheeler Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet >>>> THESE devices have the property of potentially corrupting blocks being >>>> written at the time of the power failure, >>> >>> this is true of all devices >> >> Actually I don't think so. I believe SATA disks do not corrupt even >> the sector they are writing to -- they just have big enough >> capacitors. And yes I believe ext3 depends on that. > > Pavel, no S-ATA drive has capacitors to hold up during a power failure > (or even enough power to destage their write cache). I know this from > direct, personal knowledge having built RAID boxes at EMC for years. In > fact, almost all RAID boxes require that the write cache be hardwired to > off when used in their arrays. I never claimed they have enough power to flush entire cache -- read the paragraph again. I do believe the disks have enough capacitors to finish writing single sector, and I do believe ext3 depends on that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:44 ` Pavel Machek @ 2009-08-26 0:50 ` Ric Wheeler 2009-08-26 1:19 ` david 1 sibling, 0 replies; 309+ messages in thread From: Ric Wheeler @ 2009-08-26 0:50 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On 08/25/2009 08:44 PM, Pavel Machek wrote: > >>>>> THESE devices have the property of potentially corrupting blocks being >>>>> written at the time of the power failure, >>>> >>>> this is true of all devices >>> >>> Actually I don't think so. I believe SATA disks do not corrupt even >>> the sector they are writing to -- they just have big enough >>> capacitors. And yes I believe ext3 depends on that. >> >> Pavel, no S-ATA drive has capacitors to hold up during a power failure >> (or even enough power to destage their write cache). I know this from >> direct, personal knowledge having built RAID boxes at EMC for years. In >> fact, almost all RAID boxes require that the write cache be hardwired to >> off when used in their arrays. > > I never claimed they have enough power to flush entire cache -- read > the paragraph again. I do believe the disks have enough capacitors to > finish writing single sector, and I do believe ext3 depends on that. > > Pavel Some scary terms that drive people mention (and measure): "high fly writes" "over powered seeks" "adjacent tack erasure" If you do get a partial track written, the data integrity bits that the data is embedded in will flag it as invalid and give you and IO error on the next read. Note that the damage is not persistent, it will get repaired (in place) on the next write to that sector. Also it is worth noting that ext2/3/4 write file system "blocks" not single sectors. Each ext3 IO is 8 distinct disk sector writes and those can span tracks on a drive which require a seek which all consume power. On power loss, a disk will immediately park the heads... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 0:44 ` Pavel Machek 2009-08-26 0:50 ` Ric Wheeler @ 2009-08-26 1:19 ` david 2009-08-26 11:25 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: david @ 2009-08-26 1:19 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, 26 Aug 2009, Pavel Machek wrote: >>>>> THESE devices have the property of potentially corrupting blocks being >>>>> written at the time of the power failure, >>>> >>>> this is true of all devices >>> >>> Actually I don't think so. I believe SATA disks do not corrupt even >>> the sector they are writing to -- they just have big enough >>> capacitors. And yes I believe ext3 depends on that. >> >> Pavel, no S-ATA drive has capacitors to hold up during a power failure >> (or even enough power to destage their write cache). I know this from >> direct, personal knowledge having built RAID boxes at EMC for years. In >> fact, almost all RAID boxes require that the write cache be hardwired to >> off when used in their arrays. > > I never claimed they have enough power to flush entire cache -- read > the paragraph again. I do believe the disks have enough capacitors to > finish writing single sector, and I do believe ext3 depends on that. keep in mind that in a powerfail situation the data being sent to the drive may be corrupt (the ram gets flaky while a DMA to the drive copies the bad data to the drive, which writes it before the power loss gets bad enough for the drive to decide there is a problem and shutdown) you just plain cannot count on writes that are in flight when a powerfail happens to do predictable things, let alone what you consider sane or proper. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 1:19 ` david @ 2009-08-26 11:25 ` Pavel Machek 2009-08-26 12:37 ` Theodore Tso 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-26 11:25 UTC (permalink / raw) To: david Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Tue 2009-08-25 18:19:40, david@lang.hm wrote: > On Wed, 26 Aug 2009, Pavel Machek wrote: > >>>>>> THESE devices have the property of potentially corrupting blocks being >>>>>> written at the time of the power failure, >>>>> >>>>> this is true of all devices >>>> >>>> Actually I don't think so. I believe SATA disks do not corrupt even >>>> the sector they are writing to -- they just have big enough >>>> capacitors. And yes I believe ext3 depends on that. >>> >>> Pavel, no S-ATA drive has capacitors to hold up during a power failure >>> (or even enough power to destage their write cache). I know this from >>> direct, personal knowledge having built RAID boxes at EMC for years. In >>> fact, almost all RAID boxes require that the write cache be hardwired to >>> off when used in their arrays. >> >> I never claimed they have enough power to flush entire cache -- read >> the paragraph again. I do believe the disks have enough capacitors to >> finish writing single sector, and I do believe ext3 depends on that. > > keep in mind that in a powerfail situation the data being sent to the > drive may be corrupt (the ram gets flaky while a DMA to the drive copies > the bad data to the drive, which writes it before the power loss gets bad > enough for the drive to decide there is a problem and shutdown) > > you just plain cannot count on writes that are in flight when a powerfail > happens to do predictable things, let alone what you consider sane or > proper. >From what I see, this kind of failure is rather harder to reproduce than the software problems. And at least SGI machines were designed to avoid this... Anyway, I'd like to hear from ext3 people... what happens on read errors in journal? That's what you'd expect to see in situation above. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 11:25 ` Pavel Machek @ 2009-08-26 12:37 ` Theodore Tso 2009-08-30 6:49 ` Pavel Machek 2009-08-30 6:49 ` Pavel Machek 0 siblings, 2 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-26 12:37 UTC (permalink / raw) To: Pavel Machek Cc: david, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote: > > you just plain cannot count on writes that are in flight when a powerfail > > happens to do predictable things, let alone what you consider sane or > > proper. > > From what I see, this kind of failure is rather harder to reproduce > than the software problems. And at least SGI machines were designed to > avoid this... > > Anyway, I'd like to hear from ext3 people... what happens on read > errors in journal? That's what you'd expect to see in situation above. On a power failure, what normally happens is that the random garbage gets written into the disk drive's last dying gasp, since the memory starts going insane and sends garbage to the disk. So the disk successfully completes the write, but the sector contains garbage. Since HDD's tend to be last thing to die, being less sensitive to voltage drops than the memory or DMA controller, my experience is that you don't get a read error after the system comes up, you just get garbage written into the journal. The ext3 journalling code waits until all of the journal code is written, and only then writes the commit block. On restart, we look for the last valid commit block. So if the power failure is before we write the commit block, we replay the journal up until the previous commit block. If the power failure is while we are writing the commit block, garbage will be written out instead of the commit block, and so it falls back to the previous case. We do not allow any updates to the filesystem metadata to take place until the commit block has been written; therefore the filesystem stays consistent. If there the journal *does* develop read errors, then fsck will require a manual fsck, and so the boot operation will get stopped so a system administrator can provide manual intervention. The best bet for the sysadmin is to replay as much of the journal she can, and then let fsck fix any resulting filesystem inconsistencies. In practice, though, I've not experienced or seen any reports of this happening from a power failure; usually it happens if the laptop gets dropped or the hard drive suffers or suffers some other kind of hardware failure. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 12:37 ` Theodore Tso @ 2009-08-30 6:49 ` Pavel Machek 2009-08-30 6:49 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 6:49 UTC (permalink / raw) To: Theodore Tso, david, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley On Wed 2009-08-26 08:37:09, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote: > > > you just plain cannot count on writes that are in flight when a powerfail > > > happens to do predictable things, let alone what you consider sane or > > > proper. > > > > From what I see, this kind of failure is rather harder to reproduce > > than the software problems. And at least SGI machines were designed to > > avoid this... > > > > Anyway, I'd like to hear from ext3 people... what happens on read > > errors in journal? That's what you'd expect to see in situation above. > > On a power failure, what normally happens is that the random garbage > gets written into the disk drive's last dying gasp, since the memory > starts going insane and sends garbage to the disk. So the disk > successfully completes the write, but the sector contains garbage. > Since HDD's tend to be last thing to die, being less sensitive to > voltage drops than the memory or DMA controller, my experience is that > you don't get a read error after the system comes up, you just get > garbage written into the journal. > > The ext3 journalling code waits until all of the journal code is > written, and only then writes the commit block. On restart, we look > for the last valid commit block. So if the power failure is before we > write the commit block, we replay the journal up until the previous > commit block. If the power failure is while we are writing the commit > block, garbage will be written out instead of the commit block, and so > it falls back to the previous case. > > We do not allow any updates to the filesystem metadata to take place > until the commit block has been written; therefore the filesystem > stays consistent. Ok, cool. > If there the journal *does* develop read errors, then fsck will > require a manual fsck, and so the boot operation will get stopped so a > system administrator can provide manual intervention. The best bet > for the sysadmin is to replay as much of the journal she can, and then > let fsck fix any resulting filesystem inconsistencies. In practice, ...and that should result in consistent fs with no data loss, because read error is essentialy the same as garbage given back, right? ...plus, this is significant difference from logical-logging filesystems, no? Should this go to Documentation/, somewhere? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-26 12:37 ` Theodore Tso 2009-08-30 6:49 ` Pavel Machek @ 2009-08-30 6:49 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-30 6:49 UTC (permalink / raw) To: Theodore Tso, david, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Wed 2009-08-26 08:37:09, Theodore Tso wrote: > On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote: > > > you just plain cannot count on writes that are in flight when a powerfail > > > happens to do predictable things, let alone what you consider sane or > > > proper. > > > > From what I see, this kind of failure is rather harder to reproduce > > than the software problems. And at least SGI machines were designed to > > avoid this... > > > > Anyway, I'd like to hear from ext3 people... what happens on read > > errors in journal? That's what you'd expect to see in situation above. > > On a power failure, what normally happens is that the random garbage > gets written into the disk drive's last dying gasp, since the memory > starts going insane and sends garbage to the disk. So the disk > successfully completes the write, but the sector contains garbage. > Since HDD's tend to be last thing to die, being less sensitive to > voltage drops than the memory or DMA controller, my experience is that > you don't get a read error after the system comes up, you just get > garbage written into the journal. > > The ext3 journalling code waits until all of the journal code is > written, and only then writes the commit block. On restart, we look > for the last valid commit block. So if the power failure is before we > write the commit block, we replay the journal up until the previous > commit block. If the power failure is while we are writing the commit > block, garbage will be written out instead of the commit block, and so > it falls back to the previous case. > > We do not allow any updates to the filesystem metadata to take place > until the commit block has been written; therefore the filesystem > stays consistent. Ok, cool. > If there the journal *does* develop read errors, then fsck will > require a manual fsck, and so the boot operation will get stopped so a > system administrator can provide manual intervention. The best bet > for the sysadmin is to replay as much of the journal she can, and then > let fsck fix any resulting filesystem inconsistencies. In practice, ...and that should result in consistent fs with no data loss, because read error is essentialy the same as garbage given back, right? ...plus, this is significant difference from logical-logging filesystems, no? Should this go to Documentation/, somewhere? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] document flash/RAID dangers 2009-08-25 22:40 ` Pavel Machek 2009-08-25 22:59 ` david @ 2009-08-26 4:20 ` Rik van Riel 1 sibling, 0 replies; 309+ messages in thread From: Rik van Riel @ 2009-08-26 4:20 UTC (permalink / raw) To: Pavel Machek Cc: david, Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Pavel Machek wrote: > Lets say you are writing to the (healthy) RAID5 and have a powerfail. > > So now data blocks do not correspond to the parity block. You don't > yet have the corruption, but you already have a problem. > > If you get a disk failing at this point, you'll get corruption. Not necessarily. Say you wrote out the entire stripe in a 5 disk RAID 5 array, but only 3 data blocks and the parity block got written out before power failure. If the disk with the 4th (unwritten) data block were to fail and get taken out of the RAID 5 array, the degradation of the array could actually undo your data corruption. With RAID 5 and incomplete writes, you just don't know. This kind of thing could go wrong at any level in the system, with any kind of RAID 5 setup. Of course, on a single disk system without RAID you can still get incomplete writes, for the exact same reasons. RAID 5 does not make things worse. It will protect your data against certain failure modes, but not against others. With or without RAID, you still need to make backups. -- All rights reversed. ^ permalink raw reply [flat|nested] 309+ messages in thread
* [patch] document that ext2 can't handle barriers 2009-08-25 16:11 ` Theodore Tso 2009-08-25 22:21 ` Pavel Machek @ 2009-08-25 22:27 ` Pavel Machek 2009-08-25 22:27 ` Pavel Machek 2 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 22:27 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Document things ext2 expects from storage filesystems, and the fact that it can not handle barriers. Also remove jounaling description, as that's really ext3 material. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..e300ca8 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem not to return write errors. + +It also needs write caching to be disabled for reliable fsync +operation; ext2 does not know how to issue barriers as of +2.6.31. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* [patch] document that ext2 can't handle barriers 2009-08-25 16:11 ` Theodore Tso 2009-08-25 22:21 ` Pavel Machek 2009-08-25 22:27 ` [patch] document that ext2 can't handle barriers Pavel Machek @ 2009-08-25 22:27 ` Pavel Machek 2 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-25 22:27 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel Document things ext2 expects from storage filesystems, and the fact that it can not handle barriers. Also remove jounaling description, as that's really ext3 material. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..e300ca8 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem not to return write errors. + +It also needs write caching to be disabled for reliable fsync +operation; ext2 does not know how to issue barriers as of +2.6.31. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 0:08 ` Theodore Tso 2009-08-25 9:42 ` Pavel Machek 2009-08-25 9:42 ` Pavel Machek @ 2009-08-27 3:34 ` Rob Landley 2009-08-27 8:46 ` David Woodhouse 3 siblings, 0 replies; 309+ messages in thread From: Rob Landley @ 2009-08-27 3:34 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Monday 24 August 2009 19:08:42 Theodore Tso wrote: > And if your > claim is that several hundred lines of fsck output detailing the > filesystem's destruction somehow makes things all better, I suspect > most users would disagree with you. Suppose a small office makes nightly backups to an offsite server via rsync. If a thunderstorm goes by causing their system to reboot twice in a 15 minute period, would they rather notice the filesystem corruption immediately upon reboot, or notice after the next rsync? > In any case, depending on where the flash was writing at the time of > the unplug, the data corruption could be silent anyway. Yup. Hopefully btrfs will cope less badly? They keep talking about checksumming extents... > Maybe this came as a surprise to you, but anyone who has used a > compact flash in a digital camera knows that you ***have*** to wait > until the led has gone out before trying to eject the flash card. I doubt the cupholder crowd is going to stop treating USB sticks as magical any time soon, but I also wonder how many of them even remember Linux _exists_ anymore. > I > remember seeing all sorts of horror stories from professional > photographers about how they lost an important wedding's day worth of > pictures with the attendant commercial loss, on various digital > photography forums. It tends to be the sort of mistake that digital > photographers only make once. Professionals have horror stories about this issue, therefore documenting it is _less_ important? Ok... Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 0:08 ` Theodore Tso ` (2 preceding siblings ...) 2009-08-27 3:34 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley @ 2009-08-27 8:46 ` David Woodhouse 2009-08-28 14:46 ` david 3 siblings, 1 reply; 309+ messages in thread From: David Woodhouse @ 2009-08-27 8:46 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote: > > (It's worse with people using Digital SLR's shooting in raw mode, > since it can take upwards of 30 seconds or more to write out a 12-30MB > raw image, and if you eject at the wrong time, you can trash the > contents of the entire CF card; in the worst case, the Flash > Translation Layer data can get corrupted, and the card is completely > ruined; you can't even reformat it at the filesystem level, but have > to get a special Windows program from the CF manufacturer to --maybe-- > reset the FTL layer. This just goes to show why having this "translation layer" done in firmware on the device itself is a _bad_ idea. We're much better off when we have full access to the underlying flash and the OS can actually see what's going on. That way, we can actually debug, fix and recover from such problems. > Early CF cards were especially vulnerable to > this; more recent CF cards are better, but it's a known failure mode > of CF cards.) It's a known failure mode of _everything_ that uses flash to pretend to be a block device. As I see it, there are no SSD devices which don't lose data; there are only SSD devices which haven't lost your data _yet_. There's no fundamental reason why it should be this way; it just is. (I'm kind of hoping that the shiny new expensive ones that everyone's talking about right now, that I shouldn't really be slagging off, are actually OK. But they're still new, and I'm certainly not trusting them with my own data _quite_ yet.) -- dwmw2 ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-27 8:46 ` David Woodhouse @ 2009-08-28 14:46 ` david 2009-08-29 10:09 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: david @ 2009-08-28 14:46 UTC (permalink / raw) To: David Woodhouse Cc: Theodore Tso, Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Thu, 27 Aug 2009, David Woodhouse wrote: > On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote: >> >> (It's worse with people using Digital SLR's shooting in raw mode, >> since it can take upwards of 30 seconds or more to write out a 12-30MB >> raw image, and if you eject at the wrong time, you can trash the >> contents of the entire CF card; in the worst case, the Flash >> Translation Layer data can get corrupted, and the card is completely >> ruined; you can't even reformat it at the filesystem level, but have >> to get a special Windows program from the CF manufacturer to --maybe-- >> reset the FTL layer. > > This just goes to show why having this "translation layer" done in > firmware on the device itself is a _bad_ idea. We're much better off > when we have full access to the underlying flash and the OS can actually > see what's going on. That way, we can actually debug, fix and recover > from such problems. > >> Early CF cards were especially vulnerable to >> this; more recent CF cards are better, but it's a known failure mode >> of CF cards.) > > It's a known failure mode of _everything_ that uses flash to pretend to > be a block device. As I see it, there are no SSD devices which don't > lose data; there are only SSD devices which haven't lost your data > _yet_. > > There's no fundamental reason why it should be this way; it just is. > > (I'm kind of hoping that the shiny new expensive ones that everyone's > talking about right now, that I shouldn't really be slagging off, are > actually OK. But they're still new, and I'm certainly not trusting them > with my own data _quite_ yet.) so what sort of test would be needed to identify if a device has this problem? people can do ad-hoc tests by pulling the devices in use and then checking the entire device, but something better should be available. it seems to me that there are two things needed to define the tests. 1. a predictable write load so that it's easy to detect data getting lose 2. some statistical analysis to decide how many device pulls are needed (under the write load defined in #1) to make the odds high that the problem will be revealed. with this we could have people test various devices and report if the test detects unrelated data being lost (or businesses, and I think the tech hardware sites would jump into this given some sort of accepted test) for USB devices there may be a way to use the power management functions to cut power to the device without requiring it to physically be pulled, if this is the case (even if this only works on some specific chipsets), it would drasticly speed up the testing David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-28 14:46 ` david @ 2009-08-29 10:09 ` Pavel Machek 2009-08-29 16:27 ` david 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-29 10:09 UTC (permalink / raw) To: david Cc: David Woodhouse, Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Fri 2009-08-28 07:46:42, david@lang.hm wrote: > On Thu, 27 Aug 2009, David Woodhouse wrote: > >> On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote: >>> >>> (It's worse with people using Digital SLR's shooting in raw mode, >>> since it can take upwards of 30 seconds or more to write out a 12-30MB >>> raw image, and if you eject at the wrong time, you can trash the >>> contents of the entire CF card; in the worst case, the Flash >>> Translation Layer data can get corrupted, and the card is completely >>> ruined; you can't even reformat it at the filesystem level, but have >>> to get a special Windows program from the CF manufacturer to --maybe-- >>> reset the FTL layer. >> >> This just goes to show why having this "translation layer" done in >> firmware on the device itself is a _bad_ idea. We're much better off >> when we have full access to the underlying flash and the OS can actually >> see what's going on. That way, we can actually debug, fix and recover >> from such problems. >> >>> Early CF cards were especially vulnerable to >>> this; more recent CF cards are better, but it's a known failure mode >>> of CF cards.) >> >> It's a known failure mode of _everything_ that uses flash to pretend to >> be a block device. As I see it, there are no SSD devices which don't >> lose data; there are only SSD devices which haven't lost your data >> _yet_. >> >> There's no fundamental reason why it should be this way; it just is. >> >> (I'm kind of hoping that the shiny new expensive ones that everyone's >> talking about right now, that I shouldn't really be slagging off, are >> actually OK. But they're still new, and I'm certainly not trusting them >> with my own data _quite_ yet.) > > so what sort of test would be needed to identify if a device has this > problem? > > people can do ad-hoc tests by pulling the devices in use and then > checking the entire device, but something better should be available. > > it seems to me that there are two things needed to define the tests. > > 1. a predictable write load so that it's easy to detect data getting lose > > 2. some statistical analysis to decide how many device pulls are needed > (under the write load defined in #1) to make the odds high that the > problem will be revealed. Its simpler than that. It usually breaks after third unplug or so. > for USB devices there may be a way to use the power management functions > to cut power to the device without requiring it to physically be pulled, > if this is the case (even if this only works on some specific chipsets), > it would drasticly speed up the testing This is really so easy to reproduce, that such speedup is not neccessary. Just try the scripts :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-29 10:09 ` Pavel Machek @ 2009-08-29 16:27 ` david 2009-08-29 21:33 ` Pavel Machek 0 siblings, 1 reply; 309+ messages in thread From: david @ 2009-08-29 16:27 UTC (permalink / raw) To: Pavel Machek Cc: David Woodhouse, Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet On Sat, 29 Aug 2009, Pavel Machek wrote: > On Fri 2009-08-28 07:46:42, david@lang.hm wrote: >> >> >> so what sort of test would be needed to identify if a device has this >> problem? >> >> people can do ad-hoc tests by pulling the devices in use and then >> checking the entire device, but something better should be available. >> >> it seems to me that there are two things needed to define the tests. >> >> 1. a predictable write load so that it's easy to detect data getting lose >> >> 2. some statistical analysis to decide how many device pulls are needed >> (under the write load defined in #1) to make the odds high that the >> problem will be revealed. > > Its simpler than that. It usually breaks after third unplug or so. > >> for USB devices there may be a way to use the power management functions >> to cut power to the device without requiring it to physically be pulled, >> if this is the case (even if this only works on some specific chipsets), >> it would drasticly speed up the testing > > This is really so easy to reproduce, that such speedup is not > neccessary. Just try the scripts :-). so if it doesn't get corrupted after 5 unplugs does that mean that that particular device doesn't have a problem? or does it just mean you got lucky? would 10 sucessful unplugs mean that it's safe? what about 20? we need to get this beyond anecdotal evidence mode, to something that (even if not perfect, as you can get 100 'heads' in a row with an honest coin) gives you pretty good assurances that a particular device is either good or bad. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-29 16:27 ` david @ 2009-08-29 21:33 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-29 21:33 UTC (permalink / raw) To: david Cc: David Woodhouse, Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet Hi! >> This is really so easy to reproduce, that such speedup is not >> neccessary. Just try the scripts :-). > > so if it doesn't get corrupted after 5 unplugs does that mean that that > particular device doesn't have a problem? or does it just mean you got > lucky? > > would 10 sucessful unplugs mean that it's safe? > > what about 20? I'd say 20 means its safe. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:39 ` Theodore Tso 2009-08-24 23:00 ` Pavel Machek @ 2009-08-24 23:00 ` Pavel Machek 2009-08-25 13:57 ` Chris Adams 2009-08-25 22:58 ` Neil Brown 3 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 23:00 UTC (permalink / raw) To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel Cc: corbet On Mon 2009-08-24 18:39:15, Theodore Tso wrote: > On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: > > > I have to admit that I have not paid enough attention to this specifics > > > of your ext3 + flash card issue - is it the ftl stuff doing out of order > > > IO's? > > > > The problem is that flash cards destroy whole erase block on unplug, > > and ext3 can't cope with that. > > Sure --- but name **any** filesystem that can deal with the fact that > 128k or 256k worth of data might disappear when you pull out the flash > card while it is writing a single sector? First... I consider myself quite competent in the os level, yet I did not realize what flash does and what that means for data integrity. That means we need some documentation, or maybe we should refuse to mount those devices r/w or something. Then to answer your question... ext2. You expect to run fsck after unclean shutdown, and you expect to have to solve some problems with it. So the way ext2 deals with the flash media actually matches what the user expects. (*) OTOH in ext3 case you expect consistent filesystem after unplug; and you don't get that. > > > Your statement is overly broad - ext3 on a commercial RAID array that > > > does RAID5 or RAID6, etc has no issues that I know of. > > > > If your commercial RAID array is battery backed, maybe. But I was > > talking Linux MD here. ... > If your concern is that with Linux MD, you could potentially lose an > entire stripe in RAID 5 mode, then you should say that explicitly; but > again, this isn't a filesystem specific cliam; it's true for all > filesystems. I don't know of any file system that can survive having > a RAID stripe-shaped-hole blown into the middle of it due to a power > failure. Again, ext2 handles that in a way user expects it. At least I was teached "ext2 needs fsck after powerfail; ext3 can handle powerfails just ok". > I'll note, BTW, that AIX uses a journal to protect against these sorts > of problems with software raid; this also means that with AIX, you > also don't have to rebuild a RAID 1 device after an unclean shutdown, > like you have do with Linux MD. This was on the EVMS's team > development list to implement for Linux, but it got canned after LVM > won out, lo those many years ago. Ce la vie; but it's a problem which > is solvable at the RAID layer, and which is traditionally and > historically solved in competent RAID implementations. Yep, we should add journal to RAID; or at least write "Linux MD *needs* an UPS" in big and bold letters. I'm trying to do the second part. (Attached is current version of the patch). [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are generaly unsafe to use without UPS/reliable connection/no kernel bugs... then I may try to push that. I was not sure... maybe some filesystem _can_ handle this kind of issues?] Pavel (*) Ok, now... user expects to run fsck, but very advanced users may not expect old data to be damaged. Certainly I was not advanced enough user few months ago. diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..d1ef4d0 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,57 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. Not all filesystems require all of these +to be satisfied for safe operation. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On some storage systems, failed write (for example due to power +failure) kills data in adjacent (or maybe unrelated) sectors. + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + MD RAID-4/5/6 in degraded mode has similar problem, stripes + behave similary to eraseblocks. + + +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for MD RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. (But it will only really show up in degraded mode). + UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..ef9ff0f 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 570f9bd..752f4b4 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL) + + Ext3 handles trash getting written into sectors during powerfail + surprisingly well. It's not foolproof, but it is resilient. + Incomplete journal entries are ignored, and journal replay of + complete entries will often "repair" garbage written into the inode + table. The data=journal option extends this behavior to file and + directory data blocks as well. + + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. + + References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:39 ` Theodore Tso 2009-08-24 23:00 ` Pavel Machek 2009-08-24 23:00 ` Pavel Machek @ 2009-08-25 13:57 ` Chris Adams 2009-08-25 22:58 ` Neil Brown 3 siblings, 0 replies; 309+ messages in thread From: Chris Adams @ 2009-08-25 13:57 UTC (permalink / raw) To: linux-kernel Once upon a time, Theodore Tso <tytso@mit.edu> said: >I'll note, BTW, that AIX uses a journal to protect against these sorts >of problems with software raid; this also means that with AIX, you >also don't have to rebuild a RAID 1 device after an unclean shutdown, >like you have do with Linux MD. This was on the EVMS's team >development list to implement for Linux, but it got canned after LVM >won out, lo those many years ago. See mdadm(8) and look for "--bitmap". It has a few issues (can't reshape an array with a bitmap for example; you have to remove the bitmap, reshape, and re-add the bitmap), but it is available. -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble. ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 22:39 ` Theodore Tso ` (2 preceding siblings ...) 2009-08-25 13:57 ` Chris Adams @ 2009-08-25 22:58 ` Neil Brown 2009-08-25 23:10 ` Ric Wheeler 3 siblings, 1 reply; 309+ messages in thread From: Neil Brown @ 2009-08-25 22:58 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Monday August 24, tytso@mit.edu wrote: > On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: > > > I have to admit that I have not paid enough attention to this specifics > > > of your ext3 + flash card issue - is it the ftl stuff doing out of order > > > IO's? > > > > The problem is that flash cards destroy whole erase block on unplug, > > and ext3 can't cope with that. > > Sure --- but name **any** filesystem that can deal with the fact that > 128k or 256k worth of data might disappear when you pull out the flash > card while it is writing a single sector? A Log structured filesystem could certainly be written to deal with such a situation, providing by 'deal with' you mean 'only loses data that has not yet been acknowledged to the application'. Of course the filesystem would need clear visibility into exactly how these blocks are positioned. I've been playing with just such a filesystem for some time (never really finding enough time) with the goal of making it work over RAID5 with no data risk due to power loss. One day it will be functional enough for others to try.... It is entirely possible that NILFS could be made to meet that requirement, but I haven't made time to explore NILFS so I cannot be sure. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 22:58 ` Neil Brown @ 2009-08-25 23:10 ` Ric Wheeler 2009-08-25 23:32 ` NeilBrown 0 siblings, 1 reply; 309+ messages in thread From: Ric Wheeler @ 2009-08-25 23:10 UTC (permalink / raw) To: Neil Brown Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On 08/25/2009 06:58 PM, Neil Brown wrote: > On Monday August 24, tytso@mit.edu wrote: >> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: >>>> I have to admit that I have not paid enough attention to this specifics >>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order >>>> IO's? >>> >>> The problem is that flash cards destroy whole erase block on unplug, >>> and ext3 can't cope with that. >> >> Sure --- but name **any** filesystem that can deal with the fact that >> 128k or 256k worth of data might disappear when you pull out the flash >> card while it is writing a single sector? > > A Log structured filesystem could certainly be written to deal with > such a situation, providing by 'deal with' you mean 'only loses data > that has not yet been acknowledged to the application'. Of course the > filesystem would need clear visibility into exactly how these blocks > are positioned. > > I've been playing with just such a filesystem for some time (never > really finding enough time) with the goal of making it work over RAID5 > with no data risk due to power loss. One day it will be functional > enough for others to try.... > > It is entirely possible that NILFS could be made to meet that > requirement, but I haven't made time to explore NILFS so I cannot be > sure. > > NeilBrown > I am not sure that log structure will protect you from this scenario since once you clean the log, the non-logged data is assumed to be correct. If your cheap flash storage device can nuke random regions of that clean storage, you will lose data.... ric ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:10 ` Ric Wheeler @ 2009-08-25 23:32 ` NeilBrown 0 siblings, 0 replies; 309+ messages in thread From: NeilBrown @ 2009-08-25 23:32 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Wed, August 26, 2009 9:10 am, Ric Wheeler wrote: > On 08/25/2009 06:58 PM, Neil Brown wrote: >> On Monday August 24, tytso@mit.edu wrote: >>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: >>>>> I have to admit that I have not paid enough attention to this >>>>> specifics >>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of >>>>> order >>>>> IO's? >>>> >>>> The problem is that flash cards destroy whole erase block on unplug, >>>> and ext3 can't cope with that. >>> >>> Sure --- but name **any** filesystem that can deal with the fact that >>> 128k or 256k worth of data might disappear when you pull out the flash >>> card while it is writing a single sector? >> >> A Log structured filesystem could certainly be written to deal with >> such a situation, providing by 'deal with' you mean 'only loses data >> that has not yet been acknowledged to the application'. Of course the >> filesystem would need clear visibility into exactly how these blocks >> are positioned. >> >> I've been playing with just such a filesystem for some time (never >> really finding enough time) with the goal of making it work over RAID5 >> with no data risk due to power loss. One day it will be functional >> enough for others to try.... >> >> It is entirely possible that NILFS could be made to meet that >> requirement, but I haven't made time to explore NILFS so I cannot be >> sure. >> >> NeilBrown >> > > I am not sure that log structure will protect you from this scenario since > once > you clean the log, the non-logged data is assumed to be correct. > > If your cheap flash storage device can nuke random regions of that clean > storage, you will lose data.... Hence my observation that "the filesystem would need clear visibility into exactly how these blocks are positioned". If there is an FTL in the way that randomly relocates blocks, and a power fail during write could corrupt data that appears to be megabytes away in some unpredictable location, then yes: a log structure won't help. However I would like to imagine that even a cheep flash device, if it only ever got writes that were exactly the size of the erase-block, would not break those writes over multiple erase blocks, so some degree of integrity and predictability could be preserved. Even more so, I would love to be able to disable the FTL, or at least have clear and correct documentation about how it works. So yes, not a panacea. But an avenue with real possibilities. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-08-25 23:32 ` NeilBrown 0 siblings, 0 replies; 309+ messages in thread From: NeilBrown @ 2009-08-25 23:32 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Wed, August 26, 2009 9:10 am, Ric Wheeler wrote: > On 08/25/2009 06:58 PM, Neil Brown wrote: >> On Monday August 24, tytso@mit.edu wrote: >>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote: >>>>> I have to admit that I have not paid enough attention to this >>>>> specifics >>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of >>>>> order >>>>> IO's? >>>> >>>> The problem is that flash cards destroy whole erase block on unplug, >>>> and ext3 can't cope with that. >>> >>> Sure --- but name **any** filesystem that can deal with the fact that >>> 128k or 256k worth of data might disappear when you pull out the flash >>> card while it is writing a single sector? >> >> A Log structured filesystem could certainly be written to deal with >> such a situation, providing by 'deal with' you mean 'only loses data >> that has not yet been acknowledged to the application'. Of course the >> filesystem would need clear visibility into exactly how these blocks >> are positioned. >> >> I've been playing with just such a filesystem for some time (never >> really finding enough time) with the goal of making it work over RAID5 >> with no data risk due to power loss. One day it will be functional >> enough for others to try.... >> >> It is entirely possible that NILFS could be made to meet that >> requirement, but I haven't made time to explore NILFS so I cannot be >> sure. >> >> NeilBrown >> > > I am not sure that log structure will protect you from this scenario since > once > you clean the log, the non-logged data is assumed to be correct. > > If your cheap flash storage device can nuke random regions of that clean > storage, you will lose data.... Hence my observation that "the filesystem would need clear visibility into exactly how these blocks are positioned". If there is an FTL in the way that randomly relocates blocks, and a power fail during write could corrupt data that appears to be megabytes away in some unpredictable location, then yes: a log structure won't help. However I would like to imagine that even a cheep flash device, if it only ever got writes that were exactly the size of the erase-block, would not break those writes over multiple erase blocks, so some degree of integrity and predictability could be preserved. Even more so, I would love to be able to disable the FTL, or at least have clear and correct documentation about how it works. So yes, not a panacea. But an avenue with real possibilities. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 20:52 ` Pavel Machek @ 2009-08-24 21:11 ` Greg Freemyer 2009-08-24 21:11 ` Greg Freemyer 1 sibling, 0 replies; 309+ messages in thread From: Greg Freemyer @ 2009-08-24 21:11 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 > The papers show failures in "once a year" range. I have "twice a > minute" failure scenario with flashdisks. > > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, > but I bet it would be on "once a day" scale. > I agree it should be documented, but the ext3 atomicity issue is only an issue on unexpected shutdown while the array is degraded. I surely hope most people running raid5 are not seeing that level of unexpected shutdown, let along in a degraded array, If they are, the atomicity issue pretty strongly says they should not be using raid5 in that environment. At least not for any filesystem I know. Having writes to LBA n corrupt LBA n+128 as an example is pretty hard to design around from a fs perspective. Greg ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-08-24 21:11 ` Greg Freemyer 0 siblings, 0 replies; 309+ messages in thread From: Greg Freemyer @ 2009-08-24 21:11 UTC (permalink / raw) To: Pavel Machek Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 > The papers show failures in "once a year" range. I have "twice a > minute" failure scenario with flashdisks. > > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, > but I bet it would be on "once a day" scale. > I agree it should be documented, but the ext3 atomicity issue is only an issue on unexpected shutdown while the array is degraded. I surely hope most people running raid5 are not seeing that level of unexpected shutdown, let along in a degraded array, If they are, the atomicity issue pretty strongly says they should not be using raid5 in that environment. At least not for any filesystem I know. Having writes to LBA n corrupt LBA n+128 as an example is pretty hard to design around from a fs perspective. Greg ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 21:11 ` Greg Freemyer (?) @ 2009-08-25 20:56 ` Rob Landley 2009-08-25 21:08 ` david -1 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-08-25 20:56 UTC (permalink / raw) To: Greg Freemyer Cc: Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Monday 24 August 2009 16:11:56 Greg Freemyer wrote: > > The papers show failures in "once a year" range. I have "twice a > > minute" failure scenario with flashdisks. > > > > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, > > but I bet it would be on "once a day" scale. > > I agree it should be documented, but the ext3 atomicity issue is only > an issue on unexpected shutdown while the array is degraded. I surely > hope most people running raid5 are not seeing that level of unexpected > shutdown, let along in a degraded array, > > If they are, the atomicity issue pretty strongly says they should not > be using raid5 in that environment. At least not for any filesystem I > know. Having writes to LBA n corrupt LBA n+128 as an example is > pretty hard to design around from a fs perspective. Right now, people think that a degraded raid 5 is equivalent to raid 0. As this thread demonstrates, in the power failure case it's _worse_, due to write granularity being larger than the filesystem sector size. (Just like flash.) Knowing that, some people might choose to suspend writes to their raid until it's finished recovery. Perhaps they'll set up a system where a degraded raid 5 gets remounted read only until recovery completes, and then writes go to a new blank hot spare disk using all that volume snapshoting or unionfs stuff people have been working on. (The big boys already have hot spare disks standing by on a lot of these systems, ready to power up and go without human intervention. Needing two for actual reliability isn't that big a deal.) Or maybe the raid guys might want to tweak the recovery logic so it's not entirely linear, but instead prioritizes dirty pages over clean ones. So if somebody dirties a page halfway through a degraded raid 5, skip ahead to recover that chunk first to the new disk first (yes leaving holes, it's not that hard to track), and _then_ let the write go through. But unless people know the issue exists, they won't even start thinking about ways to address it. > Greg Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 20:56 ` Rob Landley @ 2009-08-25 21:08 ` david 0 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-25 21:08 UTC (permalink / raw) To: Rob Landley Cc: Greg Freemyer, Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Tue, 25 Aug 2009, Rob Landley wrote: > On Monday 24 August 2009 16:11:56 Greg Freemyer wrote: >>> The papers show failures in "once a year" range. I have "twice a >>> minute" failure scenario with flashdisks. >>> >>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, >>> but I bet it would be on "once a day" scale. >> >> I agree it should be documented, but the ext3 atomicity issue is only >> an issue on unexpected shutdown while the array is degraded. I surely >> hope most people running raid5 are not seeing that level of unexpected >> shutdown, let along in a degraded array, >> >> If they are, the atomicity issue pretty strongly says they should not >> be using raid5 in that environment. At least not for any filesystem I >> know. Having writes to LBA n corrupt LBA n+128 as an example is >> pretty hard to design around from a fs perspective. > > Right now, people think that a degraded raid 5 is equivalent to raid 0. As > this thread demonstrates, in the power failure case it's _worse_, due to write > granularity being larger than the filesystem sector size. (Just like flash.) > > Knowing that, some people might choose to suspend writes to their raid until > it's finished recovery. Perhaps they'll set up a system where a degraded raid > 5 gets remounted read only until recovery completes, and then writes go to a > new blank hot spare disk using all that volume snapshoting or unionfs stuff > people have been working on. (The big boys already have hot spare disks > standing by on a lot of these systems, ready to power up and go without human > intervention. Needing two for actual reliability isn't that big a deal.) > > Or maybe the raid guys might want to tweak the recovery logic so it's not > entirely linear, but instead prioritizes dirty pages over clean ones. So if > somebody dirties a page halfway through a degraded raid 5, skip ahead to > recover that chunk first to the new disk first (yes leaving holes, it's not that > hard to track), and _then_ let the write go through. > > But unless people know the issue exists, they won't even start thinking about > ways to address it. if you've got the drives available you should be running raid 6 not raid 5 so that you have to loose two drives before you loose your error checking. in my opinion that's a far better use of a drive than a hot spare. David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 20:24 ` Ric Wheeler 2009-08-24 20:52 ` Pavel Machek @ 2009-08-25 18:52 ` Rob Landley 1 sibling, 0 replies; 309+ messages in thread From: Rob Landley @ 2009-08-25 18:52 UTC (permalink / raw) To: Ric Wheeler Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Monday 24 August 2009 15:24:28 Ric Wheeler wrote: > Pavel Machek wrote: > > Actually, ext2 should be able to survive that, no? Error writing -> > > remount ro -> fsck on next boot -> drive relocates the sectors. > > I think that the example and the response are both off base. If your > head ever touches the platter, you won't be reading from a huge part of > your drive ever again It's not quite that simple anymore. These days, most modern drives add an "overcoat", which is a vapor deposition layer of carbon (I.E. diamond) on top of the magnetic media, and then add a nanolayer of some kind of nonmagnetic lubricant on top of that. That protects the magnetic layer from physical contact with the head; it takes a pretty solid whack to chip through diamond and actually gouge your disk: http://www.datarecoverylink.com/understanding_magnetic_media.html You can also do fun things with various nitridies (carbon nitride, silicon nitride, titanium nitride) which are pretty darn tough too, although I dunno about their suitability to hard drives: http://www.physical-vapor-deposition.com/ So while it _is_ possible to whack your drive and scratch the platter, merely "touching" won't do it. (Laptops wouldn't be feasible if they couldn't cope with a little jostling while running.) In the case of repeated small whacks, your heads may actually go first. (I vaguely recall the little aerofoil wing thingy holding up the disk touches first, and can get ground down by repeated contact with the diamond layer (despite the lubricant, that just buys time) so it gets shorter and shorter and can't reliably keep the head above the disk rather than in contact with it. But I'm kind of stale myself here, not sure that's still current.) Here's a nice youtube video of a 2007 defcon talk from a hard drive recovery professional, "What's that Clicking Noise", series starts here: http://www.youtube.com/watch?v=vCapEFNZAJ0 And here's that guy's web page: http://www.myharddrivedied.com/presentations/index.html Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:01 ` Theodore Tso 2009-08-24 14:55 ` Artem Bityutskiy 2009-08-24 19:52 ` Pavel Machek @ 2009-08-25 14:43 ` Florian Weimer 2 siblings, 0 replies; 309+ messages in thread From: Florian Weimer @ 2009-08-25 14:43 UTC (permalink / raw) To: Theodore Tso Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 * Theodore Tso: > The only one that falls into that category is the one about not being > able to handle failed writes, and the way most failures take place, Hmm. What does "not being able to handle failed writes" actually mean? AFAICS, there are two possible answers: "all bets are off", or "we'll tell you about the problem, and all bets are off". >> Isn't this by design? In other words, if the metadata doesn't survive >> non-atomic writes, wouldn't it be an ext3 bug? > > Part of the problem here is that "atomic-writes" is confusing; it > doesn't mean what many people think it means. The assumption which > many naive filesystem designers make is that writes succeed or they > don't. If they don't succeed, they don't change the previously > existing data in any way. Right. And a lot of database systems make the same assumption. Oracle Berkeley DB cannot deal with partial page writes at all, and PostgreSQL assumes that it's safe to flip a few bits in a sector without proper WAL (it doesn't care if the changes actually hit the disk, but the write shouldn't make the sector unreadable or put random bytes there). > Is that a file system "bug"? Well, it's better to call that a > mismatch between the assumptions made of physical devices, and of the > file system code. On Irix, SGI hardware had a powerfail interrupt, > and the power supply and extra-big capacitors, so that when a power > fail interrupt came in, the Irix would run around frantically shutting > down pending DMA transfers to prevent this failure mode from causing > problems. PC class hardware (according to Ted's law), is cr*p, and > doesn't have a powerfail interrupt, so it's not something that we > have. The DMA transaction should fail due to ECC errors, though. > Ext3, ext4, and ocfs2 does physical block journalling, so as long as > journal truncate hasn't taken place right before the failure, the > replay of the physical block journal tends to repair this most (but > not necessarily all) cases of "garbage is written right before power > failure". People who care about this should really use a UPS, and > wire up the USB and/or serial cable from the UPS to the system, so > that the OS can do a controlled shutdown if the UPS is close to > shutting down due to an extended power failure. I think the general idea is to protect valuable data with WAL. You overwrite pages on disk only after you've made a backup copy into WAL. After a power loss event, you replay the log and overwrite all garbage that might be there. For the WAL, you rely on checksum and sequence numbers. This still doesn't help against write failures where the system continues running (because the fsync() during checkpointing isn't guaranteed to report errors), but it should deal with the power failure case. But this assumes that the file system protects its own data structure in a similar way. Is this really too much to demand? Partial failures are extremely difficult to deal with because of their asynchronous nature. I've come to accept that, but it's still disappointing. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 11:19 ` Florian Weimer 2009-08-24 13:01 ` Theodore Tso @ 2009-08-24 13:50 ` Theodore Tso 2009-08-24 18:48 ` Pavel Machek 2009-08-24 18:48 ` Pavel Machek 2009-08-24 18:39 ` Pavel Machek 2 siblings, 2 replies; 309+ messages in thread From: Theodore Tso @ 2009-08-24 13:50 UTC (permalink / raw) To: Florian Weimer Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote: > > +* don't damage the old data on a failed write (ATOMIC-WRITES) > > + > > + (Thrash may get written into sectors during powerfail. And > > + ext3 handles this surprisingly well at least in the > > + catastrophic case of garbage getting written into the inode > > + table, since the journal replay often will "repair" the > > + garbage that was written into the filesystem metadata blocks. > > Isn't this by design? In other words, if the metadata doesn't survive > non-atomic writes, wouldn't it be an ext3 bug? So I got confused when I quoted your note, which I had assumed was exactly what Pavel had written in his documentation. In fact, what he had written was this: +Don't damage the old data on a failed write (ATOMIC-WRITES) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + +.... So he had explicitly stated that he only cared about the whole sector being written (or not written) in the power fail case, and not any other. I'd suggest changing ATOMIC-WRITES to ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage the old data on a failed write", is also singularly misleading. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:50 ` Theodore Tso @ 2009-08-24 18:48 ` Pavel Machek 2009-08-24 18:48 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 18:48 UTC (permalink / raw) To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, An Hi! > > > +* don't damage the old data on a failed write (ATOMIC-WRITES) > > > + > > > + (Thrash may get written into sectors during powerfail. And > > > + ext3 handles this surprisingly well at least in the > > > + catastrophic case of garbage getting written into the inode > > > + table, since the journal replay often will "repair" the > > > + garbage that was written into the filesystem metadata blocks. > > > > Isn't this by design? In other words, if the metadata doesn't survive > > non-atomic writes, wouldn't it be an ext3 bug? > > So I got confused when I quoted your note, which I had assumed was > exactly what Pavel had written in his documentation. In fact, what he > had written was this: > > +Don't damage the old data on a failed write (ATOMIC-WRITES) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > +.... > > So he had explicitly stated that he only cared about the whole sector > being written (or not written) in the power fail case, and not any > other. I'd suggest changing ATOMIC-WRITES to > ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage > the old data on a failed write", is also singularly misleading. Ok, something like this? Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Either whole sector is correctly written or nothing is written during powerfail. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:50 ` Theodore Tso 2009-08-24 18:48 ` Pavel Machek @ 2009-08-24 18:48 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 18:48 UTC (permalink / raw) To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Hi! > > > +* don't damage the old data on a failed write (ATOMIC-WRITES) > > > + > > > + (Thrash may get written into sectors during powerfail. And > > > + ext3 handles this surprisingly well at least in the > > > + catastrophic case of garbage getting written into the inode > > > + table, since the journal replay often will "repair" the > > > + garbage that was written into the filesystem metadata blocks. > > > > Isn't this by design? In other words, if the metadata doesn't survive > > non-atomic writes, wouldn't it be an ext3 bug? > > So I got confused when I quoted your note, which I had assumed was > exactly what Pavel had written in his documentation. In fact, what he > had written was this: > > +Don't damage the old data on a failed write (ATOMIC-WRITES) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > +.... > > So he had explicitly stated that he only cared about the whole sector > being written (or not written) in the power fail case, and not any > other. I'd suggest changing ATOMIC-WRITES to > ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage > the old data on a failed write", is also singularly misleading. Ok, something like this? Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Either whole sector is correctly written or nothing is written during powerfail. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 11:19 ` Florian Weimer 2009-08-24 13:01 ` Theodore Tso 2009-08-24 13:50 ` Theodore Tso @ 2009-08-24 18:39 ` Pavel Machek 2 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 18:39 UTC (permalink / raw) To: Florian Weimer Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Hi! > > +Linux block-backed filesystems can only work correctly when several > > +conditions are met in the block layer and below (disks, flash > > +cards). Some of them are obvious ("data on media should not change > > +randomly"), some are less so. > > You should make clear that the file lists per-file-system rules and > that some file sytems can recover from some of the error conditions. Ok, I added "Not all filesystems require all of these to be satisfied for safe operation" sentence there. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 9:31 ` [patch] " Pavel Machek @ 2009-08-24 13:21 ` Greg Freemyer 2009-08-24 13:21 ` Greg Freemyer 2009-08-24 21:11 ` Rob Landley 2 siblings, 0 replies; 309+ messages in thread From: Greg Freemyer @ 2009-08-24 13:21 UTC (permalink / raw) To: Pavel Machek Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon, Aug 24, 2009 at 5:31 AM, Pavel Machek<pavel@ucw.cz> wrote: > > Running journaling filesystem such as ext3 over flashdisk or degraded > RAID array is a bad idea: journaling guarantees no longer apply and > you will get data corruption on powerfail. > > We can't solve it easily, but we should certainly warn the users. I > actually lost data because I did not understand these limitations... > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..80fa886 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,52 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. > + > +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, > +and are thus unsuitable for all filesystems I know. > + > + An inherent problem with using flash as a normal block device > + is that the flash erase size is bigger than most filesystem > + sector sizes. So when you request a write, it may erase and > + rewrite some 64k, 128k, or even a couple megabytes on the > + really _big_ ones. > + > + If you lose power in the middle of that, filesystem won't > + notice that data in the "sectors" _around_ the one your were > + trying to write to got trashed. > + > + RAID-4/5/6 in degraded mode has same problem. > + > + > +Don't damage the old data on a failed write (ATOMIC-WRITES) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + This may be quite common on generic PC machines. > + > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. (But it will only really show up in degraded mode). > + UPS for RAID array should help. Can someone clarify if this is true in raid-6 with just a single disk failure? I don't see why it would be. And if not can the above text be changed to reflect raid 4/5 with a single disk failure and raid 6 with a double disk failure are the modes that have atomicity problems. Greg ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-08-24 13:21 ` Greg Freemyer 0 siblings, 0 replies; 309+ messages in thread From: Greg Freemyer @ 2009-08-24 13:21 UTC (permalink / raw) To: Pavel Machek Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon, Aug 24, 2009 at 5:31 AM, Pavel Machek<pavel@ucw.cz> wrote: > > Running journaling filesystem such as ext3 over flashdisk or degraded > RAID array is a bad idea: journaling guarantees no longer apply and > you will get data corruption on powerfail. > > We can't solve it easily, but we should certainly warn the users. I > actually lost data because I did not understand these limitations... > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..80fa886 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,52 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. > + > +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, > +and are thus unsuitable for all filesystems I know. > + > + An inherent problem with using flash as a normal block device > + is that the flash erase size is bigger than most filesystem > + sector sizes. So when you request a write, it may erase and > + rewrite some 64k, 128k, or even a couple megabytes on the > + really _big_ ones. > + > + If you lose power in the middle of that, filesystem won't > + notice that data in the "sectors" _around_ the one your were > + trying to write to got trashed. > + > + RAID-4/5/6 in degraded mode has same problem. > + > + > +Don't damage the old data on a failed write (ATOMIC-WRITES) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + This may be quite common on generic PC machines. > + > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. (But it will only really show up in degraded mode). > + UPS for RAID array should help. Can someone clarify if this is true in raid-6 with just a single disk failure? I don't see why it would be. And if not can the above text be changed to reflect raid 4/5 with a single disk failure and raid 6 with a double disk failure are the modes that have atomicity problems. Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:21 ` Greg Freemyer (?) @ 2009-08-24 18:44 ` Pavel Machek -1 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-08-24 18:44 UTC (permalink / raw) To: Greg Freemyer Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > + > > + Because RAM tends to fail faster than rest of system during > > + powerfail, special hw killing DMA transfers may be necessary; > > + otherwise, disks may write garbage during powerfail. > > + This may be quite common on generic PC machines. > > + > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > + because it needs to write both changed data, and parity, to > > + different disks. (But it will only really show up in degraded mode). > > + UPS for RAID array should help. > > Can someone clarify if this is true in raid-6 with just a single disk > failure? I don't see why it would be. > > And if not can the above text be changed to reflect raid 4/5 with a > single disk failure and raid 6 with a double disk failure are the > modes that have atomicity problems. I don't know enough about raid-6, but... I said "degraded mode" above, and you can read it as double failure in raid-6 case ;-). I'll prefer to avoid too many details here. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 13:21 ` Greg Freemyer @ 2009-08-25 23:28 ` Neil Brown -1 siblings, 0 replies; 309+ messages in thread From: Neil Brown @ 2009-08-25 23:28 UTC (permalink / raw) To: Greg Freemyer Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Monday August 24, greg.freemyer@gmail.com wrote: > > +Don't damage the old data on a failed write (ATOMIC-WRITES) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > + > > + Because RAM tends to fail faster than rest of system during > > + powerfail, special hw killing DMA transfers may be necessary; > > + otherwise, disks may write garbage during powerfail. > > + This may be quite common on generic PC machines. > > + > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > + because it needs to write both changed data, and parity, to > > + different disks. (But it will only really show up in degraded mode). > > + UPS for RAID array should help. > > Can someone clarify if this is true in raid-6 with just a single disk > failure? I don't see why it would be. It does affect raid6 with a single drive missing. After an unclean shutdown you cannot trust any Parity block as it is possible that some of the blocks in the stripe have been updated, but others have not. So you must assume that all parity blocks are wrong and update them. If you have a missing disk you cannot do that. To take a more concrete example, imagine a 5 device RAID6 with 3 data blocks D0 D1 D2 as well a P and Q on some stripe. Suppose that we crashed while updating D0, which would have involved writing out D0, P and Q. On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3 of D0, P and Q have been updated and the others not. We can try to recompute D2 from D0 D1 and P, from D0 P and Q or from D1, P and Q. We could conceivably try each of those and if they all produce the same result we might be confident of it. If two produced the same result and the other was different we could use a voting process to choose the 'best'. And in this particular case I think that would work. If 0 or 3 had been updates, all would be the same. If only 1 was updated, then the combinations that exclude it will match. If 2 were updated, then the combinations that exclude the non-updated block will match. But if both D0 and D1 were being updated I think there would be too many combinations and it would be very possibly that all three computed values for D2 would be different. So yes: a singly degraded RAID6 cannot promise no data corruption after an unclean shutdown. That is why "mdadm" will not assemble such an array unless you use "--force" to acknowledge that there has been a problem. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible @ 2009-08-25 23:28 ` Neil Brown 0 siblings, 0 replies; 309+ messages in thread From: Neil Brown @ 2009-08-25 23:28 UTC (permalink / raw) To: Greg Freemyer Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Monday August 24, greg.freemyer@gmail.com wrote: > > +Don't damage the old data on a failed write (ATOMIC-WRITES) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > + > > + Because RAM tends to fail faster than rest of system during > > + powerfail, special hw killing DMA transfers may be necessary; > > + otherwise, disks may write garbage during powerfail. > > + This may be quite common on generic PC machines. > > + > > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > > + because it needs to write both changed data, and parity, to > > + different disks. (But it will only really show up in degraded mode). > > + UPS for RAID array should help. > > Can someone clarify if this is true in raid-6 with just a single disk > failure? I don't see why it would be. It does affect raid6 with a single drive missing. After an unclean shutdown you cannot trust any Parity block as it is possible that some of the blocks in the stripe have been updated, but others have not. So you must assume that all parity blocks are wrong and update them. If you have a missing disk you cannot do that. To take a more concrete example, imagine a 5 device RAID6 with 3 data blocks D0 D1 D2 as well a P and Q on some stripe. Suppose that we crashed while updating D0, which would have involved writing out D0, P and Q. On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3 of D0, P and Q have been updated and the others not. We can try to recompute D2 from D0 D1 and P, from D0 P and Q or from D1, P and Q. We could conceivably try each of those and if they all produce the same result we might be confident of it. If two produced the same result and the other was different we could use a voting process to choose the 'best'. And in this particular case I think that would work. If 0 or 3 had been updates, all would be the same. If only 1 was updated, then the combinations that exclude it will match. If 2 were updated, then the combinations that exclude the non-updated block will match. But if both D0 and D1 were being updated I think there would be too many combinations and it would be very possibly that all three computed values for D2 would be different. So yes: a singly degraded RAID6 cannot promise no data corruption after an unclean shutdown. That is why "mdadm" will not assemble such an array unless you use "--force" to acknowledge that there has been a problem. NeilBrown ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-25 23:28 ` Neil Brown (?) @ 2009-08-26 1:34 ` david -1 siblings, 0 replies; 309+ messages in thread From: david @ 2009-08-26 1:34 UTC (permalink / raw) To: Neil Brown Cc: Greg Freemyer, Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 [-- Attachment #1: Type: TEXT/PLAIN, Size: 3139 bytes --] On Wed, 26 Aug 2009, Neil Brown wrote: > On Monday August 24, greg.freemyer@gmail.com wrote: >>> +Don't damage the old data on a failed write (ATOMIC-WRITES) >>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> + >>> +Either whole sector is correctly written or nothing is written during >>> +powerfail. >>> + >>> + Because RAM tends to fail faster than rest of system during >>> + powerfail, special hw killing DMA transfers may be necessary; >>> + otherwise, disks may write garbage during powerfail. >>> + This may be quite common on generic PC machines. >>> + >>> + Note that atomic write is very hard to guarantee for RAID-4/5/6, >>> + because it needs to write both changed data, and parity, to >>> + different disks. (But it will only really show up in degraded mode). >>> + UPS for RAID array should help. >> >> Can someone clarify if this is true in raid-6 with just a single disk >> failure? I don't see why it would be. > > It does affect raid6 with a single drive missing. > > After an unclean shutdown you cannot trust any Parity block as it > is possible that some of the blocks in the stripe have been updated, > but others have not. So you must assume that all parity blocks are > wrong and update them. If you have a missing disk you cannot do that. > > To take a more concrete example, imagine a 5 device RAID6 with > 3 data blocks D0 D1 D2 as well a P and Q on some stripe. > Suppose that we crashed while updating D0, which would have involved > writing out D0, P and Q. > On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3 > of D0, P and Q have been updated and the others not. > We can try to recompute D2 from D0 D1 and P, from > D0 P and Q or from D1, P and Q. > > We could conceivably try each of those and if they all produce the > same result we might be confident of it. > If two produced the same result and the other was different we could > use a voting process to choose the 'best'. And in this particular > case I think that would work. If 0 or 3 had been updates, all would > be the same. If only 1 was updated, then the combinations that > exclude it will match. If 2 were updated, then the combinations that > exclude the non-updated block will match. > > But if both D0 and D1 were being updated I think there would be too > many combinations and it would be very possibly that all three > computed values for D2 would be different. > > So yes: a singly degraded RAID6 cannot promise no data corruption > after an unclean shutdown. That is why "mdadm" will not assemble such > an array unless you use "--force" to acknowledge that there has been a > problem. thanks for this detail, I would not have expected a partially degraded raid 6 array to be this sensitive to problems. assuming that the degradation happens prior to the power failure, what could be done to make this safer and more predictable. off the top of my head (and possibly an extreme performance hit, not nessasarily suitable for everyone) is there something that could be done with ordering the writes to the various drives? David Lang ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 9:31 ` [patch] " Pavel Machek 2009-08-24 11:19 ` Florian Weimer 2009-08-24 13:21 ` Greg Freemyer @ 2009-08-24 21:11 ` Rob Landley 2009-08-24 21:33 ` Pavel Machek 2 siblings, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-08-24 21:11 UTC (permalink / raw) To: Pavel Machek Cc: Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Monday 24 August 2009 04:31:43 Pavel Machek wrote: > Running journaling filesystem such as ext3 over flashdisk or degraded > RAID array is a bad idea: journaling guarantees no longer apply and > you will get data corruption on powerfail. > > We can't solve it easily, but we should certainly warn the users. I > actually lost data because I did not understand these limitations... > > Signed-off-by: Pavel Machek <pavel@ucw.cz> Acked-by: Rob Landley <rob@landley.net> With a couple comments: > +* write caching is disabled. ext2 does not know how to issue barriers > + as of 2.6.28. hdparm -W0 disables it on SATA disks. It's coming up on 2.6.31, has it learned anything since or should that version number be bumped? > + (Thrash may get written into sectors during powerfail. And > + ext3 handles this surprisingly well at least in the > + catastrophic case of garbage getting written into the inode > + table, since the journal replay often will "repair" the > + garbage that was written into the filesystem metadata blocks. > + It won't do a bit of good for the data blocks, of course > + (unless you are using data=journal mode). But this means that > + in fact, ext3 is more resistant to suriving failures to the > + first problem (powerfail while writing can damage old data on > + a failed write) but fortunately, hard drives generally don't > + cause collateral damage on a failed write. Possible rewording of this paragraph: Ext3 handles trash getting written into sectors during powerfail surprisingly well. It's not foolproof, but it is resilient. Incomplete journal entries are ignored, and journal replay of complete entries will often "repair" garbage written into the inode table. The data=journal option extends this behavior to file and directory data blocks as well (without which your dentries can still be badly corrupted by a power fail during a write). (I'm not entirely sure about that last bit, but clarifying it one way or the other would be nice because I can't tell from reading it which it is. My _guess_ is that directories are just treated as files with an attitude and an extra cacheing layer...?) Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 21:11 ` Rob Landley @ 2009-08-24 21:33 ` Pavel Machek 2009-08-25 18:45 ` Jan Kara 0 siblings, 1 reply; 309+ messages in thread From: Pavel Machek @ 2009-08-24 21:33 UTC (permalink / raw) To: Rob Landley, jack Cc: Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon 2009-08-24 16:11:08, Rob Landley wrote: > On Monday 24 August 2009 04:31:43 Pavel Machek wrote: > > Running journaling filesystem such as ext3 over flashdisk or degraded > > RAID array is a bad idea: journaling guarantees no longer apply and > > you will get data corruption on powerfail. > > > > We can't solve it easily, but we should certainly warn the users. I > > actually lost data because I did not understand these limitations... > > > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > Acked-by: Rob Landley <rob@landley.net> > > With a couple comments: > > > +* write caching is disabled. ext2 does not know how to issue barriers > > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > It's coming up on 2.6.31, has it learned anything since or should that version > number be bumped? Jan, did those "barrier for ext2" patches get merged? > > + (Thrash may get written into sectors during powerfail. And > > + ext3 handles this surprisingly well at least in the > > + catastrophic case of garbage getting written into the inode > > + table, since the journal replay often will "repair" the > > + garbage that was written into the filesystem metadata blocks. > > + It won't do a bit of good for the data blocks, of course > > + (unless you are using data=journal mode). But this means that > > + in fact, ext3 is more resistant to suriving failures to the > > + first problem (powerfail while writing can damage old data on > > + a failed write) but fortunately, hard drives generally don't > > + cause collateral damage on a failed write. > > Possible rewording of this paragraph: > > Ext3 handles trash getting written into sectors during powerfail > surprisingly well. It's not foolproof, but it is resilient. Incomplete > journal entries are ignored, and journal replay of complete entries will > often "repair" garbage written into the inode table. The data=journal > option extends this behavior to file and directory data blocks as well > (without which your dentries can still be badly corrupted by a power fail > during a write). > > (I'm not entirely sure about that last bit, but clarifying it one way or the > other would be nice because I can't tell from reading it which it is. My > _guess_ is that directories are just treated as files with an attitude and an > extra cacheing layer...?) Thanks, applied, it looks better than what I wrote. I removed the () part, as I'm not sure about it... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: [patch] ext2/3: document conditions when reliable operation is possible 2009-08-24 21:33 ` Pavel Machek @ 2009-08-25 18:45 ` Jan Kara 0 siblings, 0 replies; 309+ messages in thread From: Jan Kara @ 2009-08-25 18:45 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, jack, Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon 24-08-09 23:33:12, Pavel Machek wrote: > On Mon 2009-08-24 16:11:08, Rob Landley wrote: > > On Monday 24 August 2009 04:31:43 Pavel Machek wrote: > > > Running journaling filesystem such as ext3 over flashdisk or degraded > > > RAID array is a bad idea: journaling guarantees no longer apply and > > > you will get data corruption on powerfail. > > > > > > We can't solve it easily, but we should certainly warn the users. I > > > actually lost data because I did not understand these limitations... > > > > > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > > > Acked-by: Rob Landley <rob@landley.net> > > > > With a couple comments: > > > > > +* write caching is disabled. ext2 does not know how to issue barriers > > > + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > > > It's coming up on 2.6.31, has it learned anything since or should that version > > number be bumped? > > Jan, did those "barrier for ext2" patches get merged? No, they did not. We were discussing how to be able to enable / disable sending barriers, someone told he'd implement it but it somehow never got beyond an initial attempt. Actually, after recent sync cleanups (and when my O_SYNC cleanups get merged) it should be pretty easy because every filesystem now has ->fsync() and ->sync_fs() callback so we just have to add sending barriers to these two functions and implement possibility to set via sysfs that barriers on the block device should be ignored. I've put it to my todo list but if someone else has time for this, I certainly would not mind :). It would be a nice beginner project... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 19:13 ` Rob Landley 2009-03-16 12:28 ` Pavel Machek @ 2009-03-16 12:30 ` Pavel Machek 2009-03-16 19:03 ` Theodore Tso 2009-03-16 19:40 ` Sitsofe Wheeler 2009-08-29 1:33 ` Robert Hancock 2 siblings, 2 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-16 12:30 UTC (permalink / raw) To: Rob Landley Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 Updated version here. On Thu 2009-03-12 14:13:03, Rob Landley wrote: > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote: > > Not all block devices are suitable for all filesystems. In fact, some > > block devices are so broken that reliable operation is pretty much > > impossible. Document stuff ext2/ext3 needs for reliable operation. > > > > Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..710d119 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,47 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortunately, none of the cheap USB/SD flash cards I've seen + do behave like this, and are thus unsuitable for all Linux + filesystems I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite some 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _around_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + Not sure how common that problem is on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 4333e83..41fd2ec 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..02a9bd5 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,27 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed + +* sector writes are atomic + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 12:30 ` Pavel Machek @ 2009-03-16 19:03 ` Theodore Tso 2009-03-23 18:23 ` Pavel Machek 2009-03-16 19:40 ` Sitsofe Wheeler 1 sibling, 1 reply; 309+ messages in thread From: Theodore Tso @ 2009-03-16 19:03 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote: > Updated version here. > > On Thu 2009-03-12 14:13:03, Rob Landley wrote: > > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote: > > > Not all block devices are suitable for all filesystems. In fact, some > > > block devices are so broken that reliable operation is pretty much > > > impossible. Document stuff ext2/ext3 needs for reliable operation. Some of what is here are bugs, and some are legitimate long-term interfaces (for example, the question of losing I/O errors when two processes are writing to the same file, or to a directory entry, and errors aren't or in some cases, can't, be reflected back to userspace). I'm a little concerned that some of this reads a bit too much like a rant (and I know Pavel was very frustrated when he tried to use a flash card with a sucky flash card socket) and it will get used the wrong way by apoligists, because it mixes areas where "we suck, we should do better", which a re bug reports, and "Posix or the underlying block device layer makes it hard", and simply states them as fundamental design requirements, when that's probably not true. There's a lot of work that we could do to make I/O errors get better reflected to userspace by fsync(). So state things as bald requirements I think goes a little too far IMHO. We can surely do better. > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..710d119 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly, because success > +on fsync was already returned when data hit the journal. The last half of this sentence "because success on fsync was already returned when data hit the journal", obviously doesn't apply to all filesystems, since some filesystems, like ext2, don't journal data. Even for ext3, it only applies in the case of data=journal mode. There are other issues here, such as fsync() only reports an I/O problem to one caller, and in some cases I/O errors aren't propagated up the storage stack. The latter is clearly just a bug that should be fixed; the former is more of an interface limitation. But you don't talk about in this section, and I think it would be good to have a more extended discussion about I/O errors when writing data blocks, and I/O errors writing metadata blocks, etc. > + > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. This requirement is not quite the same as what you discuss below. > + > + Unfortunately, none of the cheap USB/SD flash cards I've seen > + do behave like this, and are thus unsuitable for all Linux > + filesystems I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite some 64k, 128k, or > + even a couple megabytes on the really _big_ ones. > + > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. The characteristic you descrive here is not an issue about whether the whole sector is either written or nothing happens to the data --- but rather, or at least in addition to that, there is also the issue that when a there is a flash card failure --- particularly one caused by a sucky flash card reader design causing the SD card to disconnect from the laptop in the middle of a write --- there may be "collateral damange"; that is, in addition to corrupting sector being writen, adjacent sectors might also end up getting list as an unfortunate side effect. So there are actually two desirable properties for a storage system to have; one is "don't damage the old data on a failed write"; and the other is "don't cause collateral damage to adjacent sectors on a failed write". > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + Not sure how common that problem is on generic PC machines. This problem is still relatively common, from what I can tell. And ext3 handles this surprisingly well at least in the catastrophic case of garbage getting written into the inode table, since the journal replay often will "repair" the garbage that was written into the filesystem metadata blocks. It won't do a bit of good for the data blocks, of course (unless you are using data=journal mode). But this means that in fact, ext3 is more resistant to suriving failures to the first problem (powerfail while writing can damage old data on a failed write) but fortunately, hard drives generally don't cause collateral damage on a failed write. Of course, there are some spectaular exemption to this rule --- a physical shock which causes the head to slam into a surface moving at 7200rpm can throw a lot of debris into the hard drive enclosure, causing loss to adjacent sectors. - Ted ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 19:03 ` Theodore Tso @ 2009-03-23 18:23 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-23 18:23 UTC (permalink / raw) To: Theodore Tso, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4 Hi! > > > > Not all block devices are suitable for all filesystems. In fact, some > > > > block devices are so broken that reliable operation is pretty much > > > > impossible. Document stuff ext2/ext3 needs for reliable operation. > > Some of what is here are bugs, and some are legitimate long-term > interfaces (for example, the question of losing I/O errors when two > processes are writing to the same file, or to a directory entry, and > errors aren't or in some cases, can't, be reflected back to > userspace). Well, I guess there's thin line between error and "legitimate long-term interfaces". I still believe that fsync() is broken by design. > I'm a little concerned that some of this reads a bit too much like a > rant (and I know Pavel was very frustrated when he tried to use a > flash card with a sucky flash card socket) and it will get used the It started as a rant, obviously I'd like to get away from that and get it into suitable format for inclusion. (Not being native speaker does not help here). But I do believe that we should get this documented; many common storage subsystems are broken, and can cause data loss. We should at least tell to the users. > wrong way by apoligists, because it mixes areas where "we suck, we > should do better", which a re bug reports, and "Posix or the > underlying block device layer makes it hard", and simply states them > as fundamental design requirements, when that's probably not true. Well, I guess that can be refined later. Heck, I'm not able to tell which are simple bugs likely to be fixed soon, and which are fundamental issues that are unlikely to be fixed sooner than 2030. I guess it is fair to document them ASAP, and then fix those that can be fixed... > There's a lot of work that we could do to make I/O errors get better > reflected to userspace by fsync(). So state things as bald > requirements I think goes a little too far IMHO. We can surely do > better. If the fsync() can be fixed... that would be great. But I'm not sure how easy that will be. > > +Write errors not allowed (NO-WRITE-ERRORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Writes to media never fail. Even if disk returns error condition > > +during write, filesystems can't handle that correctly, because success > > +on fsync was already returned when data hit the journal. > > The last half of this sentence "because success on fsync was already > returned when data hit the journal", obviously doesn't apply to all > filesystems, since some filesystems, like ext2, don't journal data. > Even for ext3, it only applies in the case of data=journal mode. Ok, I removed the explanation. > There are other issues here, such as fsync() only reports an I/O > problem to one caller, and in some cases I/O errors aren't propagated > up the storage stack. The latter is clearly just a bug that should be > fixed; the former is more of an interface limitation. But you don't > talk about in this section, and I think it would be good to have a > more extended discussion about I/O errors when writing data blocks, > and I/O errors writing metadata blocks, etc. Could you write a paragraph or two? > > + > > +Sector writes are atomic (ATOMIC-SECTORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > This requirement is not quite the same as what you discuss below. Ok, you are right. Fixed. > So there are actually two desirable properties for a storage system to > have; one is "don't damage the old data on a failed write"; and the > other is "don't cause collateral damage to adjacent sectors on a > failed write". Thanks, its indeed clearer that way. I split those in two. > > + Because RAM tends to fail faster than rest of system during > > + powerfail, special hw killing DMA transfers may be necessary; > > + otherwise, disks may write garbage during powerfail. > > + Not sure how common that problem is on generic PC machines. > > This problem is still relatively common, from what I can tell. And > ext3 handles this surprisingly well at least in the catastrophic case > of garbage getting written into the inode table, since the journal > replay often will "repair" the garbage that was written into the ... Ok, added to ext3 specific section. New version is attached. Feel free to help here; my goal is to get this documented, I'm not particulary attached to wording etc... Signed-off-by: Pavel Machek <pavel@ucw.cz> Pavel diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..0de456d --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,49 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + +Don't damage the old data on a failed write (ATOMIC-WRITES) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 2344855..ee88467 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index e5f3833..6de8af4 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,45 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + + (Thrash may get written into sectors during powerfail. And + ext3 handles this surprisingly well at least in the + catastrophic case of garbage getting written into the inode + table, since the journal replay often will "repair" the + garbage that was written into the filesystem metadata blocks. + It won't do a bit of good for the data blocks, of course + (unless you are using data=journal mode). But this means that + in fact, ext3 is more resistant to suriving failures to the + first problem (powerfail while writing can damage old data on + a failed write) but fortunately, hard drives generally don't + cause collateral damage on a failed write. + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible @ 2009-03-23 18:23 ` Pavel Machek 0 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-23 18:23 UTC (permalink / raw) To: Theodore Tso, Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap, l Hi! > > > > Not all block devices are suitable for all filesystems. In fact, some > > > > block devices are so broken that reliable operation is pretty much > > > > impossible. Document stuff ext2/ext3 needs for reliable operation. > > Some of what is here are bugs, and some are legitimate long-term > interfaces (for example, the question of losing I/O errors when two > processes are writing to the same file, or to a directory entry, and > errors aren't or in some cases, can't, be reflected back to > userspace). Well, I guess there's thin line between error and "legitimate long-term interfaces". I still believe that fsync() is broken by design. > I'm a little concerned that some of this reads a bit too much like a > rant (and I know Pavel was very frustrated when he tried to use a > flash card with a sucky flash card socket) and it will get used the It started as a rant, obviously I'd like to get away from that and get it into suitable format for inclusion. (Not being native speaker does not help here). But I do believe that we should get this documented; many common storage subsystems are broken, and can cause data loss. We should at least tell to the users. > wrong way by apoligists, because it mixes areas where "we suck, we > should do better", which a re bug reports, and "Posix or the > underlying block device layer makes it hard", and simply states them > as fundamental design requirements, when that's probably not true. Well, I guess that can be refined later. Heck, I'm not able to tell which are simple bugs likely to be fixed soon, and which are fundamental issues that are unlikely to be fixed sooner than 2030. I guess it is fair to document them ASAP, and then fix those that can be fixed... > There's a lot of work that we could do to make I/O errors get better > reflected to userspace by fsync(). So state things as bald > requirements I think goes a little too far IMHO. We can surely do > better. If the fsync() can be fixed... that would be great. But I'm not sure how easy that will be. > > +Write errors not allowed (NO-WRITE-ERRORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Writes to media never fail. Even if disk returns error condition > > +during write, filesystems can't handle that correctly, because success > > +on fsync was already returned when data hit the journal. > > The last half of this sentence "because success on fsync was already > returned when data hit the journal", obviously doesn't apply to all > filesystems, since some filesystems, like ext2, don't journal data. > Even for ext3, it only applies in the case of data=journal mode. Ok, I removed the explanation. > There are other issues here, such as fsync() only reports an I/O > problem to one caller, and in some cases I/O errors aren't propagated > up the storage stack. The latter is clearly just a bug that should be > fixed; the former is more of an interface limitation. But you don't > talk about in this section, and I think it would be good to have a > more extended discussion about I/O errors when writing data blocks, > and I/O errors writing metadata blocks, etc. Could you write a paragraph or two? > > + > > +Sector writes are atomic (ATOMIC-SECTORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > This requirement is not quite the same as what you discuss below. Ok, you are right. Fixed. > So there are actually two desirable properties for a storage system to > have; one is "don't damage the old data on a failed write"; and the > other is "don't cause collateral damage to adjacent sectors on a > failed write". Thanks, its indeed clearer that way. I split those in two. > > + Because RAM tends to fail faster than rest of system during > > + powerfail, special hw killing DMA transfers may be necessary; > > + otherwise, disks may write garbage during powerfail. > > + Not sure how common that problem is on generic PC machines. > > This problem is still relatively common, from what I can tell. And > ext3 handles this surprisingly well at least in the catastrophic case > of garbage getting written into the inode table, since the journal > replay often will "repair" the garbage that was written into the ... Ok, added to ext3 specific section. New version is attached. Feel free to help here; my goal is to get this documented, I'm not particulary attached to wording etc... Signed-off-by: Pavel Machek <pavel@ucw.cz> Pavel diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..0de456d --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,49 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + +Don't damage the old data on a failed write (ATOMIC-WRITES) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + This may be quite common on generic PC machines. + + Note that atomic write is very hard to guarantee for RAID-4/5/6, + because it needs to write both changed data, and parity, to + different disks. UPS for RAID array should help. + + + diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 2344855..ee88467 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* write caching is disabled. ext2 does not know how to issue barriers + as of 2.6.28. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded (which means any -filesystem changes they represent are also lost). +========== Check Documentation/filesystems/ext3.txt if you want to read more about ext3 and journaling. diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index e5f3833..6de8af4 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +200,45 @@ mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* write errors not allowed (NO-WRITE-ERRORS) + +* don't damage the old data on a failed write (ATOMIC-WRITES) + + (Thrash may get written into sectors during powerfail. And + ext3 handles this surprisingly well at least in the + catastrophic case of garbage getting written into the inode + table, since the journal replay often will "repair" the + garbage that was written into the filesystem metadata blocks. + It won't do a bit of good for the data blocks, of course + (unless you are using data=journal mode). But this means that + in fact, ext3 is more resistant to suriving failures to the + first problem (powerfail while writing can damage old data on + a failed write) but fortunately, hard drives generally don't + cause collateral damage on a failed write. + +and obviously: + +* don't cause collateral damage to adjacent sectors on a failed write + (NO-COLLATERALS) + + +(see expectations.txt; note that most/all linux block-based +filesystems have similar expectations) + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + + hdparm -I reports disk features. If you have "Native + Command Queueing" is the feature you are looking for. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 12:30 ` Pavel Machek 2009-03-16 19:03 ` Theodore Tso @ 2009-03-16 19:40 ` Sitsofe Wheeler 2009-03-16 21:43 ` Rob Landley 2009-03-23 11:00 ` Pavel Machek 1 sibling, 2 replies; 309+ messages in thread From: Sitsofe Wheeler @ 2009-03-16 19:40 UTC (permalink / raw) To: Pavel Machek Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote: > + Unfortunately, none of the cheap USB/SD flash cards I've seen > + do behave like this, and are thus unsuitable for all Linux > + filesystems I know. When you say Linux filesystems do you mean "filesystems originally designed on Linux" or do you mean "filesystems that Linux supports"? Additionally whatever the answer, people are going to need help answering the "which is the least bad?" question and saying what's not good without offering alternatives is only half helpful... People need to put SOMETHING on these cheap (and not quite so cheap) devices... The last recommendation I heard was that until btrfs/logfs/nilfs arrive people are best off sticking with FAT - http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that should be mentioned? > +* either write caching is disabled, or hw can do barriers and they are enabled. > + > + (Note that barriers are disabled by default, use "barrier=1" > + mount option after making sure hw can support them). > + > + hdparm -I reports disk features. If you have "Native > + Command Queueing" is the feature you are looking for. The document makes it sound like nearly everything bar battery backed hardware RAIDed SCSI disks (with perfect firmware) is bad - is this the intent? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 19:40 ` Sitsofe Wheeler @ 2009-03-16 21:43 ` Rob Landley 2009-03-17 4:55 ` Kyle Moffett 2009-03-23 11:00 ` Pavel Machek 1 sibling, 1 reply; 309+ messages in thread From: Rob Landley @ 2009-03-16 21:43 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Monday 16 March 2009 14:40:57 Sitsofe Wheeler wrote: > On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote: > > + Unfortunately, none of the cheap USB/SD flash cards I've seen > > + do behave like this, and are thus unsuitable for all Linux > > + filesystems I know. > > When you say Linux filesystems do you mean "filesystems originally > designed on Linux" or do you mean "filesystems that Linux supports"? > Additionally whatever the answer, people are going to need help > answering the "which is the least bad?" question and saying what's not > good without offering alternatives is only half helpful... People need > to put SOMETHING on these cheap (and not quite so cheap) devices... The > last recommendation I heard was that until btrfs/logfs/nilfs arrive > people are best off sticking with FAT - > http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that > should be mentioned? Actually, the best filesystem for USB flash devices is probably UDF. (Yes, the DVD filesystem turns out to be writeable if you put it on a writeable media. The ISO spec requires write support, so any OS that supports DVDs also supports this.) The reasons for this are: A) It's the only filesystem other than FAT that's supported out of the box by windows, mac, _and_ Linux for hotpluggable media. B) It doesn't have the horrible limitations of FAT (such as a max filesize of 2 gigabytes). C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over patents on it. However, when it comes to cutting the power on a mounted filesystem (either by yanking the device or powering off the machine) without losing your data, without warning, they all suck horribly. If you yank a USB flash disk in the middle of a write, and the device has decided to wipe a 2 megabyte erase sector that's behind a layer of wear levelling and thus consists of a series of random sectors scattered all over the disk, you're screwed no matter what filesystem you use. You know the vinyl "record scratch" sound? Imagine that, on a digital level. Bad Things Happen to the hardware, cannot compensate in software. > > +* either write caching is disabled, or hw can do barriers and they are > > enabled. + > > + (Note that barriers are disabled by default, use "barrier=1" > > + mount option after making sure hw can support them). > > + > > + hdparm -I reports disk features. If you have "Native > > + Command Queueing" is the feature you are looking for. > > The document makes it sound like nearly everything bar battery backed > hardware RAIDed SCSI disks (with perfect firmware) is bad - is this > the intent? SCSI disks? They still make those? Everything fails, it's just a question of how. Rotational media combined with journaling at least fails in fairly understandable ways, so ext3 on sata is reasonable. Flash gets into trouble when it presents the _interface_ of rotational media (a USB block device with normal 512 byte read/write sectors, which never wear out) which doesn't match what the hardware's actually doing (erase block sizes of up to several megabytes at a time, hidden behind a block remapping layer for wear leveling). For devices that have built in flash that DON'T pretend to be a conventional block device, but instead expose their flash erase granularity and let the OS do the wear levelling itself, we have special flash filesystems that can be reasonably reliable. It's just that ext3 isn't one of them, jffs2 and ubifs and logfs are. The problem with these flash filesystems is they ONLY work on flash, if you want to mount them on something other than flash you need something like a loopback interface to make a normal block device pretend to be flash. (We've got a ramdisk driver called "mtdram" that does this, but nobody's bothered to write a generic wrapper for a normal block device you can wrap over the loopback driver.) Unfortunately, when it comes to USB flash (the most common type), the USB standard defines a way for a USB device to provide a normal block disk interface as if it was rotational media. It does NOT provide a way to expose the flash erase granularity, or a way for the operating system to disable any built-in wear levelling (which is needed because windows doesn't _do_ wear levelling, and thus burns out the administrative sectors of the disk really fast while the rest of the disk is still fine unless the hardware wear-levels for it). So every USB flash disk pretends to be a normal disk, which it isn't, and Linux can't _disable_ this emulation. Which brings us back to UDF as the least sucky alternative. (Although the UDF tools kind of suck. If you reformat a FAT disk as UDF with mkudffs, it'll still be autodetected as FAT because it won't overwrite the FAT root directory. You have to blank the first 64k by hand with dd. Sad, isn't it?) Rob ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 21:43 ` Rob Landley @ 2009-03-17 4:55 ` Kyle Moffett 0 siblings, 0 replies; 309+ messages in thread From: Kyle Moffett @ 2009-03-17 4:55 UTC (permalink / raw) To: Rob Landley Cc: Sitsofe Wheeler, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon, Mar 16, 2009 at 5:43 PM, Rob Landley <rob@landley.net> wrote: > Flash gets into trouble when it presents the _interface_ of rotational media > (a USB block device with normal 512 byte read/write sectors, which never wear > out) which doesn't match what the hardware's actually doing (erase block sizes > of up to several megabytes at a time, hidden behind a block remapping layer > for wear leveling). > > For devices that have built in flash that DON'T pretend to be a conventional > block device, but instead expose their flash erase granularity and let the OS > do the wear levelling itself, we have special flash filesystems that can be > reasonably reliable. It's just that ext3 isn't one of them, jffs2 and ubifs > and logfs are. The problem with these flash filesystems is they ONLY work on > flash, if you want to mount them on something other than flash you need > something like a loopback interface to make a normal block device pretend to > be flash. (We've got a ramdisk driver called "mtdram" that does this, but > nobody's bothered to write a generic wrapper for a normal block device you can > wrap over the loopback driver.) The really nice SSDs actually reserve ~15-30% of their internal block-level storage and actually run their own log-structured virtual disk in hardware. From what I understand the Intel SSDs are that way. Real-time garbage collection is tricky, but if you require (for example) a max of ~80% utilization then you can provide good latency and bandwidth guarantees. There's usually something like a log-structured virtual-to-physical sector map as well. If designed properly with automatic hardware checksumming, such a system can actually provide atomic writes and barriers with virtually no impact on performance. With firmware-level hardware knowledge and the ability to perform extremely efficient parallel reads of flash blocks, such a log-structured virtual block device can be many times more efficient than a general purpose OS running a log-structured filesystem. The result is that for an ordinary ext3-esque filesystem with 4k blocks you can treat the SSD as though it is an atomic-write seek-less block device. Now if only I had the spare cash to go out and buy one of the shiny Intel ones for my laptop... :-) Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 19:40 ` Sitsofe Wheeler 2009-03-16 21:43 ` Rob Landley @ 2009-03-23 11:00 ` Pavel Machek 1 sibling, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-23 11:00 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon 2009-03-16 19:40:57, Sitsofe Wheeler wrote: > On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote: > > + Unfortunately, none of the cheap USB/SD flash cards I've seen > > + do behave like this, and are thus unsuitable for all Linux > > + filesystems I know. > > When you say Linux filesystems do you mean "filesystems originally > designed on Linux" or do you mean "filesystems that Linux supports"? "Linux filesystems I know" :-). No filesystem that Linux supports, AFAICT. > Additionally whatever the answer, people are going to need help > answering the "which is the least bad?" question and saying what's not > good without offering alternatives is only half helpful... People need > to put SOMETHING on these cheap (and not quite so cheap) > devices... The According to me, people should just AVOID those devices. I don't plan to point the "least bad"; its still bad. > > + hdparm -I reports disk features. If you have "Native > > + Command Queueing" is the feature you are looking for. > > The document makes it sound like nearly everything bar battery backed > hardware RAIDed SCSI disks (with perfect firmware) is bad - is this > the intent? Battery backed RAID should be ok, as should be plain single SATA drive. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 19:13 ` Rob Landley 2009-03-16 12:28 ` Pavel Machek 2009-03-16 12:30 ` Pavel Machek @ 2009-08-29 1:33 ` Robert Hancock 2009-08-29 13:04 ` Alan Cox 2 siblings, 1 reply; 309+ messages in thread From: Robert Hancock @ 2009-08-29 1:33 UTC (permalink / raw) To: Rob Landley Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4, Alan Cox On 03/12/2009 01:13 PM, Rob Landley wrote: >> +* write caching is disabled. ext2 does not know how to issue barriers >> + as of 2.6.28. hdparm -W0 disables it on SATA disks. > > And here we're talking about ext2. Does neither one know about write > barriers, or does this just apply to ext2? (What about ext4?) > > Also I remember a historical problem that not all disks honor write barriers, > because actual data integrity makes for horrible benchmark numbers. Dunno how > current that is with SATA, Alan Cox would probably know. I've heard rumors of disks that claim to support cache flushes but really just ignore them, but have never heard any specifics of model numbers, etc. which are known to do this, so it may just be legend. If we do have such knowledge then we should really be blacklisting those drives and warning the user that we can't ensure data integrity. (Even powering down the system would be unsafe in this case.) ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-08-29 1:33 ` Robert Hancock @ 2009-08-29 13:04 ` Alan Cox 0 siblings, 0 replies; 309+ messages in thread From: Alan Cox @ 2009-08-29 13:04 UTC (permalink / raw) To: Robert Hancock Cc: Rob Landley, Pavel Machek, kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 > I've heard rumors of disks that claim to support cache flushes but > really just ignore them, but have never heard any specifics of model > numbers, etc. which are known to do this, so it may just be legend. If > we do have such knowledge then we should really be blacklisting those > drives and warning the user that we can't ensure data integrity. (Even > powering down the system would be unsafe in this case.) This should not be the case for any vaguely modern drive. The standard requires the drive flushes the cache if sent the command and the size of caches on modern drives rather require it. Alan ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-12 9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek @ 2009-03-16 19:45 ` Greg Freemyer 2009-03-12 19:13 ` Rob Landley 2009-03-16 19:45 ` Greg Freemyer 2 siblings, 0 replies; 309+ messages in thread From: Greg Freemyer @ 2009-03-16 19:45 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote: <snip> > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > + behave like this, and are unsuitable for all linux filesystems > + I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite the next 64k, 128k, or > + even a couple megabytes on the really _big_ ones. > + > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. I had *assumed* that SSDs worked like: 1) write request comes in 2) new unused erase block area marked to hold the new data 3) updated data written to the previously unused erase block 4) mapping updated to replace the old erase block with the new one If it were done that way, a failure in the middle would just leave the SSD with the old data in it. If it is not done that way, then I can see your issue. (I love the potential performance of SSDs, but I'm beginning to hate the implementations and spec writing.) Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible @ 2009-03-16 19:45 ` Greg Freemyer 0 siblings, 0 replies; 309+ messages in thread From: Greg Freemyer @ 2009-03-16 19:45 UTC (permalink / raw) To: Pavel Machek Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote: <snip> > +Sector writes are atomic (ATOMIC-SECTORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > + behave like this, and are unsuitable for all linux filesystems > + I know. > + > + An inherent problem with using flash as a normal block > + device is that the flash erase size is bigger than > + most filesystem sector sizes. So when you request a > + write, it may erase and rewrite the next 64k, 128k, or > + even a couple megabytes on the really _big_ ones. > + > + If you lose power in the middle of that, filesystem > + won't notice that data in the "sectors" _around_ the > + one your were trying to write to got trashed. I had *assumed* that SSDs worked like: 1) write request comes in 2) new unused erase block area marked to hold the new data 3) updated data written to the previously unused erase block 4) mapping updated to replace the old erase block with the new one If it were done that way, a failure in the middle would just leave the SSD with the old data in it. If it is not done that way, then I can see your issue. (I love the potential performance of SSDs, but I'm beginning to hate the implementations and spec writing.) Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 309+ messages in thread
* Re: ext2/3: document conditions when reliable operation is possible 2009-03-16 19:45 ` Greg Freemyer (?) @ 2009-03-16 21:48 ` Pavel Machek -1 siblings, 0 replies; 309+ messages in thread From: Pavel Machek @ 2009-03-16 21:48 UTC (permalink / raw) To: Greg Freemyer Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4 On Mon 2009-03-16 15:45:36, Greg Freemyer wrote: > On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote: > <snip> > > +Sector writes are atomic (ATOMIC-SECTORS) > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Either whole sector is correctly written or nothing is written during > > +powerfail. > > + > > + Unfortuantely, none of the cheap USB/SD flash cards I seen do > > + behave like this, and are unsuitable for all linux filesystems > > + I know. > > + > > + An inherent problem with using flash as a normal block > > + device is that the flash erase size is bigger than > > + most filesystem sector sizes. So when you request a > > + write, it may erase and rewrite the next 64k, 128k, or > > + even a couple megabytes on the really _big_ ones. > > + > > + If you lose power in the middle of that, filesystem > > + won't notice that data in the "sectors" _around_ the > > + one your were trying to write to got trashed. > > I had *assumed* that SSDs worked like: > > 1) write request comes in > 2) new unused erase block area marked to hold the new data > 3) updated data written to the previously unused erase block > 4) mapping updated to replace the old erase block with the new one > > If it were done that way, a failure in the middle would just leave the > SSD with the old data in it. The really expensive ones (Intel SSD) apparently work like that, but I never seen one of those. USB sticks and SD cards I tried behave like I described above. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 309+ messages in thread
[parent not found: <dddMw-7Xt-57@gated-at.bofh.it>]
[parent not found: <dddWc-83W-51@gated-at.bofh.it>]
[parent not found: <ddefx-5R-37@gated-at.bofh.it>]
[parent not found: <ddep7-bI-7@gated-at.bofh.it>]
[parent not found: <ddeyR-iy-11@gated-at.bofh.it>]
[parent not found: <ddeyR-iy-9@gated-at.bofh.it>]
[parent not found: <ddeIu-n9-9@gated-at.bofh.it>]
[parent not found: <ddeSb-th-29@gated-at.bofh.it>]
[parent not found: <ddoRI-11a-39@gated-at.bofh.it>]
[parent not found: <ciPTu-7f2-47@gated-at.bofh.it>]
[parent not found: <clrhX-3R1-13@gated-at.bofh.it>]
[parent not found: <dcEc1-33Z-17@gated-at.bofh.it>]
[parent not found: <dcFUz-4yN-9@gated-at.bofh.it>]
[parent not found: <dcHtf-5XC-17@gated-at.bofh.it>]
[parent not found: <dcNSc-2DU-11@gated-at.bofh.it>]
[parent not found: <dcOl9-3aC-19@gated-at.bofh.it>]
[parent not found: <dcOOd-3qM-33@gated-at.bofh.it>]
[parent not found: <dcOXN-3M5-15@gated-at.bofh.it>]
end of thread, other threads:[~2010-04-04 23:58 UTC | newest] Thread overview: 309+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-03-12 9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek 2009-03-12 11:40 ` Jochen Voß 2009-03-21 11:24 ` Pavel Machek 2009-03-21 11:24 ` Pavel Machek 2009-03-12 19:13 ` Rob Landley 2009-03-16 12:28 ` Pavel Machek 2009-03-16 19:26 ` Rob Landley 2009-03-23 10:45 ` Pavel Machek 2009-03-30 15:06 ` Goswin von Brederlow 2009-08-24 9:26 ` Pavel Machek 2009-08-24 9:31 ` [patch] " Pavel Machek 2009-08-24 11:19 ` Florian Weimer 2009-08-24 13:01 ` Theodore Tso 2009-08-24 14:55 ` Artem Bityutskiy 2009-08-24 22:30 ` Rob Landley 2009-08-24 19:52 ` Pavel Machek 2009-08-24 19:52 ` Pavel Machek 2009-08-24 20:24 ` Ric Wheeler 2009-08-24 20:52 ` Pavel Machek 2009-08-24 21:08 ` Ric Wheeler 2009-08-24 21:25 ` Pavel Machek 2009-08-24 22:05 ` Ric Wheeler 2009-08-24 22:22 ` Zan Lynx 2009-08-24 22:44 ` Pavel Machek 2009-08-25 0:34 ` Ric Wheeler 2009-08-24 23:42 ` david 2009-08-24 22:41 ` Pavel Machek 2009-08-24 22:39 ` Theodore Tso 2009-08-24 23:00 ` Pavel Machek 2009-08-25 0:02 ` david 2009-08-25 9:32 ` Pavel Machek 2009-08-25 0:06 ` Ric Wheeler 2009-08-25 9:34 ` Pavel Machek 2009-08-25 15:34 ` david 2009-08-26 3:32 ` Rik van Riel 2009-08-26 11:17 ` Pavel Machek 2009-08-26 11:29 ` david 2009-08-26 13:10 ` Pavel Machek 2009-08-26 13:43 ` david 2009-08-26 18:02 ` Theodore Tso 2009-08-27 6:28 ` Eric Sandeen 2009-08-27 6:28 ` Eric Sandeen 2009-11-09 8:53 ` periodic fsck was " Pavel Machek 2009-11-09 8:53 ` Pavel Machek 2009-11-09 14:05 ` Theodore Tso 2009-11-09 15:58 ` Andreas Dilger 2009-08-30 7:03 ` Pavel Machek 2009-08-26 12:28 ` Theodore Tso 2009-08-27 6:06 ` Rob Landley 2009-08-27 6:54 ` david 2009-08-27 7:34 ` Rob Landley 2009-08-28 14:37 ` david 2009-08-30 7:19 ` Pavel Machek 2009-08-30 12:48 ` david 2009-08-27 5:27 ` Rob Landley 2009-08-25 0:08 ` Theodore Tso 2009-08-25 9:42 ` Pavel Machek 2009-08-25 9:42 ` Pavel Machek 2009-08-25 13:37 ` Ric Wheeler 2009-08-25 13:42 ` Alan Cox 2009-08-27 3:16 ` Rob Landley 2009-08-25 21:15 ` Pavel Machek 2009-08-25 22:42 ` Ric Wheeler 2009-08-25 22:51 ` Pavel Machek 2009-08-25 23:03 ` david 2009-08-25 23:29 ` Pavel Machek 2009-08-25 23:03 ` Ric Wheeler 2009-08-25 23:26 ` Pavel Machek 2009-08-25 23:40 ` Ric Wheeler 2009-08-25 23:48 ` david 2009-08-25 23:53 ` Pavel Machek 2009-08-26 0:11 ` Ric Wheeler 2009-08-26 0:16 ` Pavel Machek 2009-08-26 0:31 ` Ric Wheeler 2009-08-26 1:00 ` Theodore Tso 2009-08-26 1:15 ` Ric Wheeler 2009-08-26 2:58 ` Theodore Tso 2009-08-26 10:39 ` Ric Wheeler 2009-08-26 10:39 ` Ric Wheeler 2009-08-26 11:12 ` Pavel Machek 2009-08-26 11:28 ` david 2009-08-29 9:49 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek 2009-08-29 11:28 ` Ric Wheeler 2009-09-02 20:12 ` Pavel Machek 2009-09-02 20:42 ` Ric Wheeler 2009-09-02 23:00 ` Rob Landley 2009-09-02 23:09 ` david 2009-09-03 8:55 ` Pavel Machek 2009-09-03 0:36 ` jim owens 2009-09-03 2:41 ` Rob Landley 2009-09-03 14:14 ` jim owens 2009-09-04 7:44 ` Rob Landley 2009-09-04 11:49 ` Ric Wheeler 2009-09-05 10:28 ` Pavel Machek 2009-09-05 12:20 ` Ric Wheeler 2009-09-05 13:54 ` Jonathan Corbet 2009-09-05 21:27 ` Pavel Machek 2009-09-05 21:56 ` Theodore Tso 2009-09-02 22:45 ` Rob Landley 2009-09-02 22:49 ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley 2009-09-03 9:08 ` Pavel Machek 2009-09-03 12:05 ` Ric Wheeler 2009-09-03 12:31 ` Pavel Machek 2009-08-29 16:35 ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david 2009-08-29 16:35 ` david 2009-08-30 7:07 ` Pavel Machek 2009-08-26 12:01 ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler 2009-08-26 12:23 ` Theodore Tso 2009-08-30 7:01 ` Pavel Machek 2009-08-30 7:01 ` Pavel Machek 2009-08-27 5:19 ` Rob Landley 2009-08-27 12:24 ` Theodore Tso 2009-08-27 13:10 ` Ric Wheeler 2009-08-27 13:10 ` Ric Wheeler 2009-08-27 16:54 ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik 2009-08-27 18:09 ` Alasdair G Kergon 2009-09-01 14:01 ` Pavel Machek 2009-09-02 16:17 ` Michael Tokarev 2009-08-29 10:02 ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek 2009-08-29 10:02 ` Pavel Machek 2009-08-26 1:15 ` Ric Wheeler 2009-08-26 1:16 ` Pavel Machek 2009-08-26 1:16 ` Pavel Machek 2009-08-26 2:55 ` Theodore Tso 2009-08-26 13:37 ` Ric Wheeler 2009-08-26 13:37 ` Ric Wheeler 2009-08-26 2:53 ` Henrique de Moraes Holschuh 2009-08-26 2:53 ` Henrique de Moraes Holschuh 2009-09-03 9:47 ` Pavel Machek 2009-09-03 9:47 ` Pavel Machek 2009-08-26 3:50 ` Rik van Riel 2009-08-27 3:53 ` Rob Landley 2009-08-27 11:43 ` Ric Wheeler 2009-08-27 20:51 ` Rob Landley 2009-08-27 22:00 ` Ric Wheeler 2009-08-28 14:49 ` david 2009-08-29 10:05 ` Pavel Machek 2009-08-29 20:22 ` Rob Landley 2009-08-29 21:34 ` Pavel Machek 2009-09-03 16:56 ` what fsck can (and can't) do was " david 2009-09-03 19:27 ` Theodore Tso 2009-08-27 22:13 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek 2009-08-28 1:32 ` Ric Wheeler 2009-08-28 6:44 ` Pavel Machek 2009-08-28 7:31 ` NeilBrown 2009-08-28 7:31 ` NeilBrown 2009-11-09 10:50 ` Pavel Machek 2009-08-28 11:16 ` Ric Wheeler 2009-09-01 13:58 ` Pavel Machek 2009-08-28 7:11 ` raid is dangerous but that's secret Florian Weimer 2009-08-28 7:23 ` NeilBrown 2009-08-28 12:08 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso 2009-08-30 7:51 ` Pavel Machek 2009-08-30 9:01 ` Christian Kujau 2009-09-02 20:55 ` Pavel Machek 2009-08-30 12:55 ` david 2009-08-30 14:12 ` Ric Wheeler 2009-08-30 14:44 ` Michael Tokarev 2009-08-30 16:10 ` Ric Wheeler 2009-08-30 16:35 ` Christoph Hellwig 2009-08-31 13:15 ` Ric Wheeler 2009-08-31 13:16 ` Christoph Hellwig 2009-08-31 13:19 ` Mark Lord 2009-08-31 13:21 ` Christoph Hellwig 2009-08-31 15:14 ` jim owens 2009-09-03 1:59 ` Ric Wheeler 2009-09-03 11:12 ` Krzysztof Halasa 2009-09-03 11:18 ` Ric Wheeler 2009-09-03 13:34 ` Krzysztof Halasa 2009-09-03 13:50 ` Ric Wheeler 2009-09-03 13:59 ` Krzysztof Halasa 2009-09-03 14:15 ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler 2009-09-03 14:26 ` Florian Weimer 2009-09-03 15:09 ` Ric Wheeler 2009-09-03 23:50 ` Krzysztof Halasa 2009-09-04 0:39 ` Ric Wheeler 2009-09-04 21:21 ` Mark Lord 2009-09-04 21:29 ` Ric Wheeler 2009-09-05 12:57 ` Mark Lord 2009-09-05 13:40 ` Ric Wheeler 2009-09-05 21:43 ` NeilBrown 2009-09-07 11:45 ` Pavel Machek 2009-09-07 13:10 ` Theodore Tso 2010-04-04 13:47 ` fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) Pavel Machek 2010-04-04 13:47 ` Pavel Machek 2010-04-04 17:39 ` tytso 2010-04-04 17:59 ` Rob Landley 2010-04-04 18:45 ` Pavel Machek 2010-04-04 19:35 ` tytso 2010-04-04 19:29 ` tytso 2010-04-04 23:58 ` Rob Landley 2009-09-03 14:35 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david 2009-08-31 13:22 ` Ric Wheeler 2009-08-31 15:50 ` david 2009-08-31 16:21 ` Ric Wheeler 2009-08-31 18:31 ` Christoph Hellwig 2009-08-31 19:11 ` david 2009-08-30 15:05 ` Pavel Machek 2009-08-30 15:20 ` Theodore Tso 2009-08-31 17:49 ` Jesse Brandeburg 2009-08-31 18:01 ` Ric Wheeler 2009-08-31 21:01 ` MD5/6? (was Re: raid is dangerous but that's secret ...) Ron Johnson 2009-08-31 18:07 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft 2009-08-31 22:26 ` Jesse Brandeburg 2009-08-31 22:26 ` Jesse Brandeburg 2009-08-31 23:19 ` Ron Johnson 2009-09-01 5:45 ` martin f krafft 2009-08-31 17:49 ` Jesse Brandeburg 2009-09-05 10:34 ` Pavel Machek 2009-09-05 10:34 ` Pavel Machek 2009-08-30 7:51 ` Pavel Machek 2009-08-25 23:46 ` [patch] ext2/3: document conditions when reliable operation is possible david 2009-08-25 23:08 ` Neil Brown 2009-08-25 23:44 ` Pavel Machek 2009-08-26 4:08 ` Rik van Riel 2009-08-26 11:15 ` Pavel Machek 2009-08-27 3:29 ` Rik van Riel 2009-08-25 16:11 ` Theodore Tso 2009-08-25 22:21 ` [patch] document flash/RAID dangers Pavel Machek 2009-08-25 22:21 ` Pavel Machek 2009-08-25 22:33 ` david 2009-08-25 22:40 ` Pavel Machek 2009-08-25 22:59 ` david 2009-08-25 23:37 ` Pavel Machek 2009-08-25 23:48 ` Ric Wheeler 2009-08-26 0:06 ` Pavel Machek 2009-08-26 0:12 ` Ric Wheeler 2009-08-26 0:20 ` Pavel Machek 2009-08-26 0:26 ` david 2009-08-26 0:28 ` Ric Wheeler 2009-08-26 0:38 ` Pavel Machek 2009-08-26 0:45 ` Ric Wheeler 2009-08-26 11:21 ` Pavel Machek 2009-08-26 11:58 ` Ric Wheeler 2009-08-26 12:40 ` Theodore Tso 2009-08-26 13:11 ` Ric Wheeler 2009-08-26 13:11 ` Ric Wheeler 2009-08-26 13:44 ` david 2009-08-26 13:40 ` Chris Adams 2009-08-26 13:47 ` Alan Cox 2009-08-26 14:11 ` Chris Adams 2009-08-27 21:50 ` Pavel Machek 2009-08-29 9:38 ` Pavel Machek 2009-08-26 4:24 ` Rik van Riel 2009-08-26 11:22 ` Pavel Machek 2009-08-26 14:45 ` Rik van Riel 2009-08-29 9:39 ` Pavel Machek 2009-08-29 11:47 ` Ron Johnson 2009-08-29 16:12 ` jim owens 2009-08-25 23:56 ` david 2009-08-26 0:12 ` Pavel Machek 2009-08-26 0:20 ` david 2009-08-26 0:39 ` Pavel Machek 2009-08-26 1:17 ` david 2009-08-26 0:26 ` Ric Wheeler 2009-08-26 0:44 ` Pavel Machek 2009-08-26 0:50 ` Ric Wheeler 2009-08-26 1:19 ` david 2009-08-26 11:25 ` Pavel Machek 2009-08-26 12:37 ` Theodore Tso 2009-08-30 6:49 ` Pavel Machek 2009-08-30 6:49 ` Pavel Machek 2009-08-26 4:20 ` Rik van Riel 2009-08-25 22:27 ` [patch] document that ext2 can't handle barriers Pavel Machek 2009-08-25 22:27 ` Pavel Machek 2009-08-27 3:34 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley 2009-08-27 8:46 ` David Woodhouse 2009-08-28 14:46 ` david 2009-08-29 10:09 ` Pavel Machek 2009-08-29 16:27 ` david 2009-08-29 21:33 ` Pavel Machek 2009-08-24 23:00 ` Pavel Machek 2009-08-25 13:57 ` Chris Adams 2009-08-25 22:58 ` Neil Brown 2009-08-25 23:10 ` Ric Wheeler 2009-08-25 23:32 ` NeilBrown 2009-08-25 23:32 ` NeilBrown 2009-08-24 21:11 ` Greg Freemyer 2009-08-24 21:11 ` Greg Freemyer 2009-08-25 20:56 ` Rob Landley 2009-08-25 21:08 ` david 2009-08-25 18:52 ` Rob Landley 2009-08-25 14:43 ` Florian Weimer 2009-08-24 13:50 ` Theodore Tso 2009-08-24 18:48 ` Pavel Machek 2009-08-24 18:48 ` Pavel Machek 2009-08-24 18:39 ` Pavel Machek 2009-08-24 13:21 ` Greg Freemyer 2009-08-24 13:21 ` Greg Freemyer 2009-08-24 18:44 ` Pavel Machek 2009-08-25 23:28 ` Neil Brown 2009-08-25 23:28 ` Neil Brown 2009-08-26 1:34 ` david 2009-08-24 21:11 ` Rob Landley 2009-08-24 21:33 ` Pavel Machek 2009-08-25 18:45 ` Jan Kara 2009-03-16 12:30 ` Pavel Machek 2009-03-16 19:03 ` Theodore Tso 2009-03-23 18:23 ` Pavel Machek 2009-03-23 18:23 ` Pavel Machek 2009-03-16 19:40 ` Sitsofe Wheeler 2009-03-16 21:43 ` Rob Landley 2009-03-17 4:55 ` Kyle Moffett 2009-03-23 11:00 ` Pavel Machek 2009-08-29 1:33 ` Robert Hancock 2009-08-29 13:04 ` Alan Cox 2009-03-16 19:45 ` Greg Freemyer 2009-03-16 19:45 ` Greg Freemyer 2009-03-16 21:48 ` Pavel Machek [not found] <dddMw-7Xt-57@gated-at.bofh.it> [not found] ` <dddWc-83W-51@gated-at.bofh.it> [not found] ` <ddefx-5R-37@gated-at.bofh.it> [not found] ` <ddep7-bI-7@gated-at.bofh.it> [not found] ` <ddeyR-iy-11@gated-at.bofh.it> [not found] ` <ddeyR-iy-9@gated-at.bofh.it> [not found] ` <ddeIu-n9-9@gated-at.bofh.it> [not found] ` <ddeSb-th-29@gated-at.bofh.it> [not found] ` <ddoRI-11a-39@gated-at.bofh.it> [not found] <ciPTu-7f2-47@gated-at.bofh.it> [not found] ` <clrhX-3R1-13@gated-at.bofh.it> [not found] ` <dcEc1-33Z-17@gated-at.bofh.it> [not found] ` <dcFUz-4yN-9@gated-at.bofh.it> [not found] ` <dcHtf-5XC-17@gated-at.bofh.it> [not found] ` <dcNSc-2DU-11@gated-at.bofh.it> [not found] ` <dcOl9-3aC-19@gated-at.bofh.it> [not found] ` <dcOOd-3qM-33@gated-at.bofh.it> [not found] ` <dcOXN-3M5-15@gated-at.bofh.it>
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.