All of lore.kernel.org
 help / color / mirror / Atom feed
* ext2/3: document conditions when reliable operation is possible
@ 2009-03-12  9:21 Pavel Machek
  2009-03-12 11:40 ` Jochen Voß
                   ` (2 more replies)
  0 siblings, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-12  9:21 UTC (permalink / raw)
  To: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc
  Cc: linux-ext4


Not all block devices are suitable for all filesystems. In fact, some
block devices are so broken that reliable operation is pretty much
impossible. Document stuff ext2/ext3 needs for reliable operation.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..9c3d729
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortuantely, none of the cheap USB/SD flash cards I seen do 
+	behave like this, and are unsuitable for all linux filesystems 
+	I know. 
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite the next 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _around_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be neccessary;
+	otherwise, disks may write garbage during powerfail.
+	Not sure how common that problem is on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 4333e83..b09aa4c 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..02a9bd5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,27 @@ mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
 
 References
 ==========

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12  9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek
@ 2009-03-12 11:40 ` Jochen Voß
  2009-03-21 11:24     ` Pavel Machek
  2009-03-12 19:13 ` Rob Landley
  2009-03-16 19:45   ` Greg Freemyer
  2 siblings, 1 reply; 309+ messages in thread
From: Jochen Voß @ 2009-03-12 11:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

Hi,

2009/3/12 Pavel Machek <pavel@ucw.cz>:
> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 4333e83..b09aa4c 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
>  have to be 8 character filenames, even then we are fairly close to
>  running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
   ^^^^
Shouldn't this be "Ext2"?

All the best,
Jochen
-- 
http://seehuhn.de/

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12  9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek
  2009-03-12 11:40 ` Jochen Voß
@ 2009-03-12 19:13 ` Rob Landley
  2009-03-16 12:28   ` Pavel Machek
                     ` (2 more replies)
  2009-03-16 19:45   ` Greg Freemyer
  2 siblings, 3 replies; 309+ messages in thread
From: Rob Landley @ 2009-03-12 19:13 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> Not all block devices are suitable for all filesystems. In fact, some
> block devices are so broken that reliable operation is pretty much
> impossible. Document stuff ext2/ext3 needs for reliable operation.
>
> Signed-off-by: Pavel Machek <pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/expectations.txt
> b/Documentation/filesystems/expectations.txt new file mode 100644
> index 0000000..9c3d729
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,47 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.
> +
> +	Fortunately writes failing are very uncommon on traditional
> +	spinning disks, as they have spare sectors they use when write
> +	fails.

I vaguely recall that the behavior of when a write error _does_ occur is to 
remount the filesystem read only?  (Is this VFS or per-fs?)

Is there any kind of hotplug event associated with this?

I'm aware write errors shouldn't happen, and by the time they do it's too late 
to gracefully handle them, and all we can do is fail.  So how do we fail?

> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +	Unfortuantely, none of the cheap USB/SD flash cards I seen do

I've seen

> +	behave like this, and are unsuitable for all linux filesystems

"are thus unsuitable", perhaps?  (Too pretentious? :)

> +	I know.
> +
> +		An inherent problem with using flash as a normal block
> +		device is that the flash erase size is bigger than
> +		most filesystem sector sizes.  So when you request a
> +		write, it may erase and rewrite the next 64k, 128k, or
> +		even a couple megabytes on the really _big_ ones.

Somebody corrected me, it's not "the next" it's "the surrounding".

(Writes aren't always cleanly at the start of an erase block, so critical data 
_before_ what you touch is endangered too.)

> +		If you lose power in the middle of that, filesystem
> +		won't notice that data in the "sectors" _around_ the
> +		one your were trying to write to got trashed.
> +
> +	Because RAM tends to fail faster than rest of system during
> +	powerfail, special hw killing DMA transfers may be neccessary;

Necessary

> +	otherwise, disks may write garbage during powerfail.
> +	Not sure how common that problem is on generic PC machines.
> +
> +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> +	because it needs to write both changed data, and parity, to
> +	different disks.

These days instead of "atomic" it's better to think in terms of "barriers".  
Requesting a flush blocks until all the data written _before_ that point has 
made it to disk.  This wait may be arbitrarily long on a busy system with lots 
of disk transactions happening in parallel (perhaps because Firefox decided to 
garbage collect and is spending the next 30 seconds swapping itself back in to 
do so).

> +
> +
> diff --git a/Documentation/filesystems/ext2.txt
> b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory
> entries, so they have to be 8 character filenames, even then we are fairly
> close to running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:

This paragraph talks about ext3...

> +* write errors not allowed
> +
> +* sector writes are atomic
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.

And here we're talking about ext2.  Does neither one know about write 
barriers, or does this just apply to ext2?  (What about ext4?)

Also I remember a historical problem that not all disks honor write barriers, 
because actual data integrity makes for horrible benchmark numbers.  Dunno how 
current that is with SATA, Alan Cox would probably know.

Rob

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12 19:13 ` Rob Landley
@ 2009-03-16 12:28   ` Pavel Machek
  2009-03-16 19:26     ` Rob Landley
  2009-03-16 12:30   ` Pavel Machek
  2009-08-29  1:33   ` Robert Hancock
  2 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-03-16 12:28 UTC (permalink / raw)
  To: Rob Landley
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

Hi!

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> > +
> > +	Fortunately writes failing are very uncommon on traditional
> > +	spinning disks, as they have spare sectors they use when write
> > +	fails.
> 
> I vaguely recall that the behavior of when a write error _does_ occur is to 
> remount the filesystem read only?  (Is this VFS or per-fs?)

Per-fs.

> Is there any kind of hotplug event associated with this?

I don't think so.

> I'm aware write errors shouldn't happen, and by the time they do it's too late 
> to gracefully handle them, and all we can do is fail.  So how do we
> fail?

Well, even remount-ro may be too late, IIRC.

> > +	Unfortuantely, none of the cheap USB/SD flash cards I seen do
> 
> I've seen
> 
> > +	behave like this, and are unsuitable for all linux filesystems
> 
> "are thus unsuitable", perhaps?  (Too pretentious? :)

ACK, thanks.

> > +	I know.
> > +
> > +		An inherent problem with using flash as a normal block
> > +		device is that the flash erase size is bigger than
> > +		most filesystem sector sizes.  So when you request a
> > +		write, it may erase and rewrite the next 64k, 128k, or
> > +		even a couple megabytes on the really _big_ ones.
> 
> Somebody corrected me, it's not "the next" it's "the surrounding".

Its "some" ... due to wear leveling logic.

> (Writes aren't always cleanly at the start of an erase block, so critical data 
> _before_ what you touch is endangered too.)

Well, flashes do remap, so it is actually "random blocks".

> > +	otherwise, disks may write garbage during powerfail.
> > +	Not sure how common that problem is on generic PC machines.
> > +
> > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > +	because it needs to write both changed data, and parity, to
> > +	different disks.
> 
> These days instead of "atomic" it's better to think in terms of
> "barriers".  

This is not about barriers (that should be different topic). Atomic
write means that either whole sector is written, or nothing at all is
written. Because raid5 needs to update both master data and parity at
the same time, I don't think it can guarantee this during powerfail.


> > +Requirements
> > +* write errors not allowed
> > +
> > +* sector writes are atomic
> > +
> > +(see expectations.txt; note that most/all linux block-based
> > +filesystems have similar expectations)
> > +
> > +* write caching is disabled. ext2 does not know how to issue barriers
> > +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> 
> And here we're talking about ext2.  Does neither one know about write 
> barriers, or does this just apply to ext2?  (What about ext4?)

This document is about ext2. Ext3 can support barriers in
2.6.28. Someone else needs to write ext4 docs :-).

> Also I remember a historical problem that not all disks honor write barriers, 
> because actual data integrity makes for horrible benchmark numbers.  Dunno how 
> current that is with SATA, Alan Cox would probably know.

Sounds like broken disk, then. We should blacklist those.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12 19:13 ` Rob Landley
  2009-03-16 12:28   ` Pavel Machek
@ 2009-03-16 12:30   ` Pavel Machek
  2009-03-16 19:03     ` Theodore Tso
  2009-03-16 19:40     ` Sitsofe Wheeler
  2009-08-29  1:33   ` Robert Hancock
  2 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-16 12:30 UTC (permalink / raw)
  To: Rob Landley
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

Updated version here.

On Thu 2009-03-12 14:13:03, Rob Landley wrote:
> On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> > Not all block devices are suitable for all filesystems. In fact, some
> > block devices are so broken that reliable operation is pretty much
> > impossible. Document stuff ext2/ext3 needs for reliable operation.
> >
> > Signed-off-by: Pavel Machek <pavel@ucw.cz>


diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..710d119
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortunately, none of the cheap USB/SD flash cards I've seen
+	do behave like this, and are thus unsuitable for all Linux
+	filesystems I know.
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite some 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _around_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	Not sure how common that problem is on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 4333e83..41fd2ec 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..02a9bd5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,27 @@ mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
 
 References
 ==========


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 12:30   ` Pavel Machek
@ 2009-03-16 19:03     ` Theodore Tso
  2009-03-23 18:23         ` Pavel Machek
  2009-03-16 19:40     ` Sitsofe Wheeler
  1 sibling, 1 reply; 309+ messages in thread
From: Theodore Tso @ 2009-03-16 19:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> Updated version here.
> 
> On Thu 2009-03-12 14:13:03, Rob Landley wrote:
> > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> > > Not all block devices are suitable for all filesystems. In fact, some
> > > block devices are so broken that reliable operation is pretty much
> > > impossible. Document stuff ext2/ext3 needs for reliable operation.

Some of what is here are bugs, and some are legitimate long-term
interfaces (for example, the question of losing I/O errors when two
processes are writing to the same file, or to a directory entry, and
errors aren't or in some cases, can't, be reflected back to userspace).

I'm a little concerned that some of this reads a bit too much like a
rant (and I know Pavel was very frustrated when he tried to use a
flash card with a sucky flash card socket) and it will get used the
wrong way by apoligists, because it mixes areas where "we suck, we
should do better", which a re bug reports, and "Posix or the
underlying block device layer makes it hard", and simply states them
as fundamental design requirements, when that's probably not true.

There's a lot of work that we could do to make I/O errors get better
reflected to userspace by fsync().  So state things as bald
requirements I think goes a little too far IMHO.  We can surely do
better.


> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..710d119
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.

The last half of this sentence "because success on fsync was already
returned when data hit the journal", obviously doesn't apply to all
filesystems, since some filesystems, like ext2, don't journal data.
Even for ext3, it only applies in the case of data=journal mode.  

There are other issues here, such as fsync() only reports an I/O
problem to one caller, and in some cases I/O errors aren't propagated
up the storage stack.  The latter is clearly just a bug that should be
fixed; the former is more of an interface limitation.  But you don't
talk about in this section, and I think it would be good to have a
more extended discussion about I/O errors when writing data blocks,
and I/O errors writing metadata blocks, etc.


> +
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.

This requirement is not quite the same as what you discuss below.

> +
> +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> +	do behave like this, and are thus unsuitable for all Linux
> +	filesystems I know.
> +
> +		An inherent problem with using flash as a normal block
> +		device is that the flash erase size is bigger than
> +		most filesystem sector sizes.  So when you request a
> +		write, it may erase and rewrite some 64k, 128k, or
> +		even a couple megabytes on the really _big_ ones.
> +
> +		If you lose power in the middle of that, filesystem
> +		won't notice that data in the "sectors" _around_ the
> +		one your were trying to write to got trashed.

The characteristic you descrive here is not an issue about whether
the whole sector is either written or nothing happens to the data ---
but rather, or at least in addition to that, there is also the issue
that when a there is a flash card failure --- particularly one caused
by a sucky flash card reader design causing the SD card to disconnect
from the laptop in the middle of a write --- there may be "collateral
damange"; that is, in addition to corrupting sector being writen,
adjacent sectors might also end up getting list as an unfortunate side
effect.

So there are actually two desirable properties for a storage system to
have; one is "don't damage the old data on a failed write"; and the
other is "don't cause collateral damage to adjacent sectors on a
failed write".

> +	Because RAM tends to fail faster than rest of system during 
> +	powerfail, special hw killing DMA transfers may be necessary;
> +	otherwise, disks may write garbage during powerfail.
> +	Not sure how common that problem is on generic PC machines.

This problem is still relatively common, from what I can tell.  And
ext3 handles this surprisingly well at least in the catastrophic case
of garbage getting written into the inode table, since the journal
replay often will "repair" the garbage that was written into the
filesystem metadata blocks.  It won't do a bit of good for the data
blocks, of course (unless you are using data=journal mode).  But this
means that in fact, ext3 is more resistant to suriving failures to the
first problem (powerfail while writing can damage old data on a failed
write) but fortunately, hard drives generally don't cause collateral
damage on a failed write.  Of course, there are some spectaular
exemption to this rule --- a physical shock which causes the head to
slam into a surface moving at 7200rpm can throw a lot of debris into
the hard drive enclosure, causing loss to adjacent sectors.

    	       		  	       	  - Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 12:28   ` Pavel Machek
@ 2009-03-16 19:26     ` Rob Landley
  2009-03-23 10:45       ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-03-16 19:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> Hi!
> > > +	Fortunately writes failing are very uncommon on traditional
> > > +	spinning disks, as they have spare sectors they use when write
> > > +	fails.
> >
> > I vaguely recall that the behavior of when a write error _does_ occur is
> > to remount the filesystem read only?  (Is this VFS or per-fs?)
>
> Per-fs.

Might be nice to note that in the doc.

> > Is there any kind of hotplug event associated with this?
>
> I don't think so.

There probably should be, but that's a separate issue.

> > I'm aware write errors shouldn't happen, and by the time they do it's too
> > late to gracefully handle them, and all we can do is fail.  So how do we
> > fail?
>
> Well, even remount-ro may be too late, IIRC.

Care to elaborate?  (When a filesystem is mounted RO, I'm not sure what 
happens to the pages that have already been dirtied...)

> > (Writes aren't always cleanly at the start of an erase block, so critical
> > data _before_ what you touch is endangered too.)
>
> Well, flashes do remap, so it is actually "random blocks".

Fun.

When "please do not turn of your playstation until game save completes" 
honestly seems like the best solution for making the technology reliable, 
something is wrong with the technology.

I think I'll stick with rotating disks for now, thanks.

> > > +	otherwise, disks may write garbage during powerfail.
> > > +	Not sure how common that problem is on generic PC machines.
> > > +
> > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > +	because it needs to write both changed data, and parity, to
> > > +	different disks.
> >
> > These days instead of "atomic" it's better to think in terms of
> > "barriers".
>
> This is not about barriers (that should be different topic). Atomic
> write means that either whole sector is written, or nothing at all is
> written. Because raid5 needs to update both master data and parity at
> the same time, I don't think it can guarantee this during powerfail.

Good point, but I thought that's what journaling was for?

I'm aware that any flash filesystem _must_ be journaled in order to work 
sanely, and must be able to view the underlying erase granularity down to the 
bare metal, through any remapping the hardware's doing.  Possibly what's 
really needed is a "flash is weird" section, since flash filesystems can't be 
mounted on arbitrary block devices.

Although an "-O erase_size=128" option so they _could_ would be nice.  There's 
"mtdram" which seems to be the only remaining use for ram disks, but why there 
isn't an "mtdwrap" that works with arbitrary underlying block devices, I have 
no idea.  (Layering it on top of a loopback device would be most useful.)

> > > +Requirements
> > > +* write errors not allowed
> > > +
> > > +* sector writes are atomic
> > > +
> > > +(see expectations.txt; note that most/all linux block-based
> > > +filesystems have similar expectations)
> > > +
> > > +* write caching is disabled. ext2 does not know how to issue barriers
> > > +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> >
> > And here we're talking about ext2.  Does neither one know about write
> > barriers, or does this just apply to ext2?  (What about ext4?)
>
> This document is about ext2. Ext3 can support barriers in
> 2.6.28. Someone else needs to write ext4 docs :-).
>
> > Also I remember a historical problem that not all disks honor write
> > barriers, because actual data integrity makes for horrible benchmark
> > numbers.  Dunno how current that is with SATA, Alan Cox would probably
> > know.
>
> Sounds like broken disk, then. We should blacklist those.

It wasn't just one brand of disk cheating like that, and you'd have to ask him 
(or maybe Jens Axboe or somebody) whether the problem is still current.  I've 
been off in embedded-land for a few years now...

Rob

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 12:30   ` Pavel Machek
  2009-03-16 19:03     ` Theodore Tso
@ 2009-03-16 19:40     ` Sitsofe Wheeler
  2009-03-16 21:43       ` Rob Landley
  2009-03-23 11:00       ` Pavel Machek
  1 sibling, 2 replies; 309+ messages in thread
From: Sitsofe Wheeler @ 2009-03-16 19:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4

On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> +	do behave like this, and are thus unsuitable for all Linux
> +	filesystems I know.

When you say Linux filesystems do you mean "filesystems originally
designed on Linux" or do you mean "filesystems that Linux supports"?
Additionally whatever the answer, people are going to need help
answering the "which is the least bad?" question and saying what's not
good without offering alternatives is only half helpful... People need
to put SOMETHING on these cheap (and not quite so cheap) devices... The
last recommendation I heard was that until btrfs/logfs/nilfs arrive
people are best off sticking with FAT -
http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that
should be mentioned?

> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them). 
> +
> +	   hdparm -I reports disk features. If you have "Native
> +	   Command Queueing" is the feature you are looking for.

The document makes it sound like nearly everything bar battery backed
hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
the intent?

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12  9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek
@ 2009-03-16 19:45   ` Greg Freemyer
  2009-03-12 19:13 ` Rob Landley
  2009-03-16 19:45   ` Greg Freemyer
  2 siblings, 0 replies; 309+ messages in thread
From: Greg Freemyer @ 2009-03-16 19:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote:
<snip>
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> +       behave like this, and are unsuitable for all linux filesystems
> +       I know.
> +
> +               An inherent problem with using flash as a normal block
> +               device is that the flash erase size is bigger than
> +               most filesystem sector sizes.  So when you request a
> +               write, it may erase and rewrite the next 64k, 128k, or
> +               even a couple megabytes on the really _big_ ones.
> +
> +               If you lose power in the middle of that, filesystem
> +               won't notice that data in the "sectors" _around_ the
> +               one your were trying to write to got trashed.

I had *assumed* that SSDs worked like:

1) write request comes in
2) new unused erase block area marked to hold the new data
3) updated data written to the previously unused erase block
4) mapping updated to replace the old erase block with the new one

If it were done that way, a failure in the middle would just leave the
SSD with the old data in it.

If it is not done that way, then I can see your issue.  (I love the
potential performance of SSDs, but I'm beginning to hate the
implementations and spec writing.)

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
@ 2009-03-16 19:45   ` Greg Freemyer
  0 siblings, 0 replies; 309+ messages in thread
From: Greg Freemyer @ 2009-03-16 19:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote:
<snip>
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> +       behave like this, and are unsuitable for all linux filesystems
> +       I know.
> +
> +               An inherent problem with using flash as a normal block
> +               device is that the flash erase size is bigger than
> +               most filesystem sector sizes.  So when you request a
> +               write, it may erase and rewrite the next 64k, 128k, or
> +               even a couple megabytes on the really _big_ ones.
> +
> +               If you lose power in the middle of that, filesystem
> +               won't notice that data in the "sectors" _around_ the
> +               one your were trying to write to got trashed.

I had *assumed* that SSDs worked like:

1) write request comes in
2) new unused erase block area marked to hold the new data
3) updated data written to the previously unused erase block
4) mapping updated to replace the old erase block with the new one

If it were done that way, a failure in the middle would just leave the
SSD with the old data in it.

If it is not done that way, then I can see your issue.  (I love the
potential performance of SSDs, but I'm beginning to hate the
implementations and spec writing.)

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 19:40     ` Sitsofe Wheeler
@ 2009-03-16 21:43       ` Rob Landley
  2009-03-17  4:55         ` Kyle Moffett
  2009-03-23 11:00       ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-03-16 21:43 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4

On Monday 16 March 2009 14:40:57 Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> > +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> > +	do behave like this, and are thus unsuitable for all Linux
> > +	filesystems I know.
>
> When you say Linux filesystems do you mean "filesystems originally
> designed on Linux" or do you mean "filesystems that Linux supports"?
> Additionally whatever the answer, people are going to need help
> answering the "which is the least bad?" question and saying what's not
> good without offering alternatives is only half helpful... People need
> to put SOMETHING on these cheap (and not quite so cheap) devices... The
> last recommendation I heard was that until btrfs/logfs/nilfs arrive
> people are best off sticking with FAT -
> http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that
> should be mentioned?

Actually, the best filesystem for USB flash devices is probably UDF.  (Yes, 
the DVD filesystem turns out to be writeable if you put it on a writeable 
media.  The ISO spec requires write support, so any OS that supports DVDs also 
supports this.)

The reasons for this are:

A) It's the only filesystem other than FAT that's supported out of the box by 
windows, mac, _and_ Linux for hotpluggable media.

B) It doesn't have the horrible limitations of FAT (such as a max filesize of 
2 gigabytes).

C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over 
patents on it.

However, when it comes to cutting the power on a mounted filesystem (either by 
yanking the device or powering off the machine) without losing your data, 
without warning, they all suck horribly.

If you yank a USB flash disk in the middle of a write, and the device has 
decided to wipe a 2 megabyte erase sector that's behind a layer of wear 
levelling and thus consists of a series of random sectors scattered all over 
the disk, you're screwed no matter what filesystem you use.  You know the 
vinyl "record scratch" sound?  Imagine that, on a digital level.  Bad Things 
Happen to the hardware, cannot compensate in software.

> > +* either write caching is disabled, or hw can do barriers and they are
> > enabled. +
> > +	   (Note that barriers are disabled by default, use "barrier=1"
> > +	   mount option after making sure hw can support them).
> > +
> > +	   hdparm -I reports disk features. If you have "Native
> > +	   Command Queueing" is the feature you are looking for.
>
> The document makes it sound like nearly everything bar battery backed
> hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
> the intent?

SCSI disks?  They still make those?

Everything fails, it's just a question of how.  Rotational media combined with 
journaling at least fails in fairly understandable ways, so ext3 on sata is 
reasonable.

Flash gets into trouble when it presents the _interface_ of rotational media 
(a USB block device with normal 512 byte read/write sectors, which never wear 
out) which doesn't match what the hardware's actually doing (erase block sizes 
of up to several megabytes at a time, hidden behind a block remapping layer 
for wear leveling).

For devices that have built in flash that DON'T pretend to be a conventional 
block device, but instead expose their flash erase granularity and let the OS 
do the wear levelling itself, we have special flash filesystems that can be 
reasonably reliable.  It's just that ext3 isn't one of them, jffs2 and ubifs 
and logfs are.  The problem with these flash filesystems is they ONLY work on 
flash, if you want to mount them on something other than flash you need 
something like a loopback interface to make a normal block device pretend to 
be flash.  (We've got a ramdisk driver called "mtdram" that does this, but 
nobody's bothered to write a generic wrapper for a normal block device you can 
wrap over the loopback driver.)

Unfortunately, when it comes to USB flash (the most common type), the USB 
standard defines a way for a USB device to provide a normal block disk 
interface as if it was rotational media.  It does NOT provide a way to expose 
the flash erase granularity, or a way for the operating system to disable any 
built-in wear levelling (which is needed because windows doesn't _do_ wear 
levelling, and thus burns out the administrative sectors of the disk really 
fast while the rest of the disk is still fine unless the hardware wear-levels 
for it).

So every USB flash disk pretends to be a normal disk, which it isn't, and 
Linux can't _disable_ this emulation.  Which brings us back to UDF as the 
least sucky alternative.  (Although the UDF tools kind of suck.  If you 
reformat a FAT disk as UDF with mkudffs, it'll still be autodetected as FAT 
because it won't overwrite the FAT root directory.  You have to blank the 
first 64k by hand with dd.  Sad, isn't it?)

Rob

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 19:45   ` Greg Freemyer
  (?)
@ 2009-03-16 21:48   ` Pavel Machek
  -1 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-16 21:48 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Mon 2009-03-16 15:45:36, Greg Freemyer wrote:
> On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote:
> <snip>
> > +Sector writes are atomic (ATOMIC-SECTORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > +       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > +       behave like this, and are unsuitable for all linux filesystems
> > +       I know.
> > +
> > +               An inherent problem with using flash as a normal block
> > +               device is that the flash erase size is bigger than
> > +               most filesystem sector sizes.  So when you request a
> > +               write, it may erase and rewrite the next 64k, 128k, or
> > +               even a couple megabytes on the really _big_ ones.
> > +
> > +               If you lose power in the middle of that, filesystem
> > +               won't notice that data in the "sectors" _around_ the
> > +               one your were trying to write to got trashed.
> 
> I had *assumed* that SSDs worked like:
> 
> 1) write request comes in
> 2) new unused erase block area marked to hold the new data
> 3) updated data written to the previously unused erase block
> 4) mapping updated to replace the old erase block with the new one
> 
> If it were done that way, a failure in the middle would just leave the
> SSD with the old data in it.

The really expensive ones (Intel SSD) apparently work like that, but I
never seen one of those. USB sticks and SD cards I tried behave like I
described above.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 21:43       ` Rob Landley
@ 2009-03-17  4:55         ` Kyle Moffett
  0 siblings, 0 replies; 309+ messages in thread
From: Kyle Moffett @ 2009-03-17  4:55 UTC (permalink / raw)
  To: Rob Landley
  Cc: Sitsofe Wheeler, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4

On Mon, Mar 16, 2009 at 5:43 PM, Rob Landley <rob@landley.net> wrote:
> Flash gets into trouble when it presents the _interface_ of rotational media
> (a USB block device with normal 512 byte read/write sectors, which never wear
> out) which doesn't match what the hardware's actually doing (erase block sizes
> of up to several megabytes at a time, hidden behind a block remapping layer
> for wear leveling).
>
> For devices that have built in flash that DON'T pretend to be a conventional
> block device, but instead expose their flash erase granularity and let the OS
> do the wear levelling itself, we have special flash filesystems that can be
> reasonably reliable.  It's just that ext3 isn't one of them, jffs2 and ubifs
> and logfs are.  The problem with these flash filesystems is they ONLY work on
> flash, if you want to mount them on something other than flash you need
> something like a loopback interface to make a normal block device pretend to
> be flash.  (We've got a ramdisk driver called "mtdram" that does this, but
> nobody's bothered to write a generic wrapper for a normal block device you can
> wrap over the loopback driver.)

The really nice SSDs actually reserve ~15-30% of their internal
block-level storage and actually run their own log-structured virtual
disk in hardware.  From what I understand the Intel SSDs are that way.
 Real-time garbage collection is tricky, but if you require (for
example) a max of ~80% utilization then you can provide good latency
and bandwidth guarantees.  There's usually something like a
log-structured virtual-to-physical sector map as well.  If designed
properly with automatic hardware checksumming, such a system can
actually provide atomic writes and barriers with virtually no impact
on performance.

With firmware-level hardware knowledge and the ability to perform
extremely efficient parallel reads of flash blocks, such a
log-structured virtual block device can be many times more efficient
than a general purpose OS running a log-structured filesystem.  The
result is that for an ordinary ext3-esque filesystem with 4k blocks
you can treat the SSD as though it is an atomic-write seek-less block
device.

Now if only I had the spare cash to go out and buy one of the shiny
Intel ones for my laptop... :-)

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12 11:40 ` Jochen Voß
@ 2009-03-21 11:24     ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-21 11:24 UTC (permalink / raw)
  To: Jochen Voß
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Thu 2009-03-12 11:40:52, Jochen Voß wrote:
> Hi,
> 
> 2009/3/12 Pavel Machek <pavel@ucw.cz>:
> > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> > index 4333e83..b09aa4c 100644
> > --- a/Documentation/filesystems/ext2.txt
> > +++ b/Documentation/filesystems/ext2.txt
> > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
> >  have to be 8 character filenames, even then we are fairly close to
> >  running out of unique filenames.
> >
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
>    ^^^^
> Shouldn't this be "Ext2"?

Thanks, fixed.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
@ 2009-03-21 11:24     ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-21 11:24 UTC (permalink / raw)
  To: Jochen Voß
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Thu 2009-03-12 11:40:52, Jochen Voß wrote:
> Hi,
> 
> 2009/3/12 Pavel Machek <pavel@ucw.cz>:
> > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> > index 4333e83..b09aa4c 100644
> > --- a/Documentation/filesystems/ext2.txt
> > +++ b/Documentation/filesystems/ext2.txt
> > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
> >  have to be 8 character filenames, even then we are fairly close to
> >  running out of unique filenames.
> >
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
>    ^^^^
> Shouldn't this be "Ext2"?

Thanks, fixed.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 19:26     ` Rob Landley
@ 2009-03-23 10:45       ` Pavel Machek
  2009-03-30 15:06         ` Goswin von Brederlow
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-03-23 10:45 UTC (permalink / raw)
  To: Rob Landley
  Cc: kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

On Mon 2009-03-16 14:26:23, Rob Landley wrote:
> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> > Hi!
> > > > +	Fortunately writes failing are very uncommon on traditional
> > > > +	spinning disks, as they have spare sectors they use when write
> > > > +	fails.
> > >
> > > I vaguely recall that the behavior of when a write error _does_ occur is
> > > to remount the filesystem read only?  (Is this VFS or per-fs?)
> >
> > Per-fs.
> 
> Might be nice to note that in the doc.

Ok, can you suggest a patch? I believe remount-ro is already
documented ... somewhere :-).

> > > I'm aware write errors shouldn't happen, and by the time they do it's too
> > > late to gracefully handle them, and all we can do is fail.  So how do we
> > > fail?
> >
> > Well, even remount-ro may be too late, IIRC.
> 
> Care to elaborate?  (When a filesystem is mounted RO, I'm not sure what 
> happens to the pages that have already been dirtied...)

Well, fsync() error reporting does not really work properly, but I
guess it will save you for the remount-ro case. So the data will be in
the journal, but it will be impossible to replay it...

> > > (Writes aren't always cleanly at the start of an erase block, so critical
> > > data _before_ what you touch is endangered too.)
> >
> > Well, flashes do remap, so it is actually "random blocks".
> 
> Fun.

Yes.

> > > > +	otherwise, disks may write garbage during powerfail.
> > > > +	Not sure how common that problem is on generic PC machines.
> > > > +
> > > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > > +	because it needs to write both changed data, and parity, to
> > > > +	different disks.
> > >
> > > These days instead of "atomic" it's better to think in terms of
> > > "barriers".
> >
> > This is not about barriers (that should be different topic). Atomic
> > write means that either whole sector is written, or nothing at all is
> > written. Because raid5 needs to update both master data and parity at
> > the same time, I don't think it can guarantee this during powerfail.
> 
> Good point, but I thought that's what journaling was for?

I believe journaling operates on assumption that "either whole sector
is written, or nothing at all is written".

> I'm aware that any flash filesystem _must_ be journaled in order to work 
> sanely, and must be able to view the underlying erase granularity down to the 
> bare metal, through any remapping the hardware's doing.  Possibly what's 
> really needed is a "flash is weird" section, since flash filesystems can't be 
> mounted on arbitrary block devices.

> Although an "-O erase_size=128" option so they _could_ would be nice.  There's 
> "mtdram" which seems to be the only remaining use for ram disks, but why there 
> isn't an "mtdwrap" that works with arbitrary underlying block devices, I have 
> no idea.  (Layering it on top of a loopback device would be most
> useful.)

I don't think that works. Compactflash (etc) cards basically randomly
remap the data, so you can't really run flash filesystem over
compactflash/usb/SD card -- you don't know the details of remapping.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 19:40     ` Sitsofe Wheeler
  2009-03-16 21:43       ` Rob Landley
@ 2009-03-23 11:00       ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-23 11:00 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4

On Mon 2009-03-16 19:40:57, Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> > +	Unfortunately, none of the cheap USB/SD flash cards I've seen
> > +	do behave like this, and are thus unsuitable for all Linux
> > +	filesystems I know.
> 
> When you say Linux filesystems do you mean "filesystems originally
> designed on Linux" or do you mean "filesystems that Linux supports"?

"Linux filesystems I know" :-). No filesystem that Linux supports,
AFAICT.

> Additionally whatever the answer, people are going to need help
> answering the "which is the least bad?" question and saying what's not
> good without offering alternatives is only half helpful... People need
> to put SOMETHING on these cheap (and not quite so cheap)
> devices... The

According to me, people should just AVOID those devices. I don't plan
to point the "least bad"; its still bad.

> > +	   hdparm -I reports disk features. If you have "Native
> > +	   Command Queueing" is the feature you are looking for.
> 
> The document makes it sound like nearly everything bar battery backed
> hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
> the intent?

Battery backed RAID should be ok, as should be plain single SATA drive.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-16 19:03     ` Theodore Tso
@ 2009-03-23 18:23         ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-23 18:23 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

Hi!

> > > > Not all block devices are suitable for all filesystems. In fact, some
> > > > block devices are so broken that reliable operation is pretty much
> > > > impossible. Document stuff ext2/ext3 needs for reliable operation.
> 
> Some of what is here are bugs, and some are legitimate long-term
> interfaces (for example, the question of losing I/O errors when two
> processes are writing to the same file, or to a directory entry, and
> errors aren't or in some cases, can't, be reflected back to
> userspace).

Well, I guess there's thin line between error and "legitimate
long-term interfaces". I still believe that fsync() is broken by
design.

> I'm a little concerned that some of this reads a bit too much like a
> rant (and I know Pavel was very frustrated when he tried to use a
> flash card with a sucky flash card socket) and it will get used the

It started as a rant, obviously I'd like to get away from that and get
it into suitable format for inclusion. (Not being native speaker does
not help here).

But I do believe that we should get this documented; many common
storage subsystems are broken, and can cause data loss. We should at
least tell to the users.

> wrong way by apoligists, because it mixes areas where "we suck, we
> should do better", which a re bug reports, and "Posix or the
> underlying block device layer makes it hard", and simply states them
> as fundamental design requirements, when that's probably not true.

Well, I guess that can be refined later. Heck, I'm not able to tell
which are simple bugs likely to be fixed soon, and which are
fundamental issues that are unlikely to be fixed sooner than 2030. I
guess it is fair to document them ASAP, and then fix those that can be
fixed...

> There's a lot of work that we could do to make I/O errors get better
> reflected to userspace by fsync().  So state things as bald
> requirements I think goes a little too far IMHO.  We can surely do
> better.

If the fsync() can be fixed... that would be great. But I'm not sure
how easy that will be.

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> 
> The last half of this sentence "because success on fsync was already
> returned when data hit the journal", obviously doesn't apply to all
> filesystems, since some filesystems, like ext2, don't journal data.
> Even for ext3, it only applies in the case of data=journal mode.  

Ok, I removed the explanation.

> There are other issues here, such as fsync() only reports an I/O
> problem to one caller, and in some cases I/O errors aren't propagated
> up the storage stack.  The latter is clearly just a bug that should be
> fixed; the former is more of an interface limitation.  But you don't
> talk about in this section, and I think it would be good to have a
> more extended discussion about I/O errors when writing data blocks,
> and I/O errors writing metadata blocks, etc.

Could you write a paragraph or two?

> > +
> > +Sector writes are atomic (ATOMIC-SECTORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> 
> This requirement is not quite the same as what you discuss below.

Ok, you are right. Fixed.

> So there are actually two desirable properties for a storage system to
> have; one is "don't damage the old data on a failed write"; and the
> other is "don't cause collateral damage to adjacent sectors on a
> failed write".

Thanks, its indeed clearer that way. I split those in two.

> > +	Because RAM tends to fail faster than rest of system during 
> > +	powerfail, special hw killing DMA transfers may be necessary;
> > +	otherwise, disks may write garbage during powerfail.
> > +	Not sure how common that problem is on generic PC machines.
> 
> This problem is still relatively common, from what I can tell.  And
> ext3 handles this surprisingly well at least in the catastrophic case
> of garbage getting written into the inode table, since the journal
> replay often will "repair" the garbage that was written into the
...

Ok, added to ext3 specific section. New version is attached. Feel free
to help here; my goal is to get this documented, I'm not particulary
attached to wording etc...

Signed-off-by: Pavel Machek <pavel@ucw.cz>
									Pavel

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..0de456d
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,49 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+	An inherent problem with using flash as a normal block device
+	is that the flash erase size is bigger than most filesystem
+	sector sizes.  So when you request a write, it may erase and
+	rewrite some 64k, 128k, or even a couple megabytes on the
+	really _big_ ones.
+
+	If you lose power in the middle of that, filesystem won't
+	notice that data in the "sectors" _around_ the one your were
+	trying to write to got trashed.
+
+
+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	This may be quite common on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 2344855..ee88467 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index e5f3833..6de8af4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,45 @@ mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+	(Thrash may get written into sectors during powerfail.  And
+	ext3 handles this surprisingly well at least in the
+	catastrophic case of garbage getting written into the inode
+	table, since the journal replay often will "repair" the
+	garbage that was written into the filesystem metadata blocks.
+	It won't do a bit of good for the data blocks, of course
+	(unless you are using data=journal mode).  But this means that
+	in fact, ext3 is more resistant to suriving failures to the
+	first problem (powerfail while writing can damage old data on
+	a failed write) but fortunately, hard drives generally don't
+	cause collateral damage on a failed write.
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
 
 References
 ==========

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
@ 2009-03-23 18:23         ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-03-23 18:23 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, l

Hi!

> > > > Not all block devices are suitable for all filesystems. In fact, some
> > > > block devices are so broken that reliable operation is pretty much
> > > > impossible. Document stuff ext2/ext3 needs for reliable operation.
> 
> Some of what is here are bugs, and some are legitimate long-term
> interfaces (for example, the question of losing I/O errors when two
> processes are writing to the same file, or to a directory entry, and
> errors aren't or in some cases, can't, be reflected back to
> userspace).

Well, I guess there's thin line between error and "legitimate
long-term interfaces". I still believe that fsync() is broken by
design.

> I'm a little concerned that some of this reads a bit too much like a
> rant (and I know Pavel was very frustrated when he tried to use a
> flash card with a sucky flash card socket) and it will get used the

It started as a rant, obviously I'd like to get away from that and get
it into suitable format for inclusion. (Not being native speaker does
not help here).

But I do believe that we should get this documented; many common
storage subsystems are broken, and can cause data loss. We should at
least tell to the users.

> wrong way by apoligists, because it mixes areas where "we suck, we
> should do better", which a re bug reports, and "Posix or the
> underlying block device layer makes it hard", and simply states them
> as fundamental design requirements, when that's probably not true.

Well, I guess that can be refined later. Heck, I'm not able to tell
which are simple bugs likely to be fixed soon, and which are
fundamental issues that are unlikely to be fixed sooner than 2030. I
guess it is fair to document them ASAP, and then fix those that can be
fixed...

> There's a lot of work that we could do to make I/O errors get better
> reflected to userspace by fsync().  So state things as bald
> requirements I think goes a little too far IMHO.  We can surely do
> better.

If the fsync() can be fixed... that would be great. But I'm not sure
how easy that will be.

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> 
> The last half of this sentence "because success on fsync was already
> returned when data hit the journal", obviously doesn't apply to all
> filesystems, since some filesystems, like ext2, don't journal data.
> Even for ext3, it only applies in the case of data=journal mode.  

Ok, I removed the explanation.

> There are other issues here, such as fsync() only reports an I/O
> problem to one caller, and in some cases I/O errors aren't propagated
> up the storage stack.  The latter is clearly just a bug that should be
> fixed; the former is more of an interface limitation.  But you don't
> talk about in this section, and I think it would be good to have a
> more extended discussion about I/O errors when writing data blocks,
> and I/O errors writing metadata blocks, etc.

Could you write a paragraph or two?

> > +
> > +Sector writes are atomic (ATOMIC-SECTORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> 
> This requirement is not quite the same as what you discuss below.

Ok, you are right. Fixed.

> So there are actually two desirable properties for a storage system to
> have; one is "don't damage the old data on a failed write"; and the
> other is "don't cause collateral damage to adjacent sectors on a
> failed write".

Thanks, its indeed clearer that way. I split those in two.

> > +	Because RAM tends to fail faster than rest of system during 
> > +	powerfail, special hw killing DMA transfers may be necessary;
> > +	otherwise, disks may write garbage during powerfail.
> > +	Not sure how common that problem is on generic PC machines.
> 
> This problem is still relatively common, from what I can tell.  And
> ext3 handles this surprisingly well at least in the catastrophic case
> of garbage getting written into the inode table, since the journal
> replay often will "repair" the garbage that was written into the
...

Ok, added to ext3 specific section. New version is attached. Feel free
to help here; my goal is to get this documented, I'm not particulary
attached to wording etc...

Signed-off-by: Pavel Machek <pavel@ucw.cz>
									Pavel

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..0de456d
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,49 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+	An inherent problem with using flash as a normal block device
+	is that the flash erase size is bigger than most filesystem
+	sector sizes.  So when you request a write, it may erase and
+	rewrite some 64k, 128k, or even a couple megabytes on the
+	really _big_ ones.
+
+	If you lose power in the middle of that, filesystem won't
+	notice that data in the "sectors" _around_ the one your were
+	trying to write to got trashed.
+
+
+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	This may be quite common on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 2344855..ee88467 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index e5f3833..6de8af4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,45 @@ mke2fs: 	create a ext3 partition with the -j flag.
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+	(Thrash may get written into sectors during powerfail.  And
+	ext3 handles this surprisingly well at least in the
+	catastrophic case of garbage getting written into the inode
+	table, since the journal replay often will "repair" the
+	garbage that was written into the filesystem metadata blocks.
+	It won't do a bit of good for the data blocks, of course
+	(unless you are using data=journal mode).  But this means that
+	in fact, ext3 is more resistant to suriving failures to the
+	first problem (powerfail while writing can damage old data on
+	a failed write) but fortunately, hard drives generally don't
+	cause collateral damage on a failed write.
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
 
 References
 ==========

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-23 10:45       ` Pavel Machek
@ 2009-03-30 15:06         ` Goswin von Brederlow
  2009-08-24  9:26           ` Pavel Machek
  2009-08-24  9:31           ` [patch] " Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: Goswin von Brederlow @ 2009-03-30 15:06 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4

Pavel Machek <pavel@ucw.cz> writes:

> On Mon 2009-03-16 14:26:23, Rob Landley wrote:
>> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
>> > > > +	otherwise, disks may write garbage during powerfail.
>> > > > +	Not sure how common that problem is on generic PC machines.
>> > > > +
>> > > > +	Note that atomic write is very hard to guarantee for RAID-4/5/6,
>> > > > +	because it needs to write both changed data, and parity, to
>> > > > +	different disks.
>> > >
>> > > These days instead of "atomic" it's better to think in terms of
>> > > "barriers".

Would be nice to have barriers in md and dm.

>> > This is not about barriers (that should be different topic). Atomic
>> > write means that either whole sector is written, or nothing at all is
>> > written. Because raid5 needs to update both master data and parity at
>> > the same time, I don't think it can guarantee this during powerfail.

Actualy raid5 should have no problem with a power failure during
normal operations of the raid. The parity block should get marked out
of sync, then the new data block should be written, then the new
parity block and then the parity block should be flaged in sync.

>> Good point, but I thought that's what journaling was for?
>
> I believe journaling operates on assumption that "either whole sector
> is written, or nothing at all is written".

The real problem comes in degraded mode. In that case the data block
(if present) and parity block must be written at the same time
atomically. If the system crashes after writing one but before writing
the other then the data block on the missng drive changes its
contents. And for example with a chunk size of 1MB and 16 disks that
could be 15MB away from the block you actualy do change. And you can
not recover that after a crash as you need both the original and
changed contents of the block.

So writing one sector has the risk of corrupting another (for the FS)
totally unconnected sector. No amount of journaling will help
there. The raid5 would need to do journaling or use battery backed
cache.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-30 15:06         ` Goswin von Brederlow
@ 2009-08-24  9:26           ` Pavel Machek
  2009-08-24  9:31           ` [patch] " Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24  9:26 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4

Hi!

> >> > This is not about barriers (that should be different topic). Atomic
> >> > write means that either whole sector is written, or nothing at all is
> >> > written. Because raid5 needs to update both master data and parity at
> >> > the same time, I don't think it can guarantee this during powerfail.
> 
> Actualy raid5 should have no problem with a power failure during
> normal operations of the raid. The parity block should get marked out
> of sync, then the new data block should be written, then the new
> parity block and then the parity block should be flaged in sync.
> 
> >> Good point, but I thought that's what journaling was for?
> >
> > I believe journaling operates on assumption that "either whole sector
> > is written, or nothing at all is written".
> 
> The real problem comes in degraded mode. In that case the data block
> (if present) and parity block must be written at the same time
> atomically. If the system crashes after writing one but before writing
> the other then the data block on the missng drive changes its
> contents. And for example with a chunk size of 1MB and 16 disks that
> could be 15MB away from the block you actualy do change. And you can
> not recover that after a crash as you need both the original and
> changed contents of the block.
> 
> So writing one sector has the risk of corrupting another (for the FS)
> totally unconnected sector. No amount of journaling will help
> there. The raid5 would need to do journaling or use battery backed
> cache.

Thanks, I updated my notes.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* [patch] ext2/3: document conditions when reliable operation is possible
  2009-03-30 15:06         ` Goswin von Brederlow
  2009-08-24  9:26           ` Pavel Machek
@ 2009-08-24  9:31           ` Pavel Machek
  2009-08-24 11:19             ` Florian Weimer
                               ` (2 more replies)
  1 sibling, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24  9:31 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Rob Landley, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4


Running journaling filesystem such as ext3 over flashdisk or degraded
RAID array is a bad idea: journaling guarantees no longer apply and
you will get data corruption on powerfail.

We can't solve it easily, but we should certainly warn the users. I
actually lost data because I did not understand these limitations...

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..80fa886
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,52 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+	An inherent problem with using flash as a normal block device
+	is that the flash erase size is bigger than most filesystem
+	sector sizes.  So when you request a write, it may erase and
+	rewrite some 64k, 128k, or even a couple megabytes on the
+	really _big_ ones.
+
+	If you lose power in the middle of that, filesystem won't
+	notice that data in the "sectors" _around_ the one your were
+	trying to write to got trashed.
+
+	RAID-4/5/6 in degraded mode has same problem.
+
+
+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	This may be quite common on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. (But it will only really show up in degraded mode).
+	UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..0a9b87f 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 570f9bd..2ce82a3 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -199,6 +202,47 @@ debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+	(Thrash may get written into sectors during powerfail.  And
+	ext3 handles this surprisingly well at least in the
+	catastrophic case of garbage getting written into the inode
+	table, since the journal replay often will "repair" the
+	garbage that was written into the filesystem metadata blocks.
+	It won't do a bit of good for the data blocks, of course
+	(unless you are using data=journal mode).  But this means that
+	in fact, ext3 is more resistant to suriving failures to the
+	first problem (powerfail while writing can damage old data on
+	a failed write) but fortunately, hard drives generally don't
+	cause collateral damage on a failed write.
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
+
+
 References
 ==========
 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24  9:31           ` [patch] " Pavel Machek
@ 2009-08-24 11:19             ` Florian Weimer
  2009-08-24 13:01               ` Theodore Tso
                                 ` (2 more replies)
  2009-08-24 13:21               ` Greg Freemyer
  2009-08-24 21:11             ` Rob Landley
  2 siblings, 3 replies; 309+ messages in thread
From: Florian Weimer @ 2009-08-24 11:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4

* Pavel Machek:

> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.

You should make clear that the file lists per-file-system rules and
that some file sytems can recover from some of the error conditions.

> +* don't damage the old data on a failed write (ATOMIC-WRITES)
> +
> +	(Thrash may get written into sectors during powerfail.  And
> +	ext3 handles this surprisingly well at least in the
> +	catastrophic case of garbage getting written into the inode
> +	table, since the journal replay often will "repair" the
> +	garbage that was written into the filesystem metadata blocks.

Isn't this by design?  In other words, if the metadata doesn't survive
non-atomic writes, wouldn't it be an ext3 bug?

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 11:19             ` Florian Weimer
@ 2009-08-24 13:01               ` Theodore Tso
  2009-08-24 14:55                 ` Artem Bityutskiy
                                   ` (2 more replies)
  2009-08-24 13:50               ` Theodore Tso
  2009-08-24 18:39               ` Pavel Machek
  2 siblings, 3 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-24 13:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4

On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote:
> * Pavel Machek:
> 
> > +Linux block-backed filesystems can only work correctly when several
> > +conditions are met in the block layer and below (disks, flash
> > +cards). Some of them are obvious ("data on media should not change
> > +randomly"), some are less so.
> 
> You should make clear that the file lists per-file-system rules and
> that some file sytems can recover from some of the error conditions.

The only one that falls into that category is the one about not being
able to handle failed writes, and the way most failures take place,
they generally fail the ATOMIC-WRITES criterion in any case.  That is,
when a write fails, an attempt to read from that sector will generally
result in either (a) an error, or (b) data other than what was there
before the write was attempted.

> > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > +
> > +	(Thrash may get written into sectors during powerfail.  And
> > +	ext3 handles this surprisingly well at least in the
> > +	catastrophic case of garbage getting written into the inode
> > +	table, since the journal replay often will "repair" the
> > +	garbage that was written into the filesystem metadata blocks.
> 
> Isn't this by design?  In other words, if the metadata doesn't survive
> non-atomic writes, wouldn't it be an ext3 bug?

Part of the problem here is that "atomic-writes" is confusing; it
doesn't mean what many people think it means.  The assumption which
many naive filesystem designers make is that writes succeed or they
don't.  If they don't succeed, they don't change the previously
existing data in any way.  

So in the case of journalling, the assumption which gets made is that
when the power fails, the disk either writes a particular disk block,
or it doesn't.  The problem here is as with humans and animals, death
is not an event, it is a process.  When the power fails, the system
just doesn't stop functioning; the power on the +5 and +12 volt rails
start dropping to zero, and different components fail at different
times.  Specifically, DRAM, being the most voltage sensitve, tends to
fail before the DMA subsystem, the PCI bus, and the hard drive fails.
So as a result, garbage can get written out to disk as part of the
failure.  That's just the way hardware works.

Now consider a file system which does logical journalling.  It has
written to the journal, using a compact encoding, "the i_blocks field
is now 25, and i_size is 13000", and the journal transaction has
committed.  So now, it's time to update the inode on disk; but at that
precise moment, the power failures, and garbage is written to the
inode table.  Oops!  The entire sector containing the inode is
trashed.  But the only thing which recorded in the journal is the new
value of i_blocks and i_size.  So a journal replay won't help file
systems that do logical block journalling. 

Is that a file system "bug"?  Well, it's better to call that a
mismatch between the assumptions made of physical devices, and of the
file system code.  On Irix, SGI hardware had a powerfail interrupt,
and the power supply and extra-big capacitors, so that when a power
fail interrupt came in, the Irix would run around frantically shutting
down pending DMA transfers to prevent this failure mode from causing
problems.  PC class hardware (according to Ted's law), is cr*p, and
doesn't have a powerfail interrupt, so it's not something that we
have.

Ext3, ext4, and ocfs2 does physical block journalling, so as long as
journal truncate hasn't taken place right before the failure, the
replay of the physical block journal tends to repair this most (but
not necessarily all) cases of "garbage is written right before power
failure".  People who care about this should really use a UPS, and
wire up the USB and/or serial cable from the UPS to the system, so
that the OS can do a controlled shutdown if the UPS is close to
shutting down due to an extended power failure.


There is another kind of non-atomic write that nearly all file systems
are subject to, however, and to give an example of this, consider what
happens if you a laptop is subjected to a sudden shock while it is
writing a sector, and the hard drive doesn't an accelerometer which
tries to anticipates such shocks.  (nb, these things aren't
fool-proof; even if a HDD has one of these sensors, they only work if
they can detect the transition to free-fall, and the hard drive has
time to retract the heads before the actual shock hits; if you have a
sudden shock, the g-shock sensors won't have time to react and save
the hard drive).

Depending on how severe the shock happens to be, the head could end up
impacting the platter, destroying the medium (which used to be
iron-oxide; hence the term "spinning rust platters") at that spot.
This will obviously cause a write failure, and the previous contents
of the sector will be lost.  This is also considered a failure of the
ATOMIC-WRITE property, and no, ext3 doesn't handle this case
gracefully.  Very few file systems do.  (It is possible for an OS that
doesn't have fixed metadata to immediately write the inode table to a
different location on the disk, and then update the pointers to the
inode table point to the new location on disk; but very few
filesystems do this, and even those that do usually rely on the
superblock being available on a fixed location on disk.  It's much
simpler to assume that hard drives usually behave sanely, and that
writes very rarely fail.)

It's for this reason that I've never been completely sure how useful
Pavel's proposed treatise about file systems expectations really are
--- because all storage subsystems *usually* provide these guarantees,
but it is the very rare storage system that *always* provides these
guarantees.

We could just as easily have several kilobytes of explanation in
Documentation/* explaining how we assume that DRAM always returns the
same value that was stored in it previously --- and yet most PC class
hardware still does not use ECC memory, and cosmic rays are a reality.
That means that most Linux systems run on systems that are vulnerable
to this kind of failure --- and the world hasn't ended.

As I recall, the main problem which Pavel had was when he was using
ext3 on a *really* trashy flash drive, on a *really* trashing laptop
where the flash card stuck out slightly, and any jostling of the
netbook would cause the flash card to become disconnected from the
laptop, and cause write errors, very easily and very frequently.  In
those circumstnaces, it's highly unlikely that ***any*** file system
would have been able to survive such an unreliable storage system.


One of the problems I have with the break down which Pavel has used is
that it doesn't break things down according to probability; the chance
of a storage subsystem scribbling garbage on its last write during a
power failure is very different from the chance that the hard drive
fails due to a shock, or due to some spilling printer toner near the
disk drive which somehow manages to find its way inside the enclosure
containing the spinning platters, versus the other forms of random
failures that lead to write failures.  All of these fall into the
category of a failure of the property he has named "ATOMIC-WRITE", but
in fact ways in which the filesystem might try to protect itself are
varied, and it isn't necessarily all or nothing.  One can imagine a
file system which can handle write failures for data blocks, but not
for metadata blocks; given that data blocks outnumber metadata blocks
by hundreds to one, and that write failures are relatively rare
(unless you have said trashy laptop with a trash flash card), a file
system that can gracefully deal with data block failures would be a
useful advancement.

But these things are never absolute, mainly because people aren't
willing to pay for either the cost of superior hardware (consider the
cost of ECC memory, which isn't *that* much more expensive; and yet
most PC class systems don't use it) or in terms of software overhead
(historically many file system designers have eschewed the use of
physical block journalling because it really hurts on meta-data
intensive benchmarks), talking about absolute requirements for
ATOMIC-WRITE isn't all that useful --- because nearly all hardware
doesn't provide these guarantees, and nearly all filesystems require
them.  So to call out ext2 and ext3 for requiring them, without making
clear that pretty much *all* file systems require them, ends up
causing people to switch over to some other file system that
ironically enough, might end up being *more* vulernable, but which
didn't earn Pavel's displeasure because he didn't try using, say, XFS
on his flashcard on his trashy laptop.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is  possible
  2009-08-24  9:31           ` [patch] " Pavel Machek
@ 2009-08-24 13:21               ` Greg Freemyer
  2009-08-24 13:21               ` Greg Freemyer
  2009-08-24 21:11             ` Rob Landley
  2 siblings, 0 replies; 309+ messages in thread
From: Greg Freemyer @ 2009-08-24 13:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4

On Mon, Aug 24, 2009 at 5:31 AM, Pavel Machek<pavel@ucw.cz> wrote:
>
> Running journaling filesystem such as ext3 over flashdisk or degraded
> RAID array is a bad idea: journaling guarantees no longer apply and
> you will get data corruption on powerfail.
>
> We can't solve it easily, but we should certainly warn the users. I
> actually lost data because I did not understand these limitations...
>
> Signed-off-by: Pavel Machek <pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..80fa886
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,52 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> +       Fortunately writes failing are very uncommon on traditional
> +       spinning disks, as they have spare sectors they use when write
> +       fails.
> +
> +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> +       An inherent problem with using flash as a normal block device
> +       is that the flash erase size is bigger than most filesystem
> +       sector sizes.  So when you request a write, it may erase and
> +       rewrite some 64k, 128k, or even a couple megabytes on the
> +       really _big_ ones.
> +
> +       If you lose power in the middle of that, filesystem won't
> +       notice that data in the "sectors" _around_ the one your were
> +       trying to write to got trashed.
> +
> +       RAID-4/5/6 in degraded mode has same problem.
> +
> +
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +       Because RAM tends to fail faster than rest of system during
> +       powerfail, special hw killing DMA transfers may be necessary;
> +       otherwise, disks may write garbage during powerfail.
> +       This may be quite common on generic PC machines.
> +
> +       Note that atomic write is very hard to guarantee for RAID-4/5/6,
> +       because it needs to write both changed data, and parity, to
> +       different disks. (But it will only really show up in degraded mode).
> +       UPS for RAID array should help.

Can someone clarify if this is true in raid-6 with just a single disk
failure?  I don't see why it would be.

And if not can the above text be changed to reflect raid 4/5 with a
single disk failure and raid 6 with a double disk failure are the
modes that have atomicity problems.

Greg

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-08-24 13:21               ` Greg Freemyer
  0 siblings, 0 replies; 309+ messages in thread
From: Greg Freemyer @ 2009-08-24 13:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4

On Mon, Aug 24, 2009 at 5:31 AM, Pavel Machek<pavel@ucw.cz> wrote:
>
> Running journaling filesystem such as ext3 over flashdisk or degraded
> RAID array is a bad idea: journaling guarantees no longer apply and
> you will get data corruption on powerfail.
>
> We can't solve it easily, but we should certainly warn the users. I
> actually lost data because I did not understand these limitations...
>
> Signed-off-by: Pavel Machek <pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..80fa886
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,52 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> +       Fortunately writes failing are very uncommon on traditional
> +       spinning disks, as they have spare sectors they use when write
> +       fails.
> +
> +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> +       An inherent problem with using flash as a normal block device
> +       is that the flash erase size is bigger than most filesystem
> +       sector sizes.  So when you request a write, it may erase and
> +       rewrite some 64k, 128k, or even a couple megabytes on the
> +       really _big_ ones.
> +
> +       If you lose power in the middle of that, filesystem won't
> +       notice that data in the "sectors" _around_ the one your were
> +       trying to write to got trashed.
> +
> +       RAID-4/5/6 in degraded mode has same problem.
> +
> +
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +       Because RAM tends to fail faster than rest of system during
> +       powerfail, special hw killing DMA transfers may be necessary;
> +       otherwise, disks may write garbage during powerfail.
> +       This may be quite common on generic PC machines.
> +
> +       Note that atomic write is very hard to guarantee for RAID-4/5/6,
> +       because it needs to write both changed data, and parity, to
> +       different disks. (But it will only really show up in degraded mode).
> +       UPS for RAID array should help.

Can someone clarify if this is true in raid-6 with just a single disk
failure?  I don't see why it would be.

And if not can the above text be changed to reflect raid 4/5 with a
single disk failure and raid 6 with a double disk failure are the
modes that have atomicity problems.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 11:19             ` Florian Weimer
  2009-08-24 13:01               ` Theodore Tso
@ 2009-08-24 13:50               ` Theodore Tso
  2009-08-24 18:48                 ` Pavel Machek
  2009-08-24 18:48                 ` Pavel Machek
  2009-08-24 18:39               ` Pavel Machek
  2 siblings, 2 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-24 13:50 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4

On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote:
> > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > +
> > +	(Thrash may get written into sectors during powerfail.  And
> > +	ext3 handles this surprisingly well at least in the
> > +	catastrophic case of garbage getting written into the inode
> > +	table, since the journal replay often will "repair" the
> > +	garbage that was written into the filesystem metadata blocks.
> 
> Isn't this by design?  In other words, if the metadata doesn't survive
> non-atomic writes, wouldn't it be an ext3 bug?

So I got confused when I quoted your note, which I had assumed was
exactly what Pavel had written in his documentation.  In fact, what he
had written was this:

+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+....

So he had explicitly stated that he only cared about the whole sector
being written (or not written) in the power fail case, and not any
other.  I'd suggest changing ATOMIC-WRITES to
ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
the old data on a failed write", is also singularly misleading.

    	     	  	 	    	 	    - Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 13:01               ` Theodore Tso
@ 2009-08-24 14:55                 ` Artem Bityutskiy
  2009-08-24 22:30                   ` Rob Landley
  2009-08-24 19:52                   ` Pavel Machek
  2009-08-25 14:43                 ` Florian Weimer
  2 siblings, 1 reply; 309+ messages in thread
From: Artem Bityutskiy @ 2009-08-24 14:55 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Florian Weimer, Pavel Machek, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Hi Theodore,

thanks for the insightful writing.

On 08/24/2009 04:01 PM, Theodore Tso wrote:

...snip ...

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

There is a thing called eMMC (embedded MMC) in the embedded world. You
may consider it as a non-removable MMC. This thing is a block device from
the Linux POW, and you may mount ext3 on top of it. And people do this.

The device seems to have a decent FTL, and does not look bad.

However, there are subtle things which mortals never think about. In
case of eMMC - power cuts may make some sectors unreadable - eMMC returns
ECC errors on reads. Namely, the sectors which were being written at
the very moment when the power cut happened may become unreadable.
And this makes ext3 refuse mounting the file-system, this makes
chkfs.ext3 refuse the file-system. Although this should be fixable in
SW, but we did not find time to do this so far.

Anyway, my point is that documenting subtle things like this is a very
good thing to do, just because nowadays we are trying to use existing
software with flash-based storage devices, which may violate these
subtle assumptions, or introduce other ones.

Probably, Pavel did too good job in generalizing things, and it could be
better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
Not sure. But the idea to document subtle FS assumption is good, IMO.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 11:19             ` Florian Weimer
  2009-08-24 13:01               ` Theodore Tso
  2009-08-24 13:50               ` Theodore Tso
@ 2009-08-24 18:39               ` Pavel Machek
  2 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 18:39 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4

Hi!

> > +Linux block-backed filesystems can only work correctly when several
> > +conditions are met in the block layer and below (disks, flash
> > +cards). Some of them are obvious ("data on media should not change
> > +randomly"), some are less so.
> 
> You should make clear that the file lists per-file-system rules and
> that some file sytems can recover from some of the error conditions.

Ok, I added "Not all filesystems require all of these
to be satisfied for safe operation" sentence there.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 13:21               ` Greg Freemyer
  (?)
@ 2009-08-24 18:44               ` Pavel Machek
  -1 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 18:44 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4


> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > +       Because RAM tends to fail faster than rest of system during
> > +       powerfail, special hw killing DMA transfers may be necessary;
> > +       otherwise, disks may write garbage during powerfail.
> > +       This may be quite common on generic PC machines.
> > +
> > +       Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > +       because it needs to write both changed data, and parity, to
> > +       different disks. (But it will only really show up in degraded mode).
> > +       UPS for RAID array should help.
> 
> Can someone clarify if this is true in raid-6 with just a single disk
> failure?  I don't see why it would be.
> 
> And if not can the above text be changed to reflect raid 4/5 with a
> single disk failure and raid 6 with a double disk failure are the
> modes that have atomicity problems.

I don't know enough about raid-6, but... I said "degraded mode" above,
and you can read it as double failure in raid-6 case ;-). I'll prefer
to avoid too many details here.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 13:50               ` Theodore Tso
  2009-08-24 18:48                 ` Pavel Machek
@ 2009-08-24 18:48                 ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 18:48 UTC (permalink / raw)
  To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Hi!

> > > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > > +
> > > +	(Thrash may get written into sectors during powerfail.  And
> > > +	ext3 handles this surprisingly well at least in the
> > > +	catastrophic case of garbage getting written into the inode
> > > +	table, since the journal replay often will "repair" the
> > > +	garbage that was written into the filesystem metadata blocks.
> > 
> > Isn't this by design?  In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
> 
> So I got confused when I quoted your note, which I had assumed was
> exactly what Pavel had written in his documentation.  In fact, what he
> had written was this:
> 
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +....
> 
> So he had explicitly stated that he only cared about the whole sector
> being written (or not written) in the power fail case, and not any
> other.  I'd suggest changing ATOMIC-WRITES to
> ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
> the old data on a failed write", is also singularly misleading.

Ok, something like this?

Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during
powerfail.


									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 13:50               ` Theodore Tso
@ 2009-08-24 18:48                 ` Pavel Machek
  2009-08-24 18:48                 ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 18:48 UTC (permalink / raw)
  To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, An

Hi!

> > > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > > +
> > > +	(Thrash may get written into sectors during powerfail.  And
> > > +	ext3 handles this surprisingly well at least in the
> > > +	catastrophic case of garbage getting written into the inode
> > > +	table, since the journal replay often will "repair" the
> > > +	garbage that was written into the filesystem metadata blocks.
> > 
> > Isn't this by design?  In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
> 
> So I got confused when I quoted your note, which I had assumed was
> exactly what Pavel had written in his documentation.  In fact, what he
> had written was this:
> 
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +....
> 
> So he had explicitly stated that he only cared about the whole sector
> being written (or not written) in the power fail case, and not any
> other.  I'd suggest changing ATOMIC-WRITES to
> ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
> the old data on a failed write", is also singularly misleading.

Ok, something like this?

Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during
powerfail.


									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 13:01               ` Theodore Tso
@ 2009-08-24 19:52                   ` Pavel Machek
  2009-08-24 19:52                   ` Pavel Machek
  2009-08-25 14:43                 ` Florian Weimer
  2 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 19:52 UTC (permalink / raw)
  To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Hi!

> > Isn't this by design?  In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
> 
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means.  The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't.  If they don't succeed, they don't change the previously
> existing data in any way.  
> 
> So in the case of journalling, the assumption which gets made is that
> when the power fails, the disk either writes a particular disk block,
> or it doesn't.  The problem here is as with humans and animals, death
> is not an event, it is a process.  When the power fails, the system
> just doesn't stop functioning; the power on the +5 and +12 volt rails
> start dropping to zero, and different components fail at different
> times.  Specifically, DRAM, being the most voltage sensitve, tends to
> fail before the DMA subsystem, the PCI bus, and the hard drive fails.
> So as a result, garbage can get written out to disk as part of the
> failure.  That's just the way hardware works.

Yep, and at that point you lost data. You had "silent data corruption"
from fs point of view, and that's bad. 

It will be probably very bad on XFS, probably okay on Ext3, and
certainly okay on Ext2: you do filesystem check, and you should be
able to repair any damage. So yes, physical journaling is good, but
fsck is better.

> Is that a file system "bug"?  Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code.  On Irix, SGI hardware had a powerfail interrupt,

If those filesystem assumptions were not documented, I'd call it
filesystem bug. So better document them ;-).

> There is another kind of non-atomic write that nearly all file systems
> are subject to, however, and to give an example of this, consider what
> happens if you a laptop is subjected to a sudden shock while it is
> writing a sector, and the hard drive doesn't an accelerometer which
...
> Depending on how severe the shock happens to be, the head could end up
> impacting the platter, destroying the medium (which used to be
> iron-oxide; hence the term "spinning rust platters") at that spot.
> This will obviously cause a write failure, and the previous contents
> of the sector will be lost.  This is also considered a failure of the
> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
> gracefully.  Very few file systems do.  (It is possible for an OS
> that

Actually, ext2 should be able to survive that, no? Error writing ->
remount ro -> fsck on next boot -> drive relocates the sectors.

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

Well... there's very big difference between harddrives and flash
memory. Harddrives usually work, and flash memory never does.

> We could just as easily have several kilobytes of explanation in
> Documentation/* explaining how we assume that DRAM always returns the
> same value that was stored in it previously --- and yet most PC class
> hardware still does not use ECC memory, and cosmic rays are a reality.
> That means that most Linux systems run on systems that are vulnerable
> to this kind of failure --- and the world hasn't ended.

There's a difference. In case of cosmic rays, hardware is clearly
buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
and I still use it. I will not complain if ext3 trashes that.

In case of degraded raid-5, even with perfect hardware, and with
ext3 on top of that, you'll get silent data corruption. Nice, eh?

Clearly, Linux is buggy there. It could be argued it is raid-5's
fault, or maybe it is ext3's fault, but... linux is still buggy.

> As I recall, the main problem which Pavel had was when he was using
> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
> where the flash card stuck out slightly, and any jostling of the
> netbook would cause the flash card to become disconnected from the
> laptop, and cause write errors, very easily and very frequently.  In
> those circumstnaces, it's highly unlikely that ***any*** file system
> would have been able to survive such an unreliable storage system.

Well well well. Before I pulled that flash card, I assumed that doing
so is safe, because flashcard is presented as block device and ext3
should cope with sudden disk disconnects.

And I was wrong wrong wrong. (Noone told me at the university. I guess
I should want my money back).

Plus note that it is not only my trashy laptop and one trashy MMC
card; every USB thumb drive I seen is affected. (OTOH USB disks should
be safe AFAICT).

Ext3 is unsuitable for flash cards and RAID arrays, plain and
simple. It is not documented anywhere :-(. [ext2 should work better --
at least you'll not get silent data corruption.]

> One of the problems I have with the break down which Pavel has used is
> that it doesn't break things down according to probability; the chance
> of a storage subsystem scribbling garbage on its last write during a

Can you suggest better patch? I'm not saying we should redesign ext3,
but... someone should have told me that ext3+USB thumb drive=problems.

> But these things are never absolute, mainly because people aren't
> willing to pay for either the cost of superior hardware (consider the
> cost of ECC memory, which isn't *that* much more expensive; and yet
> most PC class systems don't use it) or in terms of software overhead
> (historically many file system designers have eschewed the use of
> physical block journalling because it really hurts on meta-data
> intensive benchmarks), talking about absolute requirements for
> ATOMIC-WRITE isn't all that useful --- because nearly all hardware
> doesn't provide these guarantees, and nearly all filesystems require
> them.  So to call out ext2 and ext3 for requiring them, without
> making

ext3+raid5 will fail even if you have perfect hardware.

> clear that pretty much *all* file systems require them, ends up
> causing people to switch over to some other file system that
> ironically enough, might end up being *more* vulernable, but which
> didn't earn Pavel's displeasure because he didn't try using, say, XFS
> on his flashcard on his trashy laptop.

I hold ext2/ext3 to higher standards than other filesystem in
tree. I'd not use XFS/VFAT etc. 

I would not want people to migrate towards XFS/VFAT, and yes I believe
XFSs/VFATs/... requirements should be documented, too. (But I know too
little about those filesystems).

If you can suggest better wording, please help me. But... those
requirements are non-trivial, commonly not met and the result is data
loss. It has to be documented somehow. Make it as innocent-looking as
you can...

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-08-24 19:52                   ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 19:52 UTC (permalink / raw)
  To: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, An

Hi!

> > Isn't this by design?  In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
> 
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means.  The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't.  If they don't succeed, they don't change the previously
> existing data in any way.  
> 
> So in the case of journalling, the assumption which gets made is that
> when the power fails, the disk either writes a particular disk block,
> or it doesn't.  The problem here is as with humans and animals, death
> is not an event, it is a process.  When the power fails, the system
> just doesn't stop functioning; the power on the +5 and +12 volt rails
> start dropping to zero, and different components fail at different
> times.  Specifically, DRAM, being the most voltage sensitve, tends to
> fail before the DMA subsystem, the PCI bus, and the hard drive fails.
> So as a result, garbage can get written out to disk as part of the
> failure.  That's just the way hardware works.

Yep, and at that point you lost data. You had "silent data corruption"
from fs point of view, and that's bad. 

It will be probably very bad on XFS, probably okay on Ext3, and
certainly okay on Ext2: you do filesystem check, and you should be
able to repair any damage. So yes, physical journaling is good, but
fsck is better.

> Is that a file system "bug"?  Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code.  On Irix, SGI hardware had a powerfail interrupt,

If those filesystem assumptions were not documented, I'd call it
filesystem bug. So better document them ;-).

> There is another kind of non-atomic write that nearly all file systems
> are subject to, however, and to give an example of this, consider what
> happens if you a laptop is subjected to a sudden shock while it is
> writing a sector, and the hard drive doesn't an accelerometer which
...
> Depending on how severe the shock happens to be, the head could end up
> impacting the platter, destroying the medium (which used to be
> iron-oxide; hence the term "spinning rust platters") at that spot.
> This will obviously cause a write failure, and the previous contents
> of the sector will be lost.  This is also considered a failure of the
> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
> gracefully.  Very few file systems do.  (It is possible for an OS
> that

Actually, ext2 should be able to survive that, no? Error writing ->
remount ro -> fsck on next boot -> drive relocates the sectors.

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

Well... there's very big difference between harddrives and flash
memory. Harddrives usually work, and flash memory never does.

> We could just as easily have several kilobytes of explanation in
> Documentation/* explaining how we assume that DRAM always returns the
> same value that was stored in it previously --- and yet most PC class
> hardware still does not use ECC memory, and cosmic rays are a reality.
> That means that most Linux systems run on systems that are vulnerable
> to this kind of failure --- and the world hasn't ended.

There's a difference. In case of cosmic rays, hardware is clearly
buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
and I still use it. I will not complain if ext3 trashes that.

In case of degraded raid-5, even with perfect hardware, and with
ext3 on top of that, you'll get silent data corruption. Nice, eh?

Clearly, Linux is buggy there. It could be argued it is raid-5's
fault, or maybe it is ext3's fault, but... linux is still buggy.

> As I recall, the main problem which Pavel had was when he was using
> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
> where the flash card stuck out slightly, and any jostling of the
> netbook would cause the flash card to become disconnected from the
> laptop, and cause write errors, very easily and very frequently.  In
> those circumstnaces, it's highly unlikely that ***any*** file system
> would have been able to survive such an unreliable storage system.

Well well well. Before I pulled that flash card, I assumed that doing
so is safe, because flashcard is presented as block device and ext3
should cope with sudden disk disconnects.

And I was wrong wrong wrong. (Noone told me at the university. I guess
I should want my money back).

Plus note that it is not only my trashy laptop and one trashy MMC
card; every USB thumb drive I seen is affected. (OTOH USB disks should
be safe AFAICT).

Ext3 is unsuitable for flash cards and RAID arrays, plain and
simple. It is not documented anywhere :-(. [ext2 should work better --
at least you'll not get silent data corruption.]

> One of the problems I have with the break down which Pavel has used is
> that it doesn't break things down according to probability; the chance
> of a storage subsystem scribbling garbage on its last write during a

Can you suggest better patch? I'm not saying we should redesign ext3,
but... someone should have told me that ext3+USB thumb drive=problems.

> But these things are never absolute, mainly because people aren't
> willing to pay for either the cost of superior hardware (consider the
> cost of ECC memory, which isn't *that* much more expensive; and yet
> most PC class systems don't use it) or in terms of software overhead
> (historically many file system designers have eschewed the use of
> physical block journalling because it really hurts on meta-data
> intensive benchmarks), talking about absolute requirements for
> ATOMIC-WRITE isn't all that useful --- because nearly all hardware
> doesn't provide these guarantees, and nearly all filesystems require
> them.  So to call out ext2 and ext3 for requiring them, without
> making

ext3+raid5 will fail even if you have perfect hardware.

> clear that pretty much *all* file systems require them, ends up
> causing people to switch over to some other file system that
> ironically enough, might end up being *more* vulernable, but which
> didn't earn Pavel's displeasure because he didn't try using, say, XFS
> on his flashcard on his trashy laptop.

I hold ext2/ext3 to higher standards than other filesystem in
tree. I'd not use XFS/VFAT etc. 

I would not want people to migrate towards XFS/VFAT, and yes I believe
XFSs/VFATs/... requirements should be documented, too. (But I know too
little about those filesystems).

If you can suggest better wording, please help me. But... those
requirements are non-trivial, commonly not met and the result is data
loss. It has to be documented somehow. Make it as innocent-looking as
you can...

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 19:52                   ` Pavel Machek
  (?)
@ 2009-08-24 20:24                   ` Ric Wheeler
  2009-08-24 20:52                     ` Pavel Machek
  2009-08-25 18:52                     ` Rob Landley
  -1 siblings, 2 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-24 20:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Pavel Machek wrote:
> Hi!
>
>   
>>> Isn't this by design?  In other words, if the metadata doesn't survive
>>> non-atomic writes, wouldn't it be an ext3 bug?
>>>       
>> Part of the problem here is that "atomic-writes" is confusing; it
>> doesn't mean what many people think it means.  The assumption which
>> many naive filesystem designers make is that writes succeed or they
>> don't.  If they don't succeed, they don't change the previously
>> existing data in any way.  
>>
>> So in the case of journalling, the assumption which gets made is that
>> when the power fails, the disk either writes a particular disk block,
>> or it doesn't.  The problem here is as with humans and animals, death
>> is not an event, it is a process.  When the power fails, the system
>> just doesn't stop functioning; the power on the +5 and +12 volt rails
>> start dropping to zero, and different components fail at different
>> times.  Specifically, DRAM, being the most voltage sensitve, tends to
>> fail before the DMA subsystem, the PCI bus, and the hard drive fails.
>> So as a result, garbage can get written out to disk as part of the
>> failure.  That's just the way hardware works.
>>     
>
> Yep, and at that point you lost data. You had "silent data corruption"
> from fs point of view, and that's bad. 
>
> It will be probably very bad on XFS, probably okay on Ext3, and
> certainly okay on Ext2: you do filesystem check, and you should be
> able to repair any damage. So yes, physical journaling is good, but
> fsck is better.
>   

I don't see why you think that. In general, fsck (for any fs) only 
checks metadata. If you have silent data corruption that corrupts things 
that are fixable by fsck, you most likely have silent corruption hitting 
things users care about like their data blocks inside of files. Fsck 
will not fix (or notice) any of that, that is where things like full 
data checksums can help.

Also note (from first hand experience), unless you check and validate 
your data, you can have data corruptions that will not get flagged as IO 
errors so data signing or scrubbing is a critical part of data integrity.
>   
>> Is that a file system "bug"?  Well, it's better to call that a
>> mismatch between the assumptions made of physical devices, and of the
>> file system code.  On Irix, SGI hardware had a powerfail interrupt,
>>     
>
> If those filesystem assumptions were not documented, I'd call it
> filesystem bug. So better document them ;-).
>
>   
I think that we need to help people understand the full spectrum of data 
concerns, starting with reasonable best practices that will help most 
people suffer *less* (not no) data loss. And make very sure that they 
are not falsely assured that by following any specific script that they 
can skip backups, remote backups, etc :-)

Nothing in our code in any part of the kernel deals well with every 
disaster or odd event.

>> There is another kind of non-atomic write that nearly all file systems
>> are subject to, however, and to give an example of this, consider what
>> happens if you a laptop is subjected to a sudden shock while it is
>> writing a sector, and the hard drive doesn't an accelerometer which
>>     
> ...
>   
>> Depending on how severe the shock happens to be, the head could end up
>> impacting the platter, destroying the medium (which used to be
>> iron-oxide; hence the term "spinning rust platters") at that spot.
>> This will obviously cause a write failure, and the previous contents
>> of the sector will be lost.  This is also considered a failure of the
>> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
>> gracefully.  Very few file systems do.  (It is possible for an OS
>> that
>>     
>
> Actually, ext2 should be able to survive that, no? Error writing ->
> remount ro -> fsck on next boot -> drive relocates the sectors.
>   

I think that the example and the response are both off base. If your 
head ever touches the platter, you won't be reading from a huge part of 
your drive ever again (usually, you have 2 heads per platter, 3-4 
platters, impact would kill one head and a corresponding percentage of 
your data).

No file system will recover that data although you might be able to 
scrape out some remaining useful bits and bytes.

More common causes of silent corruption would be bad DRAM in things like 
the drive write cache, hot spots (that cause adjacent track data 
errors), etc.  Note in this last case, your most recently written data 
is fine, just the data you wrote months/years ago is toast!
>   
>> It's for this reason that I've never been completely sure how useful
>> Pavel's proposed treatise about file systems expectations really are
>> --- because all storage subsystems *usually* provide these guarantees,
>> but it is the very rare storage system that *always* provides these
>> guarantees.
>>     
>
> Well... there's very big difference between harddrives and flash
> memory. Harddrives usually work, and flash memory never does.
>   

It is hard for anyone to see the real data without looking in detail at 
large numbers of parts. Back at EMC, we looked at failures for lots of 
parts so we got a clear grasp on trends.  I do agree that flash/SSD 
parts are still very young so we will have interesting and unexpected 
failure modes to learn to deal with....
>   
>> We could just as easily have several kilobytes of explanation in
>> Documentation/* explaining how we assume that DRAM always returns the
>> same value that was stored in it previously --- and yet most PC class
>> hardware still does not use ECC memory, and cosmic rays are a reality.
>> That means that most Linux systems run on systems that are vulnerable
>> to this kind of failure --- and the world hasn't ended.
>>     
>
> There's a difference. In case of cosmic rays, hardware is clearly
> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
> and I still use it. I will not complain if ext3 trashes that.
>
> In case of degraded raid-5, even with perfect hardware, and with
> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>
> Clearly, Linux is buggy there. It could be argued it is raid-5's
> fault, or maybe it is ext3's fault, but... linux is still buggy.
>   

Nothing is perfect. It is still a trade off between storage utilization 
(how much storage we give users for say 5 2TB drives), performance and 
costs (throw away any disks over 2 years old?).
>   
>> As I recall, the main problem which Pavel had was when he was using
>> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
>> where the flash card stuck out slightly, and any jostling of the
>> netbook would cause the flash card to become disconnected from the
>> laptop, and cause write errors, very easily and very frequently.  In
>> those circumstnaces, it's highly unlikely that ***any*** file system
>> would have been able to survive such an unreliable storage system.
>>     
>
> Well well well. Before I pulled that flash card, I assumed that doing
> so is safe, because flashcard is presented as block device and ext3
> should cope with sudden disk disconnects.
>
> And I was wrong wrong wrong. (Noone told me at the university. I guess
> I should want my money back).
>
> Plus note that it is not only my trashy laptop and one trashy MMC
> card; every USB thumb drive I seen is affected. (OTOH USB disks should
> be safe AFAICT).
>
> Ext3 is unsuitable for flash cards and RAID arrays, plain and
> simple. It is not documented anywhere :-(. [ext2 should work better --
> at least you'll not get silent data corruption.]
>   

ext3 is used on lots of raid arrays without any issue.
>   
>> One of the problems I have with the break down which Pavel has used is
>> that it doesn't break things down according to probability; the chance
>> of a storage subsystem scribbling garbage on its last write during a
>>     
>
> Can you suggest better patch? I'm not saying we should redesign ext3,
> but... someone should have told me that ext3+USB thumb drive=problems.
>
>   
>> But these things are never absolute, mainly because people aren't
>> willing to pay for either the cost of superior hardware (consider the
>> cost of ECC memory, which isn't *that* much more expensive; and yet
>> most PC class systems don't use it) or in terms of software overhead
>> (historically many file system designers have eschewed the use of
>> physical block journalling because it really hurts on meta-data
>> intensive benchmarks), talking about absolute requirements for
>> ATOMIC-WRITE isn't all that useful --- because nearly all hardware
>> doesn't provide these guarantees, and nearly all filesystems require
>> them.  So to call out ext2 and ext3 for requiring them, without
>> making
>>     
>
> ext3+raid5 will fail even if you have perfect hardware.
>
>   
>> clear that pretty much *all* file systems require them, ends up
>> causing people to switch over to some other file system that
>> ironically enough, might end up being *more* vulernable, but which
>> didn't earn Pavel's displeasure because he didn't try using, say, XFS
>> on his flashcard on his trashy laptop.
>>     
>
> I hold ext2/ext3 to higher standards than other filesystem in
> tree. I'd not use XFS/VFAT etc. 
>
> I would not want people to migrate towards XFS/VFAT, and yes I believe
> XFSs/VFATs/... requirements should be documented, too. (But I know too
> little about those filesystems).
>
> If you can suggest better wording, please help me. But... those
> requirements are non-trivial, commonly not met and the result is data
> loss. It has to be documented somehow. Make it as innocent-looking as
> you can...
>
> 								Pavel
>   

I think that you really need to step back and look harder at real 
failures - not just your personal experience - but a larger set of real 
world failures. Many papers have been published recently about that (the 
google paper, the Bianca paper from FAST, Netapp, etc).

Regards,

Ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 20:24                   ` Ric Wheeler
@ 2009-08-24 20:52                     ` Pavel Machek
  2009-08-24 21:08                       ` Ric Wheeler
  2009-08-24 21:11                         ` Greg Freemyer
  2009-08-25 18:52                     ` Rob Landley
  1 sibling, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 20:52 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Hi!

>> Yep, and at that point you lost data. You had "silent data corruption"
>> from fs point of view, and that's bad. 
>>
>> It will be probably very bad on XFS, probably okay on Ext3, and
>> certainly okay on Ext2: you do filesystem check, and you should be
>> able to repair any damage. So yes, physical journaling is good, but
>> fsck is better.
>
> I don't see why you think that. In general, fsck (for any fs) only  
> checks metadata. If you have silent data corruption that corrupts things  
> that are fixable by fsck, you most likely have silent corruption hitting  
> things users care about like their data blocks inside of files. Fsck  
> will not fix (or notice) any of that, that is where things like full  
> data checksums can help.

Ok, but in case of data corruption, at least your filesystem does not
degrade further.

>> If those filesystem assumptions were not documented, I'd call it
>> filesystem bug. So better document them ;-).
>>   
> I think that we need to help people understand the full spectrum of data  
> concerns, starting with reasonable best practices that will help most  
> people suffer *less* (not no) data loss. And make very sure that they  
> are not falsely assured that by following any specific script that they  
> can skip backups, remote backups, etc :-)
>
> Nothing in our code in any part of the kernel deals well with every  
> disaster or odd event.

I can reproduce data loss with ext3 on flashcard in about 40
seconds. I'd not call that "odd event". It would be nice to handle
that, but that is hard. So ... can we at least get that documented
please?


>> Actually, ext2 should be able to survive that, no? Error writing ->
>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>   
>
> I think that the example and the response are both off base. If your  
> head ever touches the platter, you won't be reading from a huge part of  
> your drive ever again (usually, you have 2 heads per platter, 3-4  
> platters, impact would kill one head and a corresponding percentage of  
> your data).

Ok, that's obviously game over.

>>> It's for this reason that I've never been completely sure how useful
>>> Pavel's proposed treatise about file systems expectations really are
>>> --- because all storage subsystems *usually* provide these guarantees,
>>> but it is the very rare storage system that *always* provides these
>>> guarantees.
>>
>> Well... there's very big difference between harddrives and flash
>> memory. Harddrives usually work, and flash memory never does.
>
> It is hard for anyone to see the real data without looking in detail at  
> large numbers of parts. Back at EMC, we looked at failures for lots of  
> parts so we got a clear grasp on trends.  I do agree that flash/SSD  
> parts are still very young so we will have interesting and unexpected  
> failure modes to learn to deal with....

_Maybe_ SSDs, being HDD replacements are better. I don't know.

_All_ flash cards (MMC, USB, SD) had the problems. You don't need to
get clear grasp on trends. Those cards just don't meet ext3
expectations, and if you pull them, you get data loss.

>>> We could just as easily have several kilobytes of explanation in
>>> Documentation/* explaining how we assume that DRAM always returns the
>>> same value that was stored in it previously --- and yet most PC class
>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>> That means that most Linux systems run on systems that are vulnerable
>>> to this kind of failure --- and the world hasn't ended.

>> There's a difference. In case of cosmic rays, hardware is clearly
>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>> and I still use it. I will not complain if ext3 trashes that.
>>
>> In case of degraded raid-5, even with perfect hardware, and with
>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>
>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>
> Nothing is perfect. It is still a trade off between storage utilization  
> (how much storage we give users for say 5 2TB drives), performance and  
> costs (throw away any disks over 2 years old?).

"Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
believe that should be at least documented. (And understand why ZFS is
interesting thing).

>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>> simple. It is not documented anywhere :-(. [ext2 should work better --
>> at least you'll not get silent data corruption.]
>
> ext3 is used on lots of raid arrays without any issue.

And I still use my zaurus with crappy DRAM.

I would not trust raid5 array with my data, for multiple
reasons. The fact that degraded raid5 breaks ext3 assumptions should
really be documented.

>> I hold ext2/ext3 to higher standards than other filesystem in
>> tree. I'd not use XFS/VFAT etc. 
>>
>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>> little about those filesystems).
>>
>> If you can suggest better wording, please help me. But... those
>> requirements are non-trivial, commonly not met and the result is data
>> loss. It has to be documented somehow. Make it as innocent-looking as
>> you can...

>
> I think that you really need to step back and look harder at real  
> failures - not just your personal experience - but a larger set of real  
> world failures. Many papers have been published recently about that (the  
> google paper, the Bianca paper from FAST, Netapp, etc).

The papers show failures in "once a year" range. I have "twice a
minute" failure scenario with flashdisks.

Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
but I bet it would be on "once a day" scale.

We should document those.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 20:52                     ` Pavel Machek
@ 2009-08-24 21:08                       ` Ric Wheeler
  2009-08-24 21:25                         ` Pavel Machek
  2009-08-24 21:11                         ` Greg Freemyer
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-24 21:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Pavel Machek wrote:
> Hi!
>
>   
>>> Yep, and at that point you lost data. You had "silent data corruption"
>>> from fs point of view, and that's bad. 
>>>
>>> It will be probably very bad on XFS, probably okay on Ext3, and
>>> certainly okay on Ext2: you do filesystem check, and you should be
>>> able to repair any damage. So yes, physical journaling is good, but
>>> fsck is better.
>>>       
>> I don't see why you think that. In general, fsck (for any fs) only  
>> checks metadata. If you have silent data corruption that corrupts things  
>> that are fixable by fsck, you most likely have silent corruption hitting  
>> things users care about like their data blocks inside of files. Fsck  
>> will not fix (or notice) any of that, that is where things like full  
>> data checksums can help.
>>     
>
> Ok, but in case of data corruption, at least your filesystem does not
> degrade further.
>
>   
Even worse, your data is potentially gone and you have not noticed 
it...  This is why array vendors and archival storage products do 
periodic scans of all stored data (read all the bytes, compared to a 
digital signature, etc).
>>> If those filesystem assumptions were not documented, I'd call it
>>> filesystem bug. So better document them ;-).
>>>   
>>>       
>> I think that we need to help people understand the full spectrum of data  
>> concerns, starting with reasonable best practices that will help most  
>> people suffer *less* (not no) data loss. And make very sure that they  
>> are not falsely assured that by following any specific script that they  
>> can skip backups, remote backups, etc :-)
>>
>> Nothing in our code in any part of the kernel deals well with every  
>> disaster or odd event.
>>     
>
> I can reproduce data loss with ext3 on flashcard in about 40
> seconds. I'd not call that "odd event". It would be nice to handle
> that, but that is hard. So ... can we at least get that documented
> please?
>   

Part of documenting best practices is to put down very specific things 
that do/don't work. What I worry about is producing too much detail to 
be of use for real end users.

I have to admit that I have not paid enough attention to this specifics 
of your ext3 + flash card issue - is it the ftl stuff doing out of order 
IO's? 
>
>   
>>> Actually, ext2 should be able to survive that, no? Error writing ->
>>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>>   
>>>       
>> I think that the example and the response are both off base. If your  
>> head ever touches the platter, you won't be reading from a huge part of  
>> your drive ever again (usually, you have 2 heads per platter, 3-4  
>> platters, impact would kill one head and a corresponding percentage of  
>> your data).
>>     
>
> Ok, that's obviously game over.
>   

This is when you start seeing lots of READ and WRITE errors :-)
>   
>>>> It's for this reason that I've never been completely sure how useful
>>>> Pavel's proposed treatise about file systems expectations really are
>>>> --- because all storage subsystems *usually* provide these guarantees,
>>>> but it is the very rare storage system that *always* provides these
>>>> guarantees.
>>>>         
>>> Well... there's very big difference between harddrives and flash
>>> memory. Harddrives usually work, and flash memory never does.
>>>       
>> It is hard for anyone to see the real data without looking in detail at  
>> large numbers of parts. Back at EMC, we looked at failures for lots of  
>> parts so we got a clear grasp on trends.  I do agree that flash/SSD  
>> parts are still very young so we will have interesting and unexpected  
>> failure modes to learn to deal with....
>>     
>
> _Maybe_ SSDs, being HDD replacements are better. I don't know.
>
> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
> get clear grasp on trends. Those cards just don't meet ext3
> expectations, and if you pull them, you get data loss.
>
>   
Pull them even after an unmount, or pull them hot?
>>>> We could just as easily have several kilobytes of explanation in
>>>> Documentation/* explaining how we assume that DRAM always returns the
>>>> same value that was stored in it previously --- and yet most PC class
>>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>>> That means that most Linux systems run on systems that are vulnerable
>>>> to this kind of failure --- and the world hasn't ended.
>>>>         
>
>   
>>> There's a difference. In case of cosmic rays, hardware is clearly
>>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>>> and I still use it. I will not complain if ext3 trashes that.
>>>
>>> In case of degraded raid-5, even with perfect hardware, and with
>>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>>
>>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>>>       
>> Nothing is perfect. It is still a trade off between storage utilization  
>> (how much storage we give users for say 5 2TB drives), performance and  
>> costs (throw away any disks over 2 years old?).
>>     
>
> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
> believe that should be at least documented. (And understand why ZFS is
> interesting thing).
>
>   
Your statement is overly broad - ext3 on a commercial RAID array that 
does RAID5 or RAID6, etc has no issues that I know of.

Do you know first hand that ZFS works on flash cards?
>>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>>> simple. It is not documented anywhere :-(. [ext2 should work better --
>>> at least you'll not get silent data corruption.]
>>>       
>> ext3 is used on lots of raid arrays without any issue.
>>     
>
> And I still use my zaurus with crappy DRAM.
>
> I would not trust raid5 array with my data, for multiple
> reasons. The fact that degraded raid5 breaks ext3 assumptions should
> really be documented.
>   

Again, you say RAID5 without enough specifics.  Are you pointing just at 
MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 vendor?
>   
>>> I hold ext2/ext3 to higher standards than other filesystem in
>>> tree. I'd not use XFS/VFAT etc. 
>>>
>>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>>> little about those filesystems).
>>>
>>> If you can suggest better wording, please help me. But... those
>>> requirements are non-trivial, commonly not met and the result is data
>>> loss. It has to be documented somehow. Make it as innocent-looking as
>>> you can...
>>>       
>
>   
>> I think that you really need to step back and look harder at real  
>> failures - not just your personal experience - but a larger set of real  
>> world failures. Many papers have been published recently about that (the  
>> google paper, the Bianca paper from FAST, Netapp, etc).
>>     
>
> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>
> We should document those.
> 								Pavel
>   

Documentation is fine with sufficient, hard data....

ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24  9:31           ` [patch] " Pavel Machek
  2009-08-24 11:19             ` Florian Weimer
  2009-08-24 13:21               ` Greg Freemyer
@ 2009-08-24 21:11             ` Rob Landley
  2009-08-24 21:33               ` Pavel Machek
  2 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-08-24 21:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	tytso, rdunlap, linux-doc, linux-ext4

On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> Running journaling filesystem such as ext3 over flashdisk or degraded
> RAID array is a bad idea: journaling guarantees no longer apply and
> you will get data corruption on powerfail.
>
> We can't solve it easily, but we should certainly warn the users. I
> actually lost data because I did not understand these limitations...
>
> Signed-off-by: Pavel Machek <pavel@ucw.cz>

Acked-by: Rob Landley <rob@landley.net>

With a couple comments:

> +* write caching is disabled. ext2 does not know how to issue barriers
> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.

It's coming up on 2.6.31, has it learned anything since or should that version 
number be bumped?

> +	(Thrash may get written into sectors during powerfail.  And
> +	ext3 handles this surprisingly well at least in the
> +	catastrophic case of garbage getting written into the inode
> +	table, since the journal replay often will "repair" the
> +	garbage that was written into the filesystem metadata blocks.
> +	It won't do a bit of good for the data blocks, of course
> +	(unless you are using data=journal mode).  But this means that
> +	in fact, ext3 is more resistant to suriving failures to the
> +	first problem (powerfail while writing can damage old data on
> +	a failed write) but fortunately, hard drives generally don't
> +	cause collateral damage on a failed write.

Possible rewording of this paragraph:

  Ext3 handles trash getting written into sectors during powerfail
  surprisingly well.  It's not foolproof, but it is resilient.  Incomplete
  journal entries are ignored, and journal replay of complete entries will
  often "repair" garbage written into the inode table.  The data=journal
  option extends this behavior to file and directory data blocks as well
  (without which your dentries can still be badly corrupted by a power fail
  during a write).

(I'm not entirely sure about that last bit, but clarifying it one way or the 
other would be nice because I can't tell from reading it which it is.  My 
_guess_ is that directories are just treated as files with an attitude and an 
extra cacheing layer...?)

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is  possible
  2009-08-24 20:52                     ` Pavel Machek
@ 2009-08-24 21:11                         ` Greg Freemyer
  2009-08-24 21:11                         ` Greg Freemyer
  1 sibling, 0 replies; 309+ messages in thread
From: Greg Freemyer @ 2009-08-24 21:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>

I agree it should be documented, but the ext3 atomicity issue is only
an issue on unexpected shutdown while the array is degraded.  I surely
hope most people running raid5 are not seeing that level of unexpected
shutdown, let along in a degraded array,

If they are, the atomicity issue pretty strongly says they should not
be using raid5 in that environment.  At least not for any filesystem I
know.  Having writes to LBA n corrupt LBA n+128 as an example is
pretty hard to design around from a fs perspective.

Greg

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-08-24 21:11                         ` Greg Freemyer
  0 siblings, 0 replies; 309+ messages in thread
From: Greg Freemyer @ 2009-08-24 21:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>

I agree it should be documented, but the ext3 atomicity issue is only
an issue on unexpected shutdown while the array is degraded.  I surely
hope most people running raid5 are not seeing that level of unexpected
shutdown, let along in a degraded array,

If they are, the atomicity issue pretty strongly says they should not
be using raid5 in that environment.  At least not for any filesystem I
know.  Having writes to LBA n corrupt LBA n+128 as an example is
pretty hard to design around from a fs perspective.

Greg

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 21:08                       ` Ric Wheeler
@ 2009-08-24 21:25                         ` Pavel Machek
  2009-08-24 22:05                           ` Ric Wheeler
  2009-08-24 22:39                           ` Theodore Tso
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 21:25 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Hi!

>> I can reproduce data loss with ext3 on flashcard in about 40
>> seconds. I'd not call that "odd event". It would be nice to handle
>> that, but that is hard. So ... can we at least get that documented
>> please?
>>   
>
> Part of documenting best practices is to put down very specific things  
> that do/don't work. What I worry about is producing too much detail to  
> be of use for real end users.

Well, I was trying to write for kernel audience. Someone can turn that
into nice end-user manual.

> I have to admit that I have not paid enough attention to this specifics  
> of your ext3 + flash card issue - is it the ftl stuff doing out of order  
> IO's? 

The problem is that flash cards destroy whole erase block on unplug,
and ext3 can't cope with that.

>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
>> get clear grasp on trends. Those cards just don't meet ext3
>> expectations, and if you pull them, you get data loss.
>>   
> Pull them even after an unmount, or pull them hot?

Pull them hot.

[Some people try -osync to avoid data loss on flash cards... that will
not do the trick. Flashcard will still kill the eraseblock.]

>>> Nothing is perfect. It is still a trade off between storage 
>>> utilization  (how much storage we give users for say 5 2TB drives), 
>>> performance and  costs (throw away any disks over 2 years old?).
>>>     
>>
>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
>> believe that should be at least documented. (And understand why ZFS is
>> interesting thing).
>>   
> Your statement is overly broad - ext3 on a commercial RAID array that  
> does RAID5 or RAID6, etc has no issues that I know of.

If your commercial RAID array is battery backed, maybe. But I was
talking Linux MD here.

>> And I still use my zaurus with crappy DRAM.
>>
>> I would not trust raid5 array with my data, for multiple
>> reasons. The fact that degraded raid5 breaks ext3 assumptions should
>> really be documented.
>
> Again, you say RAID5 without enough specifics.  Are you pointing just at  
> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 
> vendor?

Degraded MD RAID5 on anything, including SATA, and including
hypothetical "perfect disk".

>> The papers show failures in "once a year" range. I have "twice a
>> minute" failure scenario with flashdisks.
>>
>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>> but I bet it would be on "once a day" scale.
>>
>> We should document those.
>
> Documentation is fine with sufficient, hard data....

Degraded MD RAID5 does not work by design; whole stripe will be
damaged on powerfail or reset or kernel bug, and ext3 can not cope
with that kind of damage. [I don't see why statistics should be
neccessary for that; the same way we don't need statistics to see that
ext2 needs fsck after powerfail.]
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 21:11             ` Rob Landley
@ 2009-08-24 21:33               ` Pavel Machek
  2009-08-25 18:45                 ` Jan Kara
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 21:33 UTC (permalink / raw)
  To: Rob Landley, jack
  Cc: Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	tytso, rdunlap, linux-doc, linux-ext4

On Mon 2009-08-24 16:11:08, Rob Landley wrote:
> On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> > Running journaling filesystem such as ext3 over flashdisk or degraded
> > RAID array is a bad idea: journaling guarantees no longer apply and
> > you will get data corruption on powerfail.
> >
> > We can't solve it easily, but we should certainly warn the users. I
> > actually lost data because I did not understand these limitations...
> >
> > Signed-off-by: Pavel Machek <pavel@ucw.cz>
> 
> Acked-by: Rob Landley <rob@landley.net>
> 
> With a couple comments:
> 
> > +* write caching is disabled. ext2 does not know how to issue barriers
> > +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> 
> It's coming up on 2.6.31, has it learned anything since or should that version 
> number be bumped?

Jan, did those "barrier for ext2" patches get merged? 

> > +	(Thrash may get written into sectors during powerfail.  And
> > +	ext3 handles this surprisingly well at least in the
> > +	catastrophic case of garbage getting written into the inode
> > +	table, since the journal replay often will "repair" the
> > +	garbage that was written into the filesystem metadata blocks.
> > +	It won't do a bit of good for the data blocks, of course
> > +	(unless you are using data=journal mode).  But this means that
> > +	in fact, ext3 is more resistant to suriving failures to the
> > +	first problem (powerfail while writing can damage old data on
> > +	a failed write) but fortunately, hard drives generally don't
> > +	cause collateral damage on a failed write.
> 
> Possible rewording of this paragraph:
> 
>   Ext3 handles trash getting written into sectors during powerfail
>   surprisingly well.  It's not foolproof, but it is resilient.  Incomplete
>   journal entries are ignored, and journal replay of complete entries will
>   often "repair" garbage written into the inode table.  The data=journal
>   option extends this behavior to file and directory data blocks as well
>   (without which your dentries can still be badly corrupted by a power fail
>   during a write).
> 
> (I'm not entirely sure about that last bit, but clarifying it one way or the 
> other would be nice because I can't tell from reading it which it is.  My 
> _guess_ is that directories are just treated as files with an attitude and an 
> extra cacheing layer...?)

Thanks, applied, it looks better than what I wrote. I removed the ()
part, as I'm not sure about it...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 21:25                         ` Pavel Machek
@ 2009-08-24 22:05                           ` Ric Wheeler
  2009-08-24 22:22                             ` Zan Lynx
  2009-08-24 22:41                             ` Pavel Machek
  2009-08-24 22:39                           ` Theodore Tso
  1 sibling, 2 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-24 22:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

Pavel Machek wrote:
> Hi!
>
>   
>>> I can reproduce data loss with ext3 on flashcard in about 40
>>> seconds. I'd not call that "odd event". It would be nice to handle
>>> that, but that is hard. So ... can we at least get that documented
>>> please?
>>>   
>>>       
>> Part of documenting best practices is to put down very specific things  
>> that do/don't work. What I worry about is producing too much detail to  
>> be of use for real end users.
>>     
>
> Well, I was trying to write for kernel audience. Someone can turn that
> into nice end-user manual.
>   

Kernel people who don't do storage or file systems will still need a 
summary - making very specific proposals based on real data and analysis 
is useful.
>   
>> I have to admit that I have not paid enough attention to this specifics  
>> of your ext3 + flash card issue - is it the ftl stuff doing out of order  
>> IO's? 
>>     
>
> The problem is that flash cards destroy whole erase block on unplug,
> and ext3 can't cope with that.
>
>   

Even if you unmount the file system? Why isn't this an issue with ext2?

Sounds like you want to suggest very specifically that journalled file 
systems are not appropriate for low end flash cards (which seems quite 
reasonable).
>>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
>>> get clear grasp on trends. Those cards just don't meet ext3
>>> expectations, and if you pull them, you get data loss.
>>>   
>>>       
>> Pull them even after an unmount, or pull them hot?
>>     
>
> Pull them hot.
>
> [Some people try -osync to avoid data loss on flash cards... that will
> not do the trick. Flashcard will still kill the eraseblock.]
>   

Pulling hot any device will cause data loss for recent data loss, even 
with ext2 you will have data in the page cache, right?
>   
>>>> Nothing is perfect. It is still a trade off between storage 
>>>> utilization  (how much storage we give users for say 5 2TB drives), 
>>>> performance and  costs (throw away any disks over 2 years old?).
>>>>     
>>>>         
>>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
>>> believe that should be at least documented. (And understand why ZFS is
>>> interesting thing).
>>>   
>>>       
>> Your statement is overly broad - ext3 on a commercial RAID array that  
>> does RAID5 or RAID6, etc has no issues that I know of.
>>     
>
> If your commercial RAID array is battery backed, maybe. But I was
> talking Linux MD here.
>   

Many people in the real world who use RAID5 (for better or worse) use 
external raid cards or raid arrays, so you need to be very specific.
>   
>>> And I still use my zaurus with crappy DRAM.
>>>
>>> I would not trust raid5 array with my data, for multiple
>>> reasons. The fact that degraded raid5 breaks ext3 assumptions should
>>> really be documented.
>>>       
>> Again, you say RAID5 without enough specifics.  Are you pointing just at  
>> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 
>> vendor?
>>     
>
> Degraded MD RAID5 on anything, including SATA, and including
> hypothetical "perfect disk".
>   

Degraded is one faulted drive while MD is doing a rebuild? And then you 
hot unplug it or power cycle? I think that would certainly cause failure 
for ext2 as well (again, you would lose any data in the page cache).
>   
>>> The papers show failures in "once a year" range. I have "twice a
>>> minute" failure scenario with flashdisks.
>>>
>>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>>> but I bet it would be on "once a day" scale.
>>>
>>> We should document those.
>>>       
>> Documentation is fine with sufficient, hard data....
>>     
>
> Degraded MD RAID5 does not work by design; whole stripe will be
> damaged on powerfail or reset or kernel bug, and ext3 can not cope
> with that kind of damage. [I don't see why statistics should be
> neccessary for that; the same way we don't need statistics to see that
> ext2 needs fsck after powerfail.]
> 									Pavel
>   
What you are describing is a double failure and RAID5 is not double 
failure tolerant regardless of the file system type....

I don't want to be overly negative since getting good documentation is 
certainly very useful. We just need to be document things correctly 
based on real data.

Ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:05                           ` Ric Wheeler
@ 2009-08-24 22:22                             ` Zan Lynx
  2009-08-24 22:44                               ` Pavel Machek
  2009-08-24 23:42                               ` david
  2009-08-24 22:41                             ` Pavel Machek
  1 sibling, 2 replies; 309+ messages in thread
From: Zan Lynx @ 2009-08-24 22:22 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

Ric Wheeler wrote:
> Pavel Machek wrote:
>> Degraded MD RAID5 does not work by design; whole stripe will be
>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>> with that kind of damage. [I don't see why statistics should be
>> neccessary for that; the same way we don't need statistics to see that
>> ext2 needs fsck after powerfail.]
>>                                     Pavel
>>   
> What you are describing is a double failure and RAID5 is not double 
> failure tolerant regardless of the file system type....

Are you sure he isn't talking about how RAID must write all the data 
chunks to make a complete stripe and if there is a power-loss, some of 
the chunks may be written and some may not?

As I read Pavel's point he is saying that the incomplete write can be 
detected by the incorrect parity chunk, but degraded RAID-5 has no 
working parity chunk so the incomplete write would go undetected.

I know this is a RAID failure mode. However, I actually thought this was 
a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally 
read the complete stripe and perform verification unless that is 
requested, because doing so would hurt performance and lose the entire 
point of the RAID-5 rotating parity blocks.

-- 
Zan Lynx
zlynx@acm.org

"Knowledge is Power.  Power Corrupts.  Study Hard.  Be Evil."

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 14:55                 ` Artem Bityutskiy
@ 2009-08-24 22:30                   ` Rob Landley
  0 siblings, 0 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-24 22:30 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Theodore Tso, Florian Weimer, Pavel Machek, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

On Monday 24 August 2009 09:55:53 Artem Bityutskiy wrote:
> Probably, Pavel did too good job in generalizing things, and it could be
> better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
> Not sure. But the idea to document subtle FS assumption is good, IMO.

The standard procedure for this seems to be to cc: Jonathan Corbet on the 
discussion, make puppy eyes at him, and subscribe to Linux Weekly News.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 21:25                         ` Pavel Machek
  2009-08-24 22:05                           ` Ric Wheeler
@ 2009-08-24 22:39                           ` Theodore Tso
  2009-08-24 23:00                             ` Pavel Machek
                                               ` (3 more replies)
  1 sibling, 4 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-24 22:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > I have to admit that I have not paid enough attention to this specifics  
> > of your ext3 + flash card issue - is it the ftl stuff doing out of order  
> > IO's? 
> 
> The problem is that flash cards destroy whole erase block on unplug,
> and ext3 can't cope with that.

Sure --- but name **any** filesystem that can deal with the fact that
128k or 256k worth of data might disappear when you pull out the flash
card while it is writing a single sector? 

> > Your statement is overly broad - ext3 on a commercial RAID array that  
> > does RAID5 or RAID6, etc has no issues that I know of.
> 
> If your commercial RAID array is battery backed, maybe. But I was
> talking Linux MD here.

It's not just high end RAID arrays that have battery backups; I happen
to use a mid-range hardware RAID card that comes with a battery
backup.   It's just a matter of choosing your hardware carefully.

If your concern is that with Linux MD, you could potentially lose an
entire stripe in RAID 5 mode, then you should say that explicitly; but
again, this isn't a filesystem specific cliam; it's true for all
filesystems.  I don't know of any file system that can survive having
a RAID stripe-shaped-hole blown into the middle of it due to a power
failure.

I'll note, BTW, that AIX uses a journal to protect against these sorts
of problems with software raid; this also means that with AIX, you
also don't have to rebuild a RAID 1 device after an unclean shutdown,
like you have do with Linux MD.  This was on the EVMS's team
development list to implement for Linux, but it got canned after LVM
won out, lo those many years ago.  Ce la vie; but it's a problem which
is solvable at the RAID layer, and which is traditionally and
historically solved in competent RAID implementations.

							- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:05                           ` Ric Wheeler
  2009-08-24 22:22                             ` Zan Lynx
@ 2009-08-24 22:41                             ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 22:41 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

>>> I have to admit that I have not paid enough attention to this 
>>> specifics  of your ext3 + flash card issue - is it the ftl stuff 
>>> doing out of order  IO's?     
>>
>> The problem is that flash cards destroy whole erase block on unplug,
>> and ext3 can't cope with that.
>
> Even if you unmount the file system? Why isn't this an issue with
> ext2?

No, I'm talking hot unplug here. It is the issue with ext2, but ext2
will run fsck on next mount, making it less severe.


>>> Pull them even after an unmount, or pull them hot?
>>>     
>>
>> Pull them hot.
>>
>> [Some people try -osync to avoid data loss on flash cards... that will
>> not do the trick. Flashcard will still kill the eraseblock.]
>
> Pulling hot any device will cause data loss for recent data loss, even  
> with ext2 you will have data in the page cache, right?

Right. But in ext3 case you basically loose whole filesystem, because
fs is inconsistent and you did not run fsck.

>>> Again, you say RAID5 without enough specifics.  Are you pointing just 
>>> at  MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial 
>>> RAID5 vendor?
>>>     
>>
>> Degraded MD RAID5 on anything, including SATA, and including
>> hypothetical "perfect disk".
>
> Degraded is one faulted drive while MD is doing a rebuild? And then you  
> hot unplug it or power cycle? I think that would certainly cause failure  
> for ext2 as well (again, you would lose any data in the page cache).

Losing data in page cache is expected. Losing fs consistency is not.

>> Degraded MD RAID5 does not work by design; whole stripe will be
>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>> with that kind of damage. [I don't see why statistics should be
>> neccessary for that; the same way we don't need statistics to see that
>> ext2 needs fsck after powerfail.]

> What you are describing is a double failure and RAID5 is not double  
> failure tolerant regardless of the file system type....

You get single disk failure then powerfail (or reset or kernel
panic). I would not call that double failure. I agree that it will
mean problems for most filesystems.

Anyway, even if that can be called a double failure, this limitation
should be clearly documented somewhere.

...and that's exactly what I'm trying to fix.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:22                             ` Zan Lynx
@ 2009-08-24 22:44                               ` Pavel Machek
  2009-08-25  0:34                                 ` Ric Wheeler
  2009-08-24 23:42                               ` david
  1 sibling, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 22:44 UTC (permalink / raw)
  To: Zan Lynx
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

On Mon 2009-08-24 16:22:22, Zan Lynx wrote:
> Ric Wheeler wrote:
>> Pavel Machek wrote:
>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>> with that kind of damage. [I don't see why statistics should be
>>> neccessary for that; the same way we don't need statistics to see that
>>> ext2 needs fsck after powerfail.]
>>>                                     Pavel
>>>   
>> What you are describing is a double failure and RAID5 is not double  
>> failure tolerant regardless of the file system type....
>
> Are you sure he isn't talking about how RAID must write all the data  
> chunks to make a complete stripe and if there is a power-loss, some of  
> the chunks may be written and some may not?
>
> As I read Pavel's point he is saying that the incomplete write can be  
> detected by the incorrect parity chunk, but degraded RAID-5 has no  
> working parity chunk so the incomplete write would go undetected.

Yep.

> I know this is a RAID failure mode. However, I actually thought this was  
> a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally  
> read the complete stripe and perform verification unless that is  
> requested, because doing so would hurt performance and lose the entire  
> point of the RAID-5 rotating parity blocks.

Not sure; is not RAID expected to verify the array after unclean
shutdown?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:39                           ` Theodore Tso
@ 2009-08-24 23:00                             ` Pavel Machek
  2009-08-25  0:02                               ` david
                                                 ` (2 more replies)
  2009-08-24 23:00                             ` Pavel Machek
                                               ` (2 subsequent siblings)
  3 siblings, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 23:00 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4
  Cc: corbet

On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > > I have to admit that I have not paid enough attention to this specifics  
> > > of your ext3 + flash card issue - is it the ftl stuff doing out of order  
> > > IO's? 
> > 
> > The problem is that flash cards destroy whole erase block on unplug,
> > and ext3 can't cope with that.
> 
> Sure --- but name **any** filesystem that can deal with the fact that
> 128k or 256k worth of data might disappear when you pull out the flash
> card while it is writing a single sector? 

First... I consider myself quite competent in the os level, yet I did
not realize what flash does and what that means for data
integrity. That means we need some documentation, or maybe we should
refuse to mount those devices r/w or something.

Then to answer your question... ext2. You expect to run fsck after
unclean shutdown, and you expect to have to solve some problems with
it. So the way ext2 deals with the flash media actually matches what
the user expects. (*)

OTOH in ext3 case you expect consistent filesystem after unplug; and
you don't get that.

> > > Your statement is overly broad - ext3 on a commercial RAID array that  
> > > does RAID5 or RAID6, etc has no issues that I know of.
> > 
> > If your commercial RAID array is battery backed, maybe. But I was
> > talking Linux MD here.
...
> If your concern is that with Linux MD, you could potentially lose an
> entire stripe in RAID 5 mode, then you should say that explicitly; but
> again, this isn't a filesystem specific cliam; it's true for all
> filesystems.  I don't know of any file system that can survive having
> a RAID stripe-shaped-hole blown into the middle of it due to a power
> failure.

Again, ext2 handles that in a way user expects it.

At least I was teached "ext2 needs fsck after powerfail; ext3 can
handle powerfails just ok".

> I'll note, BTW, that AIX uses a journal to protect against these sorts
> of problems with software raid; this also means that with AIX, you
> also don't have to rebuild a RAID 1 device after an unclean shutdown,
> like you have do with Linux MD.  This was on the EVMS's team
> development list to implement for Linux, but it got canned after LVM
> won out, lo those many years ago.  Ce la vie; but it's a problem which
> is solvable at the RAID layer, and which is traditionally and
> historically solved in competent RAID implementations.

Yep, we should add journal to RAID; or at least write "Linux MD
*needs* an UPS" in big and bold letters. I'm trying to do the second
part.

(Attached is current version of the patch).

[If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
generaly unsafe to use without UPS/reliable connection/no kernel
bugs... then I may try to push that. I was not sure... maybe some
filesystem _can_ handle this kind of issues?]

								Pavel

(*) Ok, now... user expects to run fsck, but very advanced users may
not expect old data to be damaged. Certainly I was not advanced enough
user few months ago.

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..d1ef4d0
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,57 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so. Not all filesystems require all of these
+to be satisfied for safe operation.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On some storage systems, failed write (for example due to power
+failure) kills data in adjacent (or maybe unrelated) sectors.
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+	An inherent problem with using flash as a normal block device
+	is that the flash erase size is bigger than most filesystem
+	sector sizes.  So when you request a write, it may erase and
+	rewrite some 64k, 128k, or even a couple megabytes on the
+	really _big_ ones.
+
+	If you lose power in the middle of that, filesystem won't
+	notice that data in the "sectors" _around_ the one your were
+	trying to write to got trashed.
+
+	MD RAID-4/5/6 in degraded mode has similar problem, stripes
+	behave similary to eraseblocks.
+
+
+Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	This may be quite common on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. (But it will only really show up in degraded mode).
+	UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..ef9ff0f 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 570f9bd..752f4b4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -199,6 +202,43 @@ debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+  Ext3 handles trash getting written into sectors during powerfail
+  surprisingly well.  It's not foolproof, but it is resilient.
+  Incomplete journal entries are ignored, and journal replay of
+  complete entries will often "repair" garbage written into the inode
+  table.  The data=journal option extends this behavior to file and
+  directory data blocks as well.
+
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
+
+
 References
 ==========
 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:39                           ` Theodore Tso
  2009-08-24 23:00                             ` Pavel Machek
@ 2009-08-24 23:00                             ` Pavel Machek
  2009-08-25 13:57                             ` Chris Adams
  2009-08-25 22:58                             ` Neil Brown
  3 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-24 23:00 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel
  Cc: corbet

On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > > I have to admit that I have not paid enough attention to this specifics  
> > > of your ext3 + flash card issue - is it the ftl stuff doing out of order  
> > > IO's? 
> > 
> > The problem is that flash cards destroy whole erase block on unplug,
> > and ext3 can't cope with that.
> 
> Sure --- but name **any** filesystem that can deal with the fact that
> 128k or 256k worth of data might disappear when you pull out the flash
> card while it is writing a single sector? 

First... I consider myself quite competent in the os level, yet I did
not realize what flash does and what that means for data
integrity. That means we need some documentation, or maybe we should
refuse to mount those devices r/w or something.

Then to answer your question... ext2. You expect to run fsck after
unclean shutdown, and you expect to have to solve some problems with
it. So the way ext2 deals with the flash media actually matches what
the user expects. (*)

OTOH in ext3 case you expect consistent filesystem after unplug; and
you don't get that.

> > > Your statement is overly broad - ext3 on a commercial RAID array that  
> > > does RAID5 or RAID6, etc has no issues that I know of.
> > 
> > If your commercial RAID array is battery backed, maybe. But I was
> > talking Linux MD here.
...
> If your concern is that with Linux MD, you could potentially lose an
> entire stripe in RAID 5 mode, then you should say that explicitly; but
> again, this isn't a filesystem specific cliam; it's true for all
> filesystems.  I don't know of any file system that can survive having
> a RAID stripe-shaped-hole blown into the middle of it due to a power
> failure.

Again, ext2 handles that in a way user expects it.

At least I was teached "ext2 needs fsck after powerfail; ext3 can
handle powerfails just ok".

> I'll note, BTW, that AIX uses a journal to protect against these sorts
> of problems with software raid; this also means that with AIX, you
> also don't have to rebuild a RAID 1 device after an unclean shutdown,
> like you have do with Linux MD.  This was on the EVMS's team
> development list to implement for Linux, but it got canned after LVM
> won out, lo those many years ago.  Ce la vie; but it's a problem which
> is solvable at the RAID layer, and which is traditionally and
> historically solved in competent RAID implementations.

Yep, we should add journal to RAID; or at least write "Linux MD
*needs* an UPS" in big and bold letters. I'm trying to do the second
part.

(Attached is current version of the patch).

[If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
generaly unsafe to use without UPS/reliable connection/no kernel
bugs... then I may try to push that. I was not sure... maybe some
filesystem _can_ handle this kind of issues?]

								Pavel

(*) Ok, now... user expects to run fsck, but very advanced users may
not expect old data to be damaged. Certainly I was not advanced enough
user few months ago.

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..d1ef4d0
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,57 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so. Not all filesystems require all of these
+to be satisfied for safe operation.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On some storage systems, failed write (for example due to power
+failure) kills data in adjacent (or maybe unrelated) sectors.
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+	An inherent problem with using flash as a normal block device
+	is that the flash erase size is bigger than most filesystem
+	sector sizes.  So when you request a write, it may erase and
+	rewrite some 64k, 128k, or even a couple megabytes on the
+	really _big_ ones.
+
+	If you lose power in the middle of that, filesystem won't
+	notice that data in the "sectors" _around_ the one your were
+	trying to write to got trashed.
+
+	MD RAID-4/5/6 in degraded mode has similar problem, stripes
+	behave similary to eraseblocks.
+
+
+Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	This may be quite common on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
+	because it needs to write both changed data, and parity, to 
+	different disks. (But it will only really show up in degraded mode).
+	UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..ef9ff0f 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+  as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 570f9bd..752f4b4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -199,6 +202,43 @@ debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+  Ext3 handles trash getting written into sectors during powerfail
+  surprisingly well.  It's not foolproof, but it is resilient.
+  Incomplete journal entries are ignored, and journal replay of
+  complete entries will often "repair" garbage written into the inode
+  table.  The data=journal option extends this behavior to file and
+  directory data blocks as well.
+
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+  (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
+	   hdparm -I reports disk features. If you have "Native
+	   Command Queueing" is the feature you are looking for.
+
+
 References
 ==========
 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:22                             ` Zan Lynx
  2009-08-24 22:44                               ` Pavel Machek
@ 2009-08-24 23:42                               ` david
  1 sibling, 0 replies; 309+ messages in thread
From: david @ 2009-08-24 23:42 UTC (permalink / raw)
  To: Zan Lynx
  Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

On Mon, 24 Aug 2009, Zan Lynx wrote:

> Ric Wheeler wrote:
>> Pavel Machek wrote:
>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>> with that kind of damage. [I don't see why statistics should be
>>> neccessary for that; the same way we don't need statistics to see that
>>> ext2 needs fsck after powerfail.]
>>>                                     Pavel
>>> 
>> What you are describing is a double failure and RAID5 is not double failure 
>> tolerant regardless of the file system type....
>
> Are you sure he isn't talking about how RAID must write all the data chunks 
> to make a complete stripe and if there is a power-loss, some of the chunks 
> may be written and some may not?

q write to raid 5 doesn't need to write to all drives, but it does need to 
write to two drives (the drive you are modifying and the parity drive)

if you are not degraded and only suceed on one write you will detect the 
corruption later when you try to verify the data.

if you are degraded and only suceed on one write, then the entire stripe 
gets corrupted.

but this is a double failure (one drive + unclean shutdown)

if you have battery-backed cache you will finish the writes when you 
reboot.

if you don't have battery-backed cache (or are using software raid and 
crashed in the middle of sending the writes to the drive) you loose, but 
unless you disable write buffers and do sync writes (which nobody is going 
to do because of the performance problems) you will loose data in an 
unclean shutdown anyway.

David Lang

> As I read Pavel's point he is saying that the incomplete write can be 
> detected by the incorrect parity chunk, but degraded RAID-5 has no working 
> parity chunk so the incomplete write would go undetected.
>
> I know this is a RAID failure mode. However, I actually thought this was a 
> problem even for a intact RAID-5. AFAIK, RAID-5 does not generally read the 
> complete stripe and perform verification unless that is requested, because 
> doing so would hurt performance and lose the entire point of the RAID-5 
> rotating parity blocks.
>
>

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 23:00                             ` Pavel Machek
@ 2009-08-25  0:02                               ` david
  2009-08-25  9:32                                 ` Pavel Machek
  2009-08-25  0:06                               ` Ric Wheeler
  2009-08-25  0:08                               ` Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: david @ 2009-08-25  0:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue, 25 Aug 2009, Pavel Machek wrote:

> On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>>> I have to admit that I have not paid enough attention to this specifics
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>>>> IO's?
>>>
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector?
>
> First... I consider myself quite competent in the os level, yet I did
> not realize what flash does and what that means for data
> integrity. That means we need some documentation, or maybe we should
> refuse to mount those devices r/w or something.
>
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)

you loose data in ext2

> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.

the problem is that people have been preaching that journaling filesystems 
eliminate all data loss for no cost (or at worst for minimal cost).

they don't, they never did.

they address one specific problem (metadata inconsistancy), but they do 
not address data loss, and never did (and for the most part the filesystem 
developers never claimed to)

depending on how much data gets lost, you may or may not be able to 
recover enough to continue to use the filesystem, and when your block 
device takes actions in larger chunks than the filesystem asked it to, 
it's very possible for seemingly unrelated data to be lost as well.

this is true for every single filesystem, nothing special about ext3

people somehow have the expectation that ext3 does the data equivalent of 
solving world hunger, it doesn't, it never did, and it never claimed to.

bashing it because it doesn't isn't fair. bashing XFS because it doesn't 
also isn't fair.

personally I don't consider the two filesystems to be significantly 
different in terms of the data loss potential. I think people are more 
aware of the potentials with XFS than with ext3, but I believe that the 
risk of loss is really about the same (and pretty much for the same 
reasons)


>>>> Your statement is overly broad - ext3 on a commercial RAID array that
>>>> does RAID5 or RAID6, etc has no issues that I know of.
>>>
>>> If your commercial RAID array is battery backed, maybe. But I was
>>> talking Linux MD here.
> ...
>> If your concern is that with Linux MD, you could potentially lose an
>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>> again, this isn't a filesystem specific cliam; it's true for all
>> filesystems.  I don't know of any file system that can survive having
>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>> failure.
>
> Again, ext2 handles that in a way user expects it.
>
> At least I was teached "ext2 needs fsck after powerfail; ext3 can
> handle powerfails just ok".

you were teached wrong. the people making these claims for ext3 didn't 
understand what ext3 does and doesn't do.

David Lang

>> I'll note, BTW, that AIX uses a journal to protect against these sorts
>> of problems with software raid; this also means that with AIX, you
>> also don't have to rebuild a RAID 1 device after an unclean shutdown,
>> like you have do with Linux MD.  This was on the EVMS's team
>> development list to implement for Linux, but it got canned after LVM
>> won out, lo those many years ago.  Ce la vie; but it's a problem which
>> is solvable at the RAID layer, and which is traditionally and
>> historically solved in competent RAID implementations.
>
> Yep, we should add journal to RAID; or at least write "Linux MD
> *needs* an UPS" in big and bold letters. I'm trying to do the second
> part.
>
> (Attached is current version of the patch).
>
> [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
> generaly unsafe to use without UPS/reliable connection/no kernel
> bugs... then I may try to push that. I was not sure... maybe some
> filesystem _can_ handle this kind of issues?]
>
> 								Pavel
>
> (*) Ok, now... user expects to run fsck, but very advanced users may
> not expect old data to be damaged. Certainly I was not advanced enough
> user few months ago.
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..d1ef4d0
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,57 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so. Not all filesystems require all of these
> +to be satisfied for safe operation.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> +	Fortunately writes failing are very uncommon on traditional
> +	spinning disks, as they have spare sectors they use when write
> +	fails.
> +
> +Don't cause collateral damage on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +On some storage systems, failed write (for example due to power
> +failure) kills data in adjacent (or maybe unrelated) sectors.
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> +	An inherent problem with using flash as a normal block device
> +	is that the flash erase size is bigger than most filesystem
> +	sector sizes.  So when you request a write, it may erase and
> +	rewrite some 64k, 128k, or even a couple megabytes on the
> +	really _big_ ones.
> +
> +	If you lose power in the middle of that, filesystem won't
> +	notice that data in the "sectors" _around_ the one your were
> +	trying to write to got trashed.
> +
> +	MD RAID-4/5/6 in degraded mode has similar problem, stripes
> +	behave similary to eraseblocks.
> +
> +
> +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +	Because RAM tends to fail faster than rest of system during
> +	powerfail, special hw killing DMA transfers may be necessary;
> +	otherwise, disks may write garbage during powerfail.
> +	This may be quite common on generic PC machines.
> +
> +	Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
> +	because it needs to write both changed data, and parity, to
> +	different disks. (But it will only really show up in degraded mode).
> +	UPS for RAID array should help.
> +
> +
> +
> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 67639f9..ef9ff0f 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
> have to be 8 character filenames, even then we are fairly close to
> running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext2 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> +  (NO-COLLATERALS)
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> +
> Journaling
> -----------
> -
> -A journaling extension to the ext2 code has been developed by Stephen
> -Tweedie.  It avoids the risks of metadata corruption and the need to
> -wait for e2fsck to complete after a crash, without requiring a change
> -to the on-disk ext2 layout.  In a nutshell, the journal is a regular
> -file which stores whole metadata (and optionally data) blocks that have
> -been modified, prior to writing them into the filesystem.  This means
> -it is possible to add a journal to an existing ext2 filesystem without
> -the need for data conversion.
> -
> -When changes to the filesystem (e.g. a file is renamed) they are stored in
> -a transaction in the journal and can either be complete or incomplete at
> -the time of a crash.  If a transaction is complete at the time of a crash
> -(or in the normal case where the system does not crash), then any blocks
> -in that transaction are guaranteed to represent a valid filesystem state,
> -and are copied into the filesystem.  If a transaction is incomplete at
> -the time of the crash, then there is no guarantee of consistency for
> -the blocks in that transaction so they are discarded (which means any
> -filesystem changes they represent are also lost).
> +==========
> Check Documentation/filesystems/ext3.txt if you want to read more about
> ext3 and journaling.
>
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
> index 570f9bd..752f4b4 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -199,6 +202,43 @@ debugfs: 	ext2 and ext3 file system debugger.
> ext2online:	online (mounted) ext2 and ext3 filesystem resizer
>
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> +  Ext3 handles trash getting written into sectors during powerfail
> +  surprisingly well.  It's not foolproof, but it is resilient.
> +  Incomplete journal entries are ignored, and journal replay of
> +  complete entries will often "repair" garbage written into the inode
> +  table.  The data=journal option extends this behavior to file and
> +  directory data blocks as well.
> +
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> +  (NO-COLLATERALS)
> +
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them).
> +
> +	   hdparm -I reports disk features. If you have "Native
> +	   Command Queueing" is the feature you are looking for.
> +
> +
> References
> ==========
>
>
>

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 23:00                             ` Pavel Machek
  2009-08-25  0:02                               ` david
@ 2009-08-25  0:06                               ` Ric Wheeler
  2009-08-25  9:34                                 ` Pavel Machek
  2009-08-25  0:08                               ` Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25  0:06 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

Pavel Machek wrote:
> On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
>   
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>     
>>>> I have to admit that I have not paid enough attention to this specifics  
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order  
>>>> IO's? 
>>>>         
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>>       
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector? 
>>     
>
> First... I consider myself quite competent in the os level, yet I did
> not realize what flash does and what that means for data
> integrity. That means we need some documentation, or maybe we should
> refuse to mount those devices r/w or something.
>
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)
>
> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.
>
>   
>>>> Your statement is overly broad - ext3 on a commercial RAID array that  
>>>> does RAID5 or RAID6, etc has no issues that I know of.
>>>>         
>>> If your commercial RAID array is battery backed, maybe. But I was
>>> talking Linux MD here.
>>>       
> ...
>   
>> If your concern is that with Linux MD, you could potentially lose an
>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>> again, this isn't a filesystem specific cliam; it's true for all
>> filesystems.  I don't know of any file system that can survive having
>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>> failure.
>>     
>
> Again, ext2 handles that in a way user expects it.
>
> At least I was teached "ext2 needs fsck after powerfail; ext3 can
> handle powerfails just ok".
>
>   

So, would you be happy if ext3 fsck was always run on reboot (at least 
for flash devices)?

ric

>> I'll note, BTW, that AIX uses a journal to protect against these sorts
>> of problems with software raid; this also means that with AIX, you
>> also don't have to rebuild a RAID 1 device after an unclean shutdown,
>> like you have do with Linux MD.  This was on the EVMS's team
>> development list to implement for Linux, but it got canned after LVM
>> won out, lo those many years ago.  Ce la vie; but it's a problem which
>> is solvable at the RAID layer, and which is traditionally and
>> historically solved in competent RAID implementations.
>>     
>
> Yep, we should add journal to RAID; or at least write "Linux MD
> *needs* an UPS" in big and bold letters. I'm trying to do the second
> part.
>
> (Attached is current version of the patch).
>
> [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
> generaly unsafe to use without UPS/reliable connection/no kernel
> bugs... then I may try to push that. I was not sure... maybe some
> filesystem _can_ handle this kind of issues?]
>
> 								Pavel
>
> (*) Ok, now... user expects to run fsck, but very advanced users may
> not expect old data to be damaged. Certainly I was not advanced enough
> user few months ago.
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..d1ef4d0
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,57 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so. Not all filesystems require all of these
> +to be satisfied for safe operation.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> +	Fortunately writes failing are very uncommon on traditional 
> +	spinning disks, as they have spare sectors they use when write
> +	fails.
> +
> +Don't cause collateral damage on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +On some storage systems, failed write (for example due to power
> +failure) kills data in adjacent (or maybe unrelated) sectors.
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> +	An inherent problem with using flash as a normal block device
> +	is that the flash erase size is bigger than most filesystem
> +	sector sizes.  So when you request a write, it may erase and
> +	rewrite some 64k, 128k, or even a couple megabytes on the
> +	really _big_ ones.
> +
> +	If you lose power in the middle of that, filesystem won't
> +	notice that data in the "sectors" _around_ the one your were
> +	trying to write to got trashed.
> +
> +	MD RAID-4/5/6 in degraded mode has similar problem, stripes
> +	behave similary to eraseblocks.
> +
> +
> +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +	Because RAM tends to fail faster than rest of system during 
> +	powerfail, special hw killing DMA transfers may be necessary;
> +	otherwise, disks may write garbage during powerfail.
> +	This may be quite common on generic PC machines.
> +
> +	Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
> +	because it needs to write both changed data, and parity, to 
> +	different disks. (But it will only really show up in degraded mode).
> +	UPS for RAID array should help.
> +
> +
> +
> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 67639f9..ef9ff0f 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
>  have to be 8 character filenames, even then we are fairly close to
>  running out of unique filenames.
>  
> +Requirements
> +============
> +
> +Ext2 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> +  (NO-COLLATERALS)
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> +
>  Journaling
> -----------
> -
> -A journaling extension to the ext2 code has been developed by Stephen
> -Tweedie.  It avoids the risks of metadata corruption and the need to
> -wait for e2fsck to complete after a crash, without requiring a change
> -to the on-disk ext2 layout.  In a nutshell, the journal is a regular
> -file which stores whole metadata (and optionally data) blocks that have
> -been modified, prior to writing them into the filesystem.  This means
> -it is possible to add a journal to an existing ext2 filesystem without
> -the need for data conversion.
> -
> -When changes to the filesystem (e.g. a file is renamed) they are stored in
> -a transaction in the journal and can either be complete or incomplete at
> -the time of a crash.  If a transaction is complete at the time of a crash
> -(or in the normal case where the system does not crash), then any blocks
> -in that transaction are guaranteed to represent a valid filesystem state,
> -and are copied into the filesystem.  If a transaction is incomplete at
> -the time of the crash, then there is no guarantee of consistency for
> -the blocks in that transaction so they are discarded (which means any
> -filesystem changes they represent are also lost).
> +==========
>  Check Documentation/filesystems/ext3.txt if you want to read more about
>  ext3 and journaling.
>  
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
> index 570f9bd..752f4b4 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -199,6 +202,43 @@ debugfs: 	ext2 and ext3 file system debugger.
>  ext2online:	online (mounted) ext2 and ext3 filesystem resizer
>  
>  
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> +  Ext3 handles trash getting written into sectors during powerfail
> +  surprisingly well.  It's not foolproof, but it is resilient.
> +  Incomplete journal entries are ignored, and journal replay of
> +  complete entries will often "repair" garbage written into the inode
> +  table.  The data=journal option extends this behavior to file and
> +  directory data blocks as well.
> +
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> +  (NO-COLLATERALS)
> +
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> +	   (Note that barriers are disabled by default, use "barrier=1"
> +	   mount option after making sure hw can support them). 
> +
> +	   hdparm -I reports disk features. If you have "Native
> +	   Command Queueing" is the feature you are looking for.
> +
> +
>  References
>  ==========
>  
>
>   


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 23:00                             ` Pavel Machek
  2009-08-25  0:02                               ` david
  2009-08-25  0:06                               ` Ric Wheeler
@ 2009-08-25  0:08                               ` Theodore Tso
  2009-08-25  9:42                                 ` Pavel Machek
                                                   ` (3 more replies)
  2 siblings, 4 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-25  0:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)

But if the 256k hole is in data blocks, fsck won't find a problem,
even with ext2.

And if the 256k hole is the inode table, you will *still* suffer
massive data loss.  Fsck will tell you how badly screwed you are, but
it doesn't "fix" the disk; most users don't consider questions of the
form "directory entry <precious-thesis-data> points to trashed inode,
may I delete directory entry?" as being terribly helpful.  :-/

> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.

You don't get a consistent filesystem with ext2, either.  And if your
claim is that several hundred lines of fsck output detailing the
filesystem's destruction somehow makes things all better, I suspect
most users would disagree with you.

In any case, depending on where the flash was writing at the time of
the unplug, the data corruption could be silent anyway.

Maybe this came as a surprise to you, but anyone who has used a
compact flash in a digital camera knows that you ***have*** to wait
until the led has gone out before trying to eject the flash card.  I
remember seeing all sorts of horror stories from professional
photographers about how they lost an important wedding's day worth of
pictures with the attendant commercial loss, on various digital
photography forums.  It tends to be the sort of mistake that digital
photographers only make once.

(It's worse with people using Digital SLR's shooting in raw mode,
since it can take upwards of 30 seconds or more to write out a 12-30MB
raw image, and if you eject at the wrong time, you can trash the
contents of the entire CF card; in the worst case, the Flash
Translation Layer data can get corrupted, and the card is completely
ruined; you can't even reformat it at the filesystem level, but have
to get a special Windows program from the CF manufacturer to --maybe--
reset the FTL layer.  Early CF cards were especially vulnerable to
this; more recent CF cards are better, but it's a known failure mode
of CF cards.)

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:44                               ` Pavel Machek
@ 2009-08-25  0:34                                 ` Ric Wheeler
  0 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25  0:34 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Zan Lynx, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

Pavel Machek wrote:
> On Mon 2009-08-24 16:22:22, Zan Lynx wrote:
>   
>> Ric Wheeler wrote:
>>     
>>> Pavel Machek wrote:
>>>       
>>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>>> with that kind of damage. [I don't see why statistics should be
>>>> neccessary for that; the same way we don't need statistics to see that
>>>> ext2 needs fsck after powerfail.]
>>>>                                     Pavel
>>>>   
>>>>         
>>> What you are describing is a double failure and RAID5 is not double  
>>> failure tolerant regardless of the file system type....
>>>       
>> Are you sure he isn't talking about how RAID must write all the data  
>> chunks to make a complete stripe and if there is a power-loss, some of  
>> the chunks may be written and some may not?
>>
>> As I read Pavel's point he is saying that the incomplete write can be  
>> detected by the incorrect parity chunk, but degraded RAID-5 has no  
>> working parity chunk so the incomplete write would go undetected.
>>     
>
> Yep.
>
>   
>> I know this is a RAID failure mode. However, I actually thought this was  
>> a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally  
>> read the complete stripe and perform verification unless that is  
>> requested, because doing so would hurt performance and lose the entire  
>> point of the RAID-5 rotating parity blocks.
>>     
>
> Not sure; is not RAID expected to verify the array after unclean
> shutdown?
>
> 									Pavel
>   
 Not usually - that would take multiple hours of verification, roughly 
equivalent to doing a RAID rebuild since you have to read each sector of 
every drive (although you would do this at full speed if the array was 
offline, not throttled like we do with rebuilds).

That is part of the thing that scrubbing can do.

Note that once you find a bad bit of data, it is really useful to be 
able to map that back into a humanly understandable object/repair 
action. For example, map the bad data range back to metadata which would 
translate into a fsck run or a list of impacted files or directories....

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  0:02                               ` david
@ 2009-08-25  9:32                                 ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25  9:32 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Hi!

>>> Sure --- but name **any** filesystem that can deal with the fact that
>>> 128k or 256k worth of data might disappear when you pull out the flash
>>> card while it is writing a single sector?
>>
>> First... I consider myself quite competent in the os level, yet I did
>> not realize what flash does and what that means for data
>> integrity. That means we need some documentation, or maybe we should
>> refuse to mount those devices r/w or something.
>>
>> Then to answer your question... ext2. You expect to run fsck after
>> unclean shutdown, and you expect to have to solve some problems with
>> it. So the way ext2 deals with the flash media actually matches what
>> the user expects. (*)
>
> you loose data in ext2

Yes.

>> OTOH in ext3 case you expect consistent filesystem after unplug; and
>> you don't get that.
>
> the problem is that people have been preaching that journaling 
> filesystems eliminate all data loss for no cost (or at worst for minimal 
> cost).
>
> they don't, they never did.
>
> they address one specific problem (metadata inconsistancy), but they do  
> not address data loss, and never did (and for the most part the 
> filesystem developers never claimed to)

Well, in case of flashcard and degraded MD Raid5, ext3 does _not_
address metadata inconsistency problem. And that's why I'm trying to
fix the documentation. Current ext3 documentation says:

#Journaling Block Device layer
#-----------------------------
#The Journaling Block Device layer (JBD) isn't ext3 specific.  It was
#designed
#to add journaling capabilities to a block device.  The ext3 filesystem
#code
#will inform the JBD of modifications it is performing (called a
#transaction).
#The journal supports the transactions start and stop, and in case of a
#crash,
#the journal can replay the transactions to quickly put the partition
#back into
#a consistent state.

There's no mention that this does not work on flash cards and degraded
MD Raid5 arrays.
 
> people somehow have the expectation that ext3 does the data equivalent of 
> solving world hunger, it doesn't, it never did, and it never claimed
> to.

It claims so, above.

> personally I don't consider the two filesystems to be significantly  
> different in terms of the data loss potential. I think people are more  
> aware of the potentials with XFS than with ext3, but I believe that the  
> risk of loss is really about the same (and pretty much for the same  
> reasons)

Ack here.

>> Again, ext2 handles that in a way user expects it.
>>
>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>> handle powerfails just ok".
>
> you were teached wrong. the people making these claims for ext3 didn't  
> understand what ext3 does and doesn't do.

Cool. So... can we fix the documentation?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  0:06                               ` Ric Wheeler
@ 2009-08-25  9:34                                 ` Pavel Machek
  2009-08-25 15:34                                   ` david
  2009-08-26  3:32                                   ` Rik van Riel
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25  9:34 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

Hi!

>>> If your concern is that with Linux MD, you could potentially lose an
>>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>>> again, this isn't a filesystem specific cliam; it's true for all
>>> filesystems.  I don't know of any file system that can survive having
>>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>>> failure.
>>>     
>>
>> Again, ext2 handles that in a way user expects it.
>>
>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>> handle powerfails just ok".
>
> So, would you be happy if ext3 fsck was always run on reboot (at least  
> for flash devices)?

For flash devices, MD Raid 5 and anything else that needs it; yes that
would make me happy ;-).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  0:08                               ` Theodore Tso
  2009-08-25  9:42                                 ` Pavel Machek
@ 2009-08-25  9:42                                 ` Pavel Machek
  2009-08-25 13:37                                   ` Ric Wheeler
  2009-08-25 16:11                                   ` Theodore Tso
  2009-08-27  3:34                                 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley
  2009-08-27  8:46                                 ` David Woodhouse
  3 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25  9:42 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Mon 2009-08-24 20:08:42, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
> > Then to answer your question... ext2. You expect to run fsck after
> > unclean shutdown, and you expect to have to solve some problems with
> > it. So the way ext2 deals with the flash media actually matches what
> > the user expects. (*)
> 
> But if the 256k hole is in data blocks, fsck won't find a problem,
> even with ext2.

True.

> And if the 256k hole is the inode table, you will *still* suffer
> massive data loss.  Fsck will tell you how badly screwed you are, but
> it doesn't "fix" the disk; most users don't consider questions of the
> form "directory entry <precious-thesis-data> points to trashed inode,
> may I delete directory entry?" as being terribly helpful.  :-/

Well it will fix the disk in the end. And no, "directory entry
<precious-thesis-data> points to trashed inode, may I delete directory
entry?" is not _terribly_ helpful, but it is slightly helpful and
people actually expect that from ext2.

> Maybe this came as a surprise to you, but anyone who has used a
> compact flash in a digital camera knows that you ***have*** to wait
> until the led has gone out before trying to eject the flash card.  I
> remember seeing all sorts of horror stories from professional
> photographers about how they lost an important wedding's day worth of
> pictures with the attendant commercial loss, on various digital
> photography forums.  It tends to be the sort of mistake that digital
> photographers only make once.

It actually comes as surprise to me. Actually yes and no. I know that
digital cameras use VFAT, so pulling CF card out of it may do bad
thing, unless I run fsck.vfat afterwards. If digital camera was using
ext3, I'd expect it to be safely pullable at any time.

Will IBM microdrive do any difference there?

Anyway, it was not known to me. Rather than claiming "everyone knows"
(when clearly very few people really understand all the details), can
we simply document that?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  0:08                               ` Theodore Tso
@ 2009-08-25  9:42                                 ` Pavel Machek
  2009-08-25  9:42                                 ` Pavel Machek
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25  9:42 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel

On Mon 2009-08-24 20:08:42, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
> > Then to answer your question... ext2. You expect to run fsck after
> > unclean shutdown, and you expect to have to solve some problems with
> > it. So the way ext2 deals with the flash media actually matches what
> > the user expects. (*)
> 
> But if the 256k hole is in data blocks, fsck won't find a problem,
> even with ext2.

True.

> And if the 256k hole is the inode table, you will *still* suffer
> massive data loss.  Fsck will tell you how badly screwed you are, but
> it doesn't "fix" the disk; most users don't consider questions of the
> form "directory entry <precious-thesis-data> points to trashed inode,
> may I delete directory entry?" as being terribly helpful.  :-/

Well it will fix the disk in the end. And no, "directory entry
<precious-thesis-data> points to trashed inode, may I delete directory
entry?" is not _terribly_ helpful, but it is slightly helpful and
people actually expect that from ext2.

> Maybe this came as a surprise to you, but anyone who has used a
> compact flash in a digital camera knows that you ***have*** to wait
> until the led has gone out before trying to eject the flash card.  I
> remember seeing all sorts of horror stories from professional
> photographers about how they lost an important wedding's day worth of
> pictures with the attendant commercial loss, on various digital
> photography forums.  It tends to be the sort of mistake that digital
> photographers only make once.

It actually comes as surprise to me. Actually yes and no. I know that
digital cameras use VFAT, so pulling CF card out of it may do bad
thing, unless I run fsck.vfat afterwards. If digital camera was using
ext3, I'd expect it to be safely pullable at any time.

Will IBM microdrive do any difference there?

Anyway, it was not known to me. Rather than claiming "everyone knows"
(when clearly very few people really understand all the details), can
we simply document that?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  9:42                                 ` Pavel Machek
@ 2009-08-25 13:37                                   ` Ric Wheeler
  2009-08-25 13:42                                     ` Alan Cox
  2009-08-25 21:15                                     ` Pavel Machek
  2009-08-25 16:11                                   ` Theodore Tso
  1 sibling, 2 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25 13:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/25/2009 05:42 AM, Pavel Machek wrote:
> On Mon 2009-08-24 20:08:42, Theodore Tso wrote:
>> On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
>>> Then to answer your question... ext2. You expect to run fsck after
>>> unclean shutdown, and you expect to have to solve some problems with
>>> it. So the way ext2 deals with the flash media actually matches what
>>> the user expects. (*)
>>
>> But if the 256k hole is in data blocks, fsck won't find a problem,
>> even with ext2.
>
> True.
>
>> And if the 256k hole is the inode table, you will *still* suffer
>> massive data loss.  Fsck will tell you how badly screwed you are, but
>> it doesn't "fix" the disk; most users don't consider questions of the
>> form "directory entry<precious-thesis-data>  points to trashed inode,
>> may I delete directory entry?" as being terribly helpful.  :-/
>
> Well it will fix the disk in the end. And no, "directory entry
> <precious-thesis-data>  points to trashed inode, may I delete directory
> entry?" is not _terribly_ helpful, but it is slightly helpful and
> people actually expect that from ext2.
>
>> Maybe this came as a surprise to you, but anyone who has used a
>> compact flash in a digital camera knows that you ***have*** to wait
>> until the led has gone out before trying to eject the flash card.  I
>> remember seeing all sorts of horror stories from professional
>> photographers about how they lost an important wedding's day worth of
>> pictures with the attendant commercial loss, on various digital
>> photography forums.  It tends to be the sort of mistake that digital
>> photographers only make once.
>
> It actually comes as surprise to me. Actually yes and no. I know that
> digital cameras use VFAT, so pulling CF card out of it may do bad
> thing, unless I run fsck.vfat afterwards. If digital camera was using
> ext3, I'd expect it to be safely pullable at any time.
>
> Will IBM microdrive do any difference there?
>
> Anyway, it was not known to me. Rather than claiming "everyone knows"
> (when clearly very few people really understand all the details), can
> we simply document that?
> 									Pavel

I really think that the expectation that all OS's (windows, mac, even your ipod) 
all teach you not to hot unplug a device with any file system. Users have an 
"eject" or "safe unload" in windows, your iPod tells you not to power off or 
disconnect, etc.

I don't object to making that general statement - "Don't hot unplug a device 
with an active file system or actively used raw device" - but would object to 
the overly general statement about ext3 not working on flash, RAID5 not working, 
etc...

ric




^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 13:37                                   ` Ric Wheeler
@ 2009-08-25 13:42                                     ` Alan Cox
  2009-08-27  3:16                                       ` Rob Landley
  2009-08-25 21:15                                     ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Alan Cox @ 2009-08-25 13:42 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue, 25 Aug 2009 09:37:12 -0400
Ric Wheeler <rwheeler@redhat.com> wrote:

> I really think that the expectation that all OS's (windows, mac, even your ipod) 
> all teach you not to hot unplug a device with any file system. Users have an 
> "eject" or "safe unload" in windows, your iPod tells you not to power off or 
> disconnect, etc.

Agreed

> I don't object to making that general statement - "Don't hot unplug a device 
> with an active file system or actively used raw device" - but would object to 
> the overly general statement about ext3 not working on flash, RAID5 not working, 
> etc...

The overall general statement for all media and all OS's should be

"Do you have a backup, have you tested it recently"


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:39                           ` Theodore Tso
  2009-08-24 23:00                             ` Pavel Machek
  2009-08-24 23:00                             ` Pavel Machek
@ 2009-08-25 13:57                             ` Chris Adams
  2009-08-25 22:58                             ` Neil Brown
  3 siblings, 0 replies; 309+ messages in thread
From: Chris Adams @ 2009-08-25 13:57 UTC (permalink / raw)
  To: linux-kernel

Once upon a time, Theodore Tso  <tytso@mit.edu> said:
>I'll note, BTW, that AIX uses a journal to protect against these sorts
>of problems with software raid; this also means that with AIX, you
>also don't have to rebuild a RAID 1 device after an unclean shutdown,
>like you have do with Linux MD.  This was on the EVMS's team
>development list to implement for Linux, but it got canned after LVM
>won out, lo those many years ago.

See mdadm(8) and look for "--bitmap".  It has a few issues (can't
reshape an array with a bitmap for example; you have to remove the
bitmap, reshape, and re-add the bitmap), but it is available.
-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 13:01               ` Theodore Tso
  2009-08-24 14:55                 ` Artem Bityutskiy
  2009-08-24 19:52                   ` Pavel Machek
@ 2009-08-25 14:43                 ` Florian Weimer
  2 siblings, 0 replies; 309+ messages in thread
From: Florian Weimer @ 2009-08-25 14:43 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4

* Theodore Tso:

> The only one that falls into that category is the one about not being
> able to handle failed writes, and the way most failures take place,

Hmm.  What does "not being able to handle failed writes" actually
mean?  AFAICS, there are two possible answers: "all bets are off", or
"we'll tell you about the problem, and all bets are off".

>> Isn't this by design?  In other words, if the metadata doesn't survive
>> non-atomic writes, wouldn't it be an ext3 bug?
>
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means.  The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't.  If they don't succeed, they don't change the previously
> existing data in any way.  

Right.  And a lot of database systems make the same assumption.
Oracle Berkeley DB cannot deal with partial page writes at all, and
PostgreSQL assumes that it's safe to flip a few bits in a sector
without proper WAL (it doesn't care if the changes actually hit the
disk, but the write shouldn't make the sector unreadable or put random
bytes there).

> Is that a file system "bug"?  Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code.  On Irix, SGI hardware had a powerfail interrupt,
> and the power supply and extra-big capacitors, so that when a power
> fail interrupt came in, the Irix would run around frantically shutting
> down pending DMA transfers to prevent this failure mode from causing
> problems.  PC class hardware (according to Ted's law), is cr*p, and
> doesn't have a powerfail interrupt, so it's not something that we
> have.

The DMA transaction should fail due to ECC errors, though.

> Ext3, ext4, and ocfs2 does physical block journalling, so as long as
> journal truncate hasn't taken place right before the failure, the
> replay of the physical block journal tends to repair this most (but
> not necessarily all) cases of "garbage is written right before power
> failure".  People who care about this should really use a UPS, and
> wire up the USB and/or serial cable from the UPS to the system, so
> that the OS can do a controlled shutdown if the UPS is close to
> shutting down due to an extended power failure.

I think the general idea is to protect valuable data with WAL.  You
overwrite pages on disk only after you've made a backup copy into WAL.
After a power loss event, you replay the log and overwrite all garbage
that might be there.  For the WAL, you rely on checksum and sequence
numbers.  This still doesn't help against write failures where the
system continues running (because the fsync() during checkpointing
isn't guaranteed to report errors), but it should deal with the power
failure case.  But this assumes that the file system protects its own
data structure in a similar way.  Is this really too much to demand?

Partial failures are extremely difficult to deal with because of their
asynchronous nature.  I've come to accept that, but it's still
disappointing.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  9:34                                 ` Pavel Machek
@ 2009-08-25 15:34                                   ` david
  2009-08-26  3:32                                   ` Rik van Riel
  1 sibling, 0 replies; 309+ messages in thread
From: david @ 2009-08-25 15:34 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue, 25 Aug 2009, Pavel Machek wrote:

> Hi!
>
>>>> If your concern is that with Linux MD, you could potentially lose an
>>>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>>>> again, this isn't a filesystem specific cliam; it's true for all
>>>> filesystems.  I don't know of any file system that can survive having
>>>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>>>> failure.
>>>>
>>>
>>> Again, ext2 handles that in a way user expects it.
>>>
>>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>>> handle powerfails just ok".
>>
>> So, would you be happy if ext3 fsck was always run on reboot (at least
>> for flash devices)?
>
> For flash devices, MD Raid 5 and anything else that needs it; yes that
> would make me happy ;-).

the thing is that fsck would not fix the problem.

it may (if the data lost was metadata) detect the problem and tell you how 
many files you have lost, but if the data lost was all in a data file you 
would not detect it with a fsck

the only way you would detect the missing data is to read all the files on 
the filesystem and detect that the data you are reading is wrong.

but how can you tell if the data you are reading is wrong?

on a flash drive, your read can return garbage, but how do you know that 
garbage isn't the contents of the file?

on a degraded raid5 array you have no way to test data integrity, so when 
the missing drive is replaced, the rebuild algorithm will calculate the 
appropriate data to make the parity calculations work out and write 
garbage to that drive.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  9:42                                 ` Pavel Machek
  2009-08-25 13:37                                   ` Ric Wheeler
@ 2009-08-25 16:11                                   ` Theodore Tso
  2009-08-25 22:21                                       ` Pavel Machek
                                                       ` (2 more replies)
  1 sibling, 3 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-25 16:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

It seems that you are really hung up on whether or not the filesystem
metadata is consistent after a power failure, when I'd argue that the
problem with using storage devices that don't have good powerfail
properties have much bigger problems (such as the potential for silent
data corruption, or even if fsck will fix a trashed inode table with
ext2, massive data loss).  So instead of your suggested patch, it
might be better simply to have a file in Documentation/filesystems
that states something along the lines of:

"There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and software RAID 5/6
arrays without journals, as well as hardware RAID 5/6 devices without
battery backups.  These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
adjacent sectors are also damaged during the power failure.

Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used.  Regular backups when using these devices is also a
Very Good Idea.

Otherwise, file systems placed on these devices can suffer silent data
and file system corruption.  An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption."

My big complaint is that you seem to think that ext3 some how let you
down, but I'd argue that the real issue is that the storage device let
you down.  Any journaling filesystem will have the properties that you
seem to be complaining about, so the fact that your patch only
documents this as assumptions made by ext2 and ext3 is unfair; it also
applies to xfs, jfs, reiserfs, reiser4, etc.  Further more, most users
are even more concerned about possibility of massive data loss and/or
silent data corruption.  So if your complaint that we don't have
documentation warning users about the potential pitfalls of using
storage devices with undesirable power fail properties, let's document
that as a shortcoming in those storage devices.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 21:33               ` Pavel Machek
@ 2009-08-25 18:45                 ` Jan Kara
  0 siblings, 0 replies; 309+ messages in thread
From: Jan Kara @ 2009-08-25 18:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, jack, Goswin von Brederlow, kernel list,
	Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc,
	linux-ext4

On Mon 24-08-09 23:33:12, Pavel Machek wrote:
> On Mon 2009-08-24 16:11:08, Rob Landley wrote:
> > On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> > > Running journaling filesystem such as ext3 over flashdisk or degraded
> > > RAID array is a bad idea: journaling guarantees no longer apply and
> > > you will get data corruption on powerfail.
> > >
> > > We can't solve it easily, but we should certainly warn the users. I
> > > actually lost data because I did not understand these limitations...
> > >
> > > Signed-off-by: Pavel Machek <pavel@ucw.cz>
> > 
> > Acked-by: Rob Landley <rob@landley.net>
> > 
> > With a couple comments:
> > 
> > > +* write caching is disabled. ext2 does not know how to issue barriers
> > > +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
> > 
> > It's coming up on 2.6.31, has it learned anything since or should that version 
> > number be bumped?
> 
> Jan, did those "barrier for ext2" patches get merged? 
  No, they did not. We were discussing how to be able to enable / disable
sending barriers, someone told he'd implement it but it somehow never got
beyond an initial attempt.
  Actually, after recent sync cleanups (and when my O_SYNC cleanups get
merged) it should be pretty easy because every filesystem now has ->fsync()
and ->sync_fs() callback so we just have to add sending barriers to these
two functions and implement possibility to set via sysfs that barriers on the
block device should be ignored.
  I've put it to my todo list but if someone else has time for this, I
certainly would not mind :). It would be a nice beginner project...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 20:24                   ` Ric Wheeler
  2009-08-24 20:52                     ` Pavel Machek
@ 2009-08-25 18:52                     ` Rob Landley
  1 sibling, 0 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-25 18:52 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4

On Monday 24 August 2009 15:24:28 Ric Wheeler wrote:
> Pavel Machek wrote:

> > Actually, ext2 should be able to survive that, no? Error writing ->
> > remount ro -> fsck on next boot -> drive relocates the sectors.
>
> I think that the example and the response are both off base. If your
> head ever touches the platter, you won't be reading from a huge part of
> your drive ever again

It's not quite that simple anymore.

These days, most modern drives add an "overcoat", which is a vapor deposition 
layer of carbon (I.E. diamond) on top of the magnetic media, and then add a 
nanolayer of some kind of nonmagnetic lubricant on top of that.  That protects 
the magnetic layer from physical contact with the head; it takes a pretty 
solid whack to chip through diamond and actually gouge your disk:

  http://www.datarecoverylink.com/understanding_magnetic_media.html

You can also do fun things with various nitridies (carbon nitride, silicon 
nitride, titanium nitride) which are pretty darn tough too, although I dunno 
about their suitability to hard drives:

  http://www.physical-vapor-deposition.com/

So while it _is_ possible to whack your drive and scratch the platter, merely 
"touching" won't do it.  (Laptops wouldn't be feasible if they couldn't cope 
with a little jostling while running.)  In the case of repeated small whacks, 
your heads may actually go first.  (I vaguely recall the little aerofoil wing 
thingy holding up the disk touches first, and can get ground down by repeated 
contact with the diamond layer (despite the lubricant, that just buys time) so 
it gets shorter and shorter and can't reliably keep the head above the disk 
rather than in contact with it.  But I'm kind of stale myself here, not sure 
that's still current.)

Here's a nice youtube video of a 2007 defcon talk from a hard drive recovery 
professional, "What's that Clicking Noise", series starts here:
  http://www.youtube.com/watch?v=vCapEFNZAJ0

And here's that guy's web page:
  http://www.myharddrivedied.com/presentations/index.html

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 21:11                         ` Greg Freemyer
  (?)
@ 2009-08-25 20:56                         ` Rob Landley
  2009-08-25 21:08                           ` david
  -1 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-08-25 20:56 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4

On Monday 24 August 2009 16:11:56 Greg Freemyer wrote:
> > The papers show failures in "once a year" range. I have "twice a
> > minute" failure scenario with flashdisks.
> >
> > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> > but I bet it would be on "once a day" scale.
>
> I agree it should be documented, but the ext3 atomicity issue is only
> an issue on unexpected shutdown while the array is degraded.  I surely
> hope most people running raid5 are not seeing that level of unexpected
> shutdown, let along in a degraded array,
>
> If they are, the atomicity issue pretty strongly says they should not
> be using raid5 in that environment.  At least not for any filesystem I
> know.  Having writes to LBA n corrupt LBA n+128 as an example is
> pretty hard to design around from a fs perspective.

Right now, people think that a degraded raid 5 is equivalent to raid 0.  As 
this thread demonstrates, in the power failure case it's _worse_, due to write 
granularity being larger than the filesystem sector size.  (Just like flash.)

Knowing that, some people might choose to suspend writes to their raid until 
it's finished recovery.  Perhaps they'll set up a system where a degraded raid 
5 gets remounted read only until recovery completes, and then writes go to a 
new blank hot spare disk using all that volume snapshoting or unionfs stuff 
people have been working on.  (The big boys already have hot spare disks 
standing by on a lot of these systems, ready to power up and go without human 
intervention.  Needing two for actual reliability isn't that big a deal.)

Or maybe the raid guys might want to tweak the recovery logic so it's not 
entirely linear, but instead prioritizes dirty pages over clean ones.  So if 
somebody dirties a page halfway through a degraded raid 5, skip ahead to 
recover that chunk first to the new disk first (yes leaving holes, it's not that 
hard to track), and _then_ let the write go through.

But unless people know the issue exists, they won't even start thinking about 
ways to address it. 

> Greg

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 20:56                         ` Rob Landley
@ 2009-08-25 21:08                           ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-25 21:08 UTC (permalink / raw)
  To: Rob Landley
  Cc: Greg Freemyer, Pavel Machek, Ric Wheeler, Theodore Tso,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

On Tue, 25 Aug 2009, Rob Landley wrote:

> On Monday 24 August 2009 16:11:56 Greg Freemyer wrote:
>>> The papers show failures in "once a year" range. I have "twice a
>>> minute" failure scenario with flashdisks.
>>>
>>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>>> but I bet it would be on "once a day" scale.
>>
>> I agree it should be documented, but the ext3 atomicity issue is only
>> an issue on unexpected shutdown while the array is degraded.  I surely
>> hope most people running raid5 are not seeing that level of unexpected
>> shutdown, let along in a degraded array,
>>
>> If they are, the atomicity issue pretty strongly says they should not
>> be using raid5 in that environment.  At least not for any filesystem I
>> know.  Having writes to LBA n corrupt LBA n+128 as an example is
>> pretty hard to design around from a fs perspective.
>
> Right now, people think that a degraded raid 5 is equivalent to raid 0.  As
> this thread demonstrates, in the power failure case it's _worse_, due to write
> granularity being larger than the filesystem sector size.  (Just like flash.)
>
> Knowing that, some people might choose to suspend writes to their raid until
> it's finished recovery.  Perhaps they'll set up a system where a degraded raid
> 5 gets remounted read only until recovery completes, and then writes go to a
> new blank hot spare disk using all that volume snapshoting or unionfs stuff
> people have been working on.  (The big boys already have hot spare disks
> standing by on a lot of these systems, ready to power up and go without human
> intervention.  Needing two for actual reliability isn't that big a deal.)
>
> Or maybe the raid guys might want to tweak the recovery logic so it's not
> entirely linear, but instead prioritizes dirty pages over clean ones.  So if
> somebody dirties a page halfway through a degraded raid 5, skip ahead to
> recover that chunk first to the new disk first (yes leaving holes, it's not that
> hard to track), and _then_ let the write go through.
>
> But unless people know the issue exists, they won't even start thinking about
> ways to address it.

if you've got the drives available you should be running raid 6 not raid 5 
so that you have to loose two drives before you loose your error checking.

in my opinion that's a far better use of a drive than a hot spare.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 13:37                                   ` Ric Wheeler
  2009-08-25 13:42                                     ` Alan Cox
@ 2009-08-25 21:15                                     ` Pavel Machek
  2009-08-25 22:42                                       ` Ric Wheeler
  2009-08-25 23:08                                       ` Neil Brown
  1 sibling, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 21:15 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet


>>> Maybe this came as a surprise to you, but anyone who has used a
>>> compact flash in a digital camera knows that you ***have*** to wait
>>> until the led has gone out before trying to eject the flash card.  I
>>> remember seeing all sorts of horror stories from professional
>>> photographers about how they lost an important wedding's day worth of
>>> pictures with the attendant commercial loss, on various digital
>>> photography forums.  It tends to be the sort of mistake that digital
>>> photographers only make once.
>>
>> It actually comes as surprise to me. Actually yes and no. I know that
>> digital cameras use VFAT, so pulling CF card out of it may do bad
>> thing, unless I run fsck.vfat afterwards. If digital camera was using
>> ext3, I'd expect it to be safely pullable at any time.
>>
>> Will IBM microdrive do any difference there?
>>
>> Anyway, it was not known to me. Rather than claiming "everyone knows"
>> (when clearly very few people really understand all the details), can
>> we simply document that?
>
> I really think that the expectation that all OS's (windows, mac, even 
> your ipod) all teach you not to hot unplug a device with any file system. 
> Users have an "eject" or "safe unload" in windows, your iPod tells you 
> not to power off or disconnect, etc.

That was before journaling filesystems...

> I don't object to making that general statement - "Don't hot unplug a 
> device with an active file system or actively used raw device" - but 
> would object to the overly general statement about ext3 not working on 
> flash, RAID5 not working, etc...

You can object any way you want, but running ext3 on flash or MD RAID5
is stupid:

* ext2 would be faster

* ext2 would provide better protection against powerfail.

"ext3 works on flash and MD RAID5, as long as you do not have
powerfail" seems to be the accurate statement, and if you don't need
to protect against powerfails, you can just use ext2.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* [patch] document flash/RAID dangers
  2009-08-25 16:11                                   ` Theodore Tso
@ 2009-08-25 22:21                                       ` Pavel Machek
  2009-08-25 22:27                                     ` [patch] document that ext2 can't handle barriers Pavel Machek
  2009-08-25 22:27                                     ` Pavel Machek
  2 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 22:21 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Hi!

> It seems that you are really hung up on whether or not the filesystem
> metadata is consistent after a power failure, when I'd argue that the
> problem with using storage devices that don't have good powerfail
> properties have much bigger problems (such as the potential for silent
> data corruption, or even if fsck will fix a trashed inode table with
> ext2, massive data loss).  So instead of your suggested patch, it
> might be better simply to have a file in Documentation/filesystems
> that states something along the lines of:
> 
> "There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and software RAID 5/6
> arrays without journals, as well as hardware RAID 5/6 devices without
> battery backups.  These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> adjacent sectors are also damaged during the power failure.

In FTL case, damaged sectors are not neccessarily adjacent. Otherwise
this looks okay and fair to me.

> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used.  Regular backups when using these devices is also a
> Very Good Idea.
> 
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption.  An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption."

Ok, would you be against adding:

"Running non-journalled filesystem on these may be desirable, as
journalling can not provide meaningful protection, anyway."

> My big complaint is that you seem to think that ext3 some how let you
> down, but I'd argue that the real issue is that the storage device let
> you down.  Any journaling filesystem will have the properties that you
> seem to be complaining about, so the fact that your patch only
> documents this as assumptions made by ext2 and ext3 is unfair; it also
> applies to xfs, jfs, reiserfs, reiser4, etc.  Further more, most
> users

Yes, it applies to all journalling filesystems; it is just that I was 
clever/paranoid enough to avoid anything non-ext3.

ext3 docs still says:
# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

> are even more concerned about possibility of massive data loss and/or
> silent data corruption.  So if your complaint that we don't have
> documentation warning users about the potential pitfalls of using
> storage devices with undesirable power fail properties, let's document
> that as a shortcoming in those storage devices.

Ok, works for me.

---

From: Theodore Tso <tytso@mit.edu>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <pavel@ucw.cz> 

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..e1a46dd
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,19 @@
+There are storage devices that high highly undesirable properties
+when they are disconnected or suffer power failures while writes are
+in progress; such devices include flash devices and software RAID 5/6
+arrays without journals, as well as hardware RAID 5/6 devices without
+battery backups.  These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using these devices is also a
+Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
\ No newline at end of file

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* [patch] document flash/RAID dangers
@ 2009-08-25 22:21                                       ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 22:21 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel

Hi!

> It seems that you are really hung up on whether or not the filesystem
> metadata is consistent after a power failure, when I'd argue that the
> problem with using storage devices that don't have good powerfail
> properties have much bigger problems (such as the potential for silent
> data corruption, or even if fsck will fix a trashed inode table with
> ext2, massive data loss).  So instead of your suggested patch, it
> might be better simply to have a file in Documentation/filesystems
> that states something along the lines of:
> 
> "There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and software RAID 5/6
> arrays without journals, as well as hardware RAID 5/6 devices without
> battery backups.  These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> adjacent sectors are also damaged during the power failure.

In FTL case, damaged sectors are not neccessarily adjacent. Otherwise
this looks okay and fair to me.

> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used.  Regular backups when using these devices is also a
> Very Good Idea.
> 
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption.  An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption."

Ok, would you be against adding:

"Running non-journalled filesystem on these may be desirable, as
journalling can not provide meaningful protection, anyway."

> My big complaint is that you seem to think that ext3 some how let you
> down, but I'd argue that the real issue is that the storage device let
> you down.  Any journaling filesystem will have the properties that you
> seem to be complaining about, so the fact that your patch only
> documents this as assumptions made by ext2 and ext3 is unfair; it also
> applies to xfs, jfs, reiserfs, reiser4, etc.  Further more, most
> users

Yes, it applies to all journalling filesystems; it is just that I was 
clever/paranoid enough to avoid anything non-ext3.

ext3 docs still says:
# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

> are even more concerned about possibility of massive data loss and/or
> silent data corruption.  So if your complaint that we don't have
> documentation warning users about the potential pitfalls of using
> storage devices with undesirable power fail properties, let's document
> that as a shortcoming in those storage devices.

Ok, works for me.

---

From: Theodore Tso <tytso@mit.edu>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <pavel@ucw.cz> 

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..e1a46dd
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,19 @@
+There are storage devices that high highly undesirable properties
+when they are disconnected or suffer power failures while writes are
+in progress; such devices include flash devices and software RAID 5/6
+arrays without journals, as well as hardware RAID 5/6 devices without
+battery backups.  These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using these devices is also a
+Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
\ No newline at end of file

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* [patch] document that ext2 can't handle barriers
  2009-08-25 16:11                                   ` Theodore Tso
  2009-08-25 22:21                                       ` Pavel Machek
@ 2009-08-25 22:27                                     ` Pavel Machek
  2009-08-25 22:27                                     ` Pavel Machek
  2 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 22:27 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Document things ext2 expects from storage filesystems, and the fact
that it can not handle barriers. Also remove jounaling description, as
that's really ext3 material.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..e300ca8 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem not to return write errors.
+
+It also needs write caching to be disabled for reliable fsync
+operation; ext2 does not know how to issue barriers as of
+2.6.31. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
 Check Documentation/filesystems/ext3.txt if you want to read more about
 ext3 and journaling.
 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* [patch] document that ext2 can't handle barriers
  2009-08-25 16:11                                   ` Theodore Tso
  2009-08-25 22:21                                       ` Pavel Machek
  2009-08-25 22:27                                     ` [patch] document that ext2 can't handle barriers Pavel Machek
@ 2009-08-25 22:27                                     ` Pavel Machek
  2 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 22:27 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel

Document things ext2 expects from storage filesystems, and the fact
that it can not handle barriers. Also remove jounaling description, as
that's really ext3 material.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..e300ca8 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem not to return write errors.
+
+It also needs write caching to be disabled for reliable fsync
+operation; ext2 does not know how to issue barriers as of
+2.6.31. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 22:21                                       ` Pavel Machek
  (?)
@ 2009-08-25 22:33                                       ` david
  2009-08-25 22:40                                         ` Pavel Machek
  -1 siblings, 1 reply; 309+ messages in thread
From: david @ 2009-08-25 22:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

>> It seems that you are really hung up on whether or not the filesystem
>> metadata is consistent after a power failure, when I'd argue that the
>> problem with using storage devices that don't have good powerfail
>> properties have much bigger problems (such as the potential for silent
>> data corruption, or even if fsck will fix a trashed inode table with
>> ext2, massive data loss).  So instead of your suggested patch, it
>> might be better simply to have a file in Documentation/filesystems
>> that states something along the lines of:
>>
>> "There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and software RAID 5/6
>> arrays without journals,

is it under all conditions, or only when you have already lost redundancy?

prior discussions make me think this was only if the redundancy is already 
lost.

also, the talk about software RAID 5/6 arrays without journals will be 
confusing (after all, if you are using ext3/XFS/etc you are using a 
journal, aren't you?)

you then go on to talk about hardware raid 5/6 without battery backup. I'm 
think that you are being too specific here. any array without battery 
backup can lead to 'interesting' situations when you loose power.

in addition, even with a single drive you will loose some data on power 
loss (unless you do sync mounts with disabled write caches), full data 
journaling can help protect you from this, but the default journaling just 
protects the metadata.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 22:33                                       ` david
@ 2009-08-25 22:40                                         ` Pavel Machek
  2009-08-25 22:59                                           ` david
  2009-08-26  4:20                                           ` Rik van Riel
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 22:40 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 15:33:08, david@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>> It seems that you are really hung up on whether or not the filesystem
>>> metadata is consistent after a power failure, when I'd argue that the
>>> problem with using storage devices that don't have good powerfail
>>> properties have much bigger problems (such as the potential for silent
>>> data corruption, or even if fsck will fix a trashed inode table with
>>> ext2, massive data loss).  So instead of your suggested patch, it
>>> might be better simply to have a file in Documentation/filesystems
>>> that states something along the lines of:
>>>
>>> "There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and software RAID 5/6
>>> arrays without journals,
>
> is it under all conditions, or only when you have already lost redundancy?

I'd prefer not to specify.

> prior discussions make me think this was only if the redundancy is 
> already lost.

I'm not so sure now.

Lets say you are writing to the (healthy) RAID5 and have a powerfail.

So now data blocks do not correspond to the parity block. You don't
yet have the corruption, but you already have a problem.

If you get a disk failing at this point, you'll get corruption.

> also, the talk about software RAID 5/6 arrays without journals will be  
> confusing (after all, if you are using ext3/XFS/etc you are using a  
> journal, aren't you?)

Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
talking about hardware RAID arrays, where that's really
manufacturer-specific?

> in addition, even with a single drive you will loose some data on power  
> loss (unless you do sync mounts with disabled write caches), full data  
> journaling can help protect you from this, but the default journaling 
> just protects the metadata.

"Data loss" here means "damaging data that were already fsynced". That
will not happen on single disk (with barriers on etc), but will happen
on RAID5 and flash.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 21:15                                     ` Pavel Machek
@ 2009-08-25 22:42                                       ` Ric Wheeler
  2009-08-25 22:51                                         ` Pavel Machek
  2009-08-25 23:08                                       ` Neil Brown
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25 22:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/25/2009 05:15 PM, Pavel Machek wrote:
>
>>>> Maybe this came as a surprise to you, but anyone who has used a
>>>> compact flash in a digital camera knows that you ***have*** to wait
>>>> until the led has gone out before trying to eject the flash card.  I
>>>> remember seeing all sorts of horror stories from professional
>>>> photographers about how they lost an important wedding's day worth of
>>>> pictures with the attendant commercial loss, on various digital
>>>> photography forums.  It tends to be the sort of mistake that digital
>>>> photographers only make once.
>>>
>>> It actually comes as surprise to me. Actually yes and no. I know that
>>> digital cameras use VFAT, so pulling CF card out of it may do bad
>>> thing, unless I run fsck.vfat afterwards. If digital camera was using
>>> ext3, I'd expect it to be safely pullable at any time.
>>>
>>> Will IBM microdrive do any difference there?
>>>
>>> Anyway, it was not known to me. Rather than claiming "everyone knows"
>>> (when clearly very few people really understand all the details), can
>>> we simply document that?
>>
>> I really think that the expectation that all OS's (windows, mac, even
>> your ipod) all teach you not to hot unplug a device with any file system.
>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>> not to power off or disconnect, etc.
>
> That was before journaling filesystems...

Not true - that is true today with or without journals as we have discussed in 
great detail. Including specifically ext2.

Basically, any file system (Linux, windows, OSX, etc) that writes into the page 
cache will lose data when you hot unplug its storage. End of story, don't do it!


>
>> I don't object to making that general statement - "Don't hot unplug a
>> device with an active file system or actively used raw device" - but
>> would object to the overly general statement about ext3 not working on
>> flash, RAID5 not working, etc...
>
> You can object any way you want, but running ext3 on flash or MD RAID5
> is stupid:
>
> * ext2 would be faster
>
> * ext2 would provide better protection against powerfail.

Not true in the slightest, you continue to ignore the ext2/3/4 developers 
telling you that it will lose data.

>
> "ext3 works on flash and MD RAID5, as long as you do not have
> powerfail" seems to be the accurate statement, and if you don't need
> to protect against powerfails, you can just use ext2.
> 								Pavel

Strange how your personal preference is totally out of sync with the entire 
enterprise class user base.

ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 22:42                                       ` Ric Wheeler
@ 2009-08-25 22:51                                         ` Pavel Machek
  2009-08-25 23:03                                           ` david
  2009-08-25 23:03                                           ` Ric Wheeler
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 22:51 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet



>>> I really think that the expectation that all OS's (windows, mac, even
>>> your ipod) all teach you not to hot unplug a device with any file system.
>>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>>> not to power off or disconnect, etc.
>>
>> That was before journaling filesystems...
>
> Not true - that is true today with or without journals as we have 
> discussed in great detail. Including specifically ext2.
>
> Basically, any file system (Linux, windows, OSX, etc) that writes into 
> the page cache will lose data when you hot unplug its storage. End of 
> story, don't do it!

No, not ext3 on SATA disk with barriers on and proper use of
fsync(). I actually tested that.

Yes, I should be able to hotunplug SATA drives and expect the data
that was fsync-ed to be there.

>>> I don't object to making that general statement - "Don't hot unplug a
>>> device with an active file system or actively used raw device" - but
>>> would object to the overly general statement about ext3 not working on
>>> flash, RAID5 not working, etc...
>>
>> You can object any way you want, but running ext3 on flash or MD RAID5
>> is stupid:
>>
>> * ext2 would be faster
>>
>> * ext2 would provide better protection against powerfail.
>
> Not true in the slightest, you continue to ignore the ext2/3/4 developers 
> telling you that it will lose data.

I know I will lose data. Both ext2 and ext3 will lose data on
flashdisk. (That's what I'm trying to document). But... what is the
benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
protects you against kernel panic. MD RAID5 is in software, so... that
additional protection is just not there).

>> "ext3 works on flash and MD RAID5, as long as you do not have
>> powerfail" seems to be the accurate statement, and if you don't need
>> to protect against powerfails, you can just use ext2.
>
> Strange how your personal preference is totally out of sync with the 
> entire enterprise class user base.

Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
what I'm trying to document here.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-24 22:39                           ` Theodore Tso
                                               ` (2 preceding siblings ...)
  2009-08-25 13:57                             ` Chris Adams
@ 2009-08-25 22:58                             ` Neil Brown
  2009-08-25 23:10                               ` Ric Wheeler
  3 siblings, 1 reply; 309+ messages in thread
From: Neil Brown @ 2009-08-25 22:58 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

On Monday August 24, tytso@mit.edu wrote:
> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > > I have to admit that I have not paid enough attention to this specifics  
> > > of your ext3 + flash card issue - is it the ftl stuff doing out of order  
> > > IO's? 
> > 
> > The problem is that flash cards destroy whole erase block on unplug,
> > and ext3 can't cope with that.
> 
> Sure --- but name **any** filesystem that can deal with the fact that
> 128k or 256k worth of data might disappear when you pull out the flash
> card while it is writing a single sector? 

A Log structured filesystem could certainly be written to deal with
such a situation, providing by 'deal with' you mean 'only loses data
that has not yet been acknowledged to the application'.  Of course the
filesystem would need clear visibility into exactly how these blocks
are positioned.

I've been playing with just such a filesystem for some time (never
really finding enough time) with the goal of making it work over RAID5
with no data risk due to power loss.  One day it will be functional
enough for others to try....

It is entirely possible that NILFS could be made to meet that
requirement, but I haven't made time to explore NILFS so I cannot be
sure.

NeilBrown


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 22:40                                         ` Pavel Machek
@ 2009-08-25 22:59                                           ` david
  2009-08-25 23:37                                             ` Pavel Machek
  2009-08-26  4:20                                           ` Rik van Riel
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-25 22:59 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 15:33:08, david@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>>> It seems that you are really hung up on whether or not the filesystem
>>>> metadata is consistent after a power failure, when I'd argue that the
>>>> problem with using storage devices that don't have good powerfail
>>>> properties have much bigger problems (such as the potential for silent
>>>> data corruption, or even if fsck will fix a trashed inode table with
>>>> ext2, massive data loss).  So instead of your suggested patch, it
>>>> might be better simply to have a file in Documentation/filesystems
>>>> that states something along the lines of:
>>>>
>>>> "There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and software RAID 5/6
>>>> arrays without journals,
>>
>> is it under all conditions, or only when you have already lost redundancy?
>
> I'd prefer not to specify.

you need to, otherwise you are claiming that all linux software raid 
implementations will loose data on powerfail, which I don't think is the 
case.

>> prior discussions make me think this was only if the redundancy is
>> already lost.
>
> I'm not so sure now.
>
> Lets say you are writing to the (healthy) RAID5 and have a powerfail.
>
> So now data blocks do not correspond to the parity block. You don't
> yet have the corruption, but you already have a problem.
>
> If you get a disk failing at this point, you'll get corruption.

it's the same combination of problems (non-redundant array and write lost 
to powerfail/reboot), just in a different order.

reccomending a scrub of the raid after an unclean shutdown would make 
sense, along with a warning that if you loose all redundancy before the 
scrub is completed and there was a write failure in the unscrubbed portion 
it could corrupt things.

>> also, the talk about software RAID 5/6 arrays without journals will be
>> confusing (after all, if you are using ext3/XFS/etc you are using a
>> journal, aren't you?)
>
> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
> talking about hardware RAID arrays, where that's really
> manufacturer-specific?

what about dm raid?

I don't think you should talk about hardware raid cards.

>> in addition, even with a single drive you will loose some data on power
>> loss (unless you do sync mounts with disabled write caches), full data
>> journaling can help protect you from this, but the default journaling
>> just protects the metadata.
>
> "Data loss" here means "damaging data that were already fsynced". That
> will not happen on single disk (with barriers on etc), but will happen
> on RAID5 and flash.

this definition of data loss wasn't clear prior to this. you need to 
define this, and state that the reason that flash and raid arrays can 
suffer from this is that both of them deal with blocks of storage larger 
than the data block (eraseblock or raid stripe) and there are conditions 
that can cause the loss of the entire eraseblock or raid stripe which can 
affect data that was previously safe on disk (and if power had been lost 
before the latest write, the prior data would still be safe)

note that this doesn't nessasarily affect all flash disks. if the disk 
doesn't replace the old block in the FTL until the data has all been 
sucessfuly copies to the new eraseblock you don't have this problem.

some (possibly all) cheap thumb drives don't do this, but I would expect 
that the expensive SATA SSDs to do things in the right order.

do this right and you are properly documenting a failure mode that most 
people don't understand, but go too far and you are crying wolf.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 22:51                                         ` Pavel Machek
@ 2009-08-25 23:03                                           ` david
  2009-08-25 23:29                                             ` Pavel Machek
  2009-08-25 23:03                                           ` Ric Wheeler
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-25 23:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> I don't object to making that general statement - "Don't hot unplug a
>>>> device with an active file system or actively used raw device" - but
>>>> would object to the overly general statement about ext3 not working on
>>>> flash, RAID5 not working, etc...
>>>
>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>> is stupid:
>>>
>>> * ext2 would be faster
>>>
>>> * ext2 would provide better protection against powerfail.
>>
>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>> telling you that it will lose data.
>
> I know I will lose data. Both ext2 and ext3 will lose data on
> flashdisk. (That's what I'm trying to document). But... what is the
> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
> protects you against kernel panic. MD RAID5 is in software, so... that
> additional protection is just not there).

the block device can loose data, it has absolutly nothing to do with the 
filesystem

>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>> powerfail" seems to be the accurate statement, and if you don't need
>>> to protect against powerfails, you can just use ext2.
>>
>> Strange how your personal preference is totally out of sync with the
>> entire enterprise class user base.
>
> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
> what I'm trying to document here.

a MD raid array that's degraded to the point where there is no redundancy 
is dangerous, but I don't think that any of the enterprise users would be 
surprised.

I think they will be surprised that it's possible that a prior failed 
write that hasn't been scrubbed can cause data loss when the array later 
degrades.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 22:51                                         ` Pavel Machek
  2009-08-25 23:03                                           ` david
@ 2009-08-25 23:03                                           ` Ric Wheeler
  2009-08-25 23:26                                             ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25 23:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/25/2009 06:51 PM, Pavel Machek wrote:
>
>
>>>> I really think that the expectation that all OS's (windows, mac, even
>>>> your ipod) all teach you not to hot unplug a device with any file system.
>>>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>>>> not to power off or disconnect, etc.
>>>
>>> That was before journaling filesystems...
>>
>> Not true - that is true today with or without journals as we have
>> discussed in great detail. Including specifically ext2.
>>
>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>> the page cache will lose data when you hot unplug its storage. End of
>> story, don't do it!
>
> No, not ext3 on SATA disk with barriers on and proper use of
> fsync(). I actually tested that.
>
> Yes, I should be able to hotunplug SATA drives and expect the data
> that was fsync-ed to be there.

You can and will lose data (even after fsync) with any type of storage at some 
rate. What you are missing here is that data loss needs to be measured in hard 
numbers - say percentage of installed boxes that have config X that lose data.

Strangely enough, this is what high end storage companies do for a living, 
configure, deploy and then measure results.

A long winded way of saying that just because you can induce data failure by 
recreating an event that happens almost never (power loss while rebuilding a 
RAID5 group specifically) does not mean that this makes RAID5 with ext3 unreliable.

What does happen all of the time is single bad sector IO's and (less often, but 
more than your scenario) complete drive failures. In both cases, MD RAID5 will 
repair that damage before a second failure (including a power failure) happens 
99.99% of the time.

I can promise you that hot unplugging and replugging a S-ATA drive will also 
lose you data if you are actively writing to it (ext2, 3, whatever).

Your micro datah loss benchmark is not a valid reflection of the wider 
experience and I fear that you will cause people to lose more data, not less, 
but moving them away from ext3 and MD RAID5.

>
>>>> I don't object to making that general statement - "Don't hot unplug a
>>>> device with an active file system or actively used raw device" - but
>>>> would object to the overly general statement about ext3 not working on
>>>> flash, RAID5 not working, etc...
>>>
>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>> is stupid:
>>>
>>> * ext2 would be faster
>>>
>>> * ext2 would provide better protection against powerfail.
>>
>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>> telling you that it will lose data.
>
> I know I will lose data. Both ext2 and ext3 will lose data on
> flashdisk. (That's what I'm trying to document). But... what is the
> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
> protects you against kernel panic. MD RAID5 is in software, so... that
> additional protection is just not there).

Faster recovery time on any normal kernel crash or power outage.  Data loss 
would be equivalent with or without the journal.

>
>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>> powerfail" seems to be the accurate statement, and if you don't need
>>> to protect against powerfails, you can just use ext2.
>>
>> Strange how your personal preference is totally out of sync with the
>> entire enterprise class user base.
>
> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
> what I'm trying to document here.
> 								Pavel

Using MD RAID5 will save more people from commonly occurring errors (sector and 
disk failures) than will lose it because of your rebuild interrupted by a power 
failure worry.

What you are trying to do is to document a belief you have that is not born out 
by real data across actual user boxes running real work loads.

Unfortunately, getting that data is hard work and one of the things that we as a 
community do especially poorly.  All of the data (secret data from my past and 
published data by NetApp, Google, etc) that I have seen would directly 
contradict your assertions and you will cause harm to our users with this.

Ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 21:15                                     ` Pavel Machek
  2009-08-25 22:42                                       ` Ric Wheeler
@ 2009-08-25 23:08                                       ` Neil Brown
  2009-08-25 23:44                                         ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Neil Brown @ 2009-08-25 23:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tuesday August 25, pavel@ucw.cz wrote:
> 
> You can object any way you want, but running ext3 on flash or MD RAID5
> is stupid:
> 
> * ext2 would be faster
> 
> * ext2 would provide better protection against powerfail.
> 
> "ext3 works on flash and MD RAID5, as long as you do not have
> powerfail" seems to be the accurate statement, and if you don't need
> to protect against powerfails, you can just use ext2.
> 								Pavel

You are over generalising.
MD/RAID5 is only less than perfect if it is degraded.  If all devices
are present before the power failure and after the power failure,
then there is no risk.

RAID5 only promises to protect against a single failure.
Power loss plus device loss equals multiple failure.

And then there is the comment Ted made about probabilities.
While you can get data corruption if a RAID5 comes back degraded after
a power fail, I believe it is a lot less likely than the metadata
being inconsistent on an ext2 after a power fail.
So ext3 is still a good choice (especially if you put your journal on
a separate device).


While I think it is, in principle, worth documenting this sort of
thing, there are an awful lot of fine details and distinctions that
would need to be considered.

NeilBrown

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 22:58                             ` Neil Brown
@ 2009-08-25 23:10                               ` Ric Wheeler
  2009-08-25 23:32                                   ` NeilBrown
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25 23:10 UTC (permalink / raw)
  To: Neil Brown
  Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

On 08/25/2009 06:58 PM, Neil Brown wrote:
> On Monday August 24, tytso@mit.edu wrote:
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>>> I have to admit that I have not paid enough attention to this specifics
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>>>> IO's?
>>>
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector?
>
> A Log structured filesystem could certainly be written to deal with
> such a situation, providing by 'deal with' you mean 'only loses data
> that has not yet been acknowledged to the application'.  Of course the
> filesystem would need clear visibility into exactly how these blocks
> are positioned.
>
> I've been playing with just such a filesystem for some time (never
> really finding enough time) with the goal of making it work over RAID5
> with no data risk due to power loss.  One day it will be functional
> enough for others to try....
>
> It is entirely possible that NILFS could be made to meet that
> requirement, but I haven't made time to explore NILFS so I cannot be
> sure.
>
> NeilBrown
>

I am not sure that log structure will protect you from this scenario since once 
you clean the log, the non-logged data is assumed to be correct.

If your cheap flash storage device can nuke random regions of that clean 
storage, you will lose data....

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:03                                           ` Ric Wheeler
@ 2009-08-25 23:26                                             ` Pavel Machek
  2009-08-25 23:40                                               ` Ric Wheeler
  2009-08-25 23:46                                               ` [patch] ext2/3: document conditions when reliable operation is possible david
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 23:26 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet


>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>> the page cache will lose data when you hot unplug its storage. End of
>>> story, don't do it!
>>
>> No, not ext3 on SATA disk with barriers on and proper use of
>> fsync(). I actually tested that.
>>
>> Yes, I should be able to hotunplug SATA drives and expect the data
>> that was fsync-ed to be there.
>
> You can and will lose data (even after fsync) with any type of storage at 
> some rate. What you are missing here is that data loss needs to be 
> measured in hard numbers - say percentage of installed boxes that have 
> config X that lose data.

I'm talking "by design" here.

I will lose data even on SATA drive that is properly powered on if I
wait 5 years.

> I can promise you that hot unplugging and replugging a S-ATA drive will 
> also lose you data if you are actively writing to it (ext2, 3, whatever).

I can promise you that running S-ATA drive will also lose you data,
even if you are not actively writing to it. Just wait 10 years; so
what is your point?

But ext3 is _designed_ to preserve fsynced data on SATA drive, while
it is _not_ designed to preserve fsynced data on MD RAID5.

Do you really think that's not a difference?

>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>> device with an active file system or actively used raw device" - but
>>>>> would object to the overly general statement about ext3 not working on
>>>>> flash, RAID5 not working, etc...
>>>>
>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>> is stupid:
>>>>
>>>> * ext2 would be faster
>>>>
>>>> * ext2 would provide better protection against powerfail.
>>>
>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>> telling you that it will lose data.
>>
>> I know I will lose data. Both ext2 and ext3 will lose data on
>> flashdisk. (That's what I'm trying to document). But... what is the
>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>> protects you against kernel panic. MD RAID5 is in software, so... that
>> additional protection is just not there).
>
> Faster recovery time on any normal kernel crash or power outage.  Data 
> loss would be equivalent with or without the journal.

No, because you'll actually repair the ext2 with fsck after the kernel
crash or power outage. Data loss will not be equivalent; in particular
you'll not lose data writen _after_ power outage to ext2.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is  possible
  2009-08-24 13:21               ` Greg Freemyer
@ 2009-08-25 23:28                 ` Neil Brown
  -1 siblings, 0 replies; 309+ messages in thread
From: Neil Brown @ 2009-08-25 23:28 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc,
	linux-ext4

On Monday August 24, greg.freemyer@gmail.com wrote:
> > +Don't damage the old data on a failed write (ATOMIC-WRITES)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > +       Because RAM tends to fail faster than rest of system during
> > +       powerfail, special hw killing DMA transfers may be necessary;
> > +       otherwise, disks may write garbage during powerfail.
> > +       This may be quite common on generic PC machines.
> > +
> > +       Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > +       because it needs to write both changed data, and parity, to
> > +       different disks. (But it will only really show up in degraded mode).
> > +       UPS for RAID array should help.
> 
> Can someone clarify if this is true in raid-6 with just a single disk
> failure?  I don't see why it would be.

It does affect raid6 with a single drive missing.

After an unclean shutdown you cannot trust any Parity block as it
is possible that some of the blocks in the stripe have been updated,
but others have not.  So you must assume that all parity blocks are
wrong and update them.  If you have a missing disk you cannot do that.

To take a more concrete example, imagine a 5 device RAID6 with
3 data blocks D0 D1 D2 as well a P and Q on some stripe.
Suppose that we crashed while updating D0, which would have involved
writing out D0, P and Q.
On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3
of D0, P and Q have been updated and the others not.
We can try to recompute D2 from D0 D1 and P, from
D0 P and Q or from D1, P and Q.

We could conceivably try each of those and if they all produce the
same result we might be confident of it.
If two produced the same result and the other was different we could
use a voting process to choose the 'best'.  And in this particular
case I think that would work.  If 0 or 3 had been updates, all would
be the same.  If only 1 was updated, then the combinations that
exclude it will match.  If 2 were updated, then the combinations that
exclude the non-updated block will match.

But if both D0 and D1 were being updated I think there would be too
many combinations and it would be very possibly that all three
computed values for D2 would be different.

So yes: a singly degraded RAID6 cannot promise no data corruption
after an unclean shutdown.  That is why "mdadm" will not assemble such
an array unless you use "--force" to acknowledge that there has been a
problem. 

NeilBrown

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-08-25 23:28                 ` Neil Brown
  0 siblings, 0 replies; 309+ messages in thread
From: Neil Brown @ 2009-08-25 23:28 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Pavel Machek, Goswin von Brederlow, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, tytso, rdunlap, linux-doc,
	linux-ext4

On Monday August 24, greg.freemyer@gmail.com wrote:
> > +Don't damage the old data on a failed write (ATOMIC-WRITES)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > +       Because RAM tends to fail faster than rest of system during
> > +       powerfail, special hw killing DMA transfers may be necessary;
> > +       otherwise, disks may write garbage during powerfail.
> > +       This may be quite common on generic PC machines.
> > +
> > +       Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > +       because it needs to write both changed data, and parity, to
> > +       different disks. (But it will only really show up in degraded mode).
> > +       UPS for RAID array should help.
> 
> Can someone clarify if this is true in raid-6 with just a single disk
> failure?  I don't see why it would be.

It does affect raid6 with a single drive missing.

After an unclean shutdown you cannot trust any Parity block as it
is possible that some of the blocks in the stripe have been updated,
but others have not.  So you must assume that all parity blocks are
wrong and update them.  If you have a missing disk you cannot do that.

To take a more concrete example, imagine a 5 device RAID6 with
3 data blocks D0 D1 D2 as well a P and Q on some stripe.
Suppose that we crashed while updating D0, which would have involved
writing out D0, P and Q.
On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3
of D0, P and Q have been updated and the others not.
We can try to recompute D2 from D0 D1 and P, from
D0 P and Q or from D1, P and Q.

We could conceivably try each of those and if they all produce the
same result we might be confident of it.
If two produced the same result and the other was different we could
use a voting process to choose the 'best'.  And in this particular
case I think that would work.  If 0 or 3 had been updates, all would
be the same.  If only 1 was updated, then the combinations that
exclude it will match.  If 2 were updated, then the combinations that
exclude the non-updated block will match.

But if both D0 and D1 were being updated I think there would be too
many combinations and it would be very possibly that all three
computed values for D2 would be different.

So yes: a singly degraded RAID6 cannot promise no data corruption
after an unclean shutdown.  That is why "mdadm" will not assemble such
an array unless you use "--force" to acknowledge that there has been a
problem. 

NeilBrown

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:03                                           ` david
@ 2009-08-25 23:29                                             ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 23:29 UTC (permalink / raw)
  To: david
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet


>>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>>> powerfail" seems to be the accurate statement, and if you don't need
>>>> to protect against powerfails, you can just use ext2.
>>>
>>> Strange how your personal preference is totally out of sync with the
>>> entire enterprise class user base.
>>
>> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
>> what I'm trying to document here.
>
> a MD raid array that's degraded to the point where there is no redundancy 
> is dangerous, but I don't think that any of the enterprise users would be 
> surprised.
>
> I think they will be surprised that it's possible that a prior failed  
> write that hasn't been scrubbed can cause data loss when the array later  
> degrades.

Cool, so Ted's "raid5 has highly undesirable properties" is actually
pretty accurate. Some raid person should write more detailed README,
I'd say...
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is  possible
  2009-08-25 23:10                               ` Ric Wheeler
@ 2009-08-25 23:32                                   ` NeilBrown
  0 siblings, 0 replies; 309+ messages in thread
From: NeilBrown @ 2009-08-25 23:32 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

On Wed, August 26, 2009 9:10 am, Ric Wheeler wrote:
> On 08/25/2009 06:58 PM, Neil Brown wrote:
>> On Monday August 24, tytso@mit.edu wrote:
>>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>>>> I have to admit that I have not paid enough attention to this
>>>>> specifics
>>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of
>>>>> order
>>>>> IO's?
>>>>
>>>> The problem is that flash cards destroy whole erase block on unplug,
>>>> and ext3 can't cope with that.
>>>
>>> Sure --- but name **any** filesystem that can deal with the fact that
>>> 128k or 256k worth of data might disappear when you pull out the flash
>>> card while it is writing a single sector?
>>
>> A Log structured filesystem could certainly be written to deal with
>> such a situation, providing by 'deal with' you mean 'only loses data
>> that has not yet been acknowledged to the application'.  Of course the
>> filesystem would need clear visibility into exactly how these blocks
>> are positioned.
>>
>> I've been playing with just such a filesystem for some time (never
>> really finding enough time) with the goal of making it work over RAID5
>> with no data risk due to power loss.  One day it will be functional
>> enough for others to try....
>>
>> It is entirely possible that NILFS could be made to meet that
>> requirement, but I haven't made time to explore NILFS so I cannot be
>> sure.
>>
>> NeilBrown
>>
>
> I am not sure that log structure will protect you from this scenario since
> once
> you clean the log, the non-logged data is assumed to be correct.
>
> If your cheap flash storage device can nuke random regions of that clean
> storage, you will lose data....

Hence my observation that "the filesystem would need clear visibility into
exactly how these blocks are positioned".
If there is an FTL in the way that randomly relocates blocks, and a
power fail during write could corrupt data that appears to be
megabytes away in some unpredictable location, then yes: a log structure
won't help.

However I would like to imagine that even a cheep flash device, if it
only ever got writes that were exactly the size of the erase-block, would
not break those writes over multiple erase blocks, so some degree of
integrity and predictability could be preserved.  Even more so, I would
love to  be able to disable the FTL, or at least have clear and correct
documentation about how it works.

So yes, not a panacea.  But an avenue with real possibilities.

NeilBrown


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-08-25 23:32                                   ` NeilBrown
  0 siblings, 0 replies; 309+ messages in thread
From: NeilBrown @ 2009-08-25 23:32 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4

On Wed, August 26, 2009 9:10 am, Ric Wheeler wrote:
> On 08/25/2009 06:58 PM, Neil Brown wrote:
>> On Monday August 24, tytso@mit.edu wrote:
>>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>>>> I have to admit that I have not paid enough attention to this
>>>>> specifics
>>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of
>>>>> order
>>>>> IO's?
>>>>
>>>> The problem is that flash cards destroy whole erase block on unplug,
>>>> and ext3 can't cope with that.
>>>
>>> Sure --- but name **any** filesystem that can deal with the fact that
>>> 128k or 256k worth of data might disappear when you pull out the flash
>>> card while it is writing a single sector?
>>
>> A Log structured filesystem could certainly be written to deal with
>> such a situation, providing by 'deal with' you mean 'only loses data
>> that has not yet been acknowledged to the application'.  Of course the
>> filesystem would need clear visibility into exactly how these blocks
>> are positioned.
>>
>> I've been playing with just such a filesystem for some time (never
>> really finding enough time) with the goal of making it work over RAID5
>> with no data risk due to power loss.  One day it will be functional
>> enough for others to try....
>>
>> It is entirely possible that NILFS could be made to meet that
>> requirement, but I haven't made time to explore NILFS so I cannot be
>> sure.
>>
>> NeilBrown
>>
>
> I am not sure that log structure will protect you from this scenario since
> once
> you clean the log, the non-logged data is assumed to be correct.
>
> If your cheap flash storage device can nuke random regions of that clean
> storage, you will lose data....

Hence my observation that "the filesystem would need clear visibility into
exactly how these blocks are positioned".
If there is an FTL in the way that randomly relocates blocks, and a
power fail during write could corrupt data that appears to be
megabytes away in some unpredictable location, then yes: a log structure
won't help.

However I would like to imagine that even a cheep flash device, if it
only ever got writes that were exactly the size of the erase-block, would
not break those writes over multiple erase blocks, so some degree of
integrity and predictability could be preserved.  Even more so, I would
love to  be able to disable the FTL, or at least have clear and correct
documentation about how it works.

So yes, not a panacea.  But an avenue with real possibilities.

NeilBrown


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 22:59                                           ` david
@ 2009-08-25 23:37                                             ` Pavel Machek
  2009-08-25 23:48                                               ` Ric Wheeler
  2009-08-25 23:56                                               ` david
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 23:37 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Hi!

>>> is it under all conditions, or only when you have already lost redundancy?
>>
>> I'd prefer not to specify.
>
> you need to, otherwise you are claiming that all linux software raid  
> implementations will loose data on powerfail, which I don't think is the  
> case.

Well, I'm not saying it loses data on _every_ powerfail ;-).

>>> also, the talk about software RAID 5/6 arrays without journals will be
>>> confusing (after all, if you are using ext3/XFS/etc you are using a
>>> journal, aren't you?)
>>
>> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
>> talking about hardware RAID arrays, where that's really
>> manufacturer-specific?
>
> what about dm raid?
>
> I don't think you should talk about hardware raid cards.

Ok, fixed.

>>> in addition, even with a single drive you will loose some data on power
>>> loss (unless you do sync mounts with disabled write caches), full data
>>> journaling can help protect you from this, but the default journaling
>>> just protects the metadata.
>>
>> "Data loss" here means "damaging data that were already fsynced". That
>> will not happen on single disk (with barriers on etc), but will happen
>> on RAID5 and flash.
>
> this definition of data loss wasn't clear prior to this. you need to  

I actually think it was. write() syscall does not guarantee anything,
fsync() does.

> define this, and state that the reason that flash and raid arrays can  
> suffer from this is that both of them deal with blocks of storage larger  
> than the data block (eraseblock or raid stripe) and there are conditions  
> that can cause the loss of the entire eraseblock or raid stripe which can 
> affect data that was previously safe on disk (and if power had been lost  
> before the latest write, the prior data would still be safe)

I actually believe Ted's writeup is good.

> note that this doesn't nessasarily affect all flash disks. if the disk  
> doesn't replace the old block in the FTL until the data has all been  
> sucessfuly copies to the new eraseblock you don't have this problem.
>
> some (possibly all) cheap thumb drives don't do this, but I would expect  
> that the expensive SATA SSDs to do things in the right order.

I'd expect SATA SSDs to have that solved, yes. Again, Ted does not say
it affects _all_ such devices, and it certianly did affect all that I seen.

> do this right and you are properly documenting a failure mode that most  
> people don't understand, but go too far and you are crying wolf.

Ok, latest version is below, can you suggest improvements? (And yes,
details when exactly RAID-5 misbehaves should be noted somewhere. I
don't know enough about RAID arrays, can someone help?)
									Pavel

---
There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and MD RAID 4/5/6
arrays.  These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
additional sectors are also damaged during the power failure.
        
Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used.  Regular backups when using these devices is also a
Very Good Idea.
        
Otherwise, file systems placed on these devices can suffer silent data
and file system corruption.  An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:26                                             ` Pavel Machek
@ 2009-08-25 23:40                                               ` Ric Wheeler
  2009-08-25 23:48                                                 ` david
                                                                   ` (2 more replies)
  2009-08-25 23:46                                               ` [patch] ext2/3: document conditions when reliable operation is possible david
  1 sibling, 3 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25 23:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/25/2009 07:26 PM, Pavel Machek wrote:
>
>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>> the page cache will lose data when you hot unplug its storage. End of
>>>> story, don't do it!
>>>
>>> No, not ext3 on SATA disk with barriers on and proper use of
>>> fsync(). I actually tested that.
>>>
>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>> that was fsync-ed to be there.
>>
>> You can and will lose data (even after fsync) with any type of storage at
>> some rate. What you are missing here is that data loss needs to be
>> measured in hard numbers - say percentage of installed boxes that have
>> config X that lose data.
>
> I'm talking "by design" here.
>
> I will lose data even on SATA drive that is properly powered on if I
> wait 5 years.
>

You are dead wrong.

For RAID5 arrays, you assume that you have a hard failure and a power outage 
before you can rebuild the RAID (order of hours at full tilt).

The failure rate of S-ATA drives is at the rate of a few percentage of the 
installed base in a year. Some drives will fail faster than that (bad parts, bad 
environmental conditions, etc).

Why don't you hold all of your most precious data on that single S-ATA drive for 
five year on one box and put a second copy on a small RAID5 with ext3 for the 
same period?

Repeat experiment until you get up to something like google scale or the other 
papers on failures in national labs in the US and then we can have an informed 
discussion.


>> I can promise you that hot unplugging and replugging a S-ATA drive will
>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>
> I can promise you that running S-ATA drive will also lose you data,
> even if you are not actively writing to it. Just wait 10 years; so
> what is your point?

I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5 
RAID5, I would not have lost any.

My point is that you fail to take into account the rate of failures of a given 
configuration and the probability of data loss given those rates.

>
> But ext3 is _designed_ to preserve fsynced data on SATA drive, while
> it is _not_ designed to preserve fsynced data on MD RAID5.

Of course it will when you properly configure your MD RAID5.

>
> Do you really think that's not a difference?

I think that you are simply wrong.

>
>>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>>> device with an active file system or actively used raw device" - but
>>>>>> would object to the overly general statement about ext3 not working on
>>>>>> flash, RAID5 not working, etc...
>>>>>
>>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>>> is stupid:
>>>>>
>>>>> * ext2 would be faster
>>>>>
>>>>> * ext2 would provide better protection against powerfail.
>>>>
>>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>>> telling you that it will lose data.
>>>
>>> I know I will lose data. Both ext2 and ext3 will lose data on
>>> flashdisk. (That's what I'm trying to document). But... what is the
>>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>>> protects you against kernel panic. MD RAID5 is in software, so... that
>>> additional protection is just not there).
>>
>> Faster recovery time on any normal kernel crash or power outage.  Data
>> loss would be equivalent with or without the journal.
>
> No, because you'll actually repair the ext2 with fsck after the kernel
> crash or power outage. Data loss will not be equivalent; in particular
> you'll not lose data writen _after_ power outage to ext2.
> 									Pavel


As Ted (who wrote fsck for ext*) said, you will lose data in both.  Your 
argument is not based on fact.

You need to actually prove your point, not just state it as fact.

ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:08                                       ` Neil Brown
@ 2009-08-25 23:44                                         ` Pavel Machek
  2009-08-26  4:08                                           ` Rik van Riel
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 23:44 UTC (permalink / raw)
  To: Neil Brown
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet


> While I think it is, in principle, worth documenting this sort of
> thing, there are an awful lot of fine details and distinctions that
> would need to be considered.

Ok, can you help? Having a piece of MD documentation explaining the
"powerfail nukes entire stripe" and how current filesystems do not
deal with that would be nice, along with description when exactly that
happens.

It seems to need two events -- one failed disk and one powerfail. I
knew that raid5 only protects against one failure, but I never
realized that simple powerfail (or kernel crash) counts as a failure
here, too.

I guess it should go at the end of md.txt.... aha, it actually already
talks about the issue a bit, in:

#Boot time assembly of degraded/dirty arrays
#-------------------------------------------
#
#If a raid5 or raid6 array is both dirty and degraded, it could have
#undetectable data corruption.  This is because the fact that it is
#'dirty' means that the parity cannot be trusted, and the fact that it
#is degraded means that some datablocks are missing and cannot reliably
#be reconstructed (due to no parity).

(Actually... that's possibly what happened to friend of mine. One of
disks in raid5 stopped responding and whole system just hanged
up. Oops, two failures in one...)
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:26                                             ` Pavel Machek
  2009-08-25 23:40                                               ` Ric Wheeler
@ 2009-08-25 23:46                                               ` david
  1 sibling, 0 replies; 309+ messages in thread
From: david @ 2009-08-25 23:46 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>> the page cache will lose data when you hot unplug its storage. End of
>>>> story, don't do it!
>>>
>>> No, not ext3 on SATA disk with barriers on and proper use of
>>> fsync(). I actually tested that.
>>>
>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>> that was fsync-ed to be there.
>>
>> You can and will lose data (even after fsync) with any type of storage at
>> some rate. What you are missing here is that data loss needs to be
>> measured in hard numbers - say percentage of installed boxes that have
>> config X that lose data.
>
> I'm talking "by design" here.
>
> I will lose data even on SATA drive that is properly powered on if I
> wait 5 years.
>
>> I can promise you that hot unplugging and replugging a S-ATA drive will
>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>
> I can promise you that running S-ATA drive will also lose you data,
> even if you are not actively writing to it. Just wait 10 years; so
> what is your point?
>
> But ext3 is _designed_ to preserve fsynced data on SATA drive, while
> it is _not_ designed to preserve fsynced data on MD RAID5.

substatute 'degraded MD RAID 5' for 'MD RAID 5' and you have a point here. 
although the language you are using is pretty harsh. you make it sound 
like this is a problem with ext3 when the filesystem has nothing to do 
with it. the problem is that a degraded raid 5 array can be corrupted by 
an additional failure.

> Do you really think that's not a difference?
>
>>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>>> device with an active file system or actively used raw device" - but
>>>>>> would object to the overly general statement about ext3 not working on
>>>>>> flash, RAID5 not working, etc...
>>>>>
>>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>>> is stupid:
>>>>>
>>>>> * ext2 would be faster
>>>>>
>>>>> * ext2 would provide better protection against powerfail.
>>>>
>>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>>> telling you that it will lose data.
>>>
>>> I know I will lose data. Both ext2 and ext3 will lose data on
>>> flashdisk. (That's what I'm trying to document). But... what is the
>>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>>> protects you against kernel panic. MD RAID5 is in software, so... that
>>> additional protection is just not there).
>>
>> Faster recovery time on any normal kernel crash or power outage.  Data
>> loss would be equivalent with or without the journal.
>
> No, because you'll actually repair the ext2 with fsck after the kernel
> crash or power outage. Data loss will not be equivalent; in particular
> you'll not lose data writen _after_ power outage to ext2.

by the way, while you are thinking about failures that can happen from a 
failed write corrupting additional blocks, think about the nightmare that 
can happen if those blocks are in the journal.

the 'repair' of ext2 by a fsck is actually much less than you are thinking 
that it is.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 23:37                                             ` Pavel Machek
@ 2009-08-25 23:48                                               ` Ric Wheeler
  2009-08-26  0:06                                                 ` Pavel Machek
  2009-08-25 23:56                                               ` david
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-25 23:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet


> ---
> There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and MD RAID 4/5/6
> arrays.  These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> additional sectors are also damaged during the power failure.

I would strike the entire mention of MD devices since it is your assertion, not 
a proven fact. You will cause more data loss from common events (single sector 
errors, complete drive failure) by steering people away from more reliable 
storage configurations because of a really rare edge case (power failure during 
split write to two raid members while doing a RAID rebuild).

>
> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used.  Regular backups when using these devices is also a
> Very Good Idea.

All users who care about data integrity - including those who do not use MD5 but 
just regular single S-ATA disks - will get better reliability from a UPS.


>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption.  An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption.
>

This is very misleading. All storage "can" have silent data loss, you are making 
a statement without specifics about frequency.

FSCK can repair the file system metadata, but will not detect any data loss or 
corruption in the data blocks allocated to user files. To detect data loss 
properly, you need to checksum (or digitally sign) all objects stored in a file 
system and verify them on a regular basis.

Also helps to keep a separate list of those objects on another device so that 
when the metadata does take a hit, you can enumerate your objects and verify 
that you have not lost anything.

ric


ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:40                                               ` Ric Wheeler
@ 2009-08-25 23:48                                                 ` david
  2009-08-25 23:53                                                 ` Pavel Machek
  2009-08-27  3:53                                                 ` Rob Landley
  2 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-25 23:48 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue, 25 Aug 2009, Ric Wheeler wrote:

> On 08/25/2009 07:26 PM, Pavel Machek wrote:
>> 
>>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>>> the page cache will lose data when you hot unplug its storage. End of
>>>>> story, don't do it!
>>>> 
>>>> No, not ext3 on SATA disk with barriers on and proper use of
>>>> fsync(). I actually tested that.
>>>> 
>>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>>> that was fsync-ed to be there.
>>> 
>>> You can and will lose data (even after fsync) with any type of storage at
>>> some rate. What you are missing here is that data loss needs to be
>>> measured in hard numbers - say percentage of installed boxes that have
>>> config X that lose data.
>> 
>> I'm talking "by design" here.
>> 
>> I will lose data even on SATA drive that is properly powered on if I
>> wait 5 years.
>> 
>
> You are dead wrong.
>
> For RAID5 arrays, you assume that you have a hard failure and a power outage 
> before you can rebuild the RAID (order of hours at full tilt).

and that the power outage causes a corrupted write.

>>> I can promise you that hot unplugging and replugging a S-ATA drive will
>>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>> 
>> I can promise you that running S-ATA drive will also lose you data,
>> even if you are not actively writing to it. Just wait 10 years; so
>> what is your point?
>
> I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5 
> RAID5, I would not have lost any.

me to, in fact just after I copied data from a raid array to it so that I 
could rebuild the raid array differently :-(

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:40                                               ` Ric Wheeler
  2009-08-25 23:48                                                 ` david
@ 2009-08-25 23:53                                                 ` Pavel Machek
  2009-08-26  0:11                                                   ` Ric Wheeler
  2009-08-26  3:50                                                   ` Rik van Riel
  2009-08-27  3:53                                                 ` Rob Landley
  2 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-25 23:53 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

> Why don't you hold all of your most precious data on that single S-ATA 
> drive for five year on one box and put a second copy on a small RAID5 
> with ext3 for the same period?
>
> Repeat experiment until you get up to something like google scale or the 
> other papers on failures in national labs in the US and then we can have 
> an informed discussion.

I'm not interested in discussing statistics with you. I'd rather discuss
fsync() and storage design issues.

ext3 is designed to work on single SATA disks, and it is not designed
to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

Because that fact is non obvious to the users, I'd like to see it
documented, and now have nice short writeup from Ted.

If you want to argue that ext3/MD RAID5/no UPS combination is still
less likely to fail than single SATA disk given part fail
probabilities, go ahead and present nice statistics. Its just that I'm
not interested in them.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 23:37                                             ` Pavel Machek
  2009-08-25 23:48                                               ` Ric Wheeler
@ 2009-08-25 23:56                                               ` david
  2009-08-26  0:12                                                 ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-25 23:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

> There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and MD RAID 4/5/6
> arrays.

change this to say 'degraded MD RAID 4/5/6 arrays'

also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly 
suspect that they do)

then you need to add a note that if the array becomes degraded before a 
scrub cycle happens previously hidden damage (that would have been 
repaired by the scrub) can surface.

> These devices have the property of potentially corrupting blocks being 
> written at the time of the power failure,

this is true of all devices

> and worse yet, amplifying the region where blocks are corrupted such 
> that additional sectors are also damaged during the power failure.

re-word this something like

In addition to the standard risk of corrupting the blocks being written at 
the time of the power failure, additonal blocks (in the same flash 
eraseblock or raid stripe) may also be corrupted.

> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used.  Regular backups when using these devices is also a
> Very Good Idea.
>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption.  An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 23:48                                               ` Ric Wheeler
@ 2009-08-26  0:06                                                 ` Pavel Machek
  2009-08-26  0:12                                                   ` Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:06 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 19:48:09, Ric Wheeler wrote:
>
>> ---
>> There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and MD RAID 4/5/6
>> arrays.  These devices have the property of potentially
>> corrupting blocks being written at the time of the power failure, and
>> worse yet, amplifying the region where blocks are corrupted such that
>> additional sectors are also damaged during the power failure.
>
> I would strike the entire mention of MD devices since it is your 
> assertion, not a proven fact. You will cause more data loss from common 

That actually is a fact. That's how MD RAID 5 is designed. And btw
those are originaly Ted's words.

> events (single sector errors, complete drive failure) by steering people 
> away from more reliable storage configurations because of a really rare 
> edge case (power failure during split write to two raid members while 
> doing a RAID rebuild).

I'm not sure what's rare about power failures. Unlike single sector
errors, my machine actually has a button that produces exactly that
event. Running degraded raid5 arrays for extended periods may be
slightly unusual configuration, but I suspect people should just do
that for testing. (And from the discussion, people seem to think that
degraded raid5 is equivalent to raid0).

>> Otherwise, file systems placed on these devices can suffer silent data
>> and file system corruption.  An forced use of fsck may detect metadata
>> corruption resulting in file system corruption, but will not suffice
>> to detect data corruption.
>>
>
> This is very misleading. All storage "can" have silent data loss, you are 
> making a statement without specifics about frequency.

substitute with "can (by design)"?

Now, if you can suggest useful version of that document meeting your
criteria?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:53                                                 ` Pavel Machek
@ 2009-08-26  0:11                                                   ` Ric Wheeler
  2009-08-26  0:16                                                     ` Pavel Machek
  2009-08-26  3:50                                                   ` Rik van Riel
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/25/2009 07:53 PM, Pavel Machek wrote:
>> Why don't you hold all of your most precious data on that single S-ATA
>> drive for five year on one box and put a second copy on a small RAID5
>> with ext3 for the same period?
>>
>> Repeat experiment until you get up to something like google scale or the
>> other papers on failures in national labs in the US and then we can have
>> an informed discussion.
>
> I'm not interested in discussing statistics with you. I'd rather discuss
> fsync() and storage design issues.
>
> ext3 is designed to work on single SATA disks, and it is not designed
> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

You are simply incorrect, Ted did not say that ext3 does not work with MD raid5.

>
> Because that fact is non obvious to the users, I'd like to see it
> documented, and now have nice short writeup from Ted.
>
> If you want to argue that ext3/MD RAID5/no UPS combination is still
> less likely to fail than single SATA disk given part fail
> probabilities, go ahead and present nice statistics. Its just that I'm
> not interested in them.
> 									Pavel
>

That is a proven fact and a well published one. If you choose to ignore 
published work (and common sense) that RAID makes you lose data less than 
non-RAID, why should anyone care what you write?

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 23:56                                               ` david
@ 2009-08-26  0:12                                                 ` Pavel Machek
  2009-08-26  0:20                                                   ` david
  2009-08-26  0:26                                                   ` Ric Wheeler
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:12 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 16:56:40, david@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>> There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and MD RAID 4/5/6
>> arrays.
>
> change this to say 'degraded MD RAID 4/5/6 arrays'
>
> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly 
> suspect that they do)

I changed it to say MD/DM.

> then you need to add a note that if the array becomes degraded before a  
> scrub cycle happens previously hidden damage (that would have been  
> repaired by the scrub) can surface.

I'd prefer not to talk about scrubing and such details here. Better
leave warning here and point to MD documentation.

>> THESE devices have the property of potentially corrupting blocks being  
>> written at the time of the power failure,
>
> this is true of all devices

Actually I don't think so. I believe SATA disks do not corrupt even
the sector they are writing to -- they just have big enough
capacitors. And yes I believe ext3 depends on that.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:06                                                 ` Pavel Machek
@ 2009-08-26  0:12                                                   ` Ric Wheeler
  2009-08-26  0:20                                                     ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 08:06 PM, Pavel Machek wrote:
> On Tue 2009-08-25 19:48:09, Ric Wheeler wrote:
>>
>>> ---
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.  These devices have the property of potentially
>>> corrupting blocks being written at the time of the power failure, and
>>> worse yet, amplifying the region where blocks are corrupted such that
>>> additional sectors are also damaged during the power failure.
>>
>> I would strike the entire mention of MD devices since it is your
>> assertion, not a proven fact. You will cause more data loss from common
>
> That actually is a fact. That's how MD RAID 5 is designed. And btw
> those are originaly Ted's words.
>

Ted did not design MD RAID5.

>> events (single sector errors, complete drive failure) by steering people
>> away from more reliable storage configurations because of a really rare
>> edge case (power failure during split write to two raid members while
>> doing a RAID rebuild).
>
> I'm not sure what's rare about power failures. Unlike single sector
> errors, my machine actually has a button that produces exactly that
> event. Running degraded raid5 arrays for extended periods may be
> slightly unusual configuration, but I suspect people should just do
> that for testing. (And from the discussion, people seem to think that
> degraded raid5 is equivalent to raid0).

Power failures after a full drive failure with a split write during a rebuild?

>
>>> Otherwise, file systems placed on these devices can suffer silent data
>>> and file system corruption.  An forced use of fsck may detect metadata
>>> corruption resulting in file system corruption, but will not suffice
>>> to detect data corruption.
>>>
>>
>> This is very misleading. All storage "can" have silent data loss, you are
>> making a statement without specifics about frequency.
>
> substitute with "can (by design)"?

By Pavel's unproven casual observation?

>
> Now, if you can suggest useful version of that document meeting your
> criteria?
>
> 								Pavel


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  0:11                                                   ` Ric Wheeler
@ 2009-08-26  0:16                                                     ` Pavel Machek
  2009-08-26  0:31                                                       ` Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:16 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Tue 2009-08-25 20:11:21, Ric Wheeler wrote:
> On 08/25/2009 07:53 PM, Pavel Machek wrote:
>>> Why don't you hold all of your most precious data on that single S-ATA
>>> drive for five year on one box and put a second copy on a small RAID5
>>> with ext3 for the same period?
>>>
>>> Repeat experiment until you get up to something like google scale or the
>>> other papers on failures in national labs in the US and then we can have
>>> an informed discussion.
>>
>> I'm not interested in discussing statistics with you. I'd rather discuss
>> fsync() and storage design issues.
>>
>> ext3 is designed to work on single SATA disks, and it is not designed
>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.
>
> You are simply incorrect, Ted did not say that ext3 does not work
> with MD raid5.

http://lkml.org/lkml/2009/8/25/312
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:12                                                 ` Pavel Machek
@ 2009-08-26  0:20                                                   ` david
  2009-08-26  0:39                                                     ` Pavel Machek
  2009-08-26  0:26                                                   ` Ric Wheeler
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-26  0:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 16:56:40, david@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.
>>
>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>
>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>> suspect that they do)
>
> I changed it to say MD/DM.
>
>> then you need to add a note that if the array becomes degraded before a
>> scrub cycle happens previously hidden damage (that would have been
>> repaired by the scrub) can surface.
>
> I'd prefer not to talk about scrubing and such details here. Better
> leave warning here and point to MD documentation.

I disagree with that, the way you are wording this makes it sound as if 
raid isn't worth it. if you are going to say that raid is risky you need 
to properly specify when it is risky

>>> THESE devices have the property of potentially corrupting blocks being
>>> written at the time of the power failure,
>>
>> this is true of all devices
>
> Actually I don't think so. I believe SATA disks do not corrupt even
> the sector they are writing to -- they just have big enough
> capacitors. And yes I believe ext3 depends on that.

you are incorrect on this.

ext3 (like every other filesystem) just accepts the risk (zfs makes some 
attempt to detect such corruption)

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:12                                                   ` Ric Wheeler
@ 2009-08-26  0:20                                                     ` Pavel Machek
  2009-08-26  0:26                                                       ` david
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:20 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

>>>> ---
>>>> There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>> arrays.  These devices have the property of potentially
>>>> corrupting blocks being written at the time of the power failure, and
>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>> additional sectors are also damaged during the power failure.
>>>
>>> I would strike the entire mention of MD devices since it is your
>>> assertion, not a proven fact. You will cause more data loss from common
>>
>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>> those are originaly Ted's words.
>
> Ted did not design MD RAID5.

So what? He clearly knows how it works.

Instead of arguing he's wrong, will you simply label everything as
unproven?

>>> events (single sector errors, complete drive failure) by steering people
>>> away from more reliable storage configurations because of a really rare
>>> edge case (power failure during split write to two raid members while
>>> doing a RAID rebuild).
>>
>> I'm not sure what's rare about power failures. Unlike single sector
>> errors, my machine actually has a button that produces exactly that
>> event. Running degraded raid5 arrays for extended periods may be
>> slightly unusual configuration, but I suspect people should just do
>> that for testing. (And from the discussion, people seem to think that
>> degraded raid5 is equivalent to raid0).
>
> Power failures after a full drive failure with a split write during a rebuild?

Look, I don't need full drive failure for this to happen. I can just
remove one disk from array. I don't need power failure, I can just
press the power button. I don't even need to rebuild anything, I can
just write to degraded array.

Given that all events are under my control, statistics make little
sense here.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:20                                                     ` Pavel Machek
@ 2009-08-26  0:26                                                       ` david
  2009-08-26  0:28                                                       ` Ric Wheeler
  2009-08-26  4:24                                                       ` Rik van Riel
  2 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-26  0:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>>> ---
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays.  These devices have the property of potentially
>>>>> corrupting blocks being written at the time of the power failure, and
>>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>>> additional sectors are also damaged during the power failure.
>>>>
>>>> I would strike the entire mention of MD devices since it is your
>>>> assertion, not a proven fact. You will cause more data loss from common
>>>
>>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>>> those are originaly Ted's words.
>>
>> Ted did not design MD RAID5.
>
> So what? He clearly knows how it works.
>
> Instead of arguing he's wrong, will you simply label everything as
> unproven?
>
>>>> events (single sector errors, complete drive failure) by steering people
>>>> away from more reliable storage configurations because of a really rare
>>>> edge case (power failure during split write to two raid members while
>>>> doing a RAID rebuild).
>>>
>>> I'm not sure what's rare about power failures. Unlike single sector
>>> errors, my machine actually has a button that produces exactly that
>>> event. Running degraded raid5 arrays for extended periods may be
>>> slightly unusual configuration, but I suspect people should just do
>>> that for testing. (And from the discussion, people seem to think that
>>> degraded raid5 is equivalent to raid0).
>>
>> Power failures after a full drive failure with a split write during a rebuild?
>
> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
>
> Given that all events are under my control, statistics make little
> sense here.

if you are intentionally causing several low-probability things to happen 
at once you increase the risk of corruption

note that you also need a write to take place, and be interrupted in just 
the right way.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:12                                                 ` Pavel Machek
  2009-08-26  0:20                                                   ` david
@ 2009-08-26  0:26                                                   ` Ric Wheeler
  2009-08-26  0:44                                                     ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:26 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 08:12 PM, Pavel Machek wrote:
> On Tue 2009-08-25 16:56:40, david@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.
>>
>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>
>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>> suspect that they do)
>
> I changed it to say MD/DM.
>
>> then you need to add a note that if the array becomes degraded before a
>> scrub cycle happens previously hidden damage (that would have been
>> repaired by the scrub) can surface.
>
> I'd prefer not to talk about scrubing and such details here. Better
> leave warning here and point to MD documentation.

Than you should punt the MD discussion to the MD documentation entirely.

I would suggest:

"Users of any file system that have a single media (SSD, flash or normal disk) 
can suffer from catastrophic and complete data loss if that single media fails. 
To reduce your exposure to data loss after a single point of failure, consider 
using either hardware or properly configured software RAID. See the 
documentation on MD RAID for how to configure it.

To insure proper fsync() semantics, you will need to have a storage device that 
supports write barriers or have a non-volatile write cache. If not, best 
practices dictate disabling the write cache on the storage device."

>
>>> THESE devices have the property of potentially corrupting blocks being
>>> written at the time of the power failure,
>>
>> this is true of all devices
>
> Actually I don't think so. I believe SATA disks do not corrupt even
> the sector they are writing to -- they just have big enough
> capacitors. And yes I believe ext3 depends on that.
> 								Pavel

Pavel, no S-ATA drive has capacitors to hold up during a power failure (or even 
enough power to destage their write cache). I know this from direct, personal 
knowledge having built RAID boxes at EMC for years. In fact, almost all RAID 
boxes require that the write cache be hardwired to off when used in their arrays.

Drives fail partially on a very common basis - look at your remapped sector 
count with smartctl.

RAID (including MD RAID5) will protect you from this most common error as it 
will protect you from complete drive failure which is also an extremely common 
event.

Your scenario is really, really rare - doing a full rebuild after a complete 
drive failure (takes a matter of hours, depends on the size of the disk) and 
having a power failure during that rebuild.

Of course adding a UPS to any storage system (including MD RAID system) helps 
make it more reliable, specifically in your scenario.

The more important point is that having any RAID (MD1, MD5 or MD6) will greatly 
reduce your chance of data loss if configured correctly. With ext3, ext2 or zfs.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:20                                                     ` Pavel Machek
  2009-08-26  0:26                                                       ` david
@ 2009-08-26  0:28                                                       ` Ric Wheeler
  2009-08-26  0:38                                                         ` Pavel Machek
  2009-08-26  4:24                                                       ` Rik van Riel
  2 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 08:20 PM, Pavel Machek wrote:
>>>>> ---
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays.  These devices have the property of potentially
>>>>> corrupting blocks being written at the time of the power failure, and
>>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>>> additional sectors are also damaged during the power failure.
>>>>
>>>> I would strike the entire mention of MD devices since it is your
>>>> assertion, not a proven fact. You will cause more data loss from common
>>>
>>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>>> those are originaly Ted's words.
>>
>> Ted did not design MD RAID5.
>
> So what? He clearly knows how it works.
>
> Instead of arguing he's wrong, will you simply label everything as
> unproven?
>
>>>> events (single sector errors, complete drive failure) by steering people
>>>> away from more reliable storage configurations because of a really rare
>>>> edge case (power failure during split write to two raid members while
>>>> doing a RAID rebuild).
>>>
>>> I'm not sure what's rare about power failures. Unlike single sector
>>> errors, my machine actually has a button that produces exactly that
>>> event. Running degraded raid5 arrays for extended periods may be
>>> slightly unusual configuration, but I suspect people should just do
>>> that for testing. (And from the discussion, people seem to think that
>>> degraded raid5 is equivalent to raid0).
>>
>> Power failures after a full drive failure with a split write during a rebuild?
>
> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
>
> Given that all events are under my control, statistics make little
> sense here.
> 								Pavel
>

You are deliberately causing a double failure - pressing the power button after 
pulling a drive is exactly that scenario.

Pull your single (non-MD5) disk out while writing (hot unplug from the S-ATA 
side, leaving power on) and run some tests to verify your assertions...

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  0:16                                                     ` Pavel Machek
@ 2009-08-26  0:31                                                       ` Ric Wheeler
  2009-08-26  1:00                                                         ` Theodore Tso
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/25/2009 08:16 PM, Pavel Machek wrote:
> On Tue 2009-08-25 20:11:21, Ric Wheeler wrote:
>> On 08/25/2009 07:53 PM, Pavel Machek wrote:
>>>> Why don't you hold all of your most precious data on that single S-ATA
>>>> drive for five year on one box and put a second copy on a small RAID5
>>>> with ext3 for the same period?
>>>>
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have
>>>> an informed discussion.
>>>
>>> I'm not interested in discussing statistics with you. I'd rather discuss
>>> fsync() and storage design issues.
>>>
>>> ext3 is designed to work on single SATA disks, and it is not designed
>>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.
>>
>> You are simply incorrect, Ted did not say that ext3 does not work
>> with MD raid5.
>
> http://lkml.org/lkml/2009/8/25/312
> 									Pavel

I will let Ted clarify his text on his own, but the quoted text says "... have 
potential...".

Why not ask Neil if he designed MD to not work properly with ext3?

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:28                                                       ` Ric Wheeler
@ 2009-08-26  0:38                                                         ` Pavel Machek
  2009-08-26  0:45                                                           ` Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:38 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>> errors, my machine actually has a button that produces exactly that
>>>> event. Running degraded raid5 arrays for extended periods may be
>>>> slightly unusual configuration, but I suspect people should just do
>>>> that for testing. (And from the discussion, people seem to think that
>>>> degraded raid5 is equivalent to raid0).
>>>
>>> Power failures after a full drive failure with a split write during a rebuild?
>>
>> Look, I don't need full drive failure for this to happen. I can just
>> remove one disk from array. I don't need power failure, I can just
>> press the power button. I don't even need to rebuild anything, I can
>> just write to degraded array.
>>
>> Given that all events are under my control, statistics make little
>> sense here.
>
> You are deliberately causing a double failure - pressing the power button 
> after pulling a drive is exactly that scenario.

Exactly. And now I'm trying to get that documented, so that people
don't do it and still expect their fs to be consistent.

> Pull your single (non-MD5) disk out while writing (hot unplug from the 
> S-ATA side, leaving power on) and run some tests to verify your 
> assertions...

I actually did that some time ago with pulling SATA disk (I actually
pulled both SATA *and* power -- that was the way hotplug envelope
worked; that's more harsh test than what you suggest, so that should
be ok). Write test was fsync heavy, with logging to separate drive,
checking that all the data where fsync succeeded are indeed
accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
some libata weirdness that is not yet fixed AFAIK, but with all the
patches applied I could not break that single SATA disk.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:20                                                   ` david
@ 2009-08-26  0:39                                                     ` Pavel Machek
  2009-08-26  1:17                                                       ` david
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:39 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 17:20:13, david@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>> On Tue 2009-08-25 16:56:40, david@lang.hm wrote:
>>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>
>>>> There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>> arrays.
>>>
>>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>>
>>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>>> suspect that they do)
>>
>> I changed it to say MD/DM.
>>
>>> then you need to add a note that if the array becomes degraded before a
>>> scrub cycle happens previously hidden damage (that would have been
>>> repaired by the scrub) can surface.
>>
>> I'd prefer not to talk about scrubing and such details here. Better
>> leave warning here and point to MD documentation.
>
> I disagree with that, the way you are wording this makes it sound as if  
> raid isn't worth it. if you are going to say that raid is risky you need  
> to properly specify when it is risky

Ok, would this help? I don't really want to go to scrubbing details.

(*) Degraded array or single disk failure "near" the powerfail is
neccessary for this property of RAID arrays to bite.

>>>> THESE devices have the property of potentially corrupting blocks being
>>>> written at the time of the power failure,
>>>
>>> this is true of all devices
>>
>> Actually I don't think so. I believe SATA disks do not corrupt even
>> the sector they are writing to -- they just have big enough
>> capacitors. And yes I believe ext3 depends on that.
>
> you are incorrect on this.
>
> ext3 (like every other filesystem) just accepts the risk (zfs makes some  
> attempt to detect such corruption)

I'd like Ted to comment on this. He wrote the original document, and
I'd prefer not to introduce mistakes.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:26                                                   ` Ric Wheeler
@ 2009-08-26  0:44                                                     ` Pavel Machek
  2009-08-26  0:50                                                       ` Ric Wheeler
  2009-08-26  1:19                                                       ` david
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  0:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet


>>>> THESE devices have the property of potentially corrupting blocks being
>>>> written at the time of the power failure,
>>>
>>> this is true of all devices
>>
>> Actually I don't think so. I believe SATA disks do not corrupt even
>> the sector they are writing to -- they just have big enough
>> capacitors. And yes I believe ext3 depends on that.
>
> Pavel, no S-ATA drive has capacitors to hold up during a power failure 
> (or even enough power to destage their write cache). I know this from 
> direct, personal knowledge having built RAID boxes at EMC for years. In 
> fact, almost all RAID boxes require that the write cache be hardwired to 
> off when used in their arrays.

I never claimed they have enough power to flush entire cache -- read
the paragraph again. I do believe the disks have enough capacitors to
finish writing single sector, and I do believe ext3 depends on that.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:38                                                         ` Pavel Machek
@ 2009-08-26  0:45                                                           ` Ric Wheeler
  2009-08-26 11:21                                                             ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>> errors, my machine actually has a button that produces exactly that
>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>> slightly unusual configuration, but I suspect people should just do
>>>>> that for testing. (And from the discussion, people seem to think that
>>>>> degraded raid5 is equivalent to raid0).
>>>>
>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>
>>> Look, I don't need full drive failure for this to happen. I can just
>>> remove one disk from array. I don't need power failure, I can just
>>> press the power button. I don't even need to rebuild anything, I can
>>> just write to degraded array.
>>>
>>> Given that all events are under my control, statistics make little
>>> sense here.
>>
>> You are deliberately causing a double failure - pressing the power button
>> after pulling a drive is exactly that scenario.
>
> Exactly. And now I'm trying to get that documented, so that people
> don't do it and still expect their fs to be consistent.

The problem I have is that the way you word it steers people away from RAID5 and 
better data integrity. Your intentions are good, but your text is going to do 
considerable harm.

Most people don't intentionally drop power (or have a power failure) during RAID 
rebuilds....

>
>> Pull your single (non-MD5) disk out while writing (hot unplug from the
>> S-ATA side, leaving power on) and run some tests to verify your
>> assertions...
>
> I actually did that some time ago with pulling SATA disk (I actually
> pulled both SATA *and* power -- that was the way hotplug envelope
> worked; that's more harsh test than what you suggest, so that should
> be ok). Write test was fsync heavy, with logging to separate drive,
> checking that all the data where fsync succeeded are indeed
> accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
> some libata weirdness that is not yet fixed AFAIK, but with all the
> patches applied I could not break that single SATA disk.
> 									Pavel


Fsync heavy workloads with working barriers will tend to keep the write cache 
pretty empty (two barrier flushes per fsync) so this is not too surprising.

Drive behaviour depends on a lot of things though - how the firmware prioritizes 
writes over reads, etc.

ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:44                                                     ` Pavel Machek
@ 2009-08-26  0:50                                                       ` Ric Wheeler
  2009-08-26  1:19                                                       ` david
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  0:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 08:44 PM, Pavel Machek wrote:
>
>>>>> THESE devices have the property of potentially corrupting blocks being
>>>>> written at the time of the power failure,
>>>>
>>>> this is true of all devices
>>>
>>> Actually I don't think so. I believe SATA disks do not corrupt even
>>> the sector they are writing to -- they just have big enough
>>> capacitors. And yes I believe ext3 depends on that.
>>
>> Pavel, no S-ATA drive has capacitors to hold up during a power failure
>> (or even enough power to destage their write cache). I know this from
>> direct, personal knowledge having built RAID boxes at EMC for years. In
>> fact, almost all RAID boxes require that the write cache be hardwired to
>> off when used in their arrays.
>
> I never claimed they have enough power to flush entire cache -- read
> the paragraph again. I do believe the disks have enough capacitors to
> finish writing single sector, and I do believe ext3 depends on that.
>
> 									Pavel

Some scary terms that drive people mention (and measure):

"high fly writes"
"over powered seeks"
"adjacent tack erasure"

If you do get a partial track written, the data integrity bits that the data is 
embedded in will flag it as invalid and give you and IO error on the next read. 
Note that the damage is not persistent, it will get repaired (in place) on the 
next write to that sector.

Also it is worth noting that ext2/3/4 write file system "blocks" not single 
sectors. Each ext3 IO is 8 distinct disk sector writes and those can span tracks 
on a drive which require a seek which all consume power.

On power loss, a disk will immediately park the heads...

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  0:31                                                       ` Ric Wheeler
@ 2009-08-26  1:00                                                         ` Theodore Tso
  2009-08-26  1:15                                                           ` Ric Wheeler
                                                                             ` (6 more replies)
  0 siblings, 7 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26  1:00 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
>>> You are simply incorrect, Ted did not say that ext3 does not work
>>> with MD raid5.
>>
>> http://lkml.org/lkml/2009/8/25/312
>> 									Pavel
>
> I will let Ted clarify his text on his own, but the quoted text says "... 
> have potential...".
>
> Why not ask Neil if he designed MD to not work properly with ext3?

So let me clarify by saying the following things.   

1) Filesystems are designed to expect that storage devices have
certain properties.  These include returning the same data that you
wrote, and that an error when writing a sector, or a power failure
when writing sector, should not be amplified to cause collateral
damage with previously succfessfully written sectors.

2) Degraded RAID 5/6 filesystems do not meet these properties.
Neither to cheap flash drives.  This increases the chances you can
lose, bigtime.  

3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
course not!  First of all, Ext3 still saves you against kernel panics
and hangs caused by device driver bugs or other kernel hangs.  You
will lose less data, and avoid needing to run a long and painful fsck
after a forced reboot, compared to if you used ext2.  You are making
an assumption that the only time running the journal takes place is
after a power failure.  But if the system hangs, and you need to hit
the Big Red Switch, or if you using the system in a Linux High
Availability setup and the ethernet card fails, so the STONITH ("shoot
the other node in the head") system forces a hard reset of the system,
or you get a kernel panic which forces a reboot, in all of these cases
ext3 will save you from a long fsck, and it will do so safely.

Secondly, what's the probability of a failure causes the RAID array to
become degraded, followed by a power failure, versus a power failure
while the RAID array is not running in degraded mode?  Hopefully you
are running with the RAID array in full, proper running order a much
larger percentage of the time than running with the RAID array in
degraded mode.  If not, the bug is with the system administrator!

If you are someone who tends to run for long periods of time in
degraded mode --- then better get a UPS.  And certainly if you want to
avoid the chances of failure, periodically scrubbing the disks so you
detect hard drive failures early, instead of waiting until a disk
fails before letting the rebuild find the dreaded "second failure"
which causes data loss, is a d*mned good idea.

Maybe a random OS engineer doesn't know these things --- but trust me
when I say a competent system administrator had better be familiar
with these concepts.  And someone who wants their data to be reliably
stored needs to do some basic storage engineering if they want to have
long-term data reliability.  (That, or maybe they should outsource
their long-term reliable storage some service such as Amazon S3 ---
see Jeremy Zawodny's analysis about how it can be cheaper, here: 
http://jeremy.zawodny.com/blog/archives/007624.html)

But we *do* need to be careful that we don't write documentation which
is ends up giving users the wrong impression.  The bottom line is that
you're better off using ext3 over ext2, even on a RAID array, for the
reasons listed above.

Are you better off using ext3 over ext2 on a crappy flash drive?
Maybe --- if you are also using crappy proprietary video drivers, such
as Ubuntu ships, where every single time you exit a 3d game the system
crashes (and Ubuntu users accept this as normal?!?), then ext3 might
be a better choice since you'll reduce the chance of data loss when
the system locks up or crashes thanks to the aforemention crappy
proprietary video drivers from Nvidia.  On the other hand, crappy
flash drives *do* have really bad write amplification effects, where a
4K write can cause 128k or more worth of flash to be rewritten, such
that using ext3 could seriously degrade the lifetime of said crappy
flash drive; furthermore, the crappy flash drives have such terribly
write performance that using ext3 can be a performance nightmare.
This of course, doesn't apply to well-implemented SSD's, such as the
Intel's X25-M and X18-M.  So here your mileage may vary.  Still, if
you are using crappy proprietary drivers which cause system hangs and
crashes at a far greater rate than power fail-induced unclean
shutdowns, ext3 *still* might be the better choice, even with crappy
flash drives.

The best thing to do, of course, is to improve your storage stack; use
competently implemented SSD's instead of crap flash cards.  If your
hardware RAID card supports a battery option, *get* the battery.  Add
a UPS to your system.  Provision your RAID array with hot spares, and
regularly scrub (read-test) your array so that failed drives can be
detected early.  Make sure you configure your MD setup so that you get
e-mail when a hard drive fails and the array starts running in
degraded mode, so you can replace the failed drive ASAP.

At the end of the day, filesystems are not magic.  They can't
compensate for crap hardware, or incompetently administered machines.

							- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
@ 2009-08-26  1:15                                                           ` Ric Wheeler
  2009-08-26  2:58                                                             ` Theodore Tso
  2009-08-26  1:15                                                           ` Ric Wheeler
                                                                             ` (5 subsequent siblings)
  6 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  1:15 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 08/25/2009 09:00 PM, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
>    
>>>> You are simply incorrect, Ted did not say that ext3 does not work
>>>> with MD raid5.
>>>>          
>>> http://lkml.org/lkml/2009/8/25/312
>>> 									Pavel
>>>        
>> I will let Ted clarify his text on his own, but the quoted text says "...
>> have potential...".
>>
>> Why not ask Neil if he designed MD to not work properly with ext3?
>>      
> So let me clarify by saying the following things.
>
> 1) Filesystems are designed to expect that storage devices have
> certain properties.  These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.
>
> 2) Degraded RAID 5/6 filesystems do not meet these properties.
> Neither to cheap flash drives.  This increases the chances you can
> lose, bigtime.
>
>    

I agree with the whole write up outside of the above - degraded RAID 
does meet this requirement unless you have a second (or third, counting 
the split write) failure during the rebuild.

Note that the window of exposure during a RAID rebuild is linear with 
the size of your disk and how much you detune the rebuild...

ric

> 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
> course not!  First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs.  You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2.  You are making
> an assumption that the only time running the journal takes place is
> after a power failure.  But if the system hangs, and you need to hit
> the Big Red Switch, or if you using the system in a Linux High
> Availability setup and the ethernet card fails, so the STONITH ("shoot
> the other node in the head") system forces a hard reset of the system,
> or you get a kernel panic which forces a reboot, in all of these cases
> ext3 will save you from a long fsck, and it will do so safely.
>
> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode?  Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode.  If not, the bug is with the system administrator!
>
> If you are someone who tends to run for long periods of time in
> degraded mode --- then better get a UPS.  And certainly if you want to
> avoid the chances of failure, periodically scrubbing the disks so you
> detect hard drive failures early, instead of waiting until a disk
> fails before letting the rebuild find the dreaded "second failure"
> which causes data loss, is a d*mned good idea.
>
> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts.  And someone who wants their data to be reliably
> stored needs to do some basic storage engineering if they want to have
> long-term data reliability.  (That, or maybe they should outsource
> their long-term reliable storage some service such as Amazon S3 ---
> see Jeremy Zawodny's analysis about how it can be cheaper, here:
> http://jeremy.zawodny.com/blog/archives/007624.html)
>
> But we *do* need to be careful that we don't write documentation which
> is ends up giving users the wrong impression.  The bottom line is that
> you're better off using ext3 over ext2, even on a RAID array, for the
> reasons listed above.
>
> Are you better off using ext3 over ext2 on a crappy flash drive?
> Maybe --- if you are also using crappy proprietary video drivers, such
> as Ubuntu ships, where every single time you exit a 3d game the system
> crashes (and Ubuntu users accept this as normal?!?), then ext3 might
> be a better choice since you'll reduce the chance of data loss when
> the system locks up or crashes thanks to the aforemention crappy
> proprietary video drivers from Nvidia.  On the other hand, crappy
> flash drives *do* have really bad write amplification effects, where a
> 4K write can cause 128k or more worth of flash to be rewritten, such
> that using ext3 could seriously degrade the lifetime of said crappy
> flash drive; furthermore, the crappy flash drives have such terribly
> write performance that using ext3 can be a performance nightmare.
> This of course, doesn't apply to well-implemented SSD's, such as the
> Intel's X25-M and X18-M.  So here your mileage may vary.  Still, if
> you are using crappy proprietary drivers which cause system hangs and
> crashes at a far greater rate than power fail-induced unclean
> shutdowns, ext3 *still* might be the better choice, even with crappy
> flash drives.
>
> The best thing to do, of course, is to improve your storage stack; use
> competently implemented SSD's instead of crap flash cards.  If your
> hardware RAID card supports a battery option, *get* the battery.  Add
> a UPS to your system.  Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be
> detected early.  Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.
>
> At the end of the day, filesystems are not magic.  They can't
> compensate for crap hardware, or incompetently administered machines.
>
> 							- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>    


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
  2009-08-26  1:15                                                           ` Ric Wheeler
@ 2009-08-26  1:15                                                           ` Ric Wheeler
  2009-08-26  1:16                                                           ` Pavel Machek
                                                                             ` (4 subsequent siblings)
  6 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26  1:15 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, Rob Landley

On 08/25/2009 09:00 PM, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
>    
>>>> You are simply incorrect, Ted did not say that ext3 does not work
>>>> with MD raid5.
>>>>          
>>> http://lkml.org/lkml/2009/8/25/312
>>> 									Pavel
>>>        
>> I will let Ted clarify his text on his own, but the quoted text says "...
>> have potential...".
>>
>> Why not ask Neil if he designed MD to not work properly with ext3?
>>      
> So let me clarify by saying the following things.
>
> 1) Filesystems are designed to expect that storage devices have
> certain properties.  These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.
>
> 2) Degraded RAID 5/6 filesystems do not meet these properties.
> Neither to cheap flash drives.  This increases the chances you can
> lose, bigtime.
>
>    

I agree with the whole write up outside of the above - degraded RAID 
does meet this requirement unless you have a second (or third, counting 
the split write) failure during the rebuild.

Note that the window of exposure during a RAID rebuild is linear with 
the size of your disk and how much you detune the rebuild...

ric

> 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
> course not!  First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs.  You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2.  You are making
> an assumption that the only time running the journal takes place is
> after a power failure.  But if the system hangs, and you need to hit
> the Big Red Switch, or if you using the system in a Linux High
> Availability setup and the ethernet card fails, so the STONITH ("shoot
> the other node in the head") system forces a hard reset of the system,
> or you get a kernel panic which forces a reboot, in all of these cases
> ext3 will save you from a long fsck, and it will do so safely.
>
> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode?  Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode.  If not, the bug is with the system administrator!
>
> If you are someone who tends to run for long periods of time in
> degraded mode --- then better get a UPS.  And certainly if you want to
> avoid the chances of failure, periodically scrubbing the disks so you
> detect hard drive failures early, instead of waiting until a disk
> fails before letting the rebuild find the dreaded "second failure"
> which causes data loss, is a d*mned good idea.
>
> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts.  And someone who wants their data to be reliably
> stored needs to do some basic storage engineering if they want to have
> long-term data reliability.  (That, or maybe they should outsource
> their long-term reliable storage some service such as Amazon S3 ---
> see Jeremy Zawodny's analysis about how it can be cheaper, here:
> http://jeremy.zawodny.com/blog/archives/007624.html)
>
> But we *do* need to be careful that we don't write documentation which
> is ends up giving users the wrong impression.  The bottom line is that
> you're better off using ext3 over ext2, even on a RAID array, for the
> reasons listed above.
>
> Are you better off using ext3 over ext2 on a crappy flash drive?
> Maybe --- if you are also using crappy proprietary video drivers, such
> as Ubuntu ships, where every single time you exit a 3d game the system
> crashes (and Ubuntu users accept this as normal?!?), then ext3 might
> be a better choice since you'll reduce the chance of data loss when
> the system locks up or crashes thanks to the aforemention crappy
> proprietary video drivers from Nvidia.  On the other hand, crappy
> flash drives *do* have really bad write amplification effects, where a
> 4K write can cause 128k or more worth of flash to be rewritten, such
> that using ext3 could seriously degrade the lifetime of said crappy
> flash drive; furthermore, the crappy flash drives have such terribly
> write performance that using ext3 can be a performance nightmare.
> This of course, doesn't apply to well-implemented SSD's, such as the
> Intel's X25-M and X18-M.  So here your mileage may vary.  Still, if
> you are using crappy proprietary drivers which cause system hangs and
> crashes at a far greater rate than power fail-induced unclean
> shutdowns, ext3 *still* might be the better choice, even with crappy
> flash drives.
>
> The best thing to do, of course, is to improve your storage stack; use
> competently implemented SSD's instead of crap flash cards.  If your
> hardware RAID card supports a battery option, *get* the battery.  Add
> a UPS to your system.  Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be
> detected early.  Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.
>
> At the end of the day, filesystems are not magic.  They can't
> compensate for crap hardware, or incompetently administered machines.
>
> 							- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>    


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
                                                                             ` (2 preceding siblings ...)
  2009-08-26  1:16                                                           ` Pavel Machek
@ 2009-08-26  1:16                                                           ` Pavel Machek
  2009-08-26  2:55                                                             ` Theodore Tso
  2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
                                                                             ` (2 subsequent siblings)
  6 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  1:16 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Hi!

> 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
> course not!  First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs.  You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2.  You are making

Actually... ext3 + MD RAID5 will still have a problem on kernel
panic. MD RAID5 is implemented in software, so if kernel panics, you
can still get inconsistent data in your array.

I mostly agree with the rest.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
  2009-08-26  1:15                                                           ` Ric Wheeler
  2009-08-26  1:15                                                           ` Ric Wheeler
@ 2009-08-26  1:16                                                           ` Pavel Machek
  2009-08-26  1:16                                                           ` Pavel Machek
                                                                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-26  1:16 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel

Hi!

> 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
> course not!  First of all, Ext3 still saves you against kernel panics
> and hangs caused by device driver bugs or other kernel hangs.  You
> will lose less data, and avoid needing to run a long and painful fsck
> after a forced reboot, compared to if you used ext2.  You are making

Actually... ext3 + MD RAID5 will still have a problem on kernel
panic. MD RAID5 is implemented in software, so if kernel panics, you
can still get inconsistent data in your array.

I mostly agree with the rest.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:39                                                     ` Pavel Machek
@ 2009-08-26  1:17                                                       ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-26  1:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 17:20:13, david@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> On Tue 2009-08-25 16:56:40, david@lang.hm wrote:
>>>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>>
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays.
>>>>
>>>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>>>
>>>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>>>> suspect that they do)
>>>
>>> I changed it to say MD/DM.
>>>
>>>> then you need to add a note that if the array becomes degraded before a
>>>> scrub cycle happens previously hidden damage (that would have been
>>>> repaired by the scrub) can surface.
>>>
>>> I'd prefer not to talk about scrubing and such details here. Better
>>> leave warning here and point to MD documentation.
>>
>> I disagree with that, the way you are wording this makes it sound as if
>> raid isn't worth it. if you are going to say that raid is risky you need
>> to properly specify when it is risky
>
> Ok, would this help? I don't really want to go to scrubbing details.
>
> (*) Degraded array or single disk failure "near" the powerfail is
> neccessary for this property of RAID arrays to bite.

that sounds reasonable

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:44                                                     ` Pavel Machek
  2009-08-26  0:50                                                       ` Ric Wheeler
@ 2009-08-26  1:19                                                       ` david
  2009-08-26 11:25                                                         ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-26  1:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>>> THESE devices have the property of potentially corrupting blocks being
>>>>> written at the time of the power failure,
>>>>
>>>> this is true of all devices
>>>
>>> Actually I don't think so. I believe SATA disks do not corrupt even
>>> the sector they are writing to -- they just have big enough
>>> capacitors. And yes I believe ext3 depends on that.
>>
>> Pavel, no S-ATA drive has capacitors to hold up during a power failure
>> (or even enough power to destage their write cache). I know this from
>> direct, personal knowledge having built RAID boxes at EMC for years. In
>> fact, almost all RAID boxes require that the write cache be hardwired to
>> off when used in their arrays.
>
> I never claimed they have enough power to flush entire cache -- read
> the paragraph again. I do believe the disks have enough capacitors to
> finish writing single sector, and I do believe ext3 depends on that.

keep in mind that in a powerfail situation the data being sent to the 
drive may be corrupt (the ram gets flaky while a DMA to the drive copies 
the bad data to the drive, which writes it before the power loss gets bad 
enough for the drive to decide there is a problem and shutdown)

you just plain cannot count on writes that are in flight when a powerfail 
happens to do predictable things, let alone what you consider sane or 
proper.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:28                 ` Neil Brown
  (?)
@ 2009-08-26  1:34                 ` david
  -1 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-26  1:34 UTC (permalink / raw)
  To: Neil Brown
  Cc: Greg Freemyer, Pavel Machek, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, tytso, rdunlap,
	linux-doc, linux-ext4

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3139 bytes --]

On Wed, 26 Aug 2009, Neil Brown wrote:

> On Monday August 24, greg.freemyer@gmail.com wrote:
>>> +Don't damage the old data on a failed write (ATOMIC-WRITES)
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +Either whole sector is correctly written or nothing is written during
>>> +powerfail.
>>> +
>>> +       Because RAM tends to fail faster than rest of system during
>>> +       powerfail, special hw killing DMA transfers may be necessary;
>>> +       otherwise, disks may write garbage during powerfail.
>>> +       This may be quite common on generic PC machines.
>>> +
>>> +       Note that atomic write is very hard to guarantee for RAID-4/5/6,
>>> +       because it needs to write both changed data, and parity, to
>>> +       different disks. (But it will only really show up in degraded mode).
>>> +       UPS for RAID array should help.
>>
>> Can someone clarify if this is true in raid-6 with just a single disk
>> failure?  I don't see why it would be.
>
> It does affect raid6 with a single drive missing.
>
> After an unclean shutdown you cannot trust any Parity block as it
> is possible that some of the blocks in the stripe have been updated,
> but others have not.  So you must assume that all parity blocks are
> wrong and update them.  If you have a missing disk you cannot do that.
>
> To take a more concrete example, imagine a 5 device RAID6 with
> 3 data blocks D0 D1 D2 as well a P and Q on some stripe.
> Suppose that we crashed while updating D0, which would have involved
> writing out D0, P and Q.
> On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3
> of D0, P and Q have been updated and the others not.
> We can try to recompute D2 from D0 D1 and P, from
> D0 P and Q or from D1, P and Q.
>
> We could conceivably try each of those and if they all produce the
> same result we might be confident of it.
> If two produced the same result and the other was different we could
> use a voting process to choose the 'best'.  And in this particular
> case I think that would work.  If 0 or 3 had been updates, all would
> be the same.  If only 1 was updated, then the combinations that
> exclude it will match.  If 2 were updated, then the combinations that
> exclude the non-updated block will match.
>
> But if both D0 and D1 were being updated I think there would be too
> many combinations and it would be very possibly that all three
> computed values for D2 would be different.
>
> So yes: a singly degraded RAID6 cannot promise no data corruption
> after an unclean shutdown.  That is why "mdadm" will not assemble such
> an array unless you use "--force" to acknowledge that there has been a
> problem.

thanks for this detail, I would not have expected a partially degraded 
raid 6 array to be this sensitive to problems.

assuming that the degradation happens prior to the power failure, what 
could be done to make this safer and more predictable.

off the top of my head (and possibly an extreme performance hit, not 
nessasarily suitable for everyone) is there something that could be done 
with ordering the writes to the various drives?

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
                                                                             ` (4 preceding siblings ...)
  2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
@ 2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
  2009-09-03  9:47                                                             ` Pavel Machek
  6 siblings, 0 replies; 309+ messages in thread
From: Henrique de Moraes Holschuh @ 2009-08-26  2:53 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Tue, 25 Aug 2009, Theodore Tso wrote:
> a UPS to your system.  Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be

Can we get a proper scrub function (full rewrite of all component
disks), please?  Not every disk out there will stop a streaming read to
rewrite weak sectors it happens to come across.

> detected early.  Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.

Debian got this right :-)

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
                                                                             ` (3 preceding siblings ...)
  2009-08-26  1:16                                                           ` Pavel Machek
@ 2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
  2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
  2009-09-03  9:47                                                             ` Pavel Machek
  6 siblings, 0 replies; 309+ messages in thread
From: Henrique de Moraes Holschuh @ 2009-08-26  2:53 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, Rob Landley

On Tue, 25 Aug 2009, Theodore Tso wrote:
> a UPS to your system.  Provision your RAID array with hot spares, and
> regularly scrub (read-test) your array so that failed drives can be

Can we get a proper scrub function (full rewrite of all component
disks), please?  Not every disk out there will stop a streaming read to
rewrite weak sectors it happens to come across.

> detected early.  Make sure you configure your MD setup so that you get
> e-mail when a hard drive fails and the array starts running in
> degraded mode, so you can replace the failed drive ASAP.

Debian got this right :-)

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:16                                                           ` Pavel Machek
@ 2009-08-26  2:55                                                             ` Theodore Tso
  2009-08-26 13:37                                                               ` Ric Wheeler
  2009-08-26 13:37                                                               ` Ric Wheeler
  0 siblings, 2 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26  2:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Wed, Aug 26, 2009 at 03:16:06AM +0200, Pavel Machek wrote:
> Hi!
> 
> > 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
> > course not!  First of all, Ext3 still saves you against kernel panics
> > and hangs caused by device driver bugs or other kernel hangs.  You
> > will lose less data, and avoid needing to run a long and painful fsck
> > after a forced reboot, compared to if you used ext2.  You are making
> 
> Actually... ext3 + MD RAID5 will still have a problem on kernel
> panic. MD RAID5 is implemented in software, so if kernel panics, you
> can still get inconsistent data in your array.

Only if the MD RAID array is running in degraded mode (and again, if
the system is in this state for a long time, the bug is in the system
administrator).  And even then, it depends on how the kernel dies.  If
the system hangs due to some deadlock, or we get an OOPS that kills a
process while still holding some locks, and that leads to a deadlock,
it's likely the low-level MD driver can still complete the stripe
write, and no data will be lost.  If the kernel ties itself in knots
due to running out of memory, and the OOM handler is invoked, someone
hitting the reset button to force a reboot will also be fine.

If the RAID array is degraded, and we get an oops in interrupt
handler, such that the system is immediately halted --- then yes, data
could get lost.  But there are many system crashes where the software
RAID's ability to complete a stripe write would not be compromised.

       	       	  	     	    	  	- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:15                                                           ` Ric Wheeler
@ 2009-08-26  2:58                                                             ` Theodore Tso
  2009-08-26 10:39                                                               ` Ric Wheeler
                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26  2:58 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>
> I agree with the whole write up outside of the above - degraded RAID  
> does meet this requirement unless you have a second (or third, counting  
> the split write) failure during the rebuild.

The argument is that if the degraded RAID array is running in this
state for a long time, and the power fails while the software RAID is
in the middle of writing out a stripe, such that the stripe isn't
completely written out, we could lose all of the data in that stripe.

In other words, a power failure in the middle of writing out a stripe
in a degraded RAID array counts as a second failure.

To me, this isn't a particularly interesting or newsworthy point,
since a competent system administrator who cares about his data and/or
his hardware will (a) have a UPS, and (b) be running with a hot spare
and/or will imediately replace a failed drive in a RAID array.

       	    	       	       	 	      - Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  9:34                                 ` Pavel Machek
  2009-08-25 15:34                                   ` david
@ 2009-08-26  3:32                                   ` Rik van Riel
  2009-08-26 11:17                                     ` Pavel Machek
  2009-08-27  5:27                                     ` Rob Landley
  1 sibling, 2 replies; 309+ messages in thread
From: Rik van Riel @ 2009-08-26  3:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Pavel Machek wrote:

>> So, would you be happy if ext3 fsck was always run on reboot (at least  
>> for flash devices)?
> 
> For flash devices, MD Raid 5 and anything else that needs it; yes that
> would make me happy ;-).

Sorry, but that just shows your naivete.

Metadata takes up such a small part of the disk that fscking
it and finding it to be OK is absolutely no guarantee that
the data on the filesystem has not been horribly mangled.

Personally, what I care about is my data.

The metadata is just a way to get to my data, while the data
is actually important.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:53                                                 ` Pavel Machek
  2009-08-26  0:11                                                   ` Ric Wheeler
@ 2009-08-26  3:50                                                   ` Rik van Riel
  1 sibling, 0 replies; 309+ messages in thread
From: Rik van Riel @ 2009-08-26  3:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Pavel Machek wrote:

> If you want to argue that ext3/MD RAID5/no UPS combination is still
> less likely to fail than single SATA disk given part fail
> probabilities, go ahead and present nice statistics. Its just that I'm
> not interested in them.

The reality in your document does not match up with the reality
out there in the world.  That sounds like a good reason not to
have your (incorrect) document out there, confusing people.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:44                                         ` Pavel Machek
@ 2009-08-26  4:08                                           ` Rik van Riel
  2009-08-26 11:15                                             ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Rik van Riel @ 2009-08-26  4:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Neil Brown, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Pavel Machek wrote:

> Ok, can you help? Having a piece of MD documentation explaining the
> "powerfail nukes entire stripe" and how current filesystems do not
> deal with that would be nice, along with description when exactly that
> happens.

Except of course for the inconvenient detail that a power
failure on a degraded RAID 5 array does *NOT* nuke the
entire stripe.

A 5-disk RAID 5 array will have 4 data blocks and 1 parity
block in each stripe.  A degraded array will have either
4 data blocks or 3 data blocks and 1 parity block in the
stripe.

If we are dealing with a parity-less stripe, we cannot
lose any data due to RAID 5, because each of the 4 data
blocks has a disk block available.  We could still lose
a data write due to a power failure, but this could also
happen with the RAID 5 array still intact.

If we are dealing with a 3-data, 1-parity stripe, then
3 of the 4 data blocks have an available disk block and
will not be lost (if they make it to disk).  The only
block that maintains on all 3 data blocks and the parity
block being correct is the block that does not currently
have a disk to be written to.

In short, if a stripe is not written completely on a
degraded RAID 5 array, you can lose:
1) the blocks that were not written (duh)
2) the block that doesn't have a disk

The first part of this loss is also true in a non-degraded
RAID 5 array.  The fact that the array is degraded really
does not add much additional data loss here and you certainly
will not lose the entire stripe like you suggest.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-25 22:40                                         ` Pavel Machek
  2009-08-25 22:59                                           ` david
@ 2009-08-26  4:20                                           ` Rik van Riel
  1 sibling, 0 replies; 309+ messages in thread
From: Rik van Riel @ 2009-08-26  4:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Pavel Machek wrote:

> Lets say you are writing to the (healthy) RAID5 and have a powerfail.
> 
> So now data blocks do not correspond to the parity block. You don't
> yet have the corruption, but you already have a problem.
> 
> If you get a disk failing at this point, you'll get corruption.

Not necessarily.  Say you wrote out the entire stripe
in a 5 disk RAID 5 array, but only 3 data blocks and
the parity block got written out before power failure.

If the disk with the 4th (unwritten) data block were
to fail and get taken out of the RAID 5 array, the
degradation of the array could actually undo your data
corruption.

With RAID 5 and incomplete writes, you just don't know.

This kind of thing could go wrong at any level in the
system, with any kind of RAID 5 setup.

Of course, on a single disk system without RAID you can
still get incomplete writes, for the exact same reasons.

RAID 5 does not make things worse.  It will protect your
data against certain failure modes, but not against others.

With or without RAID, you still need to make backups.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:20                                                     ` Pavel Machek
  2009-08-26  0:26                                                       ` david
  2009-08-26  0:28                                                       ` Ric Wheeler
@ 2009-08-26  4:24                                                       ` Rik van Riel
  2009-08-26 11:22                                                         ` Pavel Machek
  2 siblings, 1 reply; 309+ messages in thread
From: Rik van Riel @ 2009-08-26  4:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Pavel Machek wrote:

> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
> 
> Given that all events are under my control, statistics make little
> sense here.

I recommend a sledgehammer.

If you want to lose your data, you might as well have some fun.

No need to bore yourself to tears by simulating events that are
unlikely to happen simultaneously to careful system administrators.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  2:58                                                             ` Theodore Tso
  2009-08-26 10:39                                                               ` Ric Wheeler
@ 2009-08-26 10:39                                                               ` Ric Wheeler
  2009-08-26 11:12                                                                 ` Pavel Machek
  2009-08-27  5:19                                                               ` Rob Landley
  2 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 10:39 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 10:58 PM, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>    
>> I agree with the whole write up outside of the above - degraded RAID
>> does meet this requirement unless you have a second (or third, counting
>> the split write) failure during the rebuild.
>>      
> The argument is that if the degraded RAID array is running in this
> state for a long time, and the power fails while the software RAID is
> in the middle of writing out a stripe, such that the stripe isn't
> completely written out, we could lose all of the data in that stripe.
>
> In other words, a power failure in the middle of writing out a stripe
> in a degraded RAID array counts as a second failure.
>    
> To me, this isn't a particularly interesting or newsworthy point,
> since a competent system administrator who cares about his data and/or
> his hardware will (a) have a UPS, and (b) be running with a hot spare
> and/or will imediately replace a failed drive in a RAID array.
>
>         	    	       	       	 	      - Ted
>    

I agree that this is not an interesting (or likely) scenario, certainly 
when compared to the much more frequent failures that RAID will protect 
against which is why I object to the document as Pavel suggested. It 
will steer people away from using RAID and directly increase their 
chances of losing their data if they use just a single disk.

Ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  2:58                                                             ` Theodore Tso
@ 2009-08-26 10:39                                                               ` Ric Wheeler
  2009-08-26 10:39                                                               ` Ric Wheeler
  2009-08-27  5:19                                                               ` Rob Landley
  2 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 10:39 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list

On 08/25/2009 10:58 PM, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>    
>> I agree with the whole write up outside of the above - degraded RAID
>> does meet this requirement unless you have a second (or third, counting
>> the split write) failure during the rebuild.
>>      
> The argument is that if the degraded RAID array is running in this
> state for a long time, and the power fails while the software RAID is
> in the middle of writing out a stripe, such that the stripe isn't
> completely written out, we could lose all of the data in that stripe.
>
> In other words, a power failure in the middle of writing out a stripe
> in a degraded RAID array counts as a second failure.
>    
> To me, this isn't a particularly interesting or newsworthy point,
> since a competent system administrator who cares about his data and/or
> his hardware will (a) have a UPS, and (b) be running with a hot spare
> and/or will imediately replace a failed drive in a RAID array.
>
>         	    	       	       	 	      - Ted
>    

I agree that this is not an interesting (or likely) scenario, certainly 
when compared to the much more frequent failures that RAID will protect 
against which is why I object to the document as Pavel suggested. It 
will steer people away from using RAID and directly increase their 
chances of losing their data if they use just a single disk.

Ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 10:39                                                               ` Ric Wheeler
@ 2009-08-26 11:12                                                                 ` Pavel Machek
  2009-08-26 11:28                                                                   ` david
                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 11:12 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Wed 2009-08-26 06:39:14, Ric Wheeler wrote:
> On 08/25/2009 10:58 PM, Theodore Tso wrote:
>> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>>    
>>> I agree with the whole write up outside of the above - degraded RAID
>>> does meet this requirement unless you have a second (or third, counting
>>> the split write) failure during the rebuild.
>>>      
>> The argument is that if the degraded RAID array is running in this
>> state for a long time, and the power fails while the software RAID is
>> in the middle of writing out a stripe, such that the stripe isn't
>> completely written out, we could lose all of the data in that stripe.
>>
>> In other words, a power failure in the middle of writing out a stripe
>> in a degraded RAID array counts as a second failure.
>>    To me, this isn't a particularly interesting or newsworthy point,
>> since a competent system administrator who cares about his data and/or
>> his hardware will (a) have a UPS, and (b) be running with a hot spare
>> and/or will imediately replace a failed drive in a RAID array.
>
> I agree that this is not an interesting (or likely) scenario, certainly  
> when compared to the much more frequent failures that RAID will protect  
> against which is why I object to the document as Pavel suggested. It  
> will steer people away from using RAID and directly increase their  
> chances of losing their data if they use just a single disk.

So instead of fixing or at least documenting known software deficiency
in Linux MD stack, you'll try to surpress that information so that
people use more of raid5 setups?

Perhaps the better documentation will push them to RAID1, or maybe
make them buy an UPS?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  4:08                                           ` Rik van Riel
@ 2009-08-26 11:15                                             ` Pavel Machek
  2009-08-27  3:29                                               ` Rik van Riel
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 11:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Neil Brown, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Hi!

>> Ok, can you help? Having a piece of MD documentation explaining the
>> "powerfail nukes entire stripe" and how current filesystems do not
>> deal with that would be nice, along with description when exactly that
>> happens.
>
> Except of course for the inconvenient detail that a power
> failure on a degraded RAID 5 array does *NOT* nuke the
> entire stripe.

Ok, you are right. It will nuke unrelated sector somewhere on the
stripe (one that is "old" and was not recently written) -- which is
still something ext3 can not reliably handle.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  3:32                                   ` Rik van Riel
@ 2009-08-26 11:17                                     ` Pavel Machek
  2009-08-26 11:29                                       ` david
  2009-08-26 12:28                                       ` Theodore Tso
  2009-08-27  5:27                                     ` Rob Landley
  1 sibling, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 11:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 23:32:47, Rik van Riel wrote:
> Pavel Machek wrote:
>
>>> So, would you be happy if ext3 fsck was always run on reboot (at 
>>> least  for flash devices)?
>>
>> For flash devices, MD Raid 5 and anything else that needs it; yes that
>> would make me happy ;-).
>
> Sorry, but that just shows your naivete.
>
> Metadata takes up such a small part of the disk that fscking
> it and finding it to be OK is absolutely no guarantee that
> the data on the filesystem has not been horribly mangled.
>
> Personally, what I care about is my data.
>
> The metadata is just a way to get to my data, while the data
> is actually important.

Personally, I care about metadata consistency, and ext3 documentation
suggests that journal protects its integrity. Except that it does not
on broken storage devices, and you still need to run fsck there.

How do you protect your data is another question, but ext3
documentation does not claim journal to protect them, so that's up to
the user I guess.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  0:45                                                           ` Ric Wheeler
@ 2009-08-26 11:21                                                             ` Pavel Machek
  2009-08-26 11:58                                                               ` Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 11:21 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 20:45:26, Ric Wheeler wrote:
> On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>>> errors, my machine actually has a button that produces exactly that
>>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>>> slightly unusual configuration, but I suspect people should just do
>>>>>> that for testing. (And from the discussion, people seem to think that
>>>>>> degraded raid5 is equivalent to raid0).
>>>>>
>>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>>
>>>> Look, I don't need full drive failure for this to happen. I can just
>>>> remove one disk from array. I don't need power failure, I can just
>>>> press the power button. I don't even need to rebuild anything, I can
>>>> just write to degraded array.
>>>>
>>>> Given that all events are under my control, statistics make little
>>>> sense here.
>>>
>>> You are deliberately causing a double failure - pressing the power button
>>> after pulling a drive is exactly that scenario.
>>
>> Exactly. And now I'm trying to get that documented, so that people
>> don't do it and still expect their fs to be consistent.
>
> The problem I have is that the way you word it steers people away from 
> RAID5 and better data integrity. Your intentions are good, but your text 
> is going to do considerable harm.
>
> Most people don't intentionally drop power (or have a power failure) 
> during RAID rebuilds....

Example I seen went like this:

Drive in raid 5 failed; hot spare was available (no idea about
UPS). System apparently locked up trying to talk to the failed drive,
or maybe admin just was not patient enough, so he just powercycled the
array. He lost the array.

So while most people will not agressively powercycle the RAID array,
drive failure still provokes little tested error paths, and getting
unclean shutdown is quite easy in such case.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  4:24                                                       ` Rik van Riel
@ 2009-08-26 11:22                                                         ` Pavel Machek
  2009-08-26 14:45                                                           ` Rik van Riel
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 11:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Wed 2009-08-26 00:24:30, Rik van Riel wrote:
> Pavel Machek wrote:
>
>> Look, I don't need full drive failure for this to happen. I can just
>> remove one disk from array. I don't need power failure, I can just
>> press the power button. I don't even need to rebuild anything, I can
>> just write to degraded array.
>>
>> Given that all events are under my control, statistics make little
>> sense here.
>
> I recommend a sledgehammer.
>
> If you want to lose your data, you might as well have some fun.
>
> No need to bore yourself to tears by simulating events that are
> unlikely to happen simultaneously to careful system administrators.

Sledgehammer is hardware problem, and I'm demonstrating
software/documentation problem we have here.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26  1:19                                                       ` david
@ 2009-08-26 11:25                                                         ` Pavel Machek
  2009-08-26 12:37                                                           ` Theodore Tso
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 11:25 UTC (permalink / raw)
  To: david
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 18:19:40, david@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>>>>> THESE devices have the property of potentially corrupting blocks being
>>>>>> written at the time of the power failure,
>>>>>
>>>>> this is true of all devices
>>>>
>>>> Actually I don't think so. I believe SATA disks do not corrupt even
>>>> the sector they are writing to -- they just have big enough
>>>> capacitors. And yes I believe ext3 depends on that.
>>>
>>> Pavel, no S-ATA drive has capacitors to hold up during a power failure
>>> (or even enough power to destage their write cache). I know this from
>>> direct, personal knowledge having built RAID boxes at EMC for years. In
>>> fact, almost all RAID boxes require that the write cache be hardwired to
>>> off when used in their arrays.
>>
>> I never claimed they have enough power to flush entire cache -- read
>> the paragraph again. I do believe the disks have enough capacitors to
>> finish writing single sector, and I do believe ext3 depends on that.
>
> keep in mind that in a powerfail situation the data being sent to the  
> drive may be corrupt (the ram gets flaky while a DMA to the drive copies  
> the bad data to the drive, which writes it before the power loss gets bad 
> enough for the drive to decide there is a problem and shutdown)
>
> you just plain cannot count on writes that are in flight when a powerfail 
> happens to do predictable things, let alone what you consider sane or  
> proper.

>From what I see, this kind of failure is rather harder to reproduce
than the software problems. And at least SGI machines were designed to
avoid this...

Anyway, I'd like to hear from ext3 people... what happens on read
errors in journal? That's what you'd expect to see in situation above.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:12                                                                 ` Pavel Machek
@ 2009-08-26 11:28                                                                   ` david
  2009-08-29  9:49                                                                     ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
  2009-08-26 12:01                                                                   ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler
  2009-08-26 12:23                                                                   ` Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: david @ 2009-08-26 11:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Wed 2009-08-26 06:39:14, Ric Wheeler wrote:
>> On 08/25/2009 10:58 PM, Theodore Tso wrote:
>>> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>>>
>>>> I agree with the whole write up outside of the above - degraded RAID
>>>> does meet this requirement unless you have a second (or third, counting
>>>> the split write) failure during the rebuild.
>>>>
>>> The argument is that if the degraded RAID array is running in this
>>> state for a long time, and the power fails while the software RAID is
>>> in the middle of writing out a stripe, such that the stripe isn't
>>> completely written out, we could lose all of the data in that stripe.
>>>
>>> In other words, a power failure in the middle of writing out a stripe
>>> in a degraded RAID array counts as a second failure.
>>>    To me, this isn't a particularly interesting or newsworthy point,
>>> since a competent system administrator who cares about his data and/or
>>> his hardware will (a) have a UPS, and (b) be running with a hot spare
>>> and/or will imediately replace a failed drive in a RAID array.
>>
>> I agree that this is not an interesting (or likely) scenario, certainly
>> when compared to the much more frequent failures that RAID will protect
>> against which is why I object to the document as Pavel suggested. It
>> will steer people away from using RAID and directly increase their
>> chances of losing their data if they use just a single disk.
>
> So instead of fixing or at least documenting known software deficiency
> in Linux MD stack, you'll try to surpress that information so that
> people use more of raid5 setups?
>
> Perhaps the better documentation will push them to RAID1, or maybe
> make them buy an UPS?

people aren't objecting to better documentation, they are objecting to 
misleading documentation.

for flash drives the danger is very straightforward (although even then 
you have to note that it depends heavily on the firmware of the device, 
some will loose lots of data, some won't loose any)

a good thing to do here would be for someone to devise a test to show this 
problem, and then gather the results of lots of people performing this 
test to see what the commonalities are.

you are generalizing that since you have lost data on flash drives, all 
flash drives are dangerous.

what if it turns out that only one manufacturer is doing things wrong? you 
will have discouraged people from using flash drives for no reason. 
(potentially causing them to loose data becouse they ae scared away from 
using flash drives and don't implement anything better)

to be safe, all that a flash drive needs to do is to not change the FTL 
pointers until the data has fully been recorded in it's new location. this 
is probably a trivial firmware change.


for raid arrays, we are still learning the nuances of what actually can 
happen. the comment that Rik made a few hours ago when he pointed out that 
with raid 5 you won't trash the entire stripe (which is what I thought 
happened from prior comments), but instead run the risk of loosing two 
relativly definable chunks of data

1. the block you are writing (which you can loose anyway)

2. the block that would live on the disk that is missing.

that drasticly lessens the impact of the problem

I would like to see someone explain what would happen on raid 6, and I 
think that the possibilities that Neil talked about where he said that it 
was possible to try the various combinations and see which ones agree with 
each other would be a good thing to implement if he can do so.

but the super simplified statement you keep trying to make is 
significantly overstating and oversimplifying the problem.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:17                                     ` Pavel Machek
@ 2009-08-26 11:29                                       ` david
  2009-08-26 13:10                                         ` Pavel Machek
  2009-08-26 12:28                                       ` Theodore Tso
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-26 11:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 23:32:47, Rik van Riel wrote:
>> Pavel Machek wrote:
>>
>>>> So, would you be happy if ext3 fsck was always run on reboot (at
>>>> least  for flash devices)?
>>>
>>> For flash devices, MD Raid 5 and anything else that needs it; yes that
>>> would make me happy ;-).
>>
>> Sorry, but that just shows your naivete.
>>
>> Metadata takes up such a small part of the disk that fscking
>> it and finding it to be OK is absolutely no guarantee that
>> the data on the filesystem has not been horribly mangled.
>>
>> Personally, what I care about is my data.
>>
>> The metadata is just a way to get to my data, while the data
>> is actually important.
>
> Personally, I care about metadata consistency, and ext3 documentation
> suggests that journal protects its integrity. Except that it does not
> on broken storage devices, and you still need to run fsck there.

as the ext3 authors have stated many times over the years, you still need 
to run fsck periodicly anyway.

what the journal gives you is a reasonable chance of skipping it when the 
system crashes and you want to get it back up ASAP.

David Lang

> How do you protect your data is another question, but ext3
> documentation does not claim journal to protect them, so that's up to
> the user I guess.
> 									Pavel
>

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 11:21                                                             ` Pavel Machek
@ 2009-08-26 11:58                                                               ` Ric Wheeler
  2009-08-26 12:40                                                                 ` Theodore Tso
  2009-08-29  9:38                                                                 ` Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 11:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/26/2009 07:21 AM, Pavel Machek wrote:
> On Tue 2009-08-25 20:45:26, Ric Wheeler wrote:
>    
>> On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>      
>>>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>>>> errors, my machine actually has a button that produces exactly that
>>>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>>>> slightly unusual configuration, but I suspect people should just do
>>>>>>> that for testing. (And from the discussion, people seem to think that
>>>>>>> degraded raid5 is equivalent to raid0).
>>>>>>>                
>>>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>>>>              
>>>>> Look, I don't need full drive failure for this to happen. I can just
>>>>> remove one disk from array. I don't need power failure, I can just
>>>>> press the power button. I don't even need to rebuild anything, I can
>>>>> just write to degraded array.
>>>>>
>>>>> Given that all events are under my control, statistics make little
>>>>> sense here.
>>>>>            
>>>> You are deliberately causing a double failure - pressing the power button
>>>> after pulling a drive is exactly that scenario.
>>>>          
>>> Exactly. And now I'm trying to get that documented, so that people
>>> don't do it and still expect their fs to be consistent.
>>>        
>> The problem I have is that the way you word it steers people away from
>> RAID5 and better data integrity. Your intentions are good, but your text
>> is going to do considerable harm.
>>
>> Most people don't intentionally drop power (or have a power failure)
>> during RAID rebuilds....
>>      
> Example I seen went like this:
>
> Drive in raid 5 failed; hot spare was available (no idea about
> UPS). System apparently locked up trying to talk to the failed drive,
> or maybe admin just was not patient enough, so he just powercycled the
> array. He lost the array.
>
> So while most people will not agressively powercycle the RAID array,
> drive failure still provokes little tested error paths, and getting
> unclean shutdown is quite easy in such case.
> 								Pavel
>    

Then what we need to document is do not power cycle an array during a 
rebuild, right?

If it wasn't the admin that timed out and the box really was hung (no 
drive activity lights, etc), you will need to power cycle/reboot but 
then you should not have this active rebuild issuing writes either...

In the end, there are cascading failures that will defeat any data 
protection scheme, but that does not mean that the value of that scheme 
is zero. We need to be get more people to use RAID (including MD5) and 
try to enhance it as we go. Just using a single disk is not a good thing...

ric


Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:12                                                                 ` Pavel Machek
  2009-08-26 11:28                                                                   ` david
@ 2009-08-26 12:01                                                                   ` Ric Wheeler
  2009-08-26 12:23                                                                   ` Theodore Tso
  2 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 12:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/26/2009 07:12 AM, Pavel Machek wrote:
> On Wed 2009-08-26 06:39:14, Ric Wheeler wrote:
>    
>> On 08/25/2009 10:58 PM, Theodore Tso wrote:
>>      
>>> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
>>>
>>>        
>>>> I agree with the whole write up outside of the above - degraded RAID
>>>> does meet this requirement unless you have a second (or third, counting
>>>> the split write) failure during the rebuild.
>>>>
>>>>          
>>> The argument is that if the degraded RAID array is running in this
>>> state for a long time, and the power fails while the software RAID is
>>> in the middle of writing out a stripe, such that the stripe isn't
>>> completely written out, we could lose all of the data in that stripe.
>>>
>>> In other words, a power failure in the middle of writing out a stripe
>>> in a degraded RAID array counts as a second failure.
>>>     To me, this isn't a particularly interesting or newsworthy point,
>>> since a competent system administrator who cares about his data and/or
>>> his hardware will (a) have a UPS, and (b) be running with a hot spare
>>> and/or will imediately replace a failed drive in a RAID array.
>>>        
>> I agree that this is not an interesting (or likely) scenario, certainly
>> when compared to the much more frequent failures that RAID will protect
>> against which is why I object to the document as Pavel suggested. It
>> will steer people away from using RAID and directly increase their
>> chances of losing their data if they use just a single disk.
>>      
> So instead of fixing or at least documenting known software deficiency
> in Linux MD stack, you'll try to surpress that information so that
> people use more of raid5 setups?
>
> Perhaps the better documentation will push them to RAID1, or maybe
> make them buy an UPS?
> 									Pavel
>    

I am against documenting unlikely scenarios out of context that will 
lead people to do the wrong thing.

ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:12                                                                 ` Pavel Machek
  2009-08-26 11:28                                                                   ` david
  2009-08-26 12:01                                                                   ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler
@ 2009-08-26 12:23                                                                   ` Theodore Tso
  2009-08-30  7:01                                                                     ` Pavel Machek
  2009-08-30  7:01                                                                     ` Pavel Machek
  2 siblings, 2 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26 12:23 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Florian Weimer, Goswin von Brederlow, Rob Landley,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote:
> > I agree that this is not an interesting (or likely) scenario, certainly  
> > when compared to the much more frequent failures that RAID will protect  
> > against which is why I object to the document as Pavel suggested. It  
> > will steer people away from using RAID and directly increase their  
> > chances of losing their data if they use just a single disk.
> 
> So instead of fixing or at least documenting known software deficiency
> in Linux MD stack, you'll try to surpress that information so that
> people use more of raid5 setups?

First of all, it's not a "known software deficiency"; you can't do
anything about a degraded RAID array, other than to replace the failed
disk.  Secondly, what we should document is things like "don't use
crappy flash devices", "don't let the RAID array run in degraded mode
for a long time" and "if you must (which is a bad idea), better have a
UPS or a battery-backed hardware RAID".  What we should *not* document
is

"ext3 is worthless for RAID 5 arrays" (simply wrong)

and

"ext2 is better than ext3 because it forces you to run a long, slow
fsck after each boot, and that helps you to catch filesystem
corruptions when the storage devices goes bad" (Second part of the
statement is true, but it's still bad general advice, and it's
horribly misleading)

and

"ext2 and ext3 have this surprising dependency that disks act like
disks".  (alarmist)

      	       	    	 	    	       	    - Ted


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:17                                     ` Pavel Machek
  2009-08-26 11:29                                       ` david
@ 2009-08-26 12:28                                       ` Theodore Tso
  2009-08-27  6:06                                         ` Rob Landley
  1 sibling, 1 reply; 309+ messages in thread
From: Theodore Tso @ 2009-08-26 12:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rik van Riel, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
> > Metadata takes up such a small part of the disk that fscking
> > it and finding it to be OK is absolutely no guarantee that
> > the data on the filesystem has not been horribly mangled.
> >
> > Personally, what I care about is my data.
> >
> > The metadata is just a way to get to my data, while the data
> > is actually important.
> 
> Personally, I care about metadata consistency, and ext3 documentation
> suggests that journal protects its integrity. Except that it does not
> on broken storage devices, and you still need to run fsck there.

Caring about metadata consistency and not data is just weird, I'm
sorry.  I can't imagine anyone who actually *cares* about what they
have stored, whether it's digital photographs of child taking a first
step, or their thesis research, caring about more about the metadata
than the data.  Giving advice that pretends that most users have that
priority is Just Wrong.

That's why what we should document is that people should avoid broken
storage devices, and advice on how to use RAID properly.  At the end
of the day, getting people to switch from ext2 to ext3 on some
misguided notion that this way, they'll know when their metadata is
safe (at least in the power failure case; but not the system hangs and
you have to reboot case), and getting them to ignore the question of
why are they using a broken storage device in the first place, is
Documentation malpractice.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 11:25                                                         ` Pavel Machek
@ 2009-08-26 12:37                                                           ` Theodore Tso
  2009-08-30  6:49                                                             ` Pavel Machek
  2009-08-30  6:49                                                             ` Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26 12:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote:
> > you just plain cannot count on writes that are in flight when a powerfail 
> > happens to do predictable things, let alone what you consider sane or  
> > proper.
> 
> From what I see, this kind of failure is rather harder to reproduce
> than the software problems. And at least SGI machines were designed to
> avoid this...
> 
> Anyway, I'd like to hear from ext3 people... what happens on read
> errors in journal? That's what you'd expect to see in situation above.

On a power failure, what normally happens is that the random garbage
gets written into the disk drive's last dying gasp, since the memory
starts going insane and sends garbage to the disk.  So the disk
successfully completes the write, but the sector contains garbage.
Since HDD's tend to be last thing to die, being less sensitive to
voltage drops than the memory or DMA controller, my experience is that
you don't get a read error after the system comes up, you just get
garbage written into the journal.

The ext3 journalling code waits until all of the journal code is
written, and only then writes the commit block.  On restart, we look
for the last valid commit block.  So if the power failure is before we
write the commit block, we replay the journal up until the previous
commit block.  If the power failure is while we are writing the commit
block, garbage will be written out instead of the commit block, and so
it falls back to the previous case.

We do not allow any updates to the filesystem metadata to take place
until the commit block has been written; therefore the filesystem
stays consistent.

If there the journal *does* develop read errors, then fsck will
require a manual fsck, and so the boot operation will get stopped so a
system administrator can provide manual intervention.  The best bet
for the sysadmin is to replay as much of the journal she can, and then
let fsck fix any resulting filesystem inconsistencies.  In practice,
though, I've not experienced or seen any reports of this happening
from a power failure; usually it happens if the laptop gets dropped or
the hard drive suffers or suffers some other kind of hardware failure.

    	       	       	  	       - Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 11:58                                                               ` Ric Wheeler
@ 2009-08-26 12:40                                                                 ` Theodore Tso
  2009-08-26 13:11                                                                   ` Ric Wheeler
                                                                                     ` (2 more replies)
  2009-08-29  9:38                                                                 ` Pavel Machek
  1 sibling, 3 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26 12:40 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, david, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>> Drive in raid 5 failed; hot spare was available (no idea about
>> UPS). System apparently locked up trying to talk to the failed drive,
>> or maybe admin just was not patient enough, so he just powercycled the
>> array. He lost the array.
>>
>> So while most people will not agressively powercycle the RAID array,
>> drive failure still provokes little tested error paths, and getting
>> unclean shutdown is quite easy in such case.
>
> Then what we need to document is do not power cycle an array during a  
> rebuild, right?

Well, the softwar raid layer could be improved so that it implements
scrubbing by default (i.e., have the md package install a cron job to
implement a periodict scrub pass automatically).  The MD code could
also regularly check to make sure the hot spare is OK; the other
possibility is that hot spare, which hadn't been used in a long time,
had silently failed.

> In the end, there are cascading failures that will defeat any data  
> protection scheme, but that does not mean that the value of that scheme  
> is zero. We need to be get more people to use RAID (including MD5) and  
> try to enhance it as we go. Just using a single disk is not a good 
> thing...

Yep; the solution is to improve the storage devices.  It is *not* to
encourage people to think RAID is not worth it, or that somehow ext2
is better than ext3 because it runs fsck's all the time at boot up.
That's just crazy talk.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:29                                       ` david
@ 2009-08-26 13:10                                         ` Pavel Machek
  2009-08-26 13:43                                           ` david
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-26 13:10 UTC (permalink / raw)
  To: david
  Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack


>>> The metadata is just a way to get to my data, while the data
>>> is actually important.
>>
>> Personally, I care about metadata consistency, and ext3 documentation
>> suggests that journal protects its integrity. Except that it does not
>> on broken storage devices, and you still need to run fsck there.
>
> as the ext3 authors have stated many times over the years, you still need 
> to run fsck periodicly anyway.

Where is that documented? I very much agree with that, but when suse10
switched periodic fsck off, I could not find any docs to show that it
is bad idea.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 12:40                                                                 ` Theodore Tso
  2009-08-26 13:11                                                                   ` Ric Wheeler
@ 2009-08-26 13:11                                                                   ` Ric Wheeler
  2009-08-26 13:44                                                                     ` david
  2009-08-26 13:40                                                                   ` Chris Adams
  2 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 13:11 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, david, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 08/26/2009 08:40 AM, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>>> Drive in raid 5 failed; hot spare was available (no idea about
>>> UPS). System apparently locked up trying to talk to the failed drive,
>>> or maybe admin just was not patient enough, so he just powercycled the
>>> array. He lost the array.
>>>
>>> So while most people will not agressively powercycle the RAID array,
>>> drive failure still provokes little tested error paths, and getting
>>> unclean shutdown is quite easy in such case.
>>
>> Then what we need to document is do not power cycle an array during a
>> rebuild, right?
>
> Well, the softwar raid layer could be improved so that it implements
> scrubbing by default (i.e., have the md package install a cron job to
> implement a periodict scrub pass automatically).  The MD code could
> also regularly check to make sure the hot spare is OK; the other
> possibility is that hot spare, which hadn't been used in a long time,
> had silently failed.

Actually, MD does this scan already (not automatically, but you can set up a 
simple cron job to kick off a periodic "check"). It is a delicate balance to get 
the frequency of the scrubbing correct.

On one hand, you want to make sure that you detect errors in a timely fashion, 
certainly detection of single sector errors before you might develop a second 
sector level error on another drive.

On the other hand, running scans/scrubs continually impacts the performance of 
your real workload and can potentially impact your components' life span by 
subjecting them to a heavy workload.

Rule of thumb seems from my experience is that most people settle in with a scan 
once a week or two (done at a throttled rate).

>
>> In the end, there are cascading failures that will defeat any data
>> protection scheme, but that does not mean that the value of that scheme
>> is zero. We need to be get more people to use RAID (including MD5) and
>> try to enhance it as we go. Just using a single disk is not a good
>> thing...
>
> Yep; the solution is to improve the storage devices.  It is *not* to
> encourage people to think RAID is not worth it, or that somehow ext2
> is better than ext3 because it runs fsck's all the time at boot up.
> That's just crazy talk.
>
> 						- Ted

Agreed....

ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 12:40                                                                 ` Theodore Tso
@ 2009-08-26 13:11                                                                   ` Ric Wheeler
  2009-08-26 13:11                                                                   ` Ric Wheeler
  2009-08-26 13:40                                                                   ` Chris Adams
  2 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 13:11 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, david, Florian Weimer,
	Goswin von Brederlow, Rob Landley, ke

On 08/26/2009 08:40 AM, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>>> Drive in raid 5 failed; hot spare was available (no idea about
>>> UPS). System apparently locked up trying to talk to the failed drive,
>>> or maybe admin just was not patient enough, so he just powercycled the
>>> array. He lost the array.
>>>
>>> So while most people will not agressively powercycle the RAID array,
>>> drive failure still provokes little tested error paths, and getting
>>> unclean shutdown is quite easy in such case.
>>
>> Then what we need to document is do not power cycle an array during a
>> rebuild, right?
>
> Well, the softwar raid layer could be improved so that it implements
> scrubbing by default (i.e., have the md package install a cron job to
> implement a periodict scrub pass automatically).  The MD code could
> also regularly check to make sure the hot spare is OK; the other
> possibility is that hot spare, which hadn't been used in a long time,
> had silently failed.

Actually, MD does this scan already (not automatically, but you can set up a 
simple cron job to kick off a periodic "check"). It is a delicate balance to get 
the frequency of the scrubbing correct.

On one hand, you want to make sure that you detect errors in a timely fashion, 
certainly detection of single sector errors before you might develop a second 
sector level error on another drive.

On the other hand, running scans/scrubs continually impacts the performance of 
your real workload and can potentially impact your components' life span by 
subjecting them to a heavy workload.

Rule of thumb seems from my experience is that most people settle in with a scan 
once a week or two (done at a throttled rate).

>
>> In the end, there are cascading failures that will defeat any data
>> protection scheme, but that does not mean that the value of that scheme
>> is zero. We need to be get more people to use RAID (including MD5) and
>> try to enhance it as we go. Just using a single disk is not a good
>> thing...
>
> Yep; the solution is to improve the storage devices.  It is *not* to
> encourage people to think RAID is not worth it, or that somehow ext2
> is better than ext3 because it runs fsck's all the time at boot up.
> That's just crazy talk.
>
> 						- Ted

Agreed....

ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  2:55                                                             ` Theodore Tso
  2009-08-26 13:37                                                               ` Ric Wheeler
@ 2009-08-26 13:37                                                               ` Ric Wheeler
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 13:37 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/25/2009 10:55 PM, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 03:16:06AM +0200, Pavel Machek wrote:
>> Hi!
>>
>>> 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
>>> course not!  First of all, Ext3 still saves you against kernel panics
>>> and hangs caused by device driver bugs or other kernel hangs.  You
>>> will lose less data, and avoid needing to run a long and painful fsck
>>> after a forced reboot, compared to if you used ext2.  You are making
>>
>> Actually... ext3 + MD RAID5 will still have a problem on kernel
>> panic. MD RAID5 is implemented in software, so if kernel panics, you
>> can still get inconsistent data in your array.
>
> Only if the MD RAID array is running in degraded mode (and again, if
> the system is in this state for a long time, the bug is in the system
> administrator).  And even then, it depends on how the kernel dies.  If
> the system hangs due to some deadlock, or we get an OOPS that kills a
> process while still holding some locks, and that leads to a deadlock,
> it's likely the low-level MD driver can still complete the stripe
> write, and no data will be lost.  If the kernel ties itself in knots
> due to running out of memory, and the OOM handler is invoked, someone
> hitting the reset button to force a reboot will also be fine.
>
> If the RAID array is degraded, and we get an oops in interrupt
> handler, such that the system is immediately halted --- then yes, data
> could get lost.  But there are many system crashes where the software
> RAID's ability to complete a stripe write would not be compromised.
>
>         	       	  	     	    	  	- Ted

Just to add some real world data, Bianca Schroeder published a really good paper 
that looks at failures in national labs which has actual measured disk failures:

http://www.cs.cmu.edu/~bianca/fast07.pdf

Her numbers showed various rates of failures, but depending on the box, drive 
type, etc, they lost between 1-6% of the install drives each year.

There is also a good paper from Google:

http://labs.google.com/papers/disk_failures.html

Both of the above are largely linux boxes.

And several other FAST papers on failures in commercial RAID boxes, most notably 
by NetApp.

If reading papers is not at the top of your list of things to do, just skim 
through and look for the tables on disk failures, etc. which have great 
measurements of what really failed in these systems...

Ric






^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  2:55                                                             ` Theodore Tso
@ 2009-08-26 13:37                                                               ` Ric Wheeler
  2009-08-26 13:37                                                               ` Ric Wheeler
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-26 13:37 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list

On 08/25/2009 10:55 PM, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 03:16:06AM +0200, Pavel Machek wrote:
>> Hi!
>>
>>> 3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
>>> course not!  First of all, Ext3 still saves you against kernel panics
>>> and hangs caused by device driver bugs or other kernel hangs.  You
>>> will lose less data, and avoid needing to run a long and painful fsck
>>> after a forced reboot, compared to if you used ext2.  You are making
>>
>> Actually... ext3 + MD RAID5 will still have a problem on kernel
>> panic. MD RAID5 is implemented in software, so if kernel panics, you
>> can still get inconsistent data in your array.
>
> Only if the MD RAID array is running in degraded mode (and again, if
> the system is in this state for a long time, the bug is in the system
> administrator).  And even then, it depends on how the kernel dies.  If
> the system hangs due to some deadlock, or we get an OOPS that kills a
> process while still holding some locks, and that leads to a deadlock,
> it's likely the low-level MD driver can still complete the stripe
> write, and no data will be lost.  If the kernel ties itself in knots
> due to running out of memory, and the OOM handler is invoked, someone
> hitting the reset button to force a reboot will also be fine.
>
> If the RAID array is degraded, and we get an oops in interrupt
> handler, such that the system is immediately halted --- then yes, data
> could get lost.  But there are many system crashes where the software
> RAID's ability to complete a stripe write would not be compromised.
>
>         	       	  	     	    	  	- Ted

Just to add some real world data, Bianca Schroeder published a really good paper 
that looks at failures in national labs which has actual measured disk failures:

http://www.cs.cmu.edu/~bianca/fast07.pdf

Her numbers showed various rates of failures, but depending on the box, drive 
type, etc, they lost between 1-6% of the install drives each year.

There is also a good paper from Google:

http://labs.google.com/papers/disk_failures.html

Both of the above are largely linux boxes.

And several other FAST papers on failures in commercial RAID boxes, most notably 
by NetApp.

If reading papers is not at the top of your list of things to do, just skim 
through and look for the tables on disk failures, etc. which have great 
measurements of what really failed in these systems...

Ric






^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 12:40                                                                 ` Theodore Tso
  2009-08-26 13:11                                                                   ` Ric Wheeler
  2009-08-26 13:11                                                                   ` Ric Wheeler
@ 2009-08-26 13:40                                                                   ` Chris Adams
  2009-08-26 13:47                                                                     ` Alan Cox
  2009-08-27 21:50                                                                     ` Pavel Machek
  2 siblings, 2 replies; 309+ messages in thread
From: Chris Adams @ 2009-08-26 13:40 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-kernel

Once upon a time, Theodore Tso  <tytso@mit.edu> said:
>Well, the softwar raid layer could be improved so that it implements
>scrubbing by default (i.e., have the md package install a cron job to
>implement a periodict scrub pass automatically).

Fedora 11 added a cron job to kick off a RAID check for each Linux MD
RAID array every week.  Combined with running mdmonitor, root will get
an email on any failure.

The other thing about this thread is that the only RAID implementation
that is being discussed here is the MD RAID stack.  There are a lot of
RAID implementations that have the same issues:

- motherboard (aka "fake") RAID - In Linux this is typically mapped with
  device mapper via dmraid; AFAIK there is not a tool to scrub (or even
  monitor the status of and notify on failure) a Linux DM RAID setup.

- hardware RAID cards without battery backup - these have the exact same
  issues because they cannot guarantee all writes complete, nor can they
  keep track of incomplete writes across power failures

- hardware RAID cards _with_ battery backup but that don't periodically
  test the battery and have a way to notify you of battery failure while
  Linux is running

The issues being raised here are not specific to extX, MD RAID, or Linux
at all; they are problems with non-"enterprise-class" RAID setups.
There's a reason enterprise-class RAID costs a lot more money than the
card you can pick up at Fry's.

There's no reason to document the design issues of general RAID
implementations in the Linux kernel.
-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 13:10                                         ` Pavel Machek
@ 2009-08-26 13:43                                           ` david
  2009-08-26 18:02                                             ` Theodore Tso
  2009-08-30  7:03                                             ` Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: david @ 2009-08-26 13:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack

On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> The metadata is just a way to get to my data, while the data
>>>> is actually important.
>>>
>>> Personally, I care about metadata consistency, and ext3 documentation
>>> suggests that journal protects its integrity. Except that it does not
>>> on broken storage devices, and you still need to run fsck there.
>>
>> as the ext3 authors have stated many times over the years, you still need
>> to run fsck periodicly anyway.
>
> Where is that documented?

linux-kernel mailing list archives.

David Lang

> I very much agree with that, but when suse10
> switched periodic fsck off, I could not find any docs to show that it
> is bad idea.
> 								Pavel
>

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 13:11                                                                   ` Ric Wheeler
@ 2009-08-26 13:44                                                                     ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-26 13:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed, 26 Aug 2009, Ric Wheeler wrote:

> On 08/26/2009 08:40 AM, Theodore Tso wrote:
>> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>>>> Drive in raid 5 failed; hot spare was available (no idea about
>>>> UPS). System apparently locked up trying to talk to the failed drive,
>>>> or maybe admin just was not patient enough, so he just powercycled the
>>>> array. He lost the array.
>>>> 
>>>> So while most people will not agressively powercycle the RAID array,
>>>> drive failure still provokes little tested error paths, and getting
>>>> unclean shutdown is quite easy in such case.
>>> 
>>> Then what we need to document is do not power cycle an array during a
>>> rebuild, right?
>> 
>> Well, the softwar raid layer could be improved so that it implements
>> scrubbing by default (i.e., have the md package install a cron job to
>> implement a periodict scrub pass automatically).  The MD code could
>> also regularly check to make sure the hot spare is OK; the other
>> possibility is that hot spare, which hadn't been used in a long time,
>> had silently failed.
>
> Actually, MD does this scan already (not automatically, but you can set up a 
> simple cron job to kick off a periodic "check"). It is a delicate balance to 
> get the frequency of the scrubbing correct.

debian defaults to doing this once a month (first sunday of each month), 
on some of my systems this scrub takes almost a week to complete.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 13:40                                                                   ` Chris Adams
@ 2009-08-26 13:47                                                                     ` Alan Cox
  2009-08-26 14:11                                                                       ` Chris Adams
  2009-08-27 21:50                                                                     ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Alan Cox @ 2009-08-26 13:47 UTC (permalink / raw)
  To: Chris Adams; +Cc: Theodore Tso, linux-kernel

> The issues being raised here are not specific to extX, MD RAID, or Linux
> at all; they are problems with non-"enterprise-class" RAID setups.
> There's a reason enterprise-class RAID costs a lot more money than the
> card you can pick up at Fry's.

And you will still need backups ;)

A long time ago I worked on a fault tolerant news server with dual
alphaserver boxes and a shared disk array. A power system failure took
out both the alpha boxes and the disk controllers and all the disks.

Fortunately it was a news server so you just had to wait a week ..

Alan

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 13:47                                                                     ` Alan Cox
@ 2009-08-26 14:11                                                                       ` Chris Adams
  0 siblings, 0 replies; 309+ messages in thread
From: Chris Adams @ 2009-08-26 14:11 UTC (permalink / raw)
  To: Alan Cox; +Cc: Theodore Tso, linux-kernel

Once upon a time, Alan Cox <alan@lxorguk.ukuu.org.uk> said:
> > The issues being raised here are not specific to extX, MD RAID, or Linux
> > at all; they are problems with non-"enterprise-class" RAID setups.
> > There's a reason enterprise-class RAID costs a lot more money than the
> > card you can pick up at Fry's.
> 
> And you will still need backups ;)

Yep.  RAID (of any class) != fail safe.

> A long time ago I worked on a fault tolerant news server with dual
> alphaserver boxes and a shared disk array. A power system failure took
> out both the alpha boxes and the disk controllers and all the disks.

Hey, that's not funny!  I'm typing this on a dual AlphaServer cluster
with a shared disk array (with dual battery backup even), and we had a
power failure at the NOC yesterday (that then tripped a breaker,
although it was between the generator and the UPS, so nothing went
down).

No matter how redundant you make things, a "no single point of failure"
setup still can fail, often in "interesting" ways that nobody
anticipated.

-- 
Chris Adams <cmadams@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 11:22                                                         ` Pavel Machek
@ 2009-08-26 14:45                                                           ` Rik van Riel
  2009-08-29  9:39                                                             ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Rik van Riel @ 2009-08-26 14:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Pavel Machek wrote:

> Sledgehammer is hardware problem, and I'm demonstrating
> software/documentation problem we have here.

So your argument is that a sledgehammer is a hardware
problem, while a broken hard disk and a power failure
are software/documentation issues?

I'd argue that the broken hard disk and power failure
are hardware issues, too.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 13:43                                           ` david
@ 2009-08-26 18:02                                             ` Theodore Tso
  2009-08-27  6:28                                                 ` Eric Sandeen
                                                                 ` (2 more replies)
  2009-08-30  7:03                                             ` Pavel Machek
  1 sibling, 3 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-26 18:02 UTC (permalink / raw)
  To: david
  Cc: Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack

On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote:
>>>
>>> as the ext3 authors have stated many times over the years, you still need
>>> to run fsck periodicly anyway.
>>
>> Where is that documented?
>
> linux-kernel mailing list archives.

Probably from some 6-8 years ago, in e-mail postings that I made.  My
argument has always been that PC-class hardware is crap, and it's a
Really Good Idea to periodically check the metadata because corruption
there can end up causing massive data loss.  The main problem is that
doing it at reboot time really hurt system availability, and "after 20
reboots (plus or minus)" resulted in fsck checks at wildly varying
intervals depending on how often people reboot.

What I've been recommending for some time is that people use LVM, and
run fsck on a snapshot every week or two, at some convenient time when
the system load is at a minimum.  There is an e2croncheck script in
the e2fsprogs sources, in the contrib directory; it's short enough
that I'll attach here here.

Is it *necessary*?  In a world where hardware is perfect, no.  In a
world where people don't bother buying ECC memory because it's 10%
more expensive, and PC builders use the cheapest possible parts --- I
think it's a really good idea.

						- Ted

P.S.  Patches so that this shell script takes a config file, and/or
parses /etc/fstab to automatically figure out which filesystems should
be checked, are greatly appreciated.  Getting distro's to start
including this in their e2fsprogs packaging scripts would also be
greatly appreciated.

#!/bin/sh
#
# e2croncheck -- run e2fsck automatically out of /etc/cron.weekly
#
# This script is intended to be run by the system administrator 
# periodically from the command line, or to be run once a week
# or so by the cron daemon to check a mounted filesystem (normally
# the root filesystem, but it could be used to check other filesystems
# that are always mounted when the system is booted).
#
# Make sure you customize "VG" so it is your LVM volume group name, 
# "VOLUME" so it is the name of the filesystem's logical volume, 
# and "EMAIL" to be your e-mail address
#
# Written by Theodore Ts'o, Copyright 2007, 2008, 2009.
#
# This file may be redistributed under the terms of the 
# GNU Public License, version 2.
#

VG=ssd
VOLUME=root
SNAPSIZE=100m
EMAIL=sysadmin@example.com

TMPFILE=`mktemp -t e2fsck.log.XXXXXXXXXX`

OPTS="-Fttv -C0"
#OPTS="-Fttv -E fragcheck"

set -e
START="$(date +'%Y%m%d%H%M%S')"
lvcreate -s -L ${SNAPSIZE} -n "${VOLUME}-snap" "${VG}/${VOLUME}"
if nice logsave -as $TMPFILE e2fsck -p $OPTS "/dev/${VG}/${VOLUME}-snap" && \
   nice logsave -as $TMPFILE e2fsck -fy $OPTS "/dev/${VG}/${VOLUME}-snap" ; then
  echo 'Background scrubbing succeeded!'
  tune2fs -C 0 -T "${START}" "/dev/${VG}/${VOLUME}"
else
  echo 'Background scrubbing failed! Reboot to fsck soon!'
  tune2fs -C 16000 -T "19000101" "/dev/${VG}/${VOLUME}"
  if test -n "$RPT-EMAIL"; then 
    mail -s "E2fsck of /dev/${VG}/${VOLUME} failed!" $EMAIL < $TMPFILE
  fi
fi
lvremove -f "${VG}/${VOLUME}-snap"
rm $TMPFILE


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 13:42                                     ` Alan Cox
@ 2009-08-27  3:16                                       ` Rob Landley
  0 siblings, 0 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-27  3:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Tuesday 25 August 2009 08:42:10 Alan Cox wrote:
> On Tue, 25 Aug 2009 09:37:12 -0400
>
> Ric Wheeler <rwheeler@redhat.com> wrote:
> > I really think that the expectation that all OS's (windows, mac, even
> > your ipod) all teach you not to hot unplug a device with any file system.
> > Users have an "eject" or "safe unload" in windows, your iPod tells you
> > not to power off or disconnect, etc.
>
> Agreed

Ok, I'll bite: What are journaling filesystems _for_?

> > I don't object to making that general statement - "Don't hot unplug a
> > device with an active file system or actively used raw device" - but
> > would object to the overly general statement about ext3 not working on
> > flash, RAID5 not working, etc...
>
> The overall general statement for all media and all OS's should be
>
> "Do you have a backup, have you tested it recently"

It might be nice to know when you _needed_ said backup, and when you shouldn't 
re-backup bad data over it, because your data corruption actually got detected 
before then.

And maybe a pony.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 11:15                                             ` Pavel Machek
@ 2009-08-27  3:29                                               ` Rik van Riel
  0 siblings, 0 replies; 309+ messages in thread
From: Rik van Riel @ 2009-08-27  3:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Neil Brown, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Pavel Machek wrote:
> Hi!
> 
>>> Ok, can you help? Having a piece of MD documentation explaining the
>>> "powerfail nukes entire stripe" and how current filesystems do not
>>> deal with that would be nice, along with description when exactly that
>>> happens.
>> Except of course for the inconvenient detail that a power
>> failure on a degraded RAID 5 array does *NOT* nuke the
>> entire stripe.
> 
> Ok, you are right. It will nuke unrelated sector somewhere on the
> stripe (one that is "old" and was not recently written) -- which is
> still something ext3 can not reliably handle.

Not quite unrelated.  The "nuked" sector will be the one
that used to live on the disk that is broken and no longer
a part of the RAID 5 array.

I wouldn't qualify a missing hard disk as a software issue...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  0:08                               ` Theodore Tso
  2009-08-25  9:42                                 ` Pavel Machek
  2009-08-25  9:42                                 ` Pavel Machek
@ 2009-08-27  3:34                                 ` Rob Landley
  2009-08-27  8:46                                 ` David Woodhouse
  3 siblings, 0 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-27  3:34 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Monday 24 August 2009 19:08:42 Theodore Tso wrote:
> And if your
> claim is that several hundred lines of fsck output detailing the
> filesystem's destruction somehow makes things all better, I suspect
> most users would disagree with you.

Suppose a small office makes nightly backups to an offsite server via rsync.  If 
a thunderstorm goes by causing their system to reboot twice in a 15 minute 
period, would they rather notice the filesystem corruption immediately upon 
reboot, or notice after the next rsync?

> In any case, depending on where the flash was writing at the time of
> the unplug, the data corruption could be silent anyway.

Yup.  Hopefully btrfs will cope less badly?  They keep talking about 
checksumming extents...

> Maybe this came as a surprise to you, but anyone who has used a
> compact flash in a digital camera knows that you ***have*** to wait
> until the led has gone out before trying to eject the flash card.

I doubt the cupholder crowd is going to stop treating USB sticks as magical 
any time soon, but I also wonder how many of them even remember Linux _exists_ 
anymore.

> I
> remember seeing all sorts of horror stories from professional
> photographers about how they lost an important wedding's day worth of
> pictures with the attendant commercial loss, on various digital
> photography forums.  It tends to be the sort of mistake that digital
> photographers only make once.

Professionals have horror stories about this issue, therefore documenting it 
is _less_ important?

Ok...

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25 23:40                                               ` Ric Wheeler
  2009-08-25 23:48                                                 ` david
  2009-08-25 23:53                                                 ` Pavel Machek
@ 2009-08-27  3:53                                                 ` Rob Landley
  2009-08-27 11:43                                                   ` Ric Wheeler
  2 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-08-27  3:53 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
> Repeat experiment until you get up to something like google scale or the
> other papers on failures in national labs in the US and then we can have an
> informed discussion.

On google scale anvil lightning can fry your machine out of a clear sky.

However, there are still a few non-enterprise users out there, and knowing 
that specific usage patterns don't behave like they expect might be useful to 
them.

> >> I can promise you that hot unplugging and replugging a S-ATA drive will
> >> also lose you data if you are actively writing to it (ext2, 3,
> >> whatever).
> >
> > I can promise you that running S-ATA drive will also lose you data,
> > even if you are not actively writing to it. Just wait 10 years; so
> > what is your point?
>
> I lost a s-ata drive 24 hours after installing it in a new box. If I had
> MD5 RAID5, I would not have lost any.
>
> My point is that you fail to take into account the rate of failures of a
> given configuration and the probability of data loss given those rates.

Actually, that's _exactly_ what he's talking about.

When writing to a degraded raid or a flash disk, journaling is essentially 
useless.  If you get a power failure, kernel panic, somebody tripping over a 
USB cable, and so on, your filesystem will not be protected by journaling.  
Your data won't be trashed _every_ time, but the likelihood is much greater 
than experience with journaling in other contexts would suggest.

Worse, the journaling may be counterproductive by _hiding_ many errors that 
fsck would promptly detect, so when the error is detected it may not be 
associated with the event that caused it.  It also may not be noticed until 
good backups of the data have been overwritten or otherwise cycled out.

You seem to be arguing that Linux is no longer used anywhere but the 
enterprise, so issues affecting USB flash keys or cheap software-only RAID 
aren't worth documenting?

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  2:58                                                             ` Theodore Tso
  2009-08-26 10:39                                                               ` Ric Wheeler
  2009-08-26 10:39                                                               ` Ric Wheeler
@ 2009-08-27  5:19                                                               ` Rob Landley
  2009-08-27 12:24                                                                 ` Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-08-27  5:19 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Tuesday 25 August 2009 21:58:49 Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 09:15:00PM -0400, Ric Wheeler wrote:
> > I agree with the whole write up outside of the above - degraded RAID
> > does meet this requirement unless you have a second (or third, counting
> > the split write) failure during the rebuild.
>
> The argument is that if the degraded RAID array is running in this
> state for a long time, and the power fails while the software RAID is
> in the middle of writing out a stripe, such that the stripe isn't
> completely written out, we could lose all of the data in that stripe.
>
> In other words, a power failure in the middle of writing out a stripe
> in a degraded RAID array counts as a second failure.

Or panic, hang, the drive failed because the system is overheating because the 
air conditioner suddenly died and the server room is now an oven.  (Yup, 
worked at that company too.)

> To me, this isn't a particularly interesting or newsworthy point,
> since a competent system administrator

I'm a bit concerned by the argument that we don't need to document serious 
pitfalls because every Linux system has a sufficiently competent administrator 
they already know stuff that didn't even come up until the second or third day 
it was discussed on lkml.

"You're documenting it wrong" != "you shouldn't document it".

> who cares about his data and/or
> his hardware will (a) have a UPS,

I worked at a company that retested their UPSes a year after installing them 
and found that _none_ of them supplied more than 15 seconds charge, and when 
they dismantled them the batteries had physically bloated inside their little 
plastic cases.  (Same company as the dead air conditioner, possibly 
overheating was involved but the little _lights_ said everything was ok.)

That was by no means the first UPS I'd seen die, the suckers have a higher 
failure rate than hard drives in my experience.  This is a device where the 
batteries get constantly charged and almost never tested because if it _does_ 
fail you just rebooted your production server, so a lot of smaller companies 
think they have one but actually don't.

> , and (b) be running with a hot spare
> and/or will imediately replace a failed drive in a RAID array.

Here's hoping they shut the system down properly to install the new drive in 
the raid then, eh?  Not accidentally pull the plug before it's finished running 
the ~7 minutes of shutdown scripts in the last Red Hat Enterprise I messed 
with...

Does this situation apply during the rebuild?  I.E. once a hot spare has been 
supplied, is the copy to the new drive linear, or will it write dirty pages to 
the new drive out of order, even before the reconstruction's gotten that far, 
_and_ do so in an order that doesn't open this race window of the data being 
unable to be reconstructed?

If "degraded array" just means "don't have a replacement disk yet", then it 
sounds like what Pavel wants to document is "don't write to a degraded array 
at all, because power failures can cost you data due to write granularity 
being larger than filesystem block size".  (Which still comes as news to some 
of us, and you need a way to remount mount the degraded array read only until 
the sysadmin can fix it.)

But if "degraded array" means "hasn't finished rebuilding the new disk yet", 
that could easily be several hours' window and not writing to it is less of an 
option.

(I realize a competent system administrator would obviously already know this, 
but I don't.)

>        	    	       	       	 	      - Ted

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  3:32                                   ` Rik van Riel
  2009-08-26 11:17                                     ` Pavel Machek
@ 2009-08-27  5:27                                     ` Rob Landley
  1 sibling, 0 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-27  5:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Tuesday 25 August 2009 22:32:47 Rik van Riel wrote:
> Pavel Machek wrote:
> >> So, would you be happy if ext3 fsck was always run on reboot (at least
> >> for flash devices)?
> >
> > For flash devices, MD Raid 5 and anything else that needs it; yes that
> > would make me happy ;-).
>
> Sorry, but that just shows your naivete.

Hence wanting documentation properly explaining the situation, yes.

Often the people writing the documentation aren't the people who know the most 
about the situation, but the people who found out they NEED said 
documentation, and post errors until they get sufficient corrections.

In which case "you're wrong, it's actually _this_" is helpful, and "you're 
wrong, go away and stop bothering us grown-ups" isn't.

> Metadata takes up such a small part of the disk that fscking
> it and finding it to be OK is absolutely no guarantee that
> the data on the filesystem has not been horribly mangled.
>
> Personally, what I care about is my data.
>
> The metadata is just a way to get to my data, while the data
> is actually important.

Are you saying ext3 should default to journal=data then?

It seems that the default journaling only handles the metadata, and people 
seem to think that journaled filesystems exist for a reason.

There seems to be a lot of "the guarantees you think a journal provides aren't 
worth anything, so the fact there are circumstances under which it doesn't 
provide them isn't worth telling anybody about" in this thread.  So we 
shouldn't bother journaled filesystems?  I'm not sure what the intended 
argument is here...

I have no clue what the finished documentation on this issue should look like 
either.  But I want to read it.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 12:28                                       ` Theodore Tso
@ 2009-08-27  6:06                                         ` Rob Landley
  2009-08-27  6:54                                           ` david
  0 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-08-27  6:06 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Rik van Riel, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
> > > Metadata takes up such a small part of the disk that fscking
> > > it and finding it to be OK is absolutely no guarantee that
> > > the data on the filesystem has not been horribly mangled.
> > >
> > > Personally, what I care about is my data.
> > >
> > > The metadata is just a way to get to my data, while the data
> > > is actually important.
> >
> > Personally, I care about metadata consistency, and ext3 documentation
> > suggests that journal protects its integrity. Except that it does not
> > on broken storage devices, and you still need to run fsck there.
>
> Caring about metadata consistency and not data is just weird, I'm
> sorry.  I can't imagine anyone who actually *cares* about what they
> have stored, whether it's digital photographs of child taking a first
> step, or their thesis research, caring about more about the metadata
> than the data.  Giving advice that pretends that most users have that
> priority is Just Wrong.

I thought the reason for that was that if your metadata is horked, further 
writes to the disk can trash unrelated existing data because it's lost track 
of what's allocated and what isn't.  So back when the assumption was "what's 
written stays written", then keeping the metadata sane was still darn 
important to prevent normal operation from overwriting unrelated existing 
data.

Then Pavel notified us of a situation where interrupted writes to the disk can 
trash unrelated existing data _anyway_, because the flash block size on the 16 
gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks 
it's 4k or smaller.  It seems like what _broke_ was the assumption that the 
filesystem block size >= the disk block size, and nobody noticed for a while.  
(Except the people making jffs2 and friends, anyway.)

Today we have cheap plentiful USB keys that act like hard drives, except that 
their write block size isn't remotely the same as hard drives', but they 
pretend it is, and then the block wear levelling algorithms fuzz things 
further.  (Gee, a drive controller lying about drive geometry, the scsi crowd 
should feel right at home.)

Now Pavel's coming back with a second situation where RAID stripes (under 
certain circumstances) seem to have similar granularity issues, again breaking 
what seems to be the same assumption.  Big media use big chunks for data, and 
media is getting bigger.  It doesn't seem like this problem is going to 
diminish in future.

I agree that it seems like a good idea to have BIG RED WARNING SIGNS about 
those kind of media and how _any_ journaling filesystem doesn't really help 
here.  So specifically documenting "These kinds of media lose unrelated random 
data if writes to them are interrupted, journaling filesystems can't help with 
this and may actually hide the problem, and even an fsck will only find 
corrupted metadata not lost file contents" seems kind of useful.

That said, ext3's assumption that filesystem block size always >= disk update 
block size _is_ a fundamental part of this problem, and one that isn't shared 
by things like jffs2, and which things like btrfs might be able to address if 
they try, by adding awareness of the real media update granularity to their 
node layout algorithms.  (Heck, ext2 has a stripe size parameter already.  
Does setting that appropriately for your raid make this suck less?  I haven't 
heard anybody comment on that one yet...) 

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 18:02                                             ` Theodore Tso
@ 2009-08-27  6:28                                                 ` Eric Sandeen
  2009-11-09  8:53                                               ` periodic fsck was " Pavel Machek
  2009-11-09  8:53                                               ` Pavel Machek
  2 siblings, 0 replies; 309+ messages in thread
From: Eric Sandeen @ 2009-08-27  6:28 UTC (permalink / raw)
  To: Theodore Tso, david, Pavel Machek, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin von Brederlow, Rob Landley, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4,
	corbet, jack

Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote:
>>>> as the ext3 authors have stated many times over the years, you still need
>>>> to run fsck periodicly anyway.
>>> Where is that documented?
>> linux-kernel mailing list archives.
> 
> Probably from some 6-8 years ago, in e-mail postings that I made.  My
> argument has always been that PC-class hardware is crap, and it's a
> Really Good Idea to periodically check the metadata because corruption
> there can end up causing massive data loss.  The main problem is that
> doing it at reboot time really hurt system availability, and "after 20
> reboots (plus or minus)" resulted in fsck checks at wildly varying
> intervals depending on how often people reboot.

Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval 
then? :)

-Eric

> What I've been recommending for some time is that people use LVM, and
> run fsck on a snapshot every week or two, at some convenient time when
> the system load is at a minimum.  There is an e2croncheck script in
> the e2fsprogs sources, in the contrib directory; it's short enough
> that I'll attach here here.
> 
> Is it *necessary*?  In a world where hardware is perfect, no.  In a
> world where people don't bother buying ECC memory because it's 10%
> more expensive, and PC builders use the cheapest possible parts --- I
> think it's a really good idea.
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-08-27  6:28                                                 ` Eric Sandeen
  0 siblings, 0 replies; 309+ messages in thread
From: Eric Sandeen @ 2009-08-27  6:28 UTC (permalink / raw)
  To: Theodore Tso, david, Pavel Machek, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin

Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote:
>>>> as the ext3 authors have stated many times over the years, you still need
>>>> to run fsck periodicly anyway.
>>> Where is that documented?
>> linux-kernel mailing list archives.
> 
> Probably from some 6-8 years ago, in e-mail postings that I made.  My
> argument has always been that PC-class hardware is crap, and it's a
> Really Good Idea to periodically check the metadata because corruption
> there can end up causing massive data loss.  The main problem is that
> doing it at reboot time really hurt system availability, and "after 20
> reboots (plus or minus)" resulted in fsck checks at wildly varying
> intervals depending on how often people reboot.

Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval 
then? :)

-Eric

> What I've been recommending for some time is that people use LVM, and
> run fsck on a snapshot every week or two, at some convenient time when
> the system load is at a minimum.  There is an e2croncheck script in
> the e2fsprogs sources, in the contrib directory; it's short enough
> that I'll attach here here.
> 
> Is it *necessary*?  In a world where hardware is perfect, no.  In a
> world where people don't bother buying ECC memory because it's 10%
> more expensive, and PC builders use the cheapest possible parts --- I
> think it's a really good idea.
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  6:06                                         ` Rob Landley
@ 2009-08-27  6:54                                           ` david
  2009-08-27  7:34                                             ` Rob Landley
  2009-08-30  7:19                                             ` Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: david @ 2009-08-27  6:54 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Pavel Machek, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Thu, 27 Aug 2009, Rob Landley wrote:

> On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote:
>> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
>>>> Metadata takes up such a small part of the disk that fscking
>>>> it and finding it to be OK is absolutely no guarantee that
>>>> the data on the filesystem has not been horribly mangled.
>>>>
>>>> Personally, what I care about is my data.
>>>>
>>>> The metadata is just a way to get to my data, while the data
>>>> is actually important.
>>>
>>> Personally, I care about metadata consistency, and ext3 documentation
>>> suggests that journal protects its integrity. Except that it does not
>>> on broken storage devices, and you still need to run fsck there.
>>
>> Caring about metadata consistency and not data is just weird, I'm
>> sorry.  I can't imagine anyone who actually *cares* about what they
>> have stored, whether it's digital photographs of child taking a first
>> step, or their thesis research, caring about more about the metadata
>> than the data.  Giving advice that pretends that most users have that
>> priority is Just Wrong.
>
> I thought the reason for that was that if your metadata is horked, further
> writes to the disk can trash unrelated existing data because it's lost track
> of what's allocated and what isn't.  So back when the assumption was "what's
> written stays written", then keeping the metadata sane was still darn
> important to prevent normal operation from overwriting unrelated existing
> data.
>
> Then Pavel notified us of a situation where interrupted writes to the disk can
> trash unrelated existing data _anyway_, because the flash block size on the 16
> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
> it's 4k or smaller.  It seems like what _broke_ was the assumption that the
> filesystem block size >= the disk block size, and nobody noticed for a while.
> (Except the people making jffs2 and friends, anyway.)
>
> Today we have cheap plentiful USB keys that act like hard drives, except that
> their write block size isn't remotely the same as hard drives', but they
> pretend it is, and then the block wear levelling algorithms fuzz things
> further.  (Gee, a drive controller lying about drive geometry, the scsi crowd
> should feel right at home.)

actually, you don't know if your USB key works that way or not. Pavel has 
ssome that do, that doesn't mean that all flash drives do

when you do a write to a flash drive you have to do the following items

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. write the updated data to the flash

5. update the flash trnslation layer to point reads at the new location 
instead of the old location.

now if the flash drive does things in this order you will not loose any 
previously written data.

if the flash drive does step 5 before it does step 4, then you have a 
window where a crash can loose data (and no btrfs won't survive any better 
to have a large chunk of data just disappear)

it's possible that some super-cheap flash drives skip having a flash 
translation layer entirely, on those the process would be

1. read the old data into ram

2. merge the new write into the data in ram

3. erase the old data

4. write the new data

this obviously has a significant data loss window.

but if the device doesn't have a flash translation layer, then repeated 
writes to any one sector will kill the drive fairly quickly. (updates to 
the FAT would kill the sectors the FAT, journal, root directory, or 
superblock lives in due to the fact that every change to the disk requires 
an update to this file for example)

> Now Pavel's coming back with a second situation where RAID stripes (under
> certain circumstances) seem to have similar granularity issues, again breaking
> what seems to be the same assumption.  Big media use big chunks for data, and
> media is getting bigger.  It doesn't seem like this problem is going to
> diminish in future.
>
> I agree that it seems like a good idea to have BIG RED WARNING SIGNS about
> those kind of media and how _any_ journaling filesystem doesn't really help
> here.  So specifically documenting "These kinds of media lose unrelated random
> data if writes to them are interrupted, journaling filesystems can't help with
> this and may actually hide the problem, and even an fsck will only find
> corrupted metadata not lost file contents" seems kind of useful.

I think an update to the documentation is a good thing (especially after 
learning that a raid 6 array that has lost a single disk can still be 
corrupted during a powerfail situation), but I also agree that Pavel's 
wording is not detailed enough

> That said, ext3's assumption that filesystem block size always >= disk update
> block size _is_ a fundamental part of this problem, and one that isn't shared
> by things like jffs2, and which things like btrfs might be able to address if
> they try, by adding awareness of the real media update granularity to their
> node layout algorithms.  (Heck, ext2 has a stripe size parameter already.
> Does setting that appropriately for your raid make this suck less?  I haven't
> heard anybody comment on that one yet...)

I thought that that assumption was in the VFS layer, not in any particular 
filesystem


David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  6:54                                           ` david
@ 2009-08-27  7:34                                             ` Rob Landley
  2009-08-28 14:37                                               ` david
  2009-08-30  7:19                                             ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-08-27  7:34 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Pavel Machek, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Thursday 27 August 2009 01:54:30 david@lang.hm wrote:
> On Thu, 27 Aug 2009, Rob Landley wrote:
> > On Wednesday 26 August 2009 07:28:13 Theodore Tso wrote:
> >> On Wed, Aug 26, 2009 at 01:17:52PM +0200, Pavel Machek wrote:
> >>>> Metadata takes up such a small part of the disk that fscking
> >>>> it and finding it to be OK is absolutely no guarantee that
> >>>> the data on the filesystem has not been horribly mangled.
> >>>>
> >>>> Personally, what I care about is my data.
> >>>>
> >>>> The metadata is just a way to get to my data, while the data
> >>>> is actually important.
> >>>
> >>> Personally, I care about metadata consistency, and ext3 documentation
> >>> suggests that journal protects its integrity. Except that it does not
> >>> on broken storage devices, and you still need to run fsck there.
> >>
> >> Caring about metadata consistency and not data is just weird, I'm
> >> sorry.  I can't imagine anyone who actually *cares* about what they
> >> have stored, whether it's digital photographs of child taking a first
> >> step, or their thesis research, caring about more about the metadata
> >> than the data.  Giving advice that pretends that most users have that
> >> priority is Just Wrong.
> >
> > I thought the reason for that was that if your metadata is horked,
> > further writes to the disk can trash unrelated existing data because it's
> > lost track of what's allocated and what isn't.  So back when the
> > assumption was "what's written stays written", then keeping the metadata
> > sane was still darn important to prevent normal operation from
> > overwriting unrelated existing data.
> >
> > Then Pavel notified us of a situation where interrupted writes to the
> > disk can trash unrelated existing data _anyway_, because the flash block
> > size on the 16 gig flash key I bought retail at Fry's is 2 megabytes, and
> > the filesystem thinks it's 4k or smaller.  It seems like what _broke_ was
> > the assumption that the filesystem block size >= the disk block size, and
> > nobody noticed for a while. (Except the people making jffs2 and friends,
> > anyway.)
> >
> > Today we have cheap plentiful USB keys that act like hard drives, except
> > that their write block size isn't remotely the same as hard drives', but
> > they pretend it is, and then the block wear levelling algorithms fuzz
> > things further.  (Gee, a drive controller lying about drive geometry, the
> > scsi crowd should feel right at home.)
>
> actually, you don't know if your USB key works that way or not.

Um, yes, I think I do.

> Pavel has ssome that do, that doesn't mean that all flash drives do

Pretty much all the ones that present a USB disk interface to the outside 
world and then thus have to do hardware levelling.  Here's Valerie Aurora on 
the topic:

http://valhenson.livejournal.com/25228.html

>Let's start with hardware wear-leveling. Basically, nearly all practical
> implementations of it suck. You'd imagine that it would spread out writes
> over all the blocks in the drive, only rewriting any particular block after
> every other block has been written. But I've heard from experts several
> times that hardware wear-leveling can be as dumb as a ring buffer of 12
> blocks; each time you write a block, it pulls something out of the queue
> and sticks the old block in. If you only write one block over and over,
> this means that writes will be spread out over a staggering 12 blocks! My
> direct experience working with corrupted flash with built-in wear-leveling
> is that corruption was centered around frequently written blocks (with
> interesting patterns resulting from the interleaving of blocks from
> different erase blocks). As a file systems person, I know what it takes to
> do high-quality wear-leveling: it's called a log-structured file system and
> they are non-trivial pieces of software. Your average consumer SSD is not
> going to have sufficient hardware to implement even a half-assed
> log-structured file system, so clearly it's going to be a lot stupider than
> that.

Back to you:

> when you do a write to a flash drive you have to do the following items
>
> 1. allocate an empty eraseblock to put the data on
>
> 2. read the old eraseblock
>
> 3. merge the incoming write to the eraseblock
>
> 4. write the updated data to the flash
>
> 5. update the flash trnslation layer to point reads at the new location
> instead of the old location.
>
> now if the flash drive does things in this order you will not loose any
> previously written data.

That's what something like jffs2 will do, sure.  (And note that mounting those 
suckers is slow while it reads the whole disk to figure out what order to put 
the chunks in.)

However, your average consumer level device A) isn't very smart, B) is judged 
almost entirely by price/capacity ratio and thus usually won't even hide 
capacity for bad block remapping.  You expect them to have significant hidden 
capacity to do safer updates with when customers aren't demanding it yet?

> if the flash drive does step 5 before it does step 4, then you have a
> window where a crash can loose data (and no btrfs won't survive any better
> to have a large chunk of data just disappear)
>
> it's possible that some super-cheap flash drives

I've never seen one that presented a USB disk interface that _didn't_ do this.  
(Not that this observation means much.)  Neither the windows nor the Macintosh 
world is calling for this yet.  Even the Linux guys barely know about it.  And 
these are the same kinds of manufacturers that NOPed out the flush commands to 
make their benchmarks look better...

> but if the device doesn't have a flash translation layer, then repeated
> writes to any one sector will kill the drive fairly quickly. (updates to
> the FAT would kill the sectors the FAT, journal, root directory, or
> superblock lives in due to the fact that every change to the disk requires
> an update to this file for example)

Yup.  It's got enough of one to get past the warantee, but beyond that they're 
intended for archiving and sneakernet, not for running compiles on.

> > That said, ext3's assumption that filesystem block size always >= disk
> > update block size _is_ a fundamental part of this problem, and one that
> > isn't shared by things like jffs2, and which things like btrfs might be
> > able to address if they try, by adding awareness of the real media update
> > granularity to their node layout algorithms.  (Heck, ext2 has a stripe
> > size parameter already. Does setting that appropriately for your raid
> > make this suck less?  I haven't heard anybody comment on that one yet...)
>
> I thought that that assumption was in the VFS layer, not in any particular
> filesystem

The VFS layer cares about how to talk to the backing store?  I thought that 
was the filesystem driver's job...

I wonder how jffs2 gets around it, then?  (Or for that matter, squashfs...)

> David Lang

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-25  0:08                               ` Theodore Tso
                                                   ` (2 preceding siblings ...)
  2009-08-27  3:34                                 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley
@ 2009-08-27  8:46                                 ` David Woodhouse
  2009-08-28 14:46                                   ` david
  3 siblings, 1 reply; 309+ messages in thread
From: David Woodhouse @ 2009-08-27  8:46 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote:
> 
> (It's worse with people using Digital SLR's shooting in raw mode,
> since it can take upwards of 30 seconds or more to write out a 12-30MB
> raw image, and if you eject at the wrong time, you can trash the
> contents of the entire CF card; in the worst case, the Flash
> Translation Layer data can get corrupted, and the card is completely
> ruined; you can't even reformat it at the filesystem level, but have
> to get a special Windows program from the CF manufacturer to --maybe--
> reset the FTL layer.

This just goes to show why having this "translation layer" done in
firmware on the device itself is a _bad_ idea. We're much better off
when we have full access to the underlying flash and the OS can actually
see what's going on. That way, we can actually debug, fix and recover
from such problems.

>   Early CF cards were especially vulnerable to
> this; more recent CF cards are better, but it's a known failure mode
> of CF cards.)

It's a known failure mode of _everything_ that uses flash to pretend to
be a block device. As I see it, there are no SSD devices which don't
lose data; there are only SSD devices which haven't lost your data
_yet_.

There's no fundamental reason why it should be this way; it just is.

(I'm kind of hoping that the shiny new expensive ones that everyone's
talking about right now, that I shouldn't really be slagging off, are
actually OK. But they're still new, and I'm certainly not trusting them
with my own data _quite_ yet.)

-- 
dwmw2


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  3:53                                                 ` Rob Landley
@ 2009-08-27 11:43                                                   ` Ric Wheeler
  2009-08-27 20:51                                                     ` Rob Landley
  2009-08-27 22:13                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-27 11:43 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/26/2009 11:53 PM, Rob Landley wrote:
> On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
>    
>> Repeat experiment until you get up to something like google scale or the
>> other papers on failures in national labs in the US and then we can have an
>> informed discussion.
>>      
> On google scale anvil lightning can fry your machine out of a clear sky.
>
> However, there are still a few non-enterprise users out there, and knowing
> that specific usage patterns don't behave like they expect might be useful to
> them.
>
>    

You are missing the broader point of both papers. They (and people like 
me when back at EMC) look at large numbers of machines and try to fix 
what actually breaks when run in the real world and causes data loss. 
The motherboards, S-ATA controllers, disk types are the same class of 
parts that I have in my desktop box today.

The advantage of google, national labs, etc is that they have large 
numbers of systems and can draw conclusions that are meaningful to our 
broad user base.

Specifically, in using S-ATA drives (just like ours, maybe slightly more 
reliable) they see up to 7% of those drives fail each year.  All users 
have "soft" drive failures like single remapped sectors.

These errors happen extremely commonly and are what RAID deals with well.

What does not happen commonly is that during the RAID rebuild (kicked 
off only after a drive is kicked out), you push the power button or have 
a second failure (power outage).

We will have more users loose data if they decide to use ext2 instead of 
ext3 and use only single disk storage.

We have real numbers that show that is true. Injecting double faults 
into a system that handles single faults is frankly not that interesting.

You can get better protection from these double faults if you move to 
"cloud" like storage configs where each box is fault tolerant, but you 
also spread your data over multiple boxes in multiple locations.

Regards,

Ric

>>>> I can promise you that hot unplugging and replugging a S-ATA drive will
>>>> also lose you data if you are actively writing to it (ext2, 3,
>>>> whatever).
>>>>          
>>> I can promise you that running S-ATA drive will also lose you data,
>>> even if you are not actively writing to it. Just wait 10 years; so
>>> what is your point?
>>>        
>> I lost a s-ata drive 24 hours after installing it in a new box. If I had
>> MD5 RAID5, I would not have lost any.
>>
>> My point is that you fail to take into account the rate of failures of a
>> given configuration and the probability of data loss given those rates.
>>      
> Actually, that's _exactly_ what he's talking about.
>
> When writing to a degraded raid or a flash disk, journaling is essentially
> useless.  If you get a power failure, kernel panic, somebody tripping over a
> USB cable, and so on, your filesystem will not be protected by journaling.
> Your data won't be trashed _every_ time, but the likelihood is much greater
> than experience with journaling in other contexts would suggest.
>
> Worse, the journaling may be counterproductive by _hiding_ many errors that
> fsck would promptly detect, so when the error is detected it may not be
> associated with the event that caused it.  It also may not be noticed until
> good backups of the data have been overwritten or otherwise cycled out.
>
> You seem to be arguing that Linux is no longer used anywhere but the
> enterprise, so issues affecting USB flash keys or cheap software-only RAID
> aren't worth documenting?
>
> Rob
>    


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  5:19                                                               ` Rob Landley
@ 2009-08-27 12:24                                                                 ` Theodore Tso
  2009-08-27 13:10                                                                   ` Ric Wheeler
                                                                                     ` (3 more replies)
  0 siblings, 4 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-27 12:24 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ric Wheeler, Pavel Machek, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote:
> > To me, this isn't a particularly interesting or newsworthy point,
> > since a competent system administrator
> 
> I'm a bit concerned by the argument that we don't need to document
> serious pitfalls because every Linux system has a sufficiently
> competent administrator they already know stuff that didn't even
> come up until the second or third day it was discussed on lkml.

I'm not convinced that information which needs to be known by System
Administrators is best documented in the kernel Documentation
directory.  Should there be a HOWTO document on stuff like that?
Sure, if someone wants to put something like that together, having
free documentation about ways to set up your storage stack in a sane
way is not a bad thing.  

It should be noted that these sorts of issues are discussed in various
books targetted at System Administrators, and in Usenix's System
Administration tutorials.  The computer industry is highly
specialized, and so just because an OS kernel hacker might not be
familiar with these issues, doesn't mean that professionals whose job
it is to run data centers don't know about these things!  Similarly,
you could be a whiz at Linux's networking stack, but you might not
know about certain pitfalls in configuring a Cisco router using IOS;
does that mean we should have an IOS tutorial in the kernel
documentation directory?  I'm not so sure about that!

> "You're documenting it wrong" != "you shouldn't document it".

Sure, but the fact that we don't currently say much about storage
stacks doesn't mean we should accept a patch that might actively
mislead people.   I'm NACK'ing the patch on that basis.

> > who cares about his data and/or
> > his hardware will (a) have a UPS,
> 
> I worked at a company that retested their UPSes a year after
> installing them and found that _none_ of them supplied more than 15
> seconds charge, and when they dismantled them the batteries had
> physically bloated inside their little plastic cases.  (Same company
> as the dead air conditioner, possibly overheating was involved but
> the little _lights_ said everything was ok.)
> 
> That was by no means the first UPS I'd seen die, the suckers have a
> higher failure rate than hard drives in my experience.  This is a
> device where the batteries get constantly charged and almost never
> tested because if it _does_ fail you just rebooted your production
> server, so a lot of smaller companies think they have one but
> actually don't.

Sounds like they were using really cheap UPS's; certainly not the kind
I would expect to find in a data center.  And if company's system
administrator is using the cheapest possible consumer-grade UPS's,
then yes, they might have a problem.  Even an educational institution
like MIT, where I was an network administrator some 15 years ago, had
proper UPS's, *and* we had a diesel generator which kicked in after 15
seconds --- and we tested the diesel generator every Friday morning,
to make sure it worked properly.

> > , and (b) be running with a hot spare
> > and/or will imediately replace a failed drive in a RAID array.
> 
> Here's hoping they shut the system down properly to install the new
> drive in the raid then, eh?  Not accidentally pull the plug before
> it's finished running the ~7 minutes of shutdown scripts in the last
> Red Hat Enterprise I messed with...

Even my home RAID array uses hot-plug SATA disks, so I can replace a
failed disk without shutting down my system.  (And yes, I have a
backup battery for the hardware RAID, and the firmware runs periodic
tests on it; the hardware RAID card also will send me e-mail if a RAID
array drive fails and it needs to use my hot-spare.  At that point, I
order a new hard drive, secure in the knowledge that the system can
still suffer another drive failure before falling into degraded mode.
And no, this isn't some expensive enterprise RAID setup; this is just
a mid-range Areca RAID card.)

> If "degraded array" just means "don't have a replacement disk yet",
> then it sounds like what Pavel wants to document is "don't write to
> a degraded array at all, because power failures can cost you data
> due to write granularity being larger than filesystem block size".
> (Which still comes as news to some of us, and you need a way to
> remount mount the degraded array read only until the sysadmin can
> fix it.)

If you want to document that as a property of RAID arrays, sure.  But
it's not something that should live in Documentation/filesystems/ext2.txt
and Documentation/filesystems/ext3.txt.  The MD RAID howto might be a
better place, since it's far more likely more users will read it.  How
many system administrators read what's in the kernel's Documentation
directory, after all, and this is basic information about how RAID
works; it's not necessarily something that someone would *expect* to
be in kernel documentation, nor would necessarily go looking for it
there.  And the reality is that it's not like most people go reading
Documentation/* for pleasure.  :-)

BTW, the RAID write atomicity issue and the possibility of failures
cause data loss *is* documented in the Wikipedia article on RAID.
It's not as written as direct practical advice to a system
administrator (you'd have to go to a book that is really targetted at
system administrators to find that sort of thing).

       		      	      	   	   - Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 12:24                                                                 ` Theodore Tso
  2009-08-27 13:10                                                                   ` Ric Wheeler
@ 2009-08-27 13:10                                                                   ` Ric Wheeler
  2009-08-27 16:54                                                                     ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik
  2009-08-29 10:02                                                                   ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek
  2009-08-29 10:02                                                                   ` Pavel Machek
  3 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-27 13:10 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On 08/27/2009 08:24 AM, Theodore Tso wrote:
> On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote:
>>> To me, this isn't a particularly interesting or newsworthy point,
>>> since a competent system administrator
>>
>> I'm a bit concerned by the argument that we don't need to document
>> serious pitfalls because every Linux system has a sufficiently
>> competent administrator they already know stuff that didn't even
>> come up until the second or third day it was discussed on lkml.
>
> I'm not convinced that information which needs to be known by System
> Administrators is best documented in the kernel Documentation
> directory.  Should there be a HOWTO document on stuff like that?
> Sure, if someone wants to put something like that together, having
> free documentation about ways to set up your storage stack in a sane
> way is not a bad thing.
>
> It should be noted that these sorts of issues are discussed in various
> books targetted at System Administrators, and in Usenix's System
> Administration tutorials.  The computer industry is highly
> specialized, and so just because an OS kernel hacker might not be
> familiar with these issues, doesn't mean that professionals whose job
> it is to run data centers don't know about these things!  Similarly,
> you could be a whiz at Linux's networking stack, but you might not
> know about certain pitfalls in configuring a Cisco router using IOS;
> does that mean we should have an IOS tutorial in the kernel
> documentation directory?  I'm not so sure about that!
>
>> "You're documenting it wrong" != "you shouldn't document it".
>
> Sure, but the fact that we don't currently say much about storage
> stacks doesn't mean we should accept a patch that might actively
> mislead people.   I'm NACK'ing the patch on that basis.
>
>>> who cares about his data and/or
>>> his hardware will (a) have a UPS,
>>
>> I worked at a company that retested their UPSes a year after
>> installing them and found that _none_ of them supplied more than 15
>> seconds charge, and when they dismantled them the batteries had
>> physically bloated inside their little plastic cases.  (Same company
>> as the dead air conditioner, possibly overheating was involved but
>> the little _lights_ said everything was ok.)
>>
>> That was by no means the first UPS I'd seen die, the suckers have a
>> higher failure rate than hard drives in my experience.  This is a
>> device where the batteries get constantly charged and almost never
>> tested because if it _does_ fail you just rebooted your production
>> server, so a lot of smaller companies think they have one but
>> actually don't.
>
> Sounds like they were using really cheap UPS's; certainly not the kind
> I would expect to find in a data center.  And if company's system
> administrator is using the cheapest possible consumer-grade UPS's,
> then yes, they might have a problem.  Even an educational institution
> like MIT, where I was an network administrator some 15 years ago, had
> proper UPS's, *and* we had a diesel generator which kicked in after 15
> seconds --- and we tested the diesel generator every Friday morning,
> to make sure it worked properly.
>
>>> , and (b) be running with a hot spare
>>> and/or will imediately replace a failed drive in a RAID array.
>>
>> Here's hoping they shut the system down properly to install the new
>> drive in the raid then, eh?  Not accidentally pull the plug before
>> it's finished running the ~7 minutes of shutdown scripts in the last
>> Red Hat Enterprise I messed with...
>
> Even my home RAID array uses hot-plug SATA disks, so I can replace a
> failed disk without shutting down my system.  (And yes, I have a
> backup battery for the hardware RAID, and the firmware runs periodic
> tests on it; the hardware RAID card also will send me e-mail if a RAID
> array drive fails and it needs to use my hot-spare.  At that point, I
> order a new hard drive, secure in the knowledge that the system can
> still suffer another drive failure before falling into degraded mode.
> And no, this isn't some expensive enterprise RAID setup; this is just
> a mid-range Areca RAID card.)
>
>> If "degraded array" just means "don't have a replacement disk yet",
>> then it sounds like what Pavel wants to document is "don't write to
>> a degraded array at all, because power failures can cost you data
>> due to write granularity being larger than filesystem block size".
>> (Which still comes as news to some of us, and you need a way to
>> remount mount the degraded array read only until the sysadmin can
>> fix it.)
>
> If you want to document that as a property of RAID arrays, sure.  But
> it's not something that should live in Documentation/filesystems/ext2.txt
> and Documentation/filesystems/ext3.txt.  The MD RAID howto might be a
> better place, since it's far more likely more users will read it.  How
> many system administrators read what's in the kernel's Documentation
> directory, after all, and this is basic information about how RAID
> works; it's not necessarily something that someone would *expect* to
> be in kernel documentation, nor would necessarily go looking for it
> there.  And the reality is that it's not like most people go reading
> Documentation/* for pleasure.  :-)
>
> BTW, the RAID write atomicity issue and the possibility of failures
> cause data loss *is* documented in the Wikipedia article on RAID.
> It's not as written as direct practical advice to a system
> administrator (you'd have to go to a book that is really targetted at
> system administrators to find that sort of thing).
>
>         		      	      	   	   - Ted

One thing that does need fixing for some MD configurations is to stress again 
that we need to make sure that barrier operations are properly supported or 
users will need to disable the write cache on devices with volatile write caches.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 12:24                                                                 ` Theodore Tso
@ 2009-08-27 13:10                                                                   ` Ric Wheeler
  2009-08-27 13:10                                                                   ` Ric Wheeler
                                                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-27 13:10 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, kernel list

On 08/27/2009 08:24 AM, Theodore Tso wrote:
> On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote:
>>> To me, this isn't a particularly interesting or newsworthy point,
>>> since a competent system administrator
>>
>> I'm a bit concerned by the argument that we don't need to document
>> serious pitfalls because every Linux system has a sufficiently
>> competent administrator they already know stuff that didn't even
>> come up until the second or third day it was discussed on lkml.
>
> I'm not convinced that information which needs to be known by System
> Administrators is best documented in the kernel Documentation
> directory.  Should there be a HOWTO document on stuff like that?
> Sure, if someone wants to put something like that together, having
> free documentation about ways to set up your storage stack in a sane
> way is not a bad thing.
>
> It should be noted that these sorts of issues are discussed in various
> books targetted at System Administrators, and in Usenix's System
> Administration tutorials.  The computer industry is highly
> specialized, and so just because an OS kernel hacker might not be
> familiar with these issues, doesn't mean that professionals whose job
> it is to run data centers don't know about these things!  Similarly,
> you could be a whiz at Linux's networking stack, but you might not
> know about certain pitfalls in configuring a Cisco router using IOS;
> does that mean we should have an IOS tutorial in the kernel
> documentation directory?  I'm not so sure about that!
>
>> "You're documenting it wrong" != "you shouldn't document it".
>
> Sure, but the fact that we don't currently say much about storage
> stacks doesn't mean we should accept a patch that might actively
> mislead people.   I'm NACK'ing the patch on that basis.
>
>>> who cares about his data and/or
>>> his hardware will (a) have a UPS,
>>
>> I worked at a company that retested their UPSes a year after
>> installing them and found that _none_ of them supplied more than 15
>> seconds charge, and when they dismantled them the batteries had
>> physically bloated inside their little plastic cases.  (Same company
>> as the dead air conditioner, possibly overheating was involved but
>> the little _lights_ said everything was ok.)
>>
>> That was by no means the first UPS I'd seen die, the suckers have a
>> higher failure rate than hard drives in my experience.  This is a
>> device where the batteries get constantly charged and almost never
>> tested because if it _does_ fail you just rebooted your production
>> server, so a lot of smaller companies think they have one but
>> actually don't.
>
> Sounds like they were using really cheap UPS's; certainly not the kind
> I would expect to find in a data center.  And if company's system
> administrator is using the cheapest possible consumer-grade UPS's,
> then yes, they might have a problem.  Even an educational institution
> like MIT, where I was an network administrator some 15 years ago, had
> proper UPS's, *and* we had a diesel generator which kicked in after 15
> seconds --- and we tested the diesel generator every Friday morning,
> to make sure it worked properly.
>
>>> , and (b) be running with a hot spare
>>> and/or will imediately replace a failed drive in a RAID array.
>>
>> Here's hoping they shut the system down properly to install the new
>> drive in the raid then, eh?  Not accidentally pull the plug before
>> it's finished running the ~7 minutes of shutdown scripts in the last
>> Red Hat Enterprise I messed with...
>
> Even my home RAID array uses hot-plug SATA disks, so I can replace a
> failed disk without shutting down my system.  (And yes, I have a
> backup battery for the hardware RAID, and the firmware runs periodic
> tests on it; the hardware RAID card also will send me e-mail if a RAID
> array drive fails and it needs to use my hot-spare.  At that point, I
> order a new hard drive, secure in the knowledge that the system can
> still suffer another drive failure before falling into degraded mode.
> And no, this isn't some expensive enterprise RAID setup; this is just
> a mid-range Areca RAID card.)
>
>> If "degraded array" just means "don't have a replacement disk yet",
>> then it sounds like what Pavel wants to document is "don't write to
>> a degraded array at all, because power failures can cost you data
>> due to write granularity being larger than filesystem block size".
>> (Which still comes as news to some of us, and you need a way to
>> remount mount the degraded array read only until the sysadmin can
>> fix it.)
>
> If you want to document that as a property of RAID arrays, sure.  But
> it's not something that should live in Documentation/filesystems/ext2.txt
> and Documentation/filesystems/ext3.txt.  The MD RAID howto might be a
> better place, since it's far more likely more users will read it.  How
> many system administrators read what's in the kernel's Documentation
> directory, after all, and this is basic information about how RAID
> works; it's not necessarily something that someone would *expect* to
> be in kernel documentation, nor would necessarily go looking for it
> there.  And the reality is that it's not like most people go reading
> Documentation/* for pleasure.  :-)
>
> BTW, the RAID write atomicity issue and the possibility of failures
> cause data loss *is* documented in the Wikipedia article on RAID.
> It's not as written as direct practical advice to a system
> administrator (you'd have to go to a book that is really targetted at
> system administrators to find that sort of thing).
>
>         		      	      	   	   - Ted

One thing that does need fixing for some MD configurations is to stress again 
that we need to make sure that barrier operations are properly supported or 
users will need to disable the write cache on devices with volatile write caches.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-27 13:10                                                                   ` Ric Wheeler
@ 2009-08-27 16:54                                                                     ` Jeff Garzik
  2009-08-27 18:09                                                                       ` Alasdair G Kergon
  2009-09-01 14:01                                                                       ` Pavel Machek
  0 siblings, 2 replies; 309+ messages in thread
From: Jeff Garzik @ 2009-08-27 16:54 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Theodore Tso, Rob Landley, Pavel Machek, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On 08/27/2009 09:10 AM, Ric Wheeler wrote:
> One thing that does need fixing for some MD configurations is to stress
> again that we need to make sure that barrier operations are properly
> supported or users will need to disable the write cache on devices with
> volatile write caches.

Agreed; chime in on Christoph's linux-vfs thread if people have input.

I quickly glanced at MD and DM.  Currently, upstream, we see a lot of

         if (unlikely(bio_barrier(bio))) {
                 bio_endio(bio, -EOPNOTSUPP);
                 return 0;
         }

in DM and MD make_request functions.

Only md/raid1 supports barriers at present, it seems.  None of the other 
MD drivers support barriers.

DM has some barrier code...  but the above code was pasted from DM's 
make_request function, so I am guessing that DM's barrier stuff is 
incomplete and disabled at present.

I've been mentioning this issue for years... glad some people finally 
noticed :)

	Jeff




^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-27 16:54                                                                     ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik
@ 2009-08-27 18:09                                                                       ` Alasdair G Kergon
  2009-09-01 14:01                                                                       ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Alasdair G Kergon @ 2009-08-27 18:09 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Theodore Tso, Rob Landley, Pavel Machek,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet,
	Mikulas Patocka

On Thu, Aug 27, 2009 at 12:54:05PM -0400, Jeff Garzik wrote:
> DM has some barrier code...  but the above code was pasted from DM's  
> make_request function, so I am guessing that DM's barrier stuff is  
> incomplete and disabled at present.

That code is from the new request-based multipath implementation in 2.6.31
which doesn't yet.

But bio-based dm does support barriers now.  (Just missing some patches to
complete the dm-raid1 support that are still under review IIRC.)

Alasdair

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 11:43                                                   ` Ric Wheeler
@ 2009-08-27 20:51                                                     ` Rob Landley
  2009-08-27 22:00                                                       ` Ric Wheeler
  2009-08-28 14:49                                                       ` david
  2009-08-27 22:13                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
  1 sibling, 2 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-27 20:51 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote:
> On 08/26/2009 11:53 PM, Rob Landley wrote:
> > On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
> >> Repeat experiment until you get up to something like google scale or the
> >> other papers on failures in national labs in the US and then we can have
> >> an informed discussion.
> >
> > On google scale anvil lightning can fry your machine out of a clear sky.
> >
> > However, there are still a few non-enterprise users out there, and
> > knowing that specific usage patterns don't behave like they expect might
> > be useful to them.
>
> You are missing the broader point of both papers.

No, I'm dismissing the papers (some of which I read when they first came out 
and got slashdotted) as irrelevant to the topic at hand.

Pavel has two failure modes which he can trivially reproduce.  The USB stick 
one is reproducible on a laptop by jostling said stick.  I myself used to have 
a literal USB keychain, and the weight of keys dangling from it pulled it out 
of the USB socket fairly easily if I wasn't careful.  At the time nobody had 
told me a journaling filesystem was not a reasonable safeguard here.

Presumably the degraded raid one can be reproduced under an emulator, with no 
hardware directly involved at all, so talking about hardware failure rates 
ignores the fact that he's actually discussing a _software_ problem.  It may 
happen in _response_ to hardware failures, but the damage he's attempting to 
document happens entirely in software.

These failure modes can cause data loss which journaling can't help, but which 
journaling might (or might not) conceivably hide so you don't immediately 
notice it.  They share a common underlying assumption that the storage 
device's update granularity is less than or equal to the filesystem's block 
size, which is not actually true of all modern storage devices.  The fact he's 
only _found_ two instances where this assumption bites doesn't mean there 
aren't more waiting to be found, especially as more new storage media types 
get introduced.

Pavel's response was to attempt to document this.  Not that journaling is 
_bad_, but that it doesn't protect against this class of problem.

Your response is to talk about google clusters, cloud storage, and cite 
academic papers of statistical hardware failure rates.  As I understand the 
discussion, that's not actually the issue Pavel's talking about, merely one 
potential trigger for it.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 13:40                                                                   ` Chris Adams
  2009-08-26 13:47                                                                     ` Alan Cox
@ 2009-08-27 21:50                                                                     ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-27 21:50 UTC (permalink / raw)
  To: Chris Adams; +Cc: Theodore Tso, linux-kernel

Hi!

> The other thing about this thread is that the only RAID implementation
> that is being discussed here is the MD RAID stack.  There are a lot of
> RAID implementations that have the same issues:
> 
> - motherboard (aka "fake") RAID - In Linux this is typically mapped with
>   device mapper via dmraid; AFAIK there is not a tool to scrub (or even
>   monitor the status of and notify on failure) a Linux DM RAID setup.
> 
> - hardware RAID cards without battery backup - these have the exact same
>   issues because they cannot guarantee all writes complete, nor can they
>   keep track of incomplete writes across power failures
> 
> - hardware RAID cards _with_ battery backup but that don't periodically
>   test the battery and have a way to notify you of battery failure while
>   Linux is running
> 
> The issues being raised here are not specific to extX, MD RAID, or Linux
> at all; they are problems with non-"enterprise-class" RAID setups.
> There's a reason enterprise-class RAID costs a lot more money than the
> card you can pick up at Fry's.
> 
> There's no reason to document the design issues of general RAID
> implementations in the Linux kernel.

Even when we carry one of those misdesigned implementations in-tree?
(Note that fixed implementations do exist -- AIX? -- just add journal).

'I wont't tell you that this pony bites, because many ponies do bite'?

WTF? I thought we had higher moral standard than this.

							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 20:51                                                     ` Rob Landley
@ 2009-08-27 22:00                                                       ` Ric Wheeler
  2009-08-28 14:49                                                       ` david
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-27 22:00 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/27/2009 04:51 PM, Rob Landley wrote:
> On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote:
>    
>> On 08/26/2009 11:53 PM, Rob Landley wrote:
>>      
>>> On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:
>>>        
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have
>>>> an informed discussion.
>>>>          
>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>
>>> However, there are still a few non-enterprise users out there, and
>>> knowing that specific usage patterns don't behave like they expect might
>>> be useful to them.
>>>        
>> You are missing the broader point of both papers.
>>      
> No, I'm dismissing the papers (some of which I read when they first came out
> and got slashdotted) as irrelevant to the topic at hand.
>    

I guess I have to dismiss your dismissing then.
> Pavel has two failure modes which he can trivially reproduce.  The USB stick
> one is reproducible on a laptop by jostling said stick.  I myself used to have
> a literal USB keychain, and the weight of keys dangling from it pulled it out
> of the USB socket fairly easily if I wasn't careful.  At the time nobody had
> told me a journaling filesystem was not a reasonable safeguard here.
>
> Presumably the degraded raid one can be reproduced under an emulator, with no
> hardware directly involved at all, so talking about hardware failure rates
> ignores the fact that he's actually discussing a _software_ problem.  It may
> happen in _response_ to hardware failures, but the damage he's attempting to
> document happens entirely in software.
>
> These failure modes can cause data loss which journaling can't help, but which
> journaling might (or might not) conceivably hide so you don't immediately
> notice it.  They share a common underlying assumption that the storage
> device's update granularity is less than or equal to the filesystem's block
> size, which is not actually true of all modern storage devices.  The fact he's
> only _found_ two instances where this assumption bites doesn't mean there
> aren't more waiting to be found, especially as more new storage media types
> get introduced.
>
> Pavel's response was to attempt to document this.  Not that journaling is
> _bad_, but that it doesn't protect against this class of problem.
>
> Your response is to talk about google clusters, cloud storage, and cite
> academic papers of statistical hardware failure rates.  As I understand the
> discussion, that's not actually the issue Pavel's talking about, merely one
> potential trigger for it.
>
> Rob
>    



^ permalink raw reply	[flat|nested] 309+ messages in thread

* raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-27 11:43                                                   ` Ric Wheeler
  2009-08-27 20:51                                                     ` Rob Landley
@ 2009-08-27 22:13                                                     ` Pavel Machek
  2009-08-28  1:32                                                       ` Ric Wheeler
  1 sibling, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-27 22:13 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet


>>> Repeat experiment until you get up to something like google scale or the
>>> other papers on failures in national labs in the US and then we can have an
>>> informed discussion.
>>>      
>> On google scale anvil lightning can fry your machine out of a clear sky.
>>
>> However, there are still a few non-enterprise users out there, and knowing
>> that specific usage patterns don't behave like they expect might be useful to
>> them.
>
> You are missing the broader point of both papers. They (and people like  
> me when back at EMC) look at large numbers of machines and try to fix  
> what actually breaks when run in the real world and causes data loss.  
> The motherboards, S-ATA controllers, disk types are the same class of  
> parts that I have in my desktop box today.
...
> These errors happen extremely commonly and are what RAID deals with well.
>
> What does not happen commonly is that during the RAID rebuild (kicked  
> off only after a drive is kicked out), you push the power button or have  
> a second failure (power outage).
>
> We will have more users loose data if they decide to use ext2 instead of  
> ext3 and use only single disk storage.

So your argument basically is

'our abs brakes are broken, but lets not tell anyone; our car is still
safer than a horse'.

and

'while we know our abs brakes are broken, they are not major factor in
accidents, so lets not tell anyone'.

Sorry, but I'd expect slightly higher moral standards. If we can
document it in a way that's non-scary, and does not push people to
single disks (horses), please go ahead; but you have to mention that
md raid breaks journalling assumptions (our abs brakes really are
broken).
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-27 22:13                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
@ 2009-08-28  1:32                                                       ` Ric Wheeler
  2009-08-28  6:44                                                         ` Pavel Machek
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-28  1:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/27/2009 06:13 PM, Pavel Machek wrote:
>
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have an
>>>> informed discussion.
>>>>
>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>
>>> However, there are still a few non-enterprise users out there, and knowing
>>> that specific usage patterns don't behave like they expect might be useful to
>>> them.
>>
>> You are missing the broader point of both papers. They (and people like
>> me when back at EMC) look at large numbers of machines and try to fix
>> what actually breaks when run in the real world and causes data loss.
>> The motherboards, S-ATA controllers, disk types are the same class of
>> parts that I have in my desktop box today.
> ...
>> These errors happen extremely commonly and are what RAID deals with well.
>>
>> What does not happen commonly is that during the RAID rebuild (kicked
>> off only after a drive is kicked out), you push the power button or have
>> a second failure (power outage).
>>
>> We will have more users loose data if they decide to use ext2 instead of
>> ext3 and use only single disk storage.
>
> So your argument basically is
>
> 'our abs brakes are broken, but lets not tell anyone; our car is still
> safer than a horse'.
>
> and
>
> 'while we know our abs brakes are broken, they are not major factor in
> accidents, so lets not tell anyone'.
>
> Sorry, but I'd expect slightly higher moral standards. If we can
> document it in a way that's non-scary, and does not push people to
> single disks (horses), please go ahead; but you have to mention that
> md raid breaks journalling assumptions (our abs brakes really are
> broken).
> 								Pavel
>


You continue to ignore the technical facts that everyone (both MD and ext3) 
people put in front of you.

If you have a specific bug in MD code, please propose a patch.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28  1:32                                                       ` Ric Wheeler
@ 2009-08-28  6:44                                                         ` Pavel Machek
  2009-08-28  7:31                                                             ` NeilBrown
  2009-08-28 11:16                                                           ` Ric Wheeler
  2009-08-28  7:11                                                         ` raid is dangerous but that's secret Florian Weimer
  2009-08-28 12:08                                                         ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso
  2 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-28  6:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
> On 08/27/2009 06:13 PM, Pavel Machek wrote:
>>
>>>>> Repeat experiment until you get up to something like google scale or the
>>>>> other papers on failures in national labs in the US and then we can have an
>>>>> informed discussion.
>>>>>
>>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>>
>>>> However, there are still a few non-enterprise users out there, and knowing
>>>> that specific usage patterns don't behave like they expect might be useful to
>>>> them.
>>>
>>> You are missing the broader point of both papers. They (and people like
>>> me when back at EMC) look at large numbers of machines and try to fix
>>> what actually breaks when run in the real world and causes data loss.
>>> The motherboards, S-ATA controllers, disk types are the same class of
>>> parts that I have in my desktop box today.
>> ...
>>> These errors happen extremely commonly and are what RAID deals with well.
>>>
>>> What does not happen commonly is that during the RAID rebuild (kicked
>>> off only after a drive is kicked out), you push the power button or have
>>> a second failure (power outage).
>>>
>>> We will have more users loose data if they decide to use ext2 instead of
>>> ext3 and use only single disk storage.
>>
>> So your argument basically is
>>
>> 'our abs brakes are broken, but lets not tell anyone; our car is still
>> safer than a horse'.
>>
>> and
>>
>> 'while we know our abs brakes are broken, they are not major factor in
>> accidents, so lets not tell anyone'.
>>
>> Sorry, but I'd expect slightly higher moral standards. If we can
>> document it in a way that's non-scary, and does not push people to
>> single disks (horses), please go ahead; but you have to mention that
>> md raid breaks journalling assumptions (our abs brakes really are
>> broken).
>
> You continue to ignore the technical facts that everyone (both MD and 
> ext3) people put in front of you.
>
> If you have a specific bug in MD code, please propose a patch.

Interesting. So, what's technically wrong with the patch below?

									Pavel
---

From: Theodore Tso <tytso@mit.edu>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <pavel@ucw.cz> 

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..2f3eec1
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,21 @@
+There are storage devices that high highly undesirable properties when
+they are disconnected or suffer power failures while writes are in
+progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
+arrays.  These devices have the property of potentially corrupting
+blocks being written at the time of the power failure, and worse yet,
+amplifying the region where blocks are corrupted such that additional
+sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using these devices is also a
+Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
+
+(*) Degraded array or single disk failure "near" the powerfail is
+neccessary for this property of RAID arrays to bite.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret
  2009-08-28  1:32                                                       ` Ric Wheeler
  2009-08-28  6:44                                                         ` Pavel Machek
@ 2009-08-28  7:11                                                         ` Florian Weimer
  2009-08-28  7:23                                                           ` NeilBrown
  2009-08-28 12:08                                                         ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: Florian Weimer @ 2009-08-28  7:11 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, Rob Landley, Theodore Tso, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

* Ric Wheeler:

> You continue to ignore the technical facts that everyone (both MD and
> ext3) people put in front of you.
>
> If you have a specific bug in MD code, please propose a patch.

In RAID 1 mode, it should read both copies and error out on
mismatch. 8-)

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret
  2009-08-28  7:11                                                         ` raid is dangerous but that's secret Florian Weimer
@ 2009-08-28  7:23                                                           ` NeilBrown
  0 siblings, 0 replies; 309+ messages in thread
From: NeilBrown @ 2009-08-28  7:23 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Ric Wheeler, Pavel Machek, Rob Landley, Theodore Tso,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Fri, August 28, 2009 5:11 pm, Florian Weimer wrote:
> * Ric Wheeler:
>
>> You continue to ignore the technical facts that everyone (both MD and
>> ext3) people put in front of you.
>>
>> If you have a specific bug in MD code, please propose a patch.
>
> In RAID 1 mode, it should read both copies and error out on
> mismatch. 8-)

Despite your smiley:

  no it shouldn't, and no one is making any claims about raid1 being
  unsafe, only raid4/5/6.

NeilBrown


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:  document conditions when reliable operation is possible)
  2009-08-28  6:44                                                         ` Pavel Machek
@ 2009-08-28  7:31                                                             ` NeilBrown
  2009-08-28 11:16                                                           ` Ric Wheeler
  1 sibling, 0 replies; 309+ messages in thread
From: NeilBrown @ 2009-08-28  7:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Rob Landley, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Fri, August 28, 2009 4:44 pm, Pavel Machek wrote:
> On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
>>>
>> If you have a specific bug in MD code, please propose a patch.
>
> Interesting. So, what's technically wrong with the patch below?
>

You mean apart from ".... that high highly undesirable ...." ??
                               ^^^^^^^^^^^

And the phrase "Regular backups when using these devices ...." should
be "Regular backups when using any devices .....".
                               ^^^
If you have a device failure near a power fail on a raid5 you might
lose some blocks of data.  If you have a device failure near (or not
near) a power failure on raid0 or jbod etc you will certainly lose lots
of blocks of data.

I think it would be better to say:

   ".... and degraded DM/MD RAID 4/5/6(*) arrays..."
             ^^^^^^^^
with
(*) If device failure causes the array to become degraded during or
immediately after the power failure, the same problem can result.

And "necessary" only have the one 'c' :-)

NeilBrown

> 									Pavel
> ---
>
> From: Theodore Tso <tytso@mit.edu>
>
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
>
> Signed-of-by: Pavel Machek <pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/dangers.txt
> b/Documentation/filesystems/dangers.txt
> new file mode 100644
> index 0000000..2f3eec1
> --- /dev/null
> +++ b/Documentation/filesystems/dangers.txt
> @@ -0,0 +1,21 @@
> +There are storage devices that high highly undesirable properties when
> +they are disconnected or suffer power failures while writes are in
> +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
> +arrays.  These devices have the property of potentially corrupting
> +blocks being written at the time of the power failure, and worse yet,
> +amplifying the region where blocks are corrupted such that additional
> +sectors are also damaged during the power failure.
> +
> +Users who use such storage devices are well advised take
> +countermeasures, such as the use of Uninterruptible Power Supplies,
> +and making sure the flash device is not hot-unplugged while the device
> +is being used.  Regular backups when using these devices is also a
> +Very Good Idea.
> +
> +Otherwise, file systems placed on these devices can suffer silent data
> +and file system corruption.  An forced use of fsck may detect metadata
> +corruption resulting in file system corruption, but will not suffice
> +to detect data corruption.
> +
> +(*) Degraded array or single disk failure "near" the powerfail is
> +neccessary for this property of RAID arrays to bite.
>
>

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
@ 2009-08-28  7:31                                                             ` NeilBrown
  0 siblings, 0 replies; 309+ messages in thread
From: NeilBrown @ 2009-08-28  7:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Rob Landley, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Fri, August 28, 2009 4:44 pm, Pavel Machek wrote:
> On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
>>>
>> If you have a specific bug in MD code, please propose a patch.
>
> Interesting. So, what's technically wrong with the patch below?
>

You mean apart from ".... that high highly undesirable ...." ??
                               ^^^^^^^^^^^

And the phrase "Regular backups when using these devices ...." should
be "Regular backups when using any devices .....".
                               ^^^
If you have a device failure near a power fail on a raid5 you might
lose some blocks of data.  If you have a device failure near (or not
near) a power failure on raid0 or jbod etc you will certainly lose lots
of blocks of data.

I think it would be better to say:

   ".... and degraded DM/MD RAID 4/5/6(*) arrays..."
             ^^^^^^^^
with
(*) If device failure causes the array to become degraded during or
immediately after the power failure, the same problem can result.

And "necessary" only have the one 'c' :-)

NeilBrown

> 									Pavel
> ---
>
> From: Theodore Tso <tytso@mit.edu>
>
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
>
> Signed-of-by: Pavel Machek <pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/dangers.txt
> b/Documentation/filesystems/dangers.txt
> new file mode 100644
> index 0000000..2f3eec1
> --- /dev/null
> +++ b/Documentation/filesystems/dangers.txt
> @@ -0,0 +1,21 @@
> +There are storage devices that high highly undesirable properties when
> +they are disconnected or suffer power failures while writes are in
> +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
> +arrays.  These devices have the property of potentially corrupting
> +blocks being written at the time of the power failure, and worse yet,
> +amplifying the region where blocks are corrupted such that additional
> +sectors are also damaged during the power failure.
> +
> +Users who use such storage devices are well advised take
> +countermeasures, such as the use of Uninterruptible Power Supplies,
> +and making sure the flash device is not hot-unplugged while the device
> +is being used.  Regular backups when using these devices is also a
> +Very Good Idea.
> +
> +Otherwise, file systems placed on these devices can suffer silent data
> +and file system corruption.  An forced use of fsck may detect metadata
> +corruption resulting in file system corruption, but will not suffice
> +to detect data corruption.
> +
> +(*) Degraded array or single disk failure "near" the powerfail is
> +neccessary for this property of RAID arrays to bite.
>
>

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28  6:44                                                         ` Pavel Machek
  2009-08-28  7:31                                                             ` NeilBrown
@ 2009-08-28 11:16                                                           ` Ric Wheeler
  2009-09-01 13:58                                                             ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-28 11:16 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/28/2009 02:44 AM, Pavel Machek wrote:
> On Thu 2009-08-27 21:32:49, Ric Wheeler wrote:
>> On 08/27/2009 06:13 PM, Pavel Machek wrote:
>>>
>>>>>> Repeat experiment until you get up to something like google scale or the
>>>>>> other papers on failures in national labs in the US and then we can have an
>>>>>> informed discussion.
>>>>>>
>>>>> On google scale anvil lightning can fry your machine out of a clear sky.
>>>>>
>>>>> However, there are still a few non-enterprise users out there, and knowing
>>>>> that specific usage patterns don't behave like they expect might be useful to
>>>>> them.
>>>>
>>>> You are missing the broader point of both papers. They (and people like
>>>> me when back at EMC) look at large numbers of machines and try to fix
>>>> what actually breaks when run in the real world and causes data loss.
>>>> The motherboards, S-ATA controllers, disk types are the same class of
>>>> parts that I have in my desktop box today.
>>> ...
>>>> These errors happen extremely commonly and are what RAID deals with well.
>>>>
>>>> What does not happen commonly is that during the RAID rebuild (kicked
>>>> off only after a drive is kicked out), you push the power button or have
>>>> a second failure (power outage).
>>>>
>>>> We will have more users loose data if they decide to use ext2 instead of
>>>> ext3 and use only single disk storage.
>>>
>>> So your argument basically is
>>>
>>> 'our abs brakes are broken, but lets not tell anyone; our car is still
>>> safer than a horse'.
>>>
>>> and
>>>
>>> 'while we know our abs brakes are broken, they are not major factor in
>>> accidents, so lets not tell anyone'.
>>>
>>> Sorry, but I'd expect slightly higher moral standards. If we can
>>> document it in a way that's non-scary, and does not push people to
>>> single disks (horses), please go ahead; but you have to mention that
>>> md raid breaks journalling assumptions (our abs brakes really are
>>> broken).
>>
>> You continue to ignore the technical facts that everyone (both MD and
>> ext3) people put in front of you.
>>
>> If you have a specific bug in MD code, please propose a patch.
>
> Interesting. So, what's technically wrong with the patch below?
>
> 									Pavel


My suggestion was that you stop trying to document your assertion of an issue 
and actually suggest fixes in code or implementation. I really don't think that 
you have properly diagnosed your specific failure or done sufficient. However, 
if you put a full analysis and suggested code out to the MD devel lists, we can 
debate technical implementation as we normally do.

As Ted quite clearly stated, documentation on how RAID works, how to configure 
it, etc, is best put in RAID documentation.  What you claim as a key issue is an 
issue for all file systems (including ext2).

The only note that I would put in ext3/4 etc documentation would be:

"Reliable storage is important for any file system. Single disks (or FLASH or 
SSD) do fail on a regular basis.

To reduce your risk of data loss, it is advisable to use RAID which can overcome 
these common issues. If using MD software RAID, see the RAID documentation on 
how best to configure your storage.

With or without RAID, it is always important to back up your data to an external 
device and keep copies of that backup off site."

ric



> ---
>
> From: Theodore Tso<tytso@mit.edu>
>
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
>
> Signed-of-by: Pavel Machek<pavel@ucw.cz>
>
> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
> new file mode 100644
> index 0000000..2f3eec1
> --- /dev/null
> +++ b/Documentation/filesystems/dangers.txt
> @@ -0,0 +1,21 @@
> +There are storage devices that high highly undesirable properties when
> +they are disconnected or suffer power failures while writes are in
> +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
> +arrays.  These devices have the property of potentially corrupting
> +blocks being written at the time of the power failure, and worse yet,
> +amplifying the region where blocks are corrupted such that additional
> +sectors are also damaged during the power failure.
> +
> +Users who use such storage devices are well advised take
> +countermeasures, such as the use of Uninterruptible Power Supplies,
> +and making sure the flash device is not hot-unplugged while the device
> +is being used.  Regular backups when using these devices is also a
> +Very Good Idea.
> +
> +Otherwise, file systems placed on these devices can suffer silent data
> +and file system corruption.  An forced use of fsck may detect metadata
> +corruption resulting in file system corruption, but will not suffice
> +to detect data corruption.
> +
> +(*) Degraded array or single disk failure "near" the powerfail is
> +neccessary for this property of RAID arrays to bite.
>
>


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28  1:32                                                       ` Ric Wheeler
  2009-08-28  6:44                                                         ` Pavel Machek
  2009-08-28  7:11                                                         ` raid is dangerous but that's secret Florian Weimer
@ 2009-08-28 12:08                                                         ` Theodore Tso
  2009-08-30  7:51                                                           ` Pavel Machek
  2009-08-30  7:51                                                           ` Pavel Machek
  2 siblings, 2 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-28 12:08 UTC (permalink / raw)
  To: Pavel Machek, NeilBrown
  Cc: Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Fri, Aug 28, 2009 at 08:44:49AM +0200, Pavel Machek wrote:
> From: Theodore Tso <tytso@mit.edu>
> 
> Document that many devices are too broken for filesystems to protect
> data in case of powerfail.
> 
> Signed-of-by: Pavel Machek <pavel@ucw.cz> 

NACK.  I didn't write this patch, and it's disingenuous for you to try
to claim that I authored it.

You took text I wrote from the *middle* of an e-mail discussion and
you ignored multiple corrections to typo's that I made --- typo's that
I would have corrected if I had ultimately decided to post this as a
patch, which I did NOT.

While Neil Brown's corrections are minimally necessary so the text is
at least technically *correct*, it's still not the right advice to
give system administrators.  It's better than the fear-mongering
patches you had proposed earlier, but what would be better *still* is
telling people why running with degraded RAID arrays is bad, and to
give them further tips about how to use RAID arrays safely.

To use your ABS brakes analogy, just becase it's not safe to rely on
ABS brakes if the "check brakes" light is on, that doesn't justify
writing something alarmist which claims that ABS brakes don't work
100% of the time, don't use ABS brakes, they're broken!!!!

The first part of it is true, since ABS brakes can suffer mechnical
failure.  But what we should be telling drivers is, "if the 'check
brakes' light comes on, don't keep driving with it, go to a garage and
get it fixed!!!".  Similarly, if you get a notice that your RAID is
running in degraded mode, you've already suffered one failure; you
won't survive another failure, so fix that issue ASAP!

If you're really paranoid, you could decide to "pull over to the side
of the road"; that is, you could stop writing to the RAID array as
soon as possible, and then get the the RAID array rebuilt before
proceeding.  That can reduce the chances of a second failure.  But in
the real world, there are costs associated with taking a production
server off-line, and the prudent system administrator has to do a
risk-reward tradeoff.  A better approach might to have the array
configured with a hot spare, and to regularly scrub the array, and
configure the RAID array with either a battery backup or a UPS.  And
hot-swap drives might not be a bad idea, too.

But in any case, just because ABS brakes and RAID arrays can suffer
failures, that doesn't mean you should run around telling people not
to use RAID arrays or RAID arrays are broken.  People are better off
using RAID than not using single disk storage solutions, just as
people are better off using ABS brakes than not.

Your argument basically boils down to, "if you drive like a maniac
when the roads are wet and slippery, ABS brakes might not save your
life.  Since ABS brake might cause you to have a false sense of
security, it's better to tell users that ABS brakes are broken."

That's just silly.  What we should be telling people instead is (a)
pay attention to the check brakes light (just as you should pay
attention to the RAID array is degraded warning), and (b) while ABS
brakes will get you out of some situations with life and limb intact,
they do not repeal that laws of physics (do regular full and
incremental backups; practice disk scrubbing; use UPS's or battery
backups).

							- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  7:34                                             ` Rob Landley
@ 2009-08-28 14:37                                               ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-28 14:37 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Pavel Machek, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Thu, 27 Aug 2009, Rob Landley wrote:

> On Thursday 27 August 2009 01:54:30 david@lang.hm wrote:
>> On Thu, 27 Aug 2009, Rob Landley wrote:
>>>
>>> Today we have cheap plentiful USB keys that act like hard drives, except
>>> that their write block size isn't remotely the same as hard drives', but
>>> they pretend it is, and then the block wear levelling algorithms fuzz
>>> things further.  (Gee, a drive controller lying about drive geometry, the
>>> scsi crowd should feel right at home.)
>>
>> actually, you don't know if your USB key works that way or not.
>
> Um, yes, I think I do.
>
>> Pavel has ssome that do, that doesn't mean that all flash drives do
>
> Pretty much all the ones that present a USB disk interface to the outside
> world and then thus have to do hardware levelling.  Here's Valerie Aurora on
> the topic:
>
> http://valhenson.livejournal.com/25228.html
>
>> Let's start with hardware wear-leveling. Basically, nearly all practical
>> implementations of it suck. You'd imagine that it would spread out writes
>> over all the blocks in the drive, only rewriting any particular block after
>> every other block has been written. But I've heard from experts several
>> times that hardware wear-leveling can be as dumb as a ring buffer of 12
>> blocks; each time you write a block, it pulls something out of the queue
>> and sticks the old block in. If you only write one block over and over,
>> this means that writes will be spread out over a staggering 12 blocks! My
>> direct experience working with corrupted flash with built-in wear-leveling
>> is that corruption was centered around frequently written blocks (with
>> interesting patterns resulting from the interleaving of blocks from
>> different erase blocks). As a file systems person, I know what it takes to
>> do high-quality wear-leveling: it's called a log-structured file system and
>> they are non-trivial pieces of software. Your average consumer SSD is not
>> going to have sufficient hardware to implement even a half-assed
>> log-structured file system, so clearly it's going to be a lot stupider than
>> that.
>
> Back to you:

I am not saying that all devices get this right (not by any means), but I 
_am_ saying that devices with wear-leveling _can_ avoid this problem 
entirely

you do not need to do a log-structured filesystem. all you need to do is 
to always write to a new block rather than re-writing a block in place.

even if the disk only does a 12-block rotation for it's wear leveling, 
that is enough for it to not loose other data when you write. to loose 
data you have to be updating a block in place by erasing the old one 
first. _anything_ that writes the data to a new location before it erases 
the old location will prevent you from loosing other data.

I'm all for documenting that this problem can and does exist, but I'm not 
in agreement with documentation that states that _all_ flash drives have 
this problem because (with wear-leveling in a flash translation layer on 
the device) it's not inherent to the technology. so even if all existing 
flash devices had this problem, there could be one released tomorrow that 
didn't.

this is like the problem that flash SSDs had last year that could cause 
them to stall for up to a second on write-heavy workloads. it went from a 
problem that almost every drive for sale had (and something that was 
generally accepted as being a characteristic of SSDs), to being extinct in 
about one product cycle after the problem was identified.

I think this problem will also disappear rapidly once it's publicised.

so what's needed is for someone to come up with a way to test this, let 
people test the various devices, find out how broad the problem is, and 
publicise the results.

personally, I expect that the better disk-replacements will not have a 
problem with this.

I would also be surprised if the larger thumb drives had this problem.

if a flash eraseblock can be used 100k times, then if you use FAT on a 16G 
drive and write 1M files and update the FAT after each file (like you 
would with a camera), the block the FAT is on will die after filling the 
device _6_ times. if it does a 12-block rotation it would die after 72 
times, but if it can move the blocks around the entire device it would 
take 50k times of filling the device.

for a 2G device the numbers would be 50 times with no wear-leveling and 
600 times with 12-block rotation.

so I could see them getting away with this sort of thing for the smaller 
devices, but as the thumb drives get larger, I expect that they will start 
to gain the wear-leveling capabilities that the SSDs have.

>> when you do a write to a flash drive you have to do the following items
>>
>> 1. allocate an empty eraseblock to put the data on
>>
>> 2. read the old eraseblock
>>
>> 3. merge the incoming write to the eraseblock
>>
>> 4. write the updated data to the flash
>>
>> 5. update the flash trnslation layer to point reads at the new location
>> instead of the old location.
>>
>> now if the flash drive does things in this order you will not loose any
>> previously written data.
>
> That's what something like jffs2 will do, sure.  (And note that mounting those
> suckers is slow while it reads the whole disk to figure out what order to put
> the chunks in.)
>
> However, your average consumer level device A) isn't very smart, B) is judged
> almost entirely by price/capacity ratio and thus usually won't even hide
> capacity for bad block remapping.  You expect them to have significant hidden
> capacity to do safer updates with when customers aren't demanding it yet?

this doesn't require filesystem smarts, but it does require a device with 
enough smarts to do bad-block remapping (if it does wear leveling all that 
bad-block remapping would be is not writing to a bad eraseblock, which 
doesn't even require maintaining a map of such blocks, all it would have 
to do is to check if what is on the flash is what it intended to write, if 
it is, use it, if it isn't, try again.

>> if the flash drive does step 5 before it does step 4, then you have a
>> window where a crash can loose data (and no btrfs won't survive any better
>> to have a large chunk of data just disappear)
>>
>> it's possible that some super-cheap flash drives
>
> I've never seen one that presented a USB disk interface that _didn't_ do this.
> (Not that this observation means much.)  Neither the windows nor the Macintosh
> world is calling for this yet.  Even the Linux guys barely know about it.  And
> these are the same kinds of manufacturers that NOPed out the flush commands to
> make their benchmarks look better...

the nature of the FAT filesystem calls for it. I've heard people talk 
about devices that try to be smart enough to take extra care of the blocks 
that the FAT is on

>> but if the device doesn't have a flash translation layer, then repeated
>> writes to any one sector will kill the drive fairly quickly. (updates to
>> the FAT would kill the sectors the FAT, journal, root directory, or
>> superblock lives in due to the fact that every change to the disk requires
>> an update to this file for example)
>
> Yup.  It's got enough of one to get past the warantee, but beyond that they're
> intended for archiving and sneakernet, not for running compiles on.

it doesn't take them being used for compiles, using them in a camera, 
media player, phone with a FAT filesystem will excersise the FAT blocks 
enough to cause problems

>>> That said, ext3's assumption that filesystem block size always >= disk
>>> update block size _is_ a fundamental part of this problem, and one that
>>> isn't shared by things like jffs2, and which things like btrfs might be
>>> able to address if they try, by adding awareness of the real media update
>>> granularity to their node layout algorithms.  (Heck, ext2 has a stripe
>>> size parameter already. Does setting that appropriately for your raid
>>> make this suck less?  I haven't heard anybody comment on that one yet...)
>>
>> I thought that that assumption was in the VFS layer, not in any particular
>> filesystem
>
> The VFS layer cares about how to talk to the backing store?  I thought that
> was the filesystem driver's job...

I could be mistaken, but I have run into cases with filesystems where the 
filesystem was designed to be able to use large blocks, but they could 
only be used on specific architectures because the disk block size had to 
be smaller than the page size.

> I wonder how jffs2 gets around it, then?  (Or for that matter, squashfs...)

if you know where the eraseblock boundries are, all you need to do is 
submit your writes in groups of blocks corresponding to those boundries. 
there is no need to make the blocks themselves the size of the 
eraseblocks.

any filesystem that is doing compressed storage is going to end up dealing 
with logical changes that span many different disk blocks.

I thought that squashfs was read-only (you create a filesystem image, burn 
it to flash, then use it)

as I say I could be completely misunderstanding this interaction.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  8:46                                 ` David Woodhouse
@ 2009-08-28 14:46                                   ` david
  2009-08-29 10:09                                     ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: david @ 2009-08-28 14:46 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Theodore Tso, Pavel Machek, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Thu, 27 Aug 2009, David Woodhouse wrote:

> On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote:
>>
>> (It's worse with people using Digital SLR's shooting in raw mode,
>> since it can take upwards of 30 seconds or more to write out a 12-30MB
>> raw image, and if you eject at the wrong time, you can trash the
>> contents of the entire CF card; in the worst case, the Flash
>> Translation Layer data can get corrupted, and the card is completely
>> ruined; you can't even reformat it at the filesystem level, but have
>> to get a special Windows program from the CF manufacturer to --maybe--
>> reset the FTL layer.
>
> This just goes to show why having this "translation layer" done in
> firmware on the device itself is a _bad_ idea. We're much better off
> when we have full access to the underlying flash and the OS can actually
> see what's going on. That way, we can actually debug, fix and recover
> from such problems.
>
>>   Early CF cards were especially vulnerable to
>> this; more recent CF cards are better, but it's a known failure mode
>> of CF cards.)
>
> It's a known failure mode of _everything_ that uses flash to pretend to
> be a block device. As I see it, there are no SSD devices which don't
> lose data; there are only SSD devices which haven't lost your data
> _yet_.
>
> There's no fundamental reason why it should be this way; it just is.
>
> (I'm kind of hoping that the shiny new expensive ones that everyone's
> talking about right now, that I shouldn't really be slagging off, are
> actually OK. But they're still new, and I'm certainly not trusting them
> with my own data _quite_ yet.)

so what sort of test would be needed to identify if a device has this 
problem?

people can do ad-hoc tests by pulling the devices in use and then checking 
the entire device, but something better should be available.

it seems to me that there are two things needed to define the tests.

1. a predictable write load so that it's easy to detect data getting lose

2. some statistical analysis to decide how many device pulls are needed 
(under the write load defined in #1) to make the odds high that the 
problem will be revealed.

with this we could have people test various devices and report if the test 
detects unrelated data being lost (or businesses, and I think the tech 
hardware sites would jump into this given some sort of accepted test)

for USB devices there may be a way to use the power management functions 
to cut power to the device without requiring it to physically be pulled, 
if this is the case (even if this only works on some specific chipsets), 
it would drasticly speed up the testing

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 20:51                                                     ` Rob Landley
  2009-08-27 22:00                                                       ` Ric Wheeler
@ 2009-08-28 14:49                                                       ` david
  2009-08-29 10:05                                                         ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-08-28 14:49 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Thu, 27 Aug 2009, Rob Landley wrote:

> Pavel's response was to attempt to document this.  Not that journaling is
> _bad_, but that it doesn't protect against this class of problem.

I don't think anyone is disagreeing with the statement that journaling 
doesn't protect against this class of problems, but Pavel's statements 
didn't say that. he stated that ext3 is more dangerous than ext2.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-03-12 19:13 ` Rob Landley
  2009-03-16 12:28   ` Pavel Machek
  2009-03-16 12:30   ` Pavel Machek
@ 2009-08-29  1:33   ` Robert Hancock
  2009-08-29 13:04     ` Alan Cox
  2 siblings, 1 reply; 309+ messages in thread
From: Robert Hancock @ 2009-08-29  1:33 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, kernel list, Andrew Morton, mtk.manpages, tytso,
	rdunlap, linux-doc, linux-ext4, Alan Cox

On 03/12/2009 01:13 PM, Rob Landley wrote:
>> +* write caching is disabled. ext2 does not know how to issue barriers
>> +  as of 2.6.28. hdparm -W0 disables it on SATA disks.
>
> And here we're talking about ext2.  Does neither one know about write
> barriers, or does this just apply to ext2?  (What about ext4?)
>
> Also I remember a historical problem that not all disks honor write barriers,
> because actual data integrity makes for horrible benchmark numbers.  Dunno how
> current that is with SATA, Alan Cox would probably know.

I've heard rumors of disks that claim to support cache flushes but 
really just ignore them, but have never heard any specifics of model 
numbers, etc. which are known to do this, so it may just be legend. If 
we do have such knowledge then we should really be blacklisting those 
drives and warning the user that we can't ensure data integrity. (Even 
powering down the system would be unsafe in this case.)

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 11:58                                                               ` Ric Wheeler
  2009-08-26 12:40                                                                 ` Theodore Tso
@ 2009-08-29  9:38                                                                 ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-29  9:38 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

>> Example I seen went like this:
>>
>> Drive in raid 5 failed; hot spare was available (no idea about
>> UPS). System apparently locked up trying to talk to the failed drive,
>> or maybe admin just was not patient enough, so he just powercycled the
>> array. He lost the array.
>>
>> So while most people will not agressively powercycle the RAID array,
>> drive failure still provokes little tested error paths, and getting
>> unclean shutdown is quite easy in such case.
>
> Then what we need to document is do not power cycle an array during a  
> rebuild, right?

Yep, that and the fact that you should fsck if you do.

> If it wasn't the admin that timed out and the box really was hung (no  
> drive activity lights, etc), you will need to power cycle/reboot but  
> then you should not have this active rebuild issuing writes either...

Ok, I guess you are right here.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 14:45                                                           ` Rik van Riel
@ 2009-08-29  9:39                                                             ` Pavel Machek
  2009-08-29 11:47                                                               ` Ron Johnson
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-29  9:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Wed 2009-08-26 10:45:44, Rik van Riel wrote:
> Pavel Machek wrote:
>
>> Sledgehammer is hardware problem, and I'm demonstrating
>> software/documentation problem we have here.
>
> So your argument is that a sledgehammer is a hardware
> problem, while a broken hard disk and a power failure
> are software/documentation issues?
>
> I'd argue that the broken hard disk and power failure
> are hardware issues, too.

Noone told me that degraded md raid5 is dangerous. Thats documentation
issue #1. Maybe I just pulled the disk for fun.

ext3 docs told me that journal protects me against fs corruption
during power fails. It does not in this particular case. Seems like
docs issue #2. Maybe I just hit the reset button because it was there.

Randomly hitting power button may be stupid, but should not result in
filesystem corruption on reasonably working filesystem/storage stack.

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-26 11:28                                                                   ` david
@ 2009-08-29  9:49                                                                     ` Pavel Machek
  2009-08-29 11:28                                                                       ` Ric Wheeler
  2009-08-29 16:35                                                                         ` david
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-29  9:49 UTC (permalink / raw)
  To: david
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]


>> So instead of fixing or at least documenting known software deficiency
>> in Linux MD stack, you'll try to surpress that information so that
>> people use more of raid5 setups?
>>
>> Perhaps the better documentation will push them to RAID1, or maybe
>> make them buy an UPS?
>
> people aren't objecting to better documentation, they are objecting to  
> misleading documentation.

Actually Ric is. He's trying hard to make RAID5 look better than it
really is.

> for flash drives the danger is very straightforward (although even then  
> you have to note that it depends heavily on the firmware of the device,  
> some will loose lots of data, some won't loose any)

I have not seen one that works :-(.

> you are generalizing that since you have lost data on flash drives, all  
> flash drives are dangerous.

Do the flash manufacturers claim they do not cause collateral damage
during powerfail? If not, they probably are dangerous.

Anyway, you wanted a test, and one is attached. It normally takes like
4 unplugs to uncover problems.

> but the super simplified statement you keep trying to make is  
> significantly overstating and oversimplifying the problem.

Offer better docs? You are right that it does not lose whole stripe,
it merely loses random block on same stripe, but result for journaling
filesystem is similar.
									Pavel


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: fstest --]
[-- Type: text/plain, Size: 923 bytes --]

#!/bin/bash
#
# Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2
#
# vfat is broken with filesize=0
#
#

if [ .$MOUNTOPTS = . ]; then
# ext3 is needed, or you need to disable caches using hdparm.
# odirsync is needed, else modify fstest.worker to fsync the directory.
    MOUNTOPTS="-o dirsync"
fi
if [ .$BDEV = . ]; then
#    BDEV=/dev/sdb3
    BDEV=/dev/nd0
fi

export FILESIZE=4000
export NUMFILES=4000

waitforcard() {
    umount /mnt
    echo Waiting for card:
    while ! mount $BDEV $MOUNTOPTS /mnt 2> /dev/null; do
	echo -n .
	sleep 1
    done
#   hdparm -W0 $BDEV
    echo
}

mkdir delme.fstest
cd delme.fstest

waitforcard
rm tmp.* final.* /mnt/tmp.* /mnt/final.*

while true; do
    ../fstest.work
    echo
    waitforcard
    echo Testing: fsck....
    umount /mnt
    fsck -fy $BDEV
    echo Testing....
    waitforcard
    for A in final.*; do
	echo -n $A " "
	cmp $A /mnt/$A || exit
    done
    echo
done

[-- Attachment #3: fstest.work --]
[-- Type: text/plain, Size: 409 bytes --]

#!/bin/bash
#
# Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2
#

echo "Writing test files: "
for A in `seq $NUMFILES`; do
    echo -n $A " "
    rm final.$A
    cat /dev/urandom | head -c $FILESIZE > tmp.$A
    dd conv=fsync if=tmp.$A of=/mnt/final.$A 2> /dev/zero || exit
#    cat /mnt/final.$A > /dev/null || exit
# sync should not be needed, as dd asks for fsync
#    sync
    mv tmp.$A final.$A
done

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 12:24                                                                 ` Theodore Tso
  2009-08-27 13:10                                                                   ` Ric Wheeler
  2009-08-27 13:10                                                                   ` Ric Wheeler
@ 2009-08-29 10:02                                                                   ` Pavel Machek
  2009-08-29 10:02                                                                   ` Pavel Machek
  3 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-29 10:02 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Thu 2009-08-27 08:24:23, Theodore Tso wrote:
> On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote:
> > > To me, this isn't a particularly interesting or newsworthy point,
> > > since a competent system administrator
> > 
> > I'm a bit concerned by the argument that we don't need to document
> > serious pitfalls because every Linux system has a sufficiently
> > competent administrator they already know stuff that didn't even
> > come up until the second or third day it was discussed on lkml.
> 
> I'm not convinced that information which needs to be known by System
> Administrators is best documented in the kernel Documentation
> directory.  Should there be a HOWTO document on stuff like that?

It is not only for system administrators; I was trying to find out if
kernel is buggy, and that should be in kernel tree.


> > If "degraded array" just means "don't have a replacement disk yet",
> > then it sounds like what Pavel wants to document is "don't write to
> > a degraded array at all, because power failures can cost you data
> > due to write granularity being larger than filesystem block size".
> > (Which still comes as news to some of us, and you need a way to
> > remount mount the degraded array read only until the sysadmin can
> > fix it.)
> 
> If you want to document that as a property of RAID arrays, sure.  But
> it's not something that should live in Documentation/filesystems/ext2.txt
> and Documentation/filesystems/ext3.txt.  The MD RAID howto might be a

ext3 documentation states that journal protects fs integrity on
powerfail. If you don't want to talk about storage stacks, perhaps
that should be removed?

Now... You mocked me up for 'ext3 expects disks to behave like disks
(alarmist)'. I actually believe that should be written somewhere. ext3
depends on fairly subtle storage disk characteristics, and many common
configs just do not meet the expectations (missing barriers is most
common, followed by collateral damage).

Maybe not documenting that was okay 10 years ago, but with all the USB
sticks and raid arrays around, its just sloppy. Because those
characteristics are not documented, storage stack authors do not know
what they have to guarantee, and the result is bad. See for example
nbd -- it does not propagate barriers and is therefore unsafe.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27 12:24                                                                 ` Theodore Tso
                                                                                     ` (2 preceding siblings ...)
  2009-08-29 10:02                                                                   ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek
@ 2009-08-29 10:02                                                                   ` Pavel Machek
  3 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-29 10:02 UTC (permalink / raw)
  To: Theodore Tso, Rob Landley, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, kernel

On Thu 2009-08-27 08:24:23, Theodore Tso wrote:
> On Thu, Aug 27, 2009 at 12:19:02AM -0500, Rob Landley wrote:
> > > To me, this isn't a particularly interesting or newsworthy point,
> > > since a competent system administrator
> > 
> > I'm a bit concerned by the argument that we don't need to document
> > serious pitfalls because every Linux system has a sufficiently
> > competent administrator they already know stuff that didn't even
> > come up until the second or third day it was discussed on lkml.
> 
> I'm not convinced that information which needs to be known by System
> Administrators is best documented in the kernel Documentation
> directory.  Should there be a HOWTO document on stuff like that?

It is not only for system administrators; I was trying to find out if
kernel is buggy, and that should be in kernel tree.


> > If "degraded array" just means "don't have a replacement disk yet",
> > then it sounds like what Pavel wants to document is "don't write to
> > a degraded array at all, because power failures can cost you data
> > due to write granularity being larger than filesystem block size".
> > (Which still comes as news to some of us, and you need a way to
> > remount mount the degraded array read only until the sysadmin can
> > fix it.)
> 
> If you want to document that as a property of RAID arrays, sure.  But
> it's not something that should live in Documentation/filesystems/ext2.txt
> and Documentation/filesystems/ext3.txt.  The MD RAID howto might be a

ext3 documentation states that journal protects fs integrity on
powerfail. If you don't want to talk about storage stacks, perhaps
that should be removed?

Now... You mocked me up for 'ext3 expects disks to behave like disks
(alarmist)'. I actually believe that should be written somewhere. ext3
depends on fairly subtle storage disk characteristics, and many common
configs just do not meet the expectations (missing barriers is most
common, followed by collateral damage).

Maybe not documenting that was okay 10 years ago, but with all the USB
sticks and raid arrays around, its just sloppy. Because those
characteristics are not documented, storage stack authors do not know
what they have to guarantee, and the result is bad. See for example
nbd -- it does not propagate barriers and is therefore unsafe.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-28 14:49                                                       ` david
@ 2009-08-29 10:05                                                         ` Pavel Machek
  2009-08-29 20:22                                                           ` Rob Landley
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-29 10:05 UTC (permalink / raw)
  To: david
  Cc: Rob Landley, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Fri 2009-08-28 07:49:38, david@lang.hm wrote:
> On Thu, 27 Aug 2009, Rob Landley wrote:
>
>> Pavel's response was to attempt to document this.  Not that journaling is
>> _bad_, but that it doesn't protect against this class of problem.
>
> I don't think anyone is disagreeing with the statement that journaling  
> doesn't protect against this class of problems, but Pavel's statements  
> didn't say that. he stated that ext3 is more dangerous than ext2.

Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.

But I'm not pushing that to documentation, I'm trying to push info
everyone agrees with. (check the patches).
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-28 14:46                                   ` david
@ 2009-08-29 10:09                                     ` Pavel Machek
  2009-08-29 16:27                                       ` david
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-29 10:09 UTC (permalink / raw)
  To: david
  Cc: David Woodhouse, Theodore Tso, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Fri 2009-08-28 07:46:42, david@lang.hm wrote:
> On Thu, 27 Aug 2009, David Woodhouse wrote:
>
>> On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote:
>>>
>>> (It's worse with people using Digital SLR's shooting in raw mode,
>>> since it can take upwards of 30 seconds or more to write out a 12-30MB
>>> raw image, and if you eject at the wrong time, you can trash the
>>> contents of the entire CF card; in the worst case, the Flash
>>> Translation Layer data can get corrupted, and the card is completely
>>> ruined; you can't even reformat it at the filesystem level, but have
>>> to get a special Windows program from the CF manufacturer to --maybe--
>>> reset the FTL layer.
>>
>> This just goes to show why having this "translation layer" done in
>> firmware on the device itself is a _bad_ idea. We're much better off
>> when we have full access to the underlying flash and the OS can actually
>> see what's going on. That way, we can actually debug, fix and recover
>> from such problems.
>>
>>>   Early CF cards were especially vulnerable to
>>> this; more recent CF cards are better, but it's a known failure mode
>>> of CF cards.)
>>
>> It's a known failure mode of _everything_ that uses flash to pretend to
>> be a block device. As I see it, there are no SSD devices which don't
>> lose data; there are only SSD devices which haven't lost your data
>> _yet_.
>>
>> There's no fundamental reason why it should be this way; it just is.
>>
>> (I'm kind of hoping that the shiny new expensive ones that everyone's
>> talking about right now, that I shouldn't really be slagging off, are
>> actually OK. But they're still new, and I'm certainly not trusting them
>> with my own data _quite_ yet.)
>
> so what sort of test would be needed to identify if a device has this  
> problem?
>
> people can do ad-hoc tests by pulling the devices in use and then 
> checking the entire device, but something better should be available.
>
> it seems to me that there are two things needed to define the tests.
>
> 1. a predictable write load so that it's easy to detect data getting lose
>
> 2. some statistical analysis to decide how many device pulls are needed  
> (under the write load defined in #1) to make the odds high that the  
> problem will be revealed.

Its simpler than that. It usually breaks after third unplug or so.

> for USB devices there may be a way to use the power management functions  
> to cut power to the device without requiring it to physically be pulled,  
> if this is the case (even if this only works on some specific chipsets),  
> it would drasticly speed up the testing

This is really so easy to reproduce, that such speedup is not
neccessary. Just try the scripts :-).
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-29  9:49                                                                     ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
@ 2009-08-29 11:28                                                                       ` Ric Wheeler
  2009-09-02 20:12                                                                         ` Pavel Machek
  2009-08-29 16:35                                                                         ` david
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-29 11:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 08/29/2009 05:49 AM, Pavel Machek wrote:
>    
>>> So instead of fixing or at least documenting known software deficiency
>>> in Linux MD stack, you'll try to surpress that information so that
>>> people use more of raid5 setups?
>>>
>>> Perhaps the better documentation will push them to RAID1, or maybe
>>> make them buy an UPS?
>>>        
>> people aren't objecting to better documentation, they are objecting to
>> misleading documentation.
>>      
> Actually Ric is. He's trying hard to make RAID5 look better than it
> really is.
>
>    
>

I object to misleading and dangerous documentation that you have 
proposed. I spend a lot of time working in data integrity, talking and 
writing about it so I care deeply that we don't misinform people.

In this thread, I put out a draft that is accurate several times and you 
have failed to respond to it.

The big picture that you don't agree with is:

(1) RAID (specifically MD RAID) will dramatically improve data integrity 
for real users. This is not a statement of opinion, this is a statement 
of fact that has been shown to be true in large scale deployments with 
commodity hardware.

(2) RAID5 protects you against a single failure and your test case 
purposely injects a double failure.

(3) How to configure MD reliably should be documented in MD 
documentation, not in each possible FS or raw device application

(4) Data loss occurs in non-journalling file systems and journalling 
file systems when you suffer double failures or hot unplug storage, 
especially inexpensive FLASH parts.

ric




^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-29  9:39                                                             ` Pavel Machek
@ 2009-08-29 11:47                                                               ` Ron Johnson
  2009-08-29 16:12                                                                 ` jim owens
  0 siblings, 1 reply; 309+ messages in thread
From: Ron Johnson @ 2009-08-29 11:47 UTC (permalink / raw)
  To: linux-ext4; +Cc: Rik van Riel, Ric Wheeler, Theodore Tso, corbet

On 2009-08-29 04:39, Pavel Machek wrote:
> On Wed 2009-08-26 10:45:44, Rik van Riel wrote:
>> Pavel Machek wrote:
>>
>>> Sledgehammer is hardware problem, and I'm demonstrating
>>> software/documentation problem we have here.
>> So your argument is that a sledgehammer is a hardware
>> problem, while a broken hard disk and a power failure
>> are software/documentation issues?
>>
>> I'd argue that the broken hard disk and power failure
>> are hardware issues, too.
> 
> Noone told me that degraded md raid5 is dangerous. Thats documentation
> issue #1. Maybe I just pulled the disk for fun.

You're kidding, right?

Or are you being too effectively sarcastic?

-- 
Obsession with "preserving cultural heritage" is a racist impediment
to moral, physical and intellectual progress.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: ext2/3: document conditions when reliable operation is possible
  2009-08-29  1:33   ` Robert Hancock
@ 2009-08-29 13:04     ` Alan Cox
  0 siblings, 0 replies; 309+ messages in thread
From: Alan Cox @ 2009-08-29 13:04 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Rob Landley, Pavel Machek, kernel list, Andrew Morton,
	mtk.manpages, tytso, rdunlap, linux-doc, linux-ext4

> I've heard rumors of disks that claim to support cache flushes but 
> really just ignore them, but have never heard any specifics of model 
> numbers, etc. which are known to do this, so it may just be legend. If 
> we do have such knowledge then we should really be blacklisting those 
> drives and warning the user that we can't ensure data integrity. (Even 
> powering down the system would be unsafe in this case.)

This should not be the case for any vaguely modern drive. The standard
requires the drive flushes the cache if sent the command and the size of
caches on modern drives rather require it.

Alan

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-29 11:47                                                               ` Ron Johnson
@ 2009-08-29 16:12                                                                 ` jim owens
  0 siblings, 0 replies; 309+ messages in thread
From: jim owens @ 2009-08-29 16:12 UTC (permalink / raw)
  To: Ron Johnson; +Cc: linux-ext4, Rik van Riel, Ric Wheeler, Theodore Tso, corbet

Ron Johnson wrote:
> On 2009-08-29 04:39, Pavel Machek wrote:
>> Noone told me that degraded md raid5 is dangerous. Thats documentation
>> issue #1. Maybe I just pulled the disk for fun.
> 
> You're kidding, right?

No he is not... and that is exactly why Ted and Ric have been
fighting so hard against his scare the children documentation.

In 20 years, I have not found a way to educate those who think
"I know computers so it must work the way I want and expect."

Tremendous amounts of information and recommendations are out
there on the web, in books, classes, etc.  But people don't
research before using or understand before they have a problem.

Pavel Machek wrote:
> It is not only for system administrators; I was trying to find
> out if kernel is buggy, and that should be in kernel tree.

Pavel, *THE KERNEL IS NOT BUGGY* end of story!

Everyone experienced in storage understands the "in the
edge case that Pavel hit, you will loose your data", and we
take our responsibility to tell people what works and does
not work very seriously.  And we try very hard to reduce the
amount of edge case data losses.

But as Ric and Ted and many others keep trying to explain:

- There is no such thing as "never fails" data storage.

- The goal of journal file systems is not what you thing.

- The goal of raid is not what you think.

- We do not want the vast majority of computer users who
   are not kernel engineers to stop using the technology
   that in 99.99 percent of the use cases keeps their data
   as safe as we can reasonably make it, just because they
   read Pavel's 0.01 percent scary and inaccurate case.

And the worst part is this 0.01 percent case problem
is really "I did not know what I was doing".

jim


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-29 10:09                                     ` Pavel Machek
@ 2009-08-29 16:27                                       ` david
  2009-08-29 21:33                                         ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: david @ 2009-08-29 16:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Woodhouse, Theodore Tso, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sat, 29 Aug 2009, Pavel Machek wrote:

> On Fri 2009-08-28 07:46:42, david@lang.hm wrote:
>>
>>
>> so what sort of test would be needed to identify if a device has this
>> problem?
>>
>> people can do ad-hoc tests by pulling the devices in use and then
>> checking the entire device, but something better should be available.
>>
>> it seems to me that there are two things needed to define the tests.
>>
>> 1. a predictable write load so that it's easy to detect data getting lose
>>
>> 2. some statistical analysis to decide how many device pulls are needed
>> (under the write load defined in #1) to make the odds high that the
>> problem will be revealed.
>
> Its simpler than that. It usually breaks after third unplug or so.
>
>> for USB devices there may be a way to use the power management functions
>> to cut power to the device without requiring it to physically be pulled,
>> if this is the case (even if this only works on some specific chipsets),
>> it would drasticly speed up the testing
>
> This is really so easy to reproduce, that such speedup is not
> neccessary. Just try the scripts :-).

so if it doesn't get corrupted after 5 unplugs does that mean that that 
particular device doesn't have a problem? or does it just mean you got 
lucky?

would 10 sucessful unplugs mean that it's safe?

what about 20?

we need to get this beyond anecdotal evidence mode, to something that 
(even if not perfect, as you can get 100 'heads' in a row with an honest 
coin) gives you pretty good assurances that a particular device is either 
good or bad.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-29  9:49                                                                     ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
@ 2009-08-29 16:35                                                                         ` david
  2009-08-29 16:35                                                                         ` david
  1 sibling, 0 replies; 309+ messages in thread
From: david @ 2009-08-29 16:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1331 bytes --]

On Sat, 29 Aug 2009, Pavel Machek wrote:

>> for flash drives the danger is very straightforward (although even then
>> you have to note that it depends heavily on the firmware of the device,
>> some will loose lots of data, some won't loose any)
>
> I have not seen one that works :-(.

so let's get broader testing (including testing the SSDs as well as the 
thumb drives)

>> you are generalizing that since you have lost data on flash drives, all
>> flash drives are dangerous.
>
> Do the flash manufacturers claim they do not cause collateral damage
> during powerfail? If not, they probably are dangerous.

I think that every single one of them will tell you to not unplug the 
drive while writing to it. in fact, I'll bet they all tell you to not 
unplug the drive without unmounting ('ejecting') it at the OS level.

> Anyway, you wanted a test, and one is attached. It normally takes like
> 4 unplugs to uncover problems.

Ok, help me understand this.

I copy these two files to a system, change them to point at the correct 
device, run them and unplug the drive while it's running.

when I plug the device back in, how do I tell if it lost something 
unexpected? since you are writing from urandom I have no idea what data 
_should_ be on the drive, so how can I detect that a data block has been 
corrupted?

David Lang

[-- Attachment #2: Type: TEXT/PLAIN, Size: 923 bytes --]

#!/bin/bash
#
# Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2
#
# vfat is broken with filesize=0
#
#

if [ .$MOUNTOPTS = . ]; then
# ext3 is needed, or you need to disable caches using hdparm.
# odirsync is needed, else modify fstest.worker to fsync the directory.
    MOUNTOPTS="-o dirsync"
fi
if [ .$BDEV = . ]; then
#    BDEV=/dev/sdb3
    BDEV=/dev/nd0
fi

export FILESIZE=4000
export NUMFILES=4000

waitforcard() {
    umount /mnt
    echo Waiting for card:
    while ! mount $BDEV $MOUNTOPTS /mnt 2> /dev/null; do
	echo -n .
	sleep 1
    done
#   hdparm -W0 $BDEV
    echo
}

mkdir delme.fstest
cd delme.fstest

waitforcard
rm tmp.* final.* /mnt/tmp.* /mnt/final.*

while true; do
    ../fstest.work
    echo
    waitforcard
    echo Testing: fsck....
    umount /mnt
    fsck -fy $BDEV
    echo Testing....
    waitforcard
    for A in final.*; do
	echo -n $A " "
	cmp $A /mnt/$A || exit
    done
    echo
done

[-- Attachment #3: Type: TEXT/PLAIN, Size: 409 bytes --]

#!/bin/bash
#
# Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2
#

echo "Writing test files: "
for A in `seq $NUMFILES`; do
    echo -n $A " "
    rm final.$A
    cat /dev/urandom | head -c $FILESIZE > tmp.$A
    dd conv=fsync if=tmp.$A of=/mnt/final.$A 2> /dev/zero || exit
#    cat /mnt/final.$A > /dev/null || exit
# sync should not be needed, as dd asks for fsync
#    sync
    mv tmp.$A final.$A
done

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
@ 2009-08-29 16:35                                                                         ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-29 16:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1331 bytes --]

On Sat, 29 Aug 2009, Pavel Machek wrote:

>> for flash drives the danger is very straightforward (although even then
>> you have to note that it depends heavily on the firmware of the device,
>> some will loose lots of data, some won't loose any)
>
> I have not seen one that works :-(.

so let's get broader testing (including testing the SSDs as well as the 
thumb drives)

>> you are generalizing that since you have lost data on flash drives, all
>> flash drives are dangerous.
>
> Do the flash manufacturers claim they do not cause collateral damage
> during powerfail? If not, they probably are dangerous.

I think that every single one of them will tell you to not unplug the 
drive while writing to it. in fact, I'll bet they all tell you to not 
unplug the drive without unmounting ('ejecting') it at the OS level.

> Anyway, you wanted a test, and one is attached. It normally takes like
> 4 unplugs to uncover problems.

Ok, help me understand this.

I copy these two files to a system, change them to point at the correct 
device, run them and unplug the drive while it's running.

when I plug the device back in, how do I tell if it lost something 
unexpected? since you are writing from urandom I have no idea what data 
_should_ be on the drive, so how can I detect that a data block has been 
corrupted?

David Lang

[-- Attachment #2: Type: TEXT/PLAIN, Size: 976 bytes --]

#!/bin/bash

#

# Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2

#

# vfat is broken with filesize=0

#

#



if [ .$MOUNTOPTS = . ]; then

# ext3 is needed, or you need to disable caches using hdparm.

# odirsync is needed, else modify fstest.worker to fsync the directory.

    MOUNTOPTS="-o dirsync"

fi

if [ .$BDEV = . ]; then

#    BDEV=/dev/sdb3

    BDEV=/dev/nd0

fi



export FILESIZE=4000

export NUMFILES=4000



waitforcard() {

    umount /mnt

    echo Waiting for card:

    while ! mount $BDEV $MOUNTOPTS /mnt 2> /dev/null; do

	echo -n .

	sleep 1

    done

#   hdparm -W0 $BDEV

    echo

}



mkdir delme.fstest

cd delme.fstest



waitforcard

rm tmp.* final.* /mnt/tmp.* /mnt/final.*



while true; do

    ../fstest.work

    echo

    waitforcard

    echo Testing: fsck....

    umount /mnt

    fsck -fy $BDEV

    echo Testing....

    waitforcard

    for A in final.*; do

	echo -n $A " "

	cmp $A /mnt/$A || exit

    done

    echo

done


[-- Attachment #3: Type: TEXT/PLAIN, Size: 425 bytes --]

#!/bin/bash

#

# Copyright 2008 Pavel Machek <pavel@suse.cz>, GPLv2

#



echo "Writing test files: "

for A in `seq $NUMFILES`; do

    echo -n $A " "

    rm final.$A

    cat /dev/urandom | head -c $FILESIZE > tmp.$A

    dd conv=fsync if=tmp.$A of=/mnt/final.$A 2> /dev/zero || exit

#    cat /mnt/final.$A > /dev/null || exit

# sync should not be needed, as dd asks for fsync

#    sync

    mv tmp.$A final.$A

done


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-29 10:05                                                         ` Pavel Machek
@ 2009-08-29 20:22                                                           ` Rob Landley
  2009-08-29 21:34                                                             ` Pavel Machek
  2009-09-03 16:56                                                             ` what fsck can (and can't) do was " david
  0 siblings, 2 replies; 309+ messages in thread
From: Rob Landley @ 2009-08-29 20:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Saturday 29 August 2009 05:05:58 Pavel Machek wrote:
> On Fri 2009-08-28 07:49:38, david@lang.hm wrote:
> > On Thu, 27 Aug 2009, Rob Landley wrote:
> >> Pavel's response was to attempt to document this.  Not that journaling
> >> is _bad_, but that it doesn't protect against this class of problem.
> >
> > I don't think anyone is disagreeing with the statement that journaling
> > doesn't protect against this class of problems, but Pavel's statements
> > didn't say that. he stated that ext3 is more dangerous than ext2.
>
> Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.

The filesystem itself isn't more dangerous, but it may provide a false sense of 
security when used on storage devices it wasn't designed for.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-29 16:27                                       ` david
@ 2009-08-29 21:33                                         ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-29 21:33 UTC (permalink / raw)
  To: david
  Cc: David Woodhouse, Theodore Tso, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Hi!

>> This is really so easy to reproduce, that such speedup is not
>> neccessary. Just try the scripts :-).
>
> so if it doesn't get corrupted after 5 unplugs does that mean that that  
> particular device doesn't have a problem? or does it just mean you got  
> lucky?
>
> would 10 sucessful unplugs mean that it's safe?
>
> what about 20?

I'd say 20 means its safe.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-29 20:22                                                           ` Rob Landley
@ 2009-08-29 21:34                                                             ` Pavel Machek
  2009-09-03 16:56                                                             ` what fsck can (and can't) do was " david
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-29 21:34 UTC (permalink / raw)
  To: Rob Landley
  Cc: david, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sat 2009-08-29 15:22:06, Rob Landley wrote:
> On Saturday 29 August 2009 05:05:58 Pavel Machek wrote:
> > On Fri 2009-08-28 07:49:38, david@lang.hm wrote:
> > > On Thu, 27 Aug 2009, Rob Landley wrote:
> > >> Pavel's response was to attempt to document this.  Not that journaling
> > >> is _bad_, but that it doesn't protect against this class of problem.
> > >
> > > I don't think anyone is disagreeing with the statement that journaling
> > > doesn't protect against this class of problems, but Pavel's statements
> > > didn't say that. he stated that ext3 is more dangerous than ext2.
> >
> > Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.
> 
> The filesystem itself isn't more dangerous, but it may provide a false sense of 
> security when used on storage devices it wasn't designed for.

Agreed.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 12:37                                                           ` Theodore Tso
  2009-08-30  6:49                                                             ` Pavel Machek
@ 2009-08-30  6:49                                                             ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  6:49 UTC (permalink / raw)
  To: Theodore Tso, david, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Wed 2009-08-26 08:37:09, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote:
> > > you just plain cannot count on writes that are in flight when a powerfail 
> > > happens to do predictable things, let alone what you consider sane or  
> > > proper.
> > 
> > From what I see, this kind of failure is rather harder to reproduce
> > than the software problems. And at least SGI machines were designed to
> > avoid this...
> > 
> > Anyway, I'd like to hear from ext3 people... what happens on read
> > errors in journal? That's what you'd expect to see in situation above.
> 
> On a power failure, what normally happens is that the random garbage
> gets written into the disk drive's last dying gasp, since the memory
> starts going insane and sends garbage to the disk.  So the disk
> successfully completes the write, but the sector contains garbage.
> Since HDD's tend to be last thing to die, being less sensitive to
> voltage drops than the memory or DMA controller, my experience is that
> you don't get a read error after the system comes up, you just get
> garbage written into the journal.
> 
> The ext3 journalling code waits until all of the journal code is
> written, and only then writes the commit block.  On restart, we look
> for the last valid commit block.  So if the power failure is before we
> write the commit block, we replay the journal up until the previous
> commit block.  If the power failure is while we are writing the commit
> block, garbage will be written out instead of the commit block, and so
> it falls back to the previous case.
> 
> We do not allow any updates to the filesystem metadata to take place
> until the commit block has been written; therefore the filesystem
> stays consistent.

Ok, cool.

> If there the journal *does* develop read errors, then fsck will
> require a manual fsck, and so the boot operation will get stopped so a
> system administrator can provide manual intervention.  The best bet
> for the sysadmin is to replay as much of the journal she can, and then
> let fsck fix any resulting filesystem inconsistencies.  In practice,

...and that should result in consistent fs with no data loss, because
read error is essentialy the same as garbage given back, right?

...plus, this is significant difference from logical-logging
filesystems, no?

Should this go to Documentation/, somewhere?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] document flash/RAID dangers
  2009-08-26 12:37                                                           ` Theodore Tso
@ 2009-08-30  6:49                                                             ` Pavel Machek
  2009-08-30  6:49                                                             ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  6:49 UTC (permalink / raw)
  To: Theodore Tso, david, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley

On Wed 2009-08-26 08:37:09, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:25:36PM +0200, Pavel Machek wrote:
> > > you just plain cannot count on writes that are in flight when a powerfail 
> > > happens to do predictable things, let alone what you consider sane or  
> > > proper.
> > 
> > From what I see, this kind of failure is rather harder to reproduce
> > than the software problems. And at least SGI machines were designed to
> > avoid this...
> > 
> > Anyway, I'd like to hear from ext3 people... what happens on read
> > errors in journal? That's what you'd expect to see in situation above.
> 
> On a power failure, what normally happens is that the random garbage
> gets written into the disk drive's last dying gasp, since the memory
> starts going insane and sends garbage to the disk.  So the disk
> successfully completes the write, but the sector contains garbage.
> Since HDD's tend to be last thing to die, being less sensitive to
> voltage drops than the memory or DMA controller, my experience is that
> you don't get a read error after the system comes up, you just get
> garbage written into the journal.
> 
> The ext3 journalling code waits until all of the journal code is
> written, and only then writes the commit block.  On restart, we look
> for the last valid commit block.  So if the power failure is before we
> write the commit block, we replay the journal up until the previous
> commit block.  If the power failure is while we are writing the commit
> block, garbage will be written out instead of the commit block, and so
> it falls back to the previous case.
> 
> We do not allow any updates to the filesystem metadata to take place
> until the commit block has been written; therefore the filesystem
> stays consistent.

Ok, cool.

> If there the journal *does* develop read errors, then fsck will
> require a manual fsck, and so the boot operation will get stopped so a
> system administrator can provide manual intervention.  The best bet
> for the sysadmin is to replay as much of the journal she can, and then
> let fsck fix any resulting filesystem inconsistencies.  In practice,

...and that should result in consistent fs with no data loss, because
read error is essentialy the same as garbage given back, right?

...plus, this is significant difference from logical-logging
filesystems, no?

Should this go to Documentation/, somewhere?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 12:23                                                                   ` Theodore Tso
  2009-08-30  7:01                                                                     ` Pavel Machek
@ 2009-08-30  7:01                                                                     ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:01 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Wed 2009-08-26 08:23:11, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote:
> > > I agree that this is not an interesting (or likely) scenario, certainly  
> > > when compared to the much more frequent failures that RAID will protect  
> > > against which is why I object to the document as Pavel suggested. It  
> > > will steer people away from using RAID and directly increase their  
> > > chances of losing their data if they use just a single disk.
> > 
> > So instead of fixing or at least documenting known software deficiency
> > in Linux MD stack, you'll try to surpress that information so that
> > people use more of raid5 setups?
> 
> First of all, it's not a "known software deficiency"; you can't do
> anything about a degraded RAID array, other than to replace the failed
> disk. 

You could add journal to raid5.

> "ext2 and ext3 have this surprising dependency that disks act like
> disks".  (alarmist)

AFAICT, you mount block device, not disk. Many block devices fail the
test. And since users (and block device developers) do not  know in
detail how disks behave, it is hard to blame them... ("you may corrupt
sector you are writing to and ext3 handles that ok" was surprise for
me, for example).

					
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 12:23                                                                   ` Theodore Tso
@ 2009-08-30  7:01                                                                     ` Pavel Machek
  2009-08-30  7:01                                                                     ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:01 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel

On Wed 2009-08-26 08:23:11, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 01:12:08PM +0200, Pavel Machek wrote:
> > > I agree that this is not an interesting (or likely) scenario, certainly  
> > > when compared to the much more frequent failures that RAID will protect  
> > > against which is why I object to the document as Pavel suggested. It  
> > > will steer people away from using RAID and directly increase their  
> > > chances of losing their data if they use just a single disk.
> > 
> > So instead of fixing or at least documenting known software deficiency
> > in Linux MD stack, you'll try to surpress that information so that
> > people use more of raid5 setups?
> 
> First of all, it's not a "known software deficiency"; you can't do
> anything about a degraded RAID array, other than to replace the failed
> disk. 

You could add journal to raid5.

> "ext2 and ext3 have this surprising dependency that disks act like
> disks".  (alarmist)

AFAICT, you mount block device, not disk. Many block devices fail the
test. And since users (and block device developers) do not  know in
detail how disks behave, it is hard to blame them... ("you may corrupt
sector you are writing to and ext3 handles that ok" was surprise for
me, for example).

					
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 13:43                                           ` david
  2009-08-26 18:02                                             ` Theodore Tso
@ 2009-08-30  7:03                                             ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:03 UTC (permalink / raw)
  To: david
  Cc: Rik van Riel, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack

On Wed 2009-08-26 06:43:24, david@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>>>> The metadata is just a way to get to my data, while the data
>>>>> is actually important.
>>>>
>>>> Personally, I care about metadata consistency, and ext3 documentation
>>>> suggests that journal protects its integrity. Except that it does not
>>>> on broken storage devices, and you still need to run fsck there.
>>>
>>> as the ext3 authors have stated many times over the years, you still need
>>> to run fsck periodicly anyway.
>>
>> Where is that documented?
>
> linux-kernel mailing list archives.

That's not where fs documentation belongs :-(.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-29 16:35                                                                         ` david
  (?)
@ 2009-08-30  7:07                                                                         ` Pavel Machek
  -1 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:07 UTC (permalink / raw)
  To: david
  Cc: Ric Wheeler, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

Hi!

>>> for flash drives the danger is very straightforward (although even then
>>> you have to note that it depends heavily on the firmware of the device,
>>> some will loose lots of data, some won't loose any)
>>
>> I have not seen one that works :-(.
>
> so let's get broader testing (including testing the SSDs as well as the  
> thumb drives)

If someone can do ssd test -- yes that would be interesting.

>> Anyway, you wanted a test, and one is attached. It normally takes like
>> 4 unplugs to uncover problems.
>
> Ok, help me understand this.
>
> I copy these two files to a system, change them to point at the correct  
> device, run them and unplug the drive while it's running.

Yep.

> when I plug the device back in, how do I tell if it lost something  
> unexpected? since you are writing from urandom I have no idea what data  
> _should_ be on the drive, so how can I detect that a data block has been  
> corrupted?

I have mirror on disk you are not unplugging. See cmp || exit lines.

The test continues until it detects corruption.
								Pavel


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-27  6:54                                           ` david
  2009-08-27  7:34                                             ` Rob Landley
@ 2009-08-30  7:19                                             ` Pavel Machek
  2009-08-30 12:48                                               ` david
  1 sibling, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:19 UTC (permalink / raw)
  To: david
  Cc: Rob Landley, Theodore Tso, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Hi!

>> I thought the reason for that was that if your metadata is horked, further
>> writes to the disk can trash unrelated existing data because it's lost track
>> of what's allocated and what isn't.  So back when the assumption was "what's
>> written stays written", then keeping the metadata sane was still darn
>> important to prevent normal operation from overwriting unrelated existing
>> data.
>>
>> Then Pavel notified us of a situation where interrupted writes to the disk can
>> trash unrelated existing data _anyway_, because the flash block size on the 16
>> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
>> it's 4k or smaller.  It seems like what _broke_ was the assumption that the
>> filesystem block size >= the disk block size, and nobody noticed for a while.
>> (Except the people making jffs2 and friends, anyway.)
>>
>> Today we have cheap plentiful USB keys that act like hard drives, except that
>> their write block size isn't remotely the same as hard drives', but they
>> pretend it is, and then the block wear levelling algorithms fuzz things
>> further.  (Gee, a drive controller lying about drive geometry, the scsi crowd
>> should feel right at home.)
>
> actually, you don't know if your USB key works that way or not. Pavel has 
> ssome that do, that doesn't mean that all flash drives do
>
> when you do a write to a flash drive you have to do the following items
>
> 1. allocate an empty eraseblock to put the data on
>
> 2. read the old eraseblock
>
> 3. merge the incoming write to the eraseblock
>
> 4. write the updated data to the flash
>
> 5. update the flash trnslation layer to point reads at the new location  
> instead of the old location.


That would need two erases per single sector writen, no? Erase is in
milisecond range, so the performance would be just way too bad :-(.
	   	     	 	     	      	       	       Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28 12:08                                                         ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso
@ 2009-08-30  7:51                                                           ` Pavel Machek
  2009-08-30  9:01                                                             ` Christian Kujau
                                                                               ` (2 more replies)
  2009-08-30  7:51                                                           ` Pavel Machek
  1 sibling, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:51 UTC (permalink / raw)
  To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Hi!

> > From: Theodore Tso <tytso@mit.edu>
> > 
> > Document that many devices are too broken for filesystems to protect
> > data in case of powerfail.
> > 
> > Signed-of-by: Pavel Machek <pavel@ucw.cz> 
> 
> NACK.  I didn't write this patch, and it's disingenuous for you to try
> to claim that I authored it.

Well, you did write original text, so I wanted to give you
credit. Sorry.

> While Neil Brown's corrections are minimally necessary so the text is
> at least technically *correct*, it's still not the right advice to
> give system administrators.  It's better than the fear-mongering
> patches you had proposed earlier, but what would be better *still* is
> telling people why running with degraded RAID arrays is bad, and to
> give them further tips about how to use RAID arrays safely.

Maybe this belongs to Doc*/filesystems, and more detailed RAID
description should go to md description?

> To use your ABS brakes analogy, just becase it's not safe to rely on
> ABS brakes if the "check brakes" light is on, that doesn't justify
> writing something alarmist which claims that ABS brakes don't work
> 100% of the time, don't use ABS brakes, they're broken!!!!

If it only was this simple. We don't have 'check brakes' (aka
'journalling ineffective') warning light. If we had that, I would not
have problem.

It is rather that your ABS brakes are ineffective if 'check engine'
(RAID degraded) is lit. And yes, running with 'check engine' for
extended periods may be bad idea, but I know people that do
that... and I still hope their brakes work (and believe they should
have won suit for damages should their ABS brakes fail). 

> That's just silly.  What we should be telling people instead is (a)
> pay attention to the check brakes light (just as you should pay
> attention to the RAID array is degraded warning), and (b) while ABS

'your RAID array is degraded' is very counter intuitive way to say
'...and btw your journalling is no longer effective, either'.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28 12:08                                                         ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso
  2009-08-30  7:51                                                           ` Pavel Machek
@ 2009-08-30  7:51                                                           ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30  7:51 UTC (permalink / raw)
  To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow

Hi!

> > From: Theodore Tso <tytso@mit.edu>
> > 
> > Document that many devices are too broken for filesystems to protect
> > data in case of powerfail.
> > 
> > Signed-of-by: Pavel Machek <pavel@ucw.cz> 
> 
> NACK.  I didn't write this patch, and it's disingenuous for you to try
> to claim that I authored it.

Well, you did write original text, so I wanted to give you
credit. Sorry.

> While Neil Brown's corrections are minimally necessary so the text is
> at least technically *correct*, it's still not the right advice to
> give system administrators.  It's better than the fear-mongering
> patches you had proposed earlier, but what would be better *still* is
> telling people why running with degraded RAID arrays is bad, and to
> give them further tips about how to use RAID arrays safely.

Maybe this belongs to Doc*/filesystems, and more detailed RAID
description should go to md description?

> To use your ABS brakes analogy, just becase it's not safe to rely on
> ABS brakes if the "check brakes" light is on, that doesn't justify
> writing something alarmist which claims that ABS brakes don't work
> 100% of the time, don't use ABS brakes, they're broken!!!!

If it only was this simple. We don't have 'check brakes' (aka
'journalling ineffective') warning light. If we had that, I would not
have problem.

It is rather that your ABS brakes are ineffective if 'check engine'
(RAID degraded) is lit. And yes, running with 'check engine' for
extended periods may be bad idea, but I know people that do
that... and I still hope their brakes work (and believe they should
have won suit for damages should their ABS brakes fail). 

> That's just silly.  What we should be telling people instead is (a)
> pay attention to the check brakes light (just as you should pay
> attention to the RAID array is degraded warning), and (b) while ABS

'your RAID array is degraded' is very counter intuitive way to say
'...and btw your journalling is no longer effective, either'.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30  7:51                                                           ` Pavel Machek
@ 2009-08-30  9:01                                                             ` Christian Kujau
  2009-09-02 20:55                                                               ` Pavel Machek
  2009-08-30 12:55                                                             ` david
  2009-08-30 15:20                                                             ` Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: Christian Kujau @ 2009-08-30  9:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sun, 30 Aug 2009 at 09:51, Pavel Machek wrote:
> > give system administrators.  It's better than the fear-mongering
> > patches you had proposed earlier, but what would be better *still* is
> > telling people why running with degraded RAID arrays is bad, and to
> > give them further tips about how to use RAID arrays safely.
> 
> Maybe this belongs to Doc*/filesystems, and more detailed RAID
> description should go to md description?

Why should this be placed in *kernel* documentation anyway? The "dangers 
of RAID", the hints that "backups are a good idea" - isn't that something 
for howtos for sysadmins? No end-user will ever look into Documentation/ 
anyway. The sysadmins should know what they're doing and see the upsides 
and downsides of RAID and journalling filesystems. And they'll turn to 
howtos and tutorials to find out. And maybe seek *reference* documentation 
in Documentation/ - but I don't think Storage-101 should be covered in 
a mostly hidden place like Documentation/.

Christian.
-- 
BOFH excuse #212:

Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-30  7:19                                             ` Pavel Machek
@ 2009-08-30 12:48                                               ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-30 12:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, Theodore Tso, Rik van Riel, Ric Wheeler,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sun, 30 Aug 2009, Pavel Machek wrote:

>>> I thought the reason for that was that if your metadata is horked, further
>>> writes to the disk can trash unrelated existing data because it's lost track
>>> of what's allocated and what isn't.  So back when the assumption was "what's
>>> written stays written", then keeping the metadata sane was still darn
>>> important to prevent normal operation from overwriting unrelated existing
>>> data.
>>>
>>> Then Pavel notified us of a situation where interrupted writes to the disk can
>>> trash unrelated existing data _anyway_, because the flash block size on the 16
>>> gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks
>>> it's 4k or smaller.  It seems like what _broke_ was the assumption that the
>>> filesystem block size >= the disk block size, and nobody noticed for a while.
>>> (Except the people making jffs2 and friends, anyway.)
>>>
>>> Today we have cheap plentiful USB keys that act like hard drives, except that
>>> their write block size isn't remotely the same as hard drives', but they
>>> pretend it is, and then the block wear levelling algorithms fuzz things
>>> further.  (Gee, a drive controller lying about drive geometry, the scsi crowd
>>> should feel right at home.)
>>
>> actually, you don't know if your USB key works that way or not. Pavel has
>> ssome that do, that doesn't mean that all flash drives do
>>
>> when you do a write to a flash drive you have to do the following items
>>
>> 1. allocate an empty eraseblock to put the data on
>>
>> 2. read the old eraseblock
>>
>> 3. merge the incoming write to the eraseblock
>>
>> 4. write the updated data to the flash
>>
>> 5. update the flash trnslation layer to point reads at the new location
>> instead of the old location.
>
>
> That would need two erases per single sector writen, no? Erase is in
> milisecond range, so the performance would be just way too bad :-(.

no, it only needs one erase

if you don't have a pool of pre-erased blocks, then you need to do an 
erase of the new block you are allocating (before step 4)

if you do have a pool of pre-erased blocks, then you don't have to do any 
erase of the data blocks until after step 5 and you do the erase when you 
add the old data block to the pool of pre-erased blocks later.

in either case the requirements of wear leveling require that the flash 
translation layer update it's records to show that an additional write 
took place.

what appears to be happening on some cheap devices is that they do the 
following instead

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. erase the old eraseblock

5. write the updated data to the flash

I don't know where in (or after) this process theyupdate the 
wear-levling/flash translation layer info.

with this algortihm, if the device looses power between step 4 and step 5 
you loose all the data on the eraseblock.

with deferred erasing of blocks, the safer algortihm is actually the 
faster one (up until you run out of your pool of available eraseblocks, at 
which time it slows down to the same speed as the unreliable one.

most flash drives are fairly slow to write to in any case.

even the Intel X25M drives are in the same ballpark as rotating media for 
writes. as far as I know only the X25E SSD drives are faster to write to 
than rotating media, and most of them are _far_ slower.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30  7:51                                                           ` Pavel Machek
  2009-08-30  9:01                                                             ` Christian Kujau
@ 2009-08-30 12:55                                                             ` david
  2009-08-30 14:12                                                               ` Ric Wheeler
  2009-08-30 15:05                                                               ` Pavel Machek
  2009-08-30 15:20                                                             ` Theodore Tso
  2 siblings, 2 replies; 309+ messages in thread
From: david @ 2009-08-30 12:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sun, 30 Aug 2009, Pavel Machek wrote:

>>> From: Theodore Tso <tytso@mit.edu>
>>>
>> To use your ABS brakes analogy, just becase it's not safe to rely on
>> ABS brakes if the "check brakes" light is on, that doesn't justify
>> writing something alarmist which claims that ABS brakes don't work
>> 100% of the time, don't use ABS brakes, they're broken!!!!
>
> If it only was this simple. We don't have 'check brakes' (aka
> 'journalling ineffective') warning light. If we had that, I would not
> have problem.
>
> It is rather that your ABS brakes are ineffective if 'check engine'
> (RAID degraded) is lit. And yes, running with 'check engine' for
> extended periods may be bad idea, but I know people that do
> that... and I still hope their brakes work (and believe they should
> have won suit for damages should their ABS brakes fail).

the 'RAID degraded' warning says that _anything_ you put on that block 
device is at risk. it doesn't matter if you are using a filesystem with a 
journal, one without, or using the raw device directly.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 12:55                                                             ` david
@ 2009-08-30 14:12                                                               ` Ric Wheeler
  2009-08-30 14:44                                                                 ` Michael Tokarev
  2009-08-30 15:05                                                               ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-30 14:12 UTC (permalink / raw)
  To: david
  Cc: Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 08/30/2009 08:55 AM, david@lang.hm wrote:
> On Sun, 30 Aug 2009, Pavel Machek wrote:
>
>>>> From: Theodore Tso <tytso@mit.edu>
>>>>
>>> To use your ABS brakes analogy, just becase it's not safe to rely on
>>> ABS brakes if the "check brakes" light is on, that doesn't justify
>>> writing something alarmist which claims that ABS brakes don't work
>>> 100% of the time, don't use ABS brakes, they're broken!!!!
>>
>> If it only was this simple. We don't have 'check brakes' (aka
>> 'journalling ineffective') warning light. If we had that, I would not
>> have problem.
>>
>> It is rather that your ABS brakes are ineffective if 'check engine'
>> (RAID degraded) is lit. And yes, running with 'check engine' for
>> extended periods may be bad idea, but I know people that do
>> that... and I still hope their brakes work (and believe they should
>> have won suit for damages should their ABS brakes fail).
>
> the 'RAID degraded' warning says that _anything_ you put on that block 
> device is at risk. it doesn't matter if you are using a filesystem 
> with a journal, one without, or using the raw device directly.
>
> David Lang

The easiest way to lose your data in Linux - with RAID, without RAID, 
S-ATA or SAS - is to run with the write cache enabled.

If you compare the size of even a large RAID stripe it will be measured 
in KB and as this thread has mentioned already, you stand to have damage 
to just one stripe (or even just a disk sector or two).

If you lose power with the write caches enabled on that same 5 drive 
RAID set, you could lose as much as 5 * 32MB of freshly written data on  
a power loss (16-32MB write caches are common on s-ata disks these days).

For MD5 (and MD6), you really must run with the write cache disabled 
until we get barriers to work for those configurations.

It would be interesting for Pavel to retest with the write cache 
enabled/disabled on his power loss scenarios with multi-drive RAID.

Regards,

Ric




^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 14:12                                                               ` Ric Wheeler
@ 2009-08-30 14:44                                                                 ` Michael Tokarev
  2009-08-30 16:10                                                                   ` Ric Wheeler
  2009-08-30 16:35                                                                   ` Christoph Hellwig
  0 siblings, 2 replies; 309+ messages in thread
From: Michael Tokarev @ 2009-08-30 14:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler wrote:
[]
> The easiest way to lose your data in Linux - with RAID, without RAID, 
> S-ATA or SAS - is to run with the write cache enabled.
> 
> If you compare the size of even a large RAID stripe it will be measured 
> in KB and as this thread has mentioned already, you stand to have damage 
> to just one stripe (or even just a disk sector or two).
> 
> If you lose power with the write caches enabled on that same 5 drive 
> RAID set, you could lose as much as 5 * 32MB of freshly written data on  
> a power loss (16-32MB write caches are common on s-ata disks these days).

This is fundamentally wrong.  Many filesystems today use either barriers
or flushes (if barriers are not supported), and the times when disk drives
were lying to the OS that the cache got flushed are long gone.

> For MD5 (and MD6), you really must run with the write cache disabled 
> until we get barriers to work for those configurations.

I highly doubt barriers will ever be supported on anything but simple
raid1, because it's impossible to guarantee ordering across multiple
drives.  Well, it *is* possible to have write barriers with journalled
(and/or with battery-backed-cache) raid[456].

Note that even if raid[456] does not support barriers, write cache
flushes still works.

/mjt

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 12:55                                                             ` david
  2009-08-30 14:12                                                               ` Ric Wheeler
@ 2009-08-30 15:05                                                               ` Pavel Machek
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-08-30 15:05 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sun 2009-08-30 05:55:01, david@lang.hm wrote:
> On Sun, 30 Aug 2009, Pavel Machek wrote:
>
>>>> From: Theodore Tso <tytso@mit.edu>
>>>>
>>> To use your ABS brakes analogy, just becase it's not safe to rely on
>>> ABS brakes if the "check brakes" light is on, that doesn't justify
>>> writing something alarmist which claims that ABS brakes don't work
>>> 100% of the time, don't use ABS brakes, they're broken!!!!
>>
>> If it only was this simple. We don't have 'check brakes' (aka
>> 'journalling ineffective') warning light. If we had that, I would not
>> have problem.
>>
>> It is rather that your ABS brakes are ineffective if 'check engine'
>> (RAID degraded) is lit. And yes, running with 'check engine' for
>> extended periods may be bad idea, but I know people that do
>> that... and I still hope their brakes work (and believe they should
>> have won suit for damages should their ABS brakes fail).
>
> the 'RAID degraded' warning says that _anything_ you put on that block  
> device is at risk. it doesn't matter if you are using a filesystem with a 
> journal, one without, or using the raw device directly.

If you are using one with journal, you'll still need to run fsck at
boot time, to make sure metadata is still consistent... Protection
provided by journaling is not effective in this configuration.

(You have the point that pretty much all users of the blockdevice will
be affected by powerfail degraded mode.)
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30  7:51                                                           ` Pavel Machek
  2009-08-30  9:01                                                             ` Christian Kujau
  2009-08-30 12:55                                                             ` david
@ 2009-08-30 15:20                                                             ` Theodore Tso
  2009-08-31 17:49                                                               ` Jesse Brandeburg
                                                                                 ` (3 more replies)
  2 siblings, 4 replies; 309+ messages in thread
From: Theodore Tso @ 2009-08-30 15:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: NeilBrown, Ric Wheeler, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sun, Aug 30, 2009 at 09:51:35AM +0200, Pavel Machek wrote:
> 
> If it only was this simple. We don't have 'check brakes' (aka
> 'journalling ineffective') warning light. If we had that, I would not
> have problem.

But we do; comptently designed (and in the cast of software RAID,
competently packaged) RAID subsystems send notifications to the system
administrator when there is a hard drive failure.  Some hardware RAID
systems will send a page to the system administrator.  A mid-range
Areca card has a separate ethernet port so it can send e-mail to the
administrator, even if the OS is hosed for some reason.

And it's not a matter of journalling ineffective; the much bigger deal
is, "your data is at risk"; perhaps because the file system metadata
may become subject to corruption, but more critically, because the
file data may become subject to corruption.  Metadata becoming subject
to corruption is important primarily because it leads to data becoming
corruption; metadata is the tail; the user's data is the dog.

So we *do* have the warning light; the problem is that just as some
people may not realize that "check brakes" means, "YOU COULD DIE",
some people may not realize that "hard drive failure; RAID array
degraded" could mean, "YOU COULD LOSE DATA".

Fortunately, for software RAID, this is easily solved; if you are so
concerned, why don't you submit a patch to mdadm adjusting the e-mail
sent to the system administrator when the array is in a degraded
state, such that it states, "YOU COULD LOSE DATA".  I would gently
suggest to you this would be ***far*** more effective that a patch to
kernel documentation.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 14:44                                                                 ` Michael Tokarev
@ 2009-08-30 16:10                                                                   ` Ric Wheeler
  2009-08-30 16:35                                                                   ` Christoph Hellwig
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-30 16:10 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 08/30/2009 10:44 AM, Michael Tokarev wrote:
> Ric Wheeler wrote:
> []
>> The easiest way to lose your data in Linux - with RAID, without RAID, 
>> S-ATA or SAS - is to run with the write cache enabled.
>>
>> If you compare the size of even a large RAID stripe it will be 
>> measured in KB and as this thread has mentioned already, you stand to 
>> have damage to just one stripe (or even just a disk sector or two).
>>
>> If you lose power with the write caches enabled on that same 5 drive 
>> RAID set, you could lose as much as 5 * 32MB of freshly written data 
>> on  a power loss (16-32MB write caches are common on s-ata disks 
>> these days).
>
> This is fundamentally wrong.  Many filesystems today use either barriers
> or flushes (if barriers are not supported), and the times when disk 
> drives
> were lying to the OS that the cache got flushed are long gone.
Unfortunately not - if you mount a file system with write cache enabled 
and see "barriers disabled" messages in /var/log/messages, this is 
exactly what happens.

File systems issue write barrier operations that in turn do cache 
flushes (ATA_FLUSH_EXT) commands or its SCSI  equivalent.

MD5 and MD6 do not pass these operations on currently and there is no 
other file system level mechanism that somehow bypasses the IO stack to 
invalidate or flush the cache.

Note that some devices have non-volatile write caches (specifically 
arrays or battery backed RAID cards) where this is not an issue.


>
>> For MD5 (and MD6), you really must run with the write cache disabled 
>> until we get barriers to work for those configurations.
>
> I highly doubt barriers will ever be supported on anything but simple
> raid1, because it's impossible to guarantee ordering across multiple
> drives.  Well, it *is* possible to have write barriers with journalled
> (and/or with battery-backed-cache) raid[456].
>
> Note that even if raid[456] does not support barriers, write cache
> flushes still works.
>
> /mjt

I think that you are confused - barriers are implemented using cache 
flushes.

Ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 14:44                                                                 ` Michael Tokarev
  2009-08-30 16:10                                                                   ` Ric Wheeler
@ 2009-08-30 16:35                                                                   ` Christoph Hellwig
  2009-08-31 13:15                                                                     ` Ric Wheeler
  1 sibling, 1 reply; 309+ messages in thread
From: Christoph Hellwig @ 2009-08-30 16:35 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Ric Wheeler, david, Pavel Machek, Theodore Tso, NeilBrown,
	Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4,
	corbet

On Sun, Aug 30, 2009 at 06:44:04PM +0400, Michael Tokarev wrote:
>> If you lose power with the write caches enabled on that same 5 drive  
>> RAID set, you could lose as much as 5 * 32MB of freshly written data on 
>>  a power loss (16-32MB write caches are common on s-ata disks these 
>> days).
>
> This is fundamentally wrong.  Many filesystems today use either barriers
> or flushes (if barriers are not supported), and the times when disk drives
> were lying to the OS that the cache got flushed are long gone.

While most common filesystem do have barrier support it is:

 - not actually enabled for the two most common filesystems
 - the support for write barriers an cache flushing tends to be buggy
   all over our software stack,

>> For MD5 (and MD6), you really must run with the write cache disabled  
>> until we get barriers to work for those configurations.
>
> I highly doubt barriers will ever be supported on anything but simple
> raid1, because it's impossible to guarantee ordering across multiple
> drives.  Well, it *is* possible to have write barriers with journalled
> (and/or with battery-backed-cache) raid[456].
>
> Note that even if raid[456] does not support barriers, write cache
> flushes still works.

All currently working barrier implementations on Linux are built upon
queue drains and cache flushes, plus sometimes setting the FUA bit.


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 16:35                                                                   ` Christoph Hellwig
@ 2009-08-31 13:15                                                                     ` Ric Wheeler
  2009-08-31 13:16                                                                       ` Christoph Hellwig
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-31 13:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown,
	Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4,
	corbet

On 08/30/2009 12:35 PM, Christoph Hellwig wrote:
> On Sun, Aug 30, 2009 at 06:44:04PM +0400, Michael Tokarev wrote:
>>> If you lose power with the write caches enabled on that same 5 drive
>>> RAID set, you could lose as much as 5 * 32MB of freshly written data on
>>>   a power loss (16-32MB write caches are common on s-ata disks these
>>> days).
>>
>> This is fundamentally wrong.  Many filesystems today use either barriers
>> or flushes (if barriers are not supported), and the times when disk drives
>> were lying to the OS that the cache got flushed are long gone.
>
> While most common filesystem do have barrier support it is:
>
>   - not actually enabled for the two most common filesystems
>   - the support for write barriers an cache flushing tends to be buggy
>     all over our software stack,
>

Or just missing - I think that MD5/6 simply drop the requests at present.

I wonder if it would be worth having MD probe for write cache enabled & warn if 
barriers are not supported?

>>> For MD5 (and MD6), you really must run with the write cache disabled
>>> until we get barriers to work for those configurations.
>>
>> I highly doubt barriers will ever be supported on anything but simple
>> raid1, because it's impossible to guarantee ordering across multiple
>> drives.  Well, it *is* possible to have write barriers with journalled
>> (and/or with battery-backed-cache) raid[456].
>>
>> Note that even if raid[456] does not support barriers, write cache
>> flushes still works.
>
> All currently working barrier implementations on Linux are built upon
> queue drains and cache flushes, plus sometimes setting the FUA bit.
>


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:15                                                                     ` Ric Wheeler
@ 2009-08-31 13:16                                                                       ` Christoph Hellwig
  2009-08-31 13:19                                                                         ` Mark Lord
  2009-08-31 13:22                                                                         ` Ric Wheeler
  0 siblings, 2 replies; 309+ messages in thread
From: Christoph Hellwig @ 2009-08-31 13:16 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Michael Tokarev, david, Pavel Machek,
	Theodore Tso, NeilBrown, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>> While most common filesystem do have barrier support it is:
>>
>>   - not actually enabled for the two most common filesystems
>>   - the support for write barriers an cache flushing tends to be buggy
>>     all over our software stack,
>>
>
> Or just missing - I think that MD5/6 simply drop the requests at present.
>
> I wonder if it would be worth having MD probe for write cache enabled & 
> warn if barriers are not supported?

In my opinion even that is too weak.  We know how to control the cache
settings on all common disks (that is scsi and ata), so we should always
disable the write cache unless we know that the whole stack (filesystem,
raid, volume managers) supports barriers.  And even then we should make
sure the filesystems does actually use barriers everywhere that's needed
which failed at for years.


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:16                                                                       ` Christoph Hellwig
@ 2009-08-31 13:19                                                                         ` Mark Lord
  2009-08-31 13:21                                                                           ` Christoph Hellwig
  2009-08-31 13:22                                                                         ` Ric Wheeler
  1 sibling, 1 reply; 309+ messages in thread
From: Mark Lord @ 2009-08-31 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ric Wheeler, Michael Tokarev, david, Pavel Machek, Theodore Tso,
	NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>> While most common filesystem do have barrier support it is:
>>>
>>>   - not actually enabled for the two most common filesystems
>>>   - the support for write barriers an cache flushing tends to be buggy
>>>     all over our software stack,
>>>
>> Or just missing - I think that MD5/6 simply drop the requests at present.
>>
>> I wonder if it would be worth having MD probe for write cache enabled & 
>> warn if barriers are not supported?
> 
> In my opinion even that is too weak.  We know how to control the cache
> settings on all common disks (that is scsi and ata), so we should always
> disable the write cache unless we know that the whole stack (filesystem,
> raid, volume managers) supports barriers.  And even then we should make
> sure the filesystems does actually use barriers everywhere that's needed
> which failed at for years.
..

That stack does not know that my MD device has full battery backup,
so it bloody well better NOT prevent me from enabling the write caches.

In fact, MD should have nothing to do with that.  I do like/prefer the
way that XFS currently does it:  disables barriers and logs the event,
but otherwise doesn't try to enforce policy upon me from kernel space.

Cheers

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:19                                                                         ` Mark Lord
@ 2009-08-31 13:21                                                                           ` Christoph Hellwig
  2009-08-31 15:14                                                                             ` jim owens
  2009-09-03  1:59                                                                             ` Ric Wheeler
  0 siblings, 2 replies; 309+ messages in thread
From: Christoph Hellwig @ 2009-08-31 13:21 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, Ric Wheeler, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>> In my opinion even that is too weak.  We know how to control the cache
>> settings on all common disks (that is scsi and ata), so we should always
>> disable the write cache unless we know that the whole stack (filesystem,
>> raid, volume managers) supports barriers.  And even then we should make
>> sure the filesystems does actually use barriers everywhere that's needed
>> which failed at for years.
> ..
>
> That stack does not know that my MD device has full battery backup,
> so it bloody well better NOT prevent me from enabling the write caches.

No one is going to prevent you from doing it.  That question is one of
sane defaults.  And always safe, but slower if you have advanced
equipment is a much better default than usafe by default on most of
the install base.


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:16                                                                       ` Christoph Hellwig
  2009-08-31 13:19                                                                         ` Mark Lord
@ 2009-08-31 13:22                                                                         ` Ric Wheeler
  2009-08-31 15:50                                                                           ` david
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-31 13:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Tokarev, david, Pavel Machek, Theodore Tso, NeilBrown,
	Rob Landley, Florian Weimer, Goswin von Brederlow, kernel list,
	Andrew Morton, mtk.manpages, rdunlap, linux-doc, linux-ext4,
	corbet

On 08/31/2009 09:16 AM, Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>> While most common filesystem do have barrier support it is:
>>>
>>>    - not actually enabled for the two most common filesystems
>>>    - the support for write barriers an cache flushing tends to be buggy
>>>      all over our software stack,
>>>
>>
>> Or just missing - I think that MD5/6 simply drop the requests at present.
>>
>> I wonder if it would be worth having MD probe for write cache enabled&
>> warn if barriers are not supported?
>
> In my opinion even that is too weak.  We know how to control the cache
> settings on all common disks (that is scsi and ata), so we should always
> disable the write cache unless we know that the whole stack (filesystem,
> raid, volume managers) supports barriers.  And even then we should make
> sure the filesystems does actually use barriers everywhere that's needed
> which failed at for years.
>

I was thinking about that as well. Having us disable the write cache when we 
know it is not supported (like in the MD5 case) would certainly be *much* safer 
for almost everyone.

We would need to have a way to override the write cache disabling for people who 
either know that they have a non-volatile write cache (unlikely as it would 
probably be to put MD5 on top of a hardware RAID/external array, but some of the 
new SSD's claim to have non-volatile write cache).

It would also be very useful to have all of our top tier file systems enable 
barriers by default, provide consistent barrier on/off mount options and log a 
nice warning when not enabled....

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:21                                                                           ` Christoph Hellwig
@ 2009-08-31 15:14                                                                             ` jim owens
  2009-09-03  1:59                                                                             ` Ric Wheeler
  1 sibling, 0 replies; 309+ messages in thread
From: jim owens @ 2009-08-31 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mark Lord, Ric Wheeler, Michael Tokarev, david, Pavel Machek,
	Theodore Tso, NeilBrown, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>>> In my opinion even that is too weak.  We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers.  And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>> ..
>>
>> That stack does not know that my MD device has full battery backup,
>> so it bloody well better NOT prevent me from enabling the write caches.
> 
> No one is going to prevent you from doing it.  That question is one of
> sane defaults.  And always safe, but slower if you have advanced
> equipment is a much better default than usafe by default on most of
> the install base.

I've always agreed with "be safe first" and have worked where
we always shut write cache off unless we knew it had battery.

But before we make disabling cache the default, this is the impact:

- users will see it as a performance regression

- trashy OS vendors who never disable cache will benchmark
   better than "out of the box" linux.

Because as we all know, users don't read release notes.

Been there, done that, felt the pain.

jim

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:22                                                                         ` Ric Wheeler
@ 2009-08-31 15:50                                                                           ` david
  2009-08-31 16:21                                                                             ` Ric Wheeler
  2009-08-31 18:31                                                                             ` Christoph Hellwig
  0 siblings, 2 replies; 309+ messages in thread
From: david @ 2009-08-31 15:50 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Michael Tokarev, Pavel Machek, Theodore Tso,
	NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Mon, 31 Aug 2009, Ric Wheeler wrote:

> On 08/31/2009 09:16 AM, Christoph Hellwig wrote:
>> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>>> While most common filesystem do have barrier support it is:
>>>>
>>>>    - not actually enabled for the two most common filesystems
>>>>    - the support for write barriers an cache flushing tends to be buggy
>>>>      all over our software stack,
>>>> 
>>> 
>>> Or just missing - I think that MD5/6 simply drop the requests at present.
>>> 
>>> I wonder if it would be worth having MD probe for write cache enabled&
>>> warn if barriers are not supported?
>> 
>> In my opinion even that is too weak.  We know how to control the cache
>> settings on all common disks (that is scsi and ata), so we should always
>> disable the write cache unless we know that the whole stack (filesystem,
>> raid, volume managers) supports barriers.  And even then we should make
>> sure the filesystems does actually use barriers everywhere that's needed
>> which failed at for years.
>> 
>
> I was thinking about that as well. Having us disable the write cache when we 
> know it is not supported (like in the MD5 case) would certainly be *much* 
> safer for almost everyone.
>
> We would need to have a way to override the write cache disabling for people 
> who either know that they have a non-volatile write cache (unlikely as it 
> would probably be to put MD5 on top of a hardware RAID/external array, but 
> some of the new SSD's claim to have non-volatile write cache).

I've done this when the hardware raid only suppored raid 5 but I wanted 
raid 6. I've also done it when I had enough disks to need more than one 
hardware raid card to talk to them all, but wanted one logical drive for 
the system.

> It would also be very useful to have all of our top tier file systems enable 
> barriers by default, provide consistent barrier on/off mount options and log 
> a nice warning when not enabled....

most people are not willing to live with unbuffered write performance. 
they care about their data, but they also care about performance, and 
since performance is what they see on an ongong basis, they tend to care 
more about performance.

given that we don't even have barriers enabled by default on ext3 due to 
the performance hit, what makes you think that disabling buffers entirely 
is going to be acceptable to people?

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 15:50                                                                           ` david
@ 2009-08-31 16:21                                                                             ` Ric Wheeler
  2009-08-31 18:31                                                                             ` Christoph Hellwig
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-08-31 16:21 UTC (permalink / raw)
  To: david
  Cc: Christoph Hellwig, Michael Tokarev, Pavel Machek, Theodore Tso,
	NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/31/2009 11:50 AM, david@lang.hm wrote:
> On Mon, 31 Aug 2009, Ric Wheeler wrote:
>
>> On 08/31/2009 09:16 AM, Christoph Hellwig wrote:
>>> On Mon, Aug 31, 2009 at 09:15:27AM -0400, Ric Wheeler wrote:
>>>>> While most common filesystem do have barrier support it is:
>>>>>
>>>>> - not actually enabled for the two most common filesystems
>>>>> - the support for write barriers an cache flushing tends to be buggy
>>>>> all over our software stack,
>>>>>
>>>>
>>>> Or just missing - I think that MD5/6 simply drop the requests at
>>>> present.
>>>>
>>>> I wonder if it would be worth having MD probe for write cache enabled&
>>>> warn if barriers are not supported?
>>>
>>> In my opinion even that is too weak. We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers. And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>>>
>>
>> I was thinking about that as well. Having us disable the write cache
>> when we know it is not supported (like in the MD5 case) would
>> certainly be *much* safer for almost everyone.
>>
>> We would need to have a way to override the write cache disabling for
>> people who either know that they have a non-volatile write cache
>> (unlikely as it would probably be to put MD5 on top of a hardware
>> RAID/external array, but some of the new SSD's claim to have
>> non-volatile write cache).
>
> I've done this when the hardware raid only suppored raid 5 but I wanted
> raid 6. I've also done it when I had enough disks to need more than one
> hardware raid card to talk to them all, but wanted one logical drive for
> the system.
>
>> It would also be very useful to have all of our top tier file systems
>> enable barriers by default, provide consistent barrier on/off mount
>> options and log a nice warning when not enabled....
>
> most people are not willing to live with unbuffered write performance.
> they care about their data, but they also care about performance, and
> since performance is what they see on an ongong basis, they tend to care
> more about performance.
>
> given that we don't even have barriers enabled by default on ext3 due to
> the performance hit, what makes you think that disabling buffers
> entirely is going to be acceptable to people?
>
> David Lang

We do (and have for a number of years) enable barriers by default for XFS and 
reiserfs. In SLES, ext3 has default barriers as well.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:  document conditions when reliable operation is possible)
  2009-08-30 15:20                                                             ` Theodore Tso
@ 2009-08-31 17:49                                                               ` Jesse Brandeburg
  2009-08-31 18:01                                                                 ` Ric Wheeler
  2009-08-31 18:07                                                                 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft
  2009-08-31 17:49                                                               ` Jesse Brandeburg
                                                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 309+ messages in thread
From: Jesse Brandeburg @ 2009-08-31 17:49 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<tytso@mit.edu> wrote:
> So we *do* have the warning light; the problem is that just as some
> people may not realize that "check brakes" means, "YOU COULD DIE",
> some people may not realize that "hard drive failure; RAID array
> degraded" could mean, "YOU COULD LOSE DATA".
>
> Fortunately, for software RAID, this is easily solved; if you are so
> concerned, why don't you submit a patch to mdadm adjusting the e-mail
> sent to the system administrator when the array is in a degraded
> state, such that it states, "YOU COULD LOSE DATA".  I would gently
> suggest to you this would be ***far*** more effective that a patch to
> kernel documentation.

In the case of a degraded array, could the kernel be more proactive
(or maybe even mdadm) and have the filesystem remount itself withOUT
journalling enabled?  This seems on the surface to be possible, but I
don't know the internal particulars that might prevent/allow it.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 15:20                                                             ` Theodore Tso
  2009-08-31 17:49                                                               ` Jesse Brandeburg
@ 2009-08-31 17:49                                                               ` Jesse Brandeburg
  2009-09-05 10:34                                                               ` Pavel Machek
  2009-09-05 10:34                                                               ` Pavel Machek
  3 siblings, 0 replies; 309+ messages in thread
From: Jesse Brandeburg @ 2009-08-31 17:49 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer

On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<tytso@mit.edu> wrote:
> So we *do* have the warning light; the problem is that just as some
> people may not realize that "check brakes" means, "YOU COULD DIE",
> some people may not realize that "hard drive failure; RAID array
> degraded" could mean, "YOU COULD LOSE DATA".
>
> Fortunately, for software RAID, this is easily solved; if you are so
> concerned, why don't you submit a patch to mdadm adjusting the e-mail
> sent to the system administrator when the array is in a degraded
> state, such that it states, "YOU COULD LOSE DATA".  I would gently
> suggest to you this would be ***far*** more effective that a patch to
> kernel documentation.

In the case of a degraded array, could the kernel be more proactive
(or maybe even mdadm) and have the filesystem remount itself withOUT
journalling enabled?  This seems on the surface to be possible, but I
don't know the internal particulars that might prevent/allow it.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 17:49                                                               ` Jesse Brandeburg
@ 2009-08-31 18:01                                                                 ` Ric Wheeler
  2009-08-31 21:01                                                                   ` MD5/6? (was Re: raid is dangerous but that's secret ...) Ron Johnson
  2009-08-31 18:07                                                                 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-08-31 18:01 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Theodore Tso, Pavel Machek, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 08/31/2009 01:49 PM, Jesse Brandeburg wrote:
> On Sun, Aug 30, 2009 at 8:20 AM, Theodore Tso<tytso@mit.edu>  wrote:
>> So we *do* have the warning light; the problem is that just as some
>> people may not realize that "check brakes" means, "YOU COULD DIE",
>> some people may not realize that "hard drive failure; RAID array
>> degraded" could mean, "YOU COULD LOSE DATA".
>>
>> Fortunately, for software RAID, this is easily solved; if you are so
>> concerned, why don't you submit a patch to mdadm adjusting the e-mail
>> sent to the system administrator when the array is in a degraded
>> state, such that it states, "YOU COULD LOSE DATA".  I would gently
>> suggest to you this would be ***far*** more effective that a patch to
>> kernel documentation.
>
> In the case of a degraded array, could the kernel be more proactive
> (or maybe even mdadm) and have the filesystem remount itself withOUT
> journalling enabled?  This seems on the surface to be possible, but I
> don't know the internal particulars that might prevent/allow it.

This a misconception - with or without journalling, you are open to a second 
failure during a RAID rebuild.

Also note that by default, ext3 does not mount with barriers turned on.

Even if you mount with barriers, MD5 does not handle barriers, so you stand to 
lose a lot of data if you have a power outage.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 17:49                                                               ` Jesse Brandeburg
  2009-08-31 18:01                                                                 ` Ric Wheeler
@ 2009-08-31 18:07                                                                 ` martin f krafft
  2009-08-31 22:26                                                                     ` Jesse Brandeburg
  1 sibling, 1 reply; 309+ messages in thread
From: martin f krafft @ 2009-08-31 18:07 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]:
> In the case of a degraded array, could the kernel be more
> proactive (or maybe even mdadm) and have the filesystem remount
> itself withOUT journalling enabled?  This seems on the surface to
> be possible, but I don't know the internal particulars that might
> prevent/allow it.

Why would I want to disable the filesystem journal in that case?

-- 
 .''`.   martin f. krafft <madduck@d.o>      Related projects:
: :'  :  proud Debian developer               http://debiansystem.info
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
 
"i can stand brute force, but brute reason is quite unbearable. there
 is something unfair about its use. it is hitting below the
 intellect."
                                                        -- oscar wilde

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 15:50                                                                           ` david
  2009-08-31 16:21                                                                             ` Ric Wheeler
@ 2009-08-31 18:31                                                                             ` Christoph Hellwig
  2009-08-31 19:11                                                                               ` david
  1 sibling, 1 reply; 309+ messages in thread
From: Christoph Hellwig @ 2009-08-31 18:31 UTC (permalink / raw)
  To: david
  Cc: Ric Wheeler, Christoph Hellwig, Michael Tokarev, Pavel Machek,
	Theodore Tso, NeilBrown, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Mon, Aug 31, 2009 at 08:50:53AM -0700, david@lang.hm wrote:
>> It would also be very useful to have all of our top tier file systems 
>> enable barriers by default, provide consistent barrier on/off mount 
>> options and log a nice warning when not enabled....
>
> most people are not willing to live with unbuffered write performance.  

I'm not sure what you mean with unbuffered write support, the only
common use of that term is for userspace I/O using the read/write
sysctem calls directly in comparism to buffered I/O which uses
the stdio library.

But be ensure that the use of barriers and cache flushes in fsync does not
completely disable caching (or "buffering"), it just does flush flushes
the disk write cache in case we either commit a log buffer than need to
be on disk, or performan an fsync where we really do want to have data
on disk instead of lying to the application about the status of the
I/O completion.  Which btw could be interpreted as a violation of the
Posix rules.


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 18:31                                                                             ` Christoph Hellwig
@ 2009-08-31 19:11                                                                               ` david
  0 siblings, 0 replies; 309+ messages in thread
From: david @ 2009-08-31 19:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ric Wheeler, Michael Tokarev, Pavel Machek, Theodore Tso,
	NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Mon, 31 Aug 2009, Christoph Hellwig wrote:

> On Mon, Aug 31, 2009 at 08:50:53AM -0700, david@lang.hm wrote:
>>> It would also be very useful to have all of our top tier file systems
>>> enable barriers by default, provide consistent barrier on/off mount
>>> options and log a nice warning when not enabled....
>>
>> most people are not willing to live with unbuffered write performance.
>
> I'm not sure what you mean with unbuffered write support, the only
> common use of that term is for userspace I/O using the read/write
> sysctem calls directly in comparism to buffered I/O which uses
> the stdio library.
>
> But be ensure that the use of barriers and cache flushes in fsync does not
> completely disable caching (or "buffering"), it just does flush flushes
> the disk write cache in case we either commit a log buffer than need to
> be on disk, or performan an fsync where we really do want to have data
> on disk instead of lying to the application about the status of the
> I/O completion.  Which btw could be interpreted as a violation of the
> Posix rules.

as I understood it, the proposal that I responded to was to change the 
kernel to detect if barriers are enabled for the entire stack or not, and 
if not disable the write caches on the drives.

there are definantly times when that is the correct thing to do, but I 
am not sure that it is the correct thing to do by default.

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* MD5/6? (was Re: raid is dangerous but that's secret ...)
  2009-08-31 18:01                                                                 ` Ric Wheeler
@ 2009-08-31 21:01                                                                   ` Ron Johnson
  0 siblings, 0 replies; 309+ messages in thread
From: Ron Johnson @ 2009-08-31 21:01 UTC (permalink / raw)
  To: Linux-Ext4

On 2009-08-31 13:01, Ric Wheeler wrote:
[snip]
> 
> Even if you mount with barriers, MD5 does not handle barriers, so you 
> stand to lose a lot of data if you have a power outage.

Pardon me for asking for such a seemingly obvious question, but what 
(besides "Message-Digest algorithm 5") is MD5?

(I've always seen "multiple drive" written in the lower case "md".)

-- 
Brawndo's got what plants crave.  It's got electrolytes!

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:  document conditions when reliable operation is possible)
  2009-08-31 18:07                                                                 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft
@ 2009-08-31 22:26                                                                     ` Jesse Brandeburg
  0 siblings, 0 replies; 309+ messages in thread
From: Jesse Brandeburg @ 2009-08-31 22:26 UTC (permalink / raw)
  To: Jesse Brandeburg, Theodore Tso, Pavel Machek, NeilBrown,
	Ric Wheeler, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck@debian.org> wrote:
> also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]:
>> In the case of a degraded array, could the kernel be more
>> proactive (or maybe even mdadm) and have the filesystem remount
>> itself withOUT journalling enabled?  This seems on the surface to
>> be possible, but I don't know the internal particulars that might
>> prevent/allow it.
>
> Why would I want to disable the filesystem journal in that case?

I misspoke w.r.t journalling, the idea I was trying to get across was
to remount with -o sync while running on a degraded array, but given
some of the other comments in this thread I'm not even sure that would
help.  the idea was to make writes as safe as possible (at the cost of
speed) when running on a degraded array, and to have the transition be
as hands-free as possible, just have the kernel (or mdadm) by default
remount.

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
@ 2009-08-31 22:26                                                                     ` Jesse Brandeburg
  0 siblings, 0 replies; 309+ messages in thread
From: Jesse Brandeburg @ 2009-08-31 22:26 UTC (permalink / raw)
  To: Jesse Brandeburg, Theodore Tso, Pavel Machek, NeilBrown,
	Ric Wheeler, Rob Landley

On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck@debian.org> wrote:
> also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]:
>> In the case of a degraded array, could the kernel be more
>> proactive (or maybe even mdadm) and have the filesystem remount
>> itself withOUT journalling enabled?  This seems on the surface to
>> be possible, but I don't know the internal particulars that might
>> prevent/allow it.
>
> Why would I want to disable the filesystem journal in that case?

I misspoke w.r.t journalling, the idea I was trying to get across was
to remount with -o sync while running on a degraded array, but given
some of the other comments in this thread I'm not even sure that would
help.  the idea was to make writes as safe as possible (at the cost of
speed) when running on a degraded array, and to have the transition be
as hands-free as possible, just have the kernel (or mdadm) by default
remount.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 22:26                                                                     ` Jesse Brandeburg
  (?)
@ 2009-08-31 23:19                                                                     ` Ron Johnson
  -1 siblings, 0 replies; 309+ messages in thread
From: Ron Johnson @ 2009-08-31 23:19 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: Theodore Tso, Ric Wheeler, Linux-Ext4

On 2009-08-31 17:26, Jesse Brandeburg wrote:
> On Mon, Aug 31, 2009 at 11:07 AM, martin f krafft<madduck@debian.org> wrote:
>> also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 +0200]:
>>> In the case of a degraded array, could the kernel be more
>>> proactive (or maybe even mdadm) and have the filesystem remount
>>> itself withOUT journalling enabled?  This seems on the surface to
>>> be possible, but I don't know the internal particulars that might
>>> prevent/allow it.
>> Why would I want to disable the filesystem journal in that case?
> 
> I misspoke w.r.t journalling, the idea I was trying to get across was
> to remount with -o sync while running on a degraded array, but given
> some of the other comments in this thread I'm not even sure that would
> help.  the idea was to make writes as safe as possible (at the cost of
> speed) when running on a degraded array, and to have the transition be
> as hands-free as possible, just have the kernel (or mdadm) by default
> remount.

Much better, I'd think, to "just" have it scream out DANGER!! WILL 
ROBINSON!! DANGER!! to syslog and to an email hook.

-- 
Brawndo's got what plants crave.  It's got electrolytes!

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 22:26                                                                     ` Jesse Brandeburg
  (?)
  (?)
@ 2009-09-01  5:45                                                                     ` martin f krafft
  -1 siblings, 0 replies; 309+ messages in thread
From: martin f krafft @ 2009-09-01  5:45 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Theodore Tso, Pavel Machek, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.09.01.0026 +0200]:
> I misspoke w.r.t journalling, the idea I was trying to get across
> was to remount with -o sync while running on a degraded array, but
> given some of the other comments in this thread I'm not even sure
> that would help.  the idea was to make writes as safe as possible
> (at the cost of speed) when running on a degraded array, and to
> have the transition be as hands-free as possible, just have the
> kernel (or mdadm) by default remount.

I don't see how that is any more necessary with a degraded array
than it is when you have a fully working array. Sync just ensures
that the data are written and not cached, but that has absolutely
nothing to do with the underlying storage. Or am I failing to see
the link?

-- 
 .''`.   martin f. krafft <madduck@d.o>      Related projects:
: :'  :  proud Debian developer               http://debiansystem.info
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
 
"how do you feel about women's rights?"
"i like either side of them."
                                                       -- groucho marx

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28 11:16                                                           ` Ric Wheeler
@ 2009-09-01 13:58                                                             ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-01 13:58 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Rob Landley, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet


>> Interesting. So, what's technically wrong with the patch below?
>
> My suggestion was that you stop trying to document your assertion of an 
> issue and actually suggest fixes in code or implementation. I really 
> don't think that you have properly diagnosed your specific failure or 
> done sufficient. However, if you put a full analysis and suggested code 
> out to the MD devel lists, we can debate technical implementation as we 
> normally do.

I don't think I should be required to rewrite linux md layer in order
to fix documentation. 

> The only note that I would put in ext3/4 etc documentation would be:
>
> "Reliable storage is important for any file system. Single disks (or 
> FLASH or SSD) do fail on a regular basis.

Uh, how clever, instead of documenting that our md raid code does not
always work as expected, you document that components fail. Newspeak
101?

You even failed to mention little design problem with flash and
eraseblock size... and the fact that you don't need flash to fail to
get data loss.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-27 16:54                                                                     ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik
  2009-08-27 18:09                                                                       ` Alasdair G Kergon
@ 2009-09-01 14:01                                                                       ` Pavel Machek
  2009-09-02 16:17                                                                         ` Michael Tokarev
  1 sibling, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-09-01 14:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ric Wheeler, Theodore Tso, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Thu 2009-08-27 12:54:05, Jeff Garzik wrote:
> On 08/27/2009 09:10 AM, Ric Wheeler wrote:
>> One thing that does need fixing for some MD configurations is to stress
>> again that we need to make sure that barrier operations are properly
>> supported or users will need to disable the write cache on devices with
>> volatile write caches.
>
> Agreed; chime in on Christoph's linux-vfs thread if people have input.
>
> I quickly glanced at MD and DM.  Currently, upstream, we see a lot of
>
>         if (unlikely(bio_barrier(bio))) {
>                 bio_endio(bio, -EOPNOTSUPP);
>                 return 0;
>         }
>
> in DM and MD make_request functions.
>
> Only md/raid1 supports barriers at present, it seems.  None of the other  
> MD drivers support barriers.

Not even md/raid0? Ouch :-(.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-01 14:01                                                                       ` Pavel Machek
@ 2009-09-02 16:17                                                                         ` Michael Tokarev
  0 siblings, 0 replies; 309+ messages in thread
From: Michael Tokarev @ 2009-09-02 16:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jeff Garzik, Ric Wheeler, Theodore Tso, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Pavel Machek wrote:
> On Thu 2009-08-27 12:54:05, Jeff Garzik wrote:
[]
>> Only md/raid1 supports barriers at present, it seems.  None of the other  
>> MD drivers support barriers.
> 
> Not even md/raid0? Ouch :-(.

Only for raid1 there's no requiriment for inter-drive ordering.  Hence
only raid1 supports barriers (and gained that support very recently,
in 1 or 2 kernel releases).  For the rest, including raid0 and linear,
inter-drive ordering is necessary to implement barriers.  Or md should
have its own queue (flushing) mechanisms.

/mjt

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-29 11:28                                                                       ` Ric Wheeler
@ 2009-09-02 20:12                                                                         ` Pavel Machek
  2009-09-02 20:42                                                                           ` Ric Wheeler
                                                                                             ` (2 more replies)
  0 siblings, 3 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-02 20:12 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet


>>> people aren't objecting to better documentation, they are objecting to
>>> misleading documentation.
>>>      
>> Actually Ric is. He's trying hard to make RAID5 look better than it
>> really is.
>
> I object to misleading and dangerous documentation that you have  
> proposed. I spend a lot of time working in data integrity, talking and  
> writing about it so I care deeply that we don't misinform people.

Yes, truth is dangerous. To vendors selling crap products. 

> In this thread, I put out a draft that is accurate several times and you  
> have failed to respond to it.

Accurate as in 'has 0 information content' :-(.

> The big picture that you don't agree with is:
>
> (1) RAID (specifically MD RAID) will dramatically improve data integrity  
> for real users. This is not a statement of opinion, this is a statement  
> of fact that has been shown to be true in large scale deployments with  
> commodity hardware.

It is also completely irrelevant.

> (2) RAID5 protects you against a single failure and your test case  
> purposely injects a double failure.

Most people would be surprised that press of reset button is 'failure'
in this context.

> (4) Data loss occurs in non-journalling file systems and journalling  
> file systems when you suffer double failures or hot unplug storage,  
> especially inexpensive FLASH parts.

It does not happen on inexpensive DISK parts,  so people do not expect
that and it is worth pointing out.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-02 20:12                                                                         ` Pavel Machek
@ 2009-09-02 20:42                                                                           ` Ric Wheeler
  2009-09-02 23:00                                                                             ` Rob Landley
  2009-09-02 22:45                                                                           ` Rob Landley
  2009-09-02 22:49                                                                           ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley
  2 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-02 20:42 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Theodore Tso, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On 09/02/2009 04:12 PM, Pavel Machek wrote:
>
>>>> people aren't objecting to better documentation, they are objecting to
>>>> misleading documentation.
>>>>
>>> Actually Ric is. He's trying hard to make RAID5 look better than it
>>> really is.
>>
>> I object to misleading and dangerous documentation that you have
>> proposed. I spend a lot of time working in data integrity, talking and
>> writing about it so I care deeply that we don't misinform people.
>
> Yes, truth is dangerous. To vendors selling crap products.

Pavel, you have no information and an attitude of not wanting to listen to 
anyone who has real experience or facts. Not just me, but also Ted and others.

Totally pointless to reply to you further.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30  9:01                                                             ` Christian Kujau
@ 2009-09-02 20:55                                                               ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-02 20:55 UTC (permalink / raw)
  To: Christian Kujau
  Cc: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sun 2009-08-30 02:01:10, Christian Kujau wrote:
> On Sun, 30 Aug 2009 at 09:51, Pavel Machek wrote:
> > > give system administrators.  It's better than the fear-mongering
> > > patches you had proposed earlier, but what would be better *still* is
> > > telling people why running with degraded RAID arrays is bad, and to
> > > give them further tips about how to use RAID arrays safely.
> > 
> > Maybe this belongs to Doc*/filesystems, and more detailed RAID
> > description should go to md description?
> 
> Why should this be placed in *kernel* documentation anyway? The "dangers 
> of RAID", the hints that "backups are a good idea" - isn't that something 
> for howtos for sysadmins? No end-user will ever look into

The fact that two kernel subsystems (MD RAID, journaling filesystems)
do not work well together is surprising and should be documented near
the source.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-02 20:12                                                                         ` Pavel Machek
  2009-09-02 20:42                                                                           ` Ric Wheeler
@ 2009-09-02 22:45                                                                           ` Rob Landley
  2009-09-02 22:49                                                                           ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley
  2 siblings, 0 replies; 309+ messages in thread
From: Rob Landley @ 2009-09-02 22:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Wednesday 02 September 2009 15:12:10 Pavel Machek wrote:
> > (2) RAID5 protects you against a single failure and your test case
> > purposely injects a double failure.
>
> Most people would be surprised that press of reset button is 'failure'
> in this context.

Apparently because most people haven't read Documentation/md.txt:

  Boot time assembly of degraded/dirty arrays
  -------------------------------------------

  If a raid5 or raid6 array is both dirty and degraded, it could have
  undetectable data corruption.  This is because the fact that it is
  'dirty' means that the parity cannot be trusted, and the fact that it
  is degraded means that some datablocks are missing and cannot reliably
  be reconstructed (due to no parity).

And so on for several more paragraphs.  Perhaps the documentation needs to be 
extended to note that "journaling will not help here, because the lost data 
blocks render entire stripes unreconstructable"...

Hmmm, I'll take a stab at it.  (I'm not addressing the raid 0 issues brought 
up elsewhere in this thread because I don't comfortably understand the current 
state of play...)

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.
  2009-09-02 20:12                                                                         ` Pavel Machek
  2009-09-02 20:42                                                                           ` Ric Wheeler
  2009-09-02 22:45                                                                           ` Rob Landley
@ 2009-09-02 22:49                                                                           ` Rob Landley
  2009-09-03  9:08                                                                             ` Pavel Machek
  2009-09-03 12:05                                                                             ` Ric Wheeler
  2 siblings, 2 replies; 309+ messages in thread
From: Rob Landley @ 2009-09-02 22:49 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

From: Rob Landley <rob@landley.net>

Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
explaining that using a journaling filesystem can't overcome this problem.

Signed-off-by: Rob Landley <rob@landley.net>
---

 Documentation/md.txt |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..52b8450 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
 
    md-mod.start_dirty_degraded=1
 
+Note that Journaling filesystems do not effectively protect data in this
+case, because the update granularity of the RAID is larger than the journal
+was designed to expect.  Reconstructing data via partity information involes
+matching together corresponding stripes, and updating only some of these
+stripes renders the corresponding data in all the unmatched stripes
+meaningless.  Thus seemingly unrelated data in other parts of the filesystem
+(stored in the unmatched stripes) can become unreadable after a partial
+update, but the journal is only aware of the parts it modified, not the
+"collateral damage" elsewhere in the filesystem which was affected by those
+changes.
+
+Thus successful journal replay proves nothing in this context, and even a
+full fsck only shows whether or not the filesystem's metadata was affected.
+(A proper solution to this problem would involve adding journaling to the RAID
+itself, at least during degraded writes.  In the meantime, try not to allow
+a system to shut down uncleanly with its RAID both dirty and degraded, it
+can handle one but not both.)
 
 Superblock formats
 ------------------


-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-02 20:42                                                                           ` Ric Wheeler
@ 2009-09-02 23:00                                                                             ` Rob Landley
  2009-09-02 23:09                                                                               ` david
  2009-09-03  0:36                                                                               ` jim owens
  0 siblings, 2 replies; 309+ messages in thread
From: Rob Landley @ 2009-09-02 23:00 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Pavel Machek, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote:
> On 09/02/2009 04:12 PM, Pavel Machek wrote:
> >>>> people aren't objecting to better documentation, they are objecting to
> >>>> misleading documentation.
> >>>
> >>> Actually Ric is. He's trying hard to make RAID5 look better than it
> >>> really is.
> >>
> >> I object to misleading and dangerous documentation that you have
> >> proposed. I spend a lot of time working in data integrity, talking and
> >> writing about it so I care deeply that we don't misinform people.
> >
> > Yes, truth is dangerous. To vendors selling crap products.
>
> Pavel, you have no information and an attitude of not wanting to listen to
> anyone who has real experience or facts. Not just me, but also Ted and
> others.
>
> Totally pointless to reply to you further.

For the record, I've been able to follow Pavel's arguments, and I've been able 
to follow Ted's arguments.  But as far as I can tell, you're arguing about a 
different topic than the rest of us.

There's a difference between:

A) This filesystem was corrupted because the underlying hardware is permanently 
damaged, no longer functioning as it did when it was new, and never will 
again.

B) We had a transient glitch that ate the filesystem.  The underlying hardware 
is as good as new, but our data is gone.

You can argue about whether or not "new" was ever any good, but Linux has run 
on PC-class hardware from day 1.  Sure PC-class hardware remains crap in many 
different ways, but this is not a _new_ problem.  Refusing to work around what 
people actually _have_ and insisting we get a better class of user instead 
_is_ a new problem, kind of a disturbing one.

USB keys are the modern successor to floppy drives, and even now 
Documentation/blockdev/floppy.txt is still full of some of the torturous 
workarounds implemented for that over the past 2 decades.  The hardware 
existed, and instead of turning up their nose at it they made it work as best 
they could.

Perhaps what's needed for the flash thing is a userspace package, the way 
mdutils made floppies a lot more usable than the kernel managed at the time.  
For the flash problem perhaps some FUSE thing a bit like mtdblock might be 
nice, a translation layer remapping an arbitrary underlying block device into 
larger granularity chunks and being sure to do the "write the new one before 
you erase the old one" trick that so many hardware-only flash devices _don't_, 
and then maybe even use Pavel's crash tool to figure out the write granularity 
of various sticks and ship it with a whitelist people can email updates to so 
we don't have to guess large.  (Pressure on the USB vendors to give us a "raw 
view" extension bypassing the "pretend to be a hard drive, with remapping" 
hardware in future devices would be nice too, but won't help any of the 
hardware out in the field.  I'm not sure that block remapping wouldn't screw up 
_this_ approach either, but it's an example of something that culd be 
_tried_.)

However, thinking about how to _fix_ a problem is predicated on acknowledging 
that there actually _is_ a problem.  "The hardware is not physically damaged 
but your data was lost" sounds to me like a software problem, and thus 
something software could at least _attempt_ to address.  "There's millions of 
'em, Linux can't cope" doesn't seem like a useful approach.

I already addressed the software raid thing last post.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-02 23:00                                                                             ` Rob Landley
@ 2009-09-02 23:09                                                                               ` david
  2009-09-03  8:55                                                                                 ` Pavel Machek
  2009-09-03  0:36                                                                               ` jim owens
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-09-02 23:09 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ric Wheeler, Pavel Machek, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Wed, 2 Sep 2009, Rob Landley wrote:

> USB keys are the modern successor to floppy drives, and even now
> Documentation/blockdev/floppy.txt is still full of some of the torturous
> workarounds implemented for that over the past 2 decades.  The hardware
> existed, and instead of turning up their nose at it they made it work as best
> they could.
>
> Perhaps what's needed for the flash thing is a userspace package, the way
> mdutils made floppies a lot more usable than the kernel managed at the time.
> For the flash problem perhaps some FUSE thing a bit like mtdblock might be
> nice, a translation layer remapping an arbitrary underlying block device into
> larger granularity chunks and being sure to do the "write the new one before
> you erase the old one" trick that so many hardware-only flash devices _don't_,
> and then maybe even use Pavel's crash tool to figure out the write granularity
> of various sticks and ship it with a whitelist people can email updates to so
> we don't have to guess large.  (Pressure on the USB vendors to give us a "raw
> view" extension bypassing the "pretend to be a hard drive, with remapping"
> hardware in future devices would be nice too, but won't help any of the
> hardware out in the field.  I'm not sure that block remapping wouldn't screw up
> _this_ approach either, but it's an example of something that culd be
> _tried_.)
>
> However, thinking about how to _fix_ a problem is predicated on acknowledging
> that there actually _is_ a problem.  "The hardware is not physically damaged
> but your data was lost" sounds to me like a software problem, and thus
> something software could at least _attempt_ to address.  "There's millions of
> 'em, Linux can't cope" doesn't seem like a useful approach.

no other OS avoids this problem either.

I actually don't see how you can do this from userspace, because when you 
write to the device you have _no_ idea where on the device your data will 
actually land.

writing in larger chunks may or may not help, (if you do a 128K write, 
and the device is emulating 512b blocks on top of 128K eraseblocks, 
depending on the current state of the flash translation layer, you could 
end up writing to many different eraseblocks, up to the theoretical max of 
256)

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-02 23:00                                                                             ` Rob Landley
  2009-09-02 23:09                                                                               ` david
@ 2009-09-03  0:36                                                                               ` jim owens
  2009-09-03  2:41                                                                                 ` Rob Landley
  1 sibling, 1 reply; 309+ messages in thread
From: jim owens @ 2009-09-03  0:36 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

Rob Landley wrote:
> On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote:
>>
>> Totally pointless to reply to you further.
> 
> For the record, I've been able to follow Pavel's arguments, and I've been able 
> to follow Ted's arguments.  But as far as I can tell, you're arguing about a 
> different topic than the rest of us.

I had no trouble following what Ric was arguing about.

Ric never said "use only the best devices and you won't have problems".

Ric was arguing the exact opposite - ALL devices are crap if you define
crap as "can loose data".  What he is saying is you need to UNDERSTAND
your devices and their behavior and you must act accordingly.

PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS.

We understand he was clueless, but user error is still user error!

And Ric said do not stigmatize whole classes of A) devices, B) raid,
and C) filesystems with "Pavel says...".

> However, thinking about how to _fix_ a problem is predicated on acknowledging 
> that there actually _is_ a problem.  "The hardware is not physically damaged 
> but your data was lost" sounds to me like a software problem, and thus 
> something software could at least _attempt_ to address.  "There's millions of 
> 'em, Linux can't cope" doesn't seem like a useful approach.

We have been trying forever to deal with device problems and as
Ric kept trying to explain we do understand them.  The problem is
not "can we be better" it is "at what cost".  As they keep saying
"fast", "cheap", "safe"... pick any 2.  Adding software solutions
to solve it will always turn "fast" to "slow".

Most people will choose some risk they can manage (such as
don't pull the flash card you idiot), instead of snail slow.

> I already addressed the software raid thing last post.

Saw it. I am not an MD guy so I will not say anything bad about it
except all the "journal" crud.  It really is only pandering to Pavel
because ALL filesystems can be screwed and that is what they really
need to know.  The journal stuff distracts those who are not running
a journaling filesystem, even if your description is correct except
that as we fs people keep saying, fsck is meaningless and again will
only give you a false sense of security that your data is OK.

jim

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-31 13:21                                                                           ` Christoph Hellwig
  2009-08-31 15:14                                                                             ` jim owens
@ 2009-09-03  1:59                                                                             ` Ric Wheeler
  2009-09-03 11:12                                                                               ` Krzysztof Halasa
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-03  1:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mark Lord, Michael Tokarev, david, Pavel Machek, Theodore Tso,
	NeilBrown, Rob Landley, Florian Weimer, Goswin von Brederlow,
	kernel list, Andrew Morton, mtk.manpages, rdunlap, linux-doc,
	linux-ext4, corbet

On 08/31/2009 09:21 AM, Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>>> In my opinion even that is too weak.  We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers.  And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>> ..
>>
>> That stack does not know that my MD device has full battery backup,
>> so it bloody well better NOT prevent me from enabling the write caches.
>
> No one is going to prevent you from doing it.  That question is one of
> sane defaults.  And always safe, but slower if you have advanced
> equipment is a much better default than usafe by default on most of
> the install base.
>

Just to add some support to this, all of the external RAID arrays that I know of 
normally run with write cache disabled on the component drives. In addition, 
many of them will disable their internal write cache if/when they detect that 
they have lost their UPS.

I think that if we had done this kind of sane default earlier for MD levels that 
do not handle barriers, we would not have left some people worried about our 
software RAID.

To be clear, if a sophisticated user wants to override this default, that should 
be supported. It is not (in my opinion) a safe default behaviour.

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03  0:36                                                                               ` jim owens
@ 2009-09-03  2:41                                                                                 ` Rob Landley
  2009-09-03 14:14                                                                                   ` jim owens
  0 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-09-03  2:41 UTC (permalink / raw)
  To: jim owens
  Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Wednesday 02 September 2009 19:36:10 jim owens wrote:
> Rob Landley wrote:
> > On Wednesday 02 September 2009 15:42:19 Ric Wheeler wrote:
> >> Totally pointless to reply to you further.
> >
> > For the record, I've been able to follow Pavel's arguments, and I've been
> > able to follow Ted's arguments.  But as far as I can tell, you're arguing
> > about a different topic than the rest of us.
>
> I had no trouble following what Ric was arguing about.
>
> Ric never said "use only the best devices and you won't have problems".
>
> Ric was arguing the exact opposite - ALL devices are crap if you define
> crap as "can loose data". 

And if you include meteor strike and flooding in your operating criteria you 
can come up with quite a straw man argument.  It still doesn't mean "X is 
highly likely to cause data loss" can never come as news to people.

> What he is saying is you need to UNDERSTAND
> your devices and their behavior and you must act accordingly.
>
> PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS.

Where was this limitation documented?  (Before he documented it, I mean?)

> We understand he was clueless, but user error is still user error!

I think he understands he was clueless too, that's why he investigated the 
failure and wrote it up for posterity.

> And Ric said do not stigmatize whole classes of A) devices, B) raid,
> and C) filesystems with "Pavel says...".

I don't care what "Pavel says", so you can leave the ad hominem at the door, 
thanks.

The kernel presents abstractions, such as block device nodes.  Sometimes 
implementation details bubble through those abstractions.  Presumably, we 
agree on that so far.

I was once asked to write what became Documentation/rbtree.txt, which got 
merged.  I've also read maybe half of Documentation/RCU.  Neither technique is 
specific to Linux, but this doesn't seem to have been an objection at the time.

The technique, "journaling", is widely perceived as eliminating the need for 
fsck (and thus the potential for filesystem corruption) in the case of unclean 
shutdowns.  But there are easily reproducible cases where the technique, 
"journaling", does not do this.  Thus journaling, as a concept, has 
limitations which are _not_ widely understood by the majority of people who 
purchase and use USB flash keys.

The kernel doesn't currently have any documentation on journaling theory where 
mention of journaling's limitations could go.  It does have a section on its 
internal Journaling API in Documentation/DocBook/filesystems.tmpl which links 
to two papers (both about ext3, even though reiserfs was merged first and IBM's 
JFS was implemented before either) from 1998 and 2000 respectively.  The 2000 
paper brushes against disk granularity answering a question starting at 72m, 
21s, and brushes against software raid and write ordering starting at the 72m 
32s mark.  But it never directly addresses either issue...

Sigh, I'm well into tl;dr territory here, aren't I?

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-02 23:09                                                                               ` david
@ 2009-09-03  8:55                                                                                 ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-03  8:55 UTC (permalink / raw)
  To: david
  Cc: Rob Landley, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

Hi!

>> However, thinking about how to _fix_ a problem is predicated on acknowledging
>> that there actually _is_ a problem.  "The hardware is not physically damaged
>> but your data was lost" sounds to me like a software problem, and thus
>> something software could at least _attempt_ to address.  "There's millions of
>> 'em, Linux can't cope" doesn't seem like a useful approach.
>
> no other OS avoids this problem either.
>
> I actually don't see how you can do this from userspace, because when you 
> write to the device you have _no_ idea where on the device your data will 
> actually land.

It certainly is not easy. Self-correcting codes could probably be
used, but that would be very special, very slow, and very
non-standard. (Basically... we could design filesystem so that it
would survive damage of arbitrarily 512K on disk -- using
self-correcting codes in CD-like manner). I'm not sure if it would be
practical.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.
  2009-09-02 22:49                                                                           ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley
@ 2009-09-03  9:08                                                                             ` Pavel Machek
  2009-09-03 12:05                                                                             ` Ric Wheeler
  1 sibling, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-03  9:08 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ric Wheeler, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Wed 2009-09-02 17:49:46, Rob Landley wrote:
> From: Rob Landley <rob@landley.net>
> 
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
> 
> Signed-off-by: Rob Landley <rob@landley.net>

I like it! Not sure if I know enough about MD to add ack, but...

Acked-by: Pavel Machek <pavel@ucw.cz>

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26  1:00                                                         ` Theodore Tso
@ 2009-09-03  9:47                                                             ` Pavel Machek
  2009-08-26  1:15                                                           ` Ric Wheeler
                                                                               ` (5 subsequent siblings)
  6 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-03  9:47 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel list, Andrew Morton, mtk.manpages, rdunlap,
	linux-doc, linux-ext4, corbet

On Tue 2009-08-25 21:00:18, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
> >>> You are simply incorrect, Ted did not say that ext3 does not work
> >>> with MD raid5.
> >>
> >> http://lkml.org/lkml/2009/8/25/312
> >
> > I will let Ted clarify his text on his own, but the quoted text says "... 
> > have potential...".
> >
> > Why not ask Neil if he designed MD to not work properly with ext3?
> 
> So let me clarify by saying the following things.   
> 
> 1) Filesystems are designed to expect that storage devices have
> certain properties.  These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.

Yes. Unfortunately, different filesystems expect different properties
from block devices. ext3 will work with write cache enabled/barriers
enabled, while ext2 needs write cache disabled.

The requirements are also quite surprising; AFAICT ext3 can handle
disk writing garbage to single sector during powerfail, while xfs can
not handle that.

Now, how do you expect users to know these subtle details when it is
not documented anywhere? And why are you fighting against documenting
these subtleties?

> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode?  Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode.  If not, the bug is with the system administrator!

As was uncovered, MD RAID does not properly support barriers,
so... you don't actually need drive failure.

> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts.  And someone who wants their data to be
> reliably

Trust me, 99% of sysadmins are not compentent by your definition. So
this should be documented.

> At the end of the day, filesystems are not magic.  They can't
> compensate for crap hardware, or incompetently administered machines.

ext3 greatly contributes to administrator incomentency:

# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

...it does not mention that (non-default!) barrier=1 is needed to make
this reliable, nor it mentions that there are certain requirements for
this to work. It just says that journal will magically help you.

And you wonder while people expect magic from your filesystem?

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [patch] ext2/3: document conditions when reliable operation is possible
@ 2009-09-03  9:47                                                             ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-03  9:47 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Florian Weimer, Goswin von Brederlow,
	Rob Landley, kernel

On Tue 2009-08-25 21:00:18, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 08:31:21PM -0400, Ric Wheeler wrote:
> >>> You are simply incorrect, Ted did not say that ext3 does not work
> >>> with MD raid5.
> >>
> >> http://lkml.org/lkml/2009/8/25/312
> >
> > I will let Ted clarify his text on his own, but the quoted text says "... 
> > have potential...".
> >
> > Why not ask Neil if he designed MD to not work properly with ext3?
> 
> So let me clarify by saying the following things.   
> 
> 1) Filesystems are designed to expect that storage devices have
> certain properties.  These include returning the same data that you
> wrote, and that an error when writing a sector, or a power failure
> when writing sector, should not be amplified to cause collateral
> damage with previously succfessfully written sectors.

Yes. Unfortunately, different filesystems expect different properties
from block devices. ext3 will work with write cache enabled/barriers
enabled, while ext2 needs write cache disabled.

The requirements are also quite surprising; AFAICT ext3 can handle
disk writing garbage to single sector during powerfail, while xfs can
not handle that.

Now, how do you expect users to know these subtle details when it is
not documented anywhere? And why are you fighting against documenting
these subtleties?

> Secondly, what's the probability of a failure causes the RAID array to
> become degraded, followed by a power failure, versus a power failure
> while the RAID array is not running in degraded mode?  Hopefully you
> are running with the RAID array in full, proper running order a much
> larger percentage of the time than running with the RAID array in
> degraded mode.  If not, the bug is with the system administrator!

As was uncovered, MD RAID does not properly support barriers,
so... you don't actually need drive failure.

> Maybe a random OS engineer doesn't know these things --- but trust me
> when I say a competent system administrator had better be familiar
> with these concepts.  And someone who wants their data to be
> reliably

Trust me, 99% of sysadmins are not compentent by your definition. So
this should be documented.

> At the end of the day, filesystems are not magic.  They can't
> compensate for crap hardware, or incompetently administered machines.

ext3 greatly contributes to administrator incomentency:

# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

...it does not mention that (non-default!) barrier=1 is needed to make
this reliable, nor it mentions that there are certain requirements for
this to work. It just says that journal will magically help you.

And you wonder while people expect magic from your filesystem?

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03  1:59                                                                             ` Ric Wheeler
@ 2009-09-03 11:12                                                                               ` Krzysztof Halasa
  2009-09-03 11:18                                                                                 ` Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Krzysztof Halasa @ 2009-09-03 11:12 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler <rwheeler@redhat.com> writes:

> Just to add some support to this, all of the external RAID arrays that
> I know of normally run with write cache disabled on the component
> drives.

Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03 11:12                                                                               ` Krzysztof Halasa
@ 2009-09-03 11:18                                                                                 ` Ric Wheeler
  2009-09-03 13:34                                                                                   ` Krzysztof Halasa
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-03 11:18 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/03/2009 07:12 AM, Krzysztof Halasa wrote:
> Ric Wheeler<rwheeler@redhat.com>  writes:
>
>> Just to add some support to this, all of the external RAID arrays that
>> I know of normally run with write cache disabled on the component
>> drives.
>
> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?

Which drives various vendors ships changes with specific products. Usually, they 
ship drives that have carefully vetted firmware, etc. but they are close to the 
same drives you buy on the open market.

Seagate has a huge slice of the market,

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.
  2009-09-02 22:49                                                                           ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley
  2009-09-03  9:08                                                                             ` Pavel Machek
@ 2009-09-03 12:05                                                                             ` Ric Wheeler
  2009-09-03 12:31                                                                               ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-03 12:05 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On 09/02/2009 06:49 PM, Rob Landley wrote:
> From: Rob Landley<rob@landley.net>
>
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
>
> Signed-off-by: Rob Landley<rob@landley.net>
> ---
>
>   Documentation/md.txt |   17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/md.txt b/Documentation/md.txt
> index 4edd39e..52b8450 100644
> --- a/Documentation/md.txt
> +++ b/Documentation/md.txt
> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>
>      md-mod.start_dirty_degraded=1
>
> +Note that Journaling filesystems do not effectively protect data in this
> +case, because the update granularity of the RAID is larger than the journal
> +was designed to expect.  Reconstructing data via partity information involes
> +matching together corresponding stripes, and updating only some of these
> +stripes renders the corresponding data in all the unmatched stripes
> +meaningless.  Thus seemingly unrelated data in other parts of the filesystem
> +(stored in the unmatched stripes) can become unreadable after a partial
> +update, but the journal is only aware of the parts it modified, not the
> +"collateral damage" elsewhere in the filesystem which was affected by those
> +changes.
> +
> +Thus successful journal replay proves nothing in this context, and even a
> +full fsck only shows whether or not the filesystem's metadata was affected.
> +(A proper solution to this problem would involve adding journaling to the RAID
> +itself, at least during degraded writes.  In the meantime, try not to allow
> +a system to shut down uncleanly with its RAID both dirty and degraded, it
> +can handle one but not both.)
>
>   Superblock formats
>   ------------------
>
>

NACK.

Now you have moved the inaccurate documentation about journalling file systems 
into the MD documentation.

Repeat after me:

(1) partial writes to a RAID stripe (with or without file systems, with or 
without journals) create an invalid stripe

(2) partial writes can be prevented in most cases by running with write cache 
disabled or working barriers

(3) fsck can (for journalling fs or non journalling fs) detect and fix your file 
system. It won't give you back the data in that stripe, but you will get the 
rest of your metadata and data back and usable.

You don't need MD in the picture to test this - take fsfuzzer or just dd and 
zero out a RAID stripe width of data from a file system. If you hit data blocks, 
your fsck (for ext2) or mount (for any journalling fs) will not see an error. If 
metadata, fsck in both cases when run will try to fix it as best as it can.

Also note that partial writes (similar to torn writes) can happen for multiple 
reasons on non-RAID systems and leave the same kind of damage.

Side note, proposing a half sketched out "fix" for partial stripe writes in 
documentation is not productive. Much better to submit a fully thought out 
proposal or actual patches to demonstrate the issue.

Rob, you should really try to take a few disks, build a working MD RAID5 group 
and test your ideas. Try it with and without the write cache enabled.

Measure and report, say after 20 power losses, how  files integrity and fsck 
repairs were impacted.

Try the same with ext2 and ext3.

Regards,

Ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.
  2009-09-03 12:05                                                                             ` Ric Wheeler
@ 2009-09-03 12:31                                                                               ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-03 12:31 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Rob Landley, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Thu 2009-09-03 08:05:31, Ric Wheeler wrote:
> On 09/02/2009 06:49 PM, Rob Landley wrote:
>> From: Rob Landley<rob@landley.net>
>>
>> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
>> explaining that using a journaling filesystem can't overcome this problem.
>>
>> Signed-off-by: Rob Landley<rob@landley.net>
>> ---
>>
>>   Documentation/md.txt |   17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/Documentation/md.txt b/Documentation/md.txt
>> index 4edd39e..52b8450 100644
>> --- a/Documentation/md.txt
>> +++ b/Documentation/md.txt
>> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>>
>>      md-mod.start_dirty_degraded=1
>>
>> +Note that Journaling filesystems do not effectively protect data in this
>> +case, because the update granularity of the RAID is larger than the journal
>> +was designed to expect.  Reconstructing data via partity information involes
>> +matching together corresponding stripes, and updating only some of these
>> +stripes renders the corresponding data in all the unmatched stripes
>> +meaningless.  Thus seemingly unrelated data in other parts of the filesystem
>> +(stored in the unmatched stripes) can become unreadable after a partial
>> +update, but the journal is only aware of the parts it modified, not the
>> +"collateral damage" elsewhere in the filesystem which was affected by those
>> +changes.
>> +
>> +Thus successful journal replay proves nothing in this context, and even a
>> +full fsck only shows whether or not the filesystem's metadata was affected.
>> +(A proper solution to this problem would involve adding journaling to the RAID
>> +itself, at least during degraded writes.  In the meantime, try not to allow
>> +a system to shut down uncleanly with its RAID both dirty and degraded, it
>> +can handle one but not both.)
>>
>>   Superblock formats
>>   ------------------
>>
>>
>
> NACK.
>
> Now you have moved the inaccurate documentation about journalling file 
> systems into the MD documentation.

What is inaccurate about it?

> Repeat after me:

> (1) partial writes to a RAID stripe (with or without file systems, with 
> or without journals) create an invalid stripe

That's what he's documenting.

> (2) partial writes can be prevented in most cases by running with write 
> cache disabled or working barriers

Given how long experience with storage you claim, you should know that
MD RAID5 does not support barriers by now...


> Rob, you should really try to take a few disks, build a working MD RAID5 
> group and test your ideas. Try it with and without the write cache 
> enabled.

....and understand by now that statistics are irrelevant for design
problems.

Ouch and trying to silence people by telling them to fix the problem
instead of documenting it is not nice either.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03 11:18                                                                                 ` Ric Wheeler
@ 2009-09-03 13:34                                                                                   ` Krzysztof Halasa
  2009-09-03 13:50                                                                                     ` Ric Wheeler
  2009-09-03 14:35                                                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david
  0 siblings, 2 replies; 309+ messages in thread
From: Krzysztof Halasa @ 2009-09-03 13:34 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler <rwheeler@redhat.com> writes:

>>> Just to add some support to this, all of the external RAID arrays that
>>> I know of normally run with write cache disabled on the component
>>> drives.
>>
>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>
> Which drives various vendors ships changes with specific products.
> Usually, they ship drives that have carefully vetted firmware, etc.
> but they are close to the same drives you buy on the open market.

But they aren't the same, are they? If they are not, the fact they can
run well with the write-through cache doesn't mean the off-the-shelf
ones can do as well.

Are they SATA (or PATA) at all? SCSI etc. are usually different
animals, though there are SCSI and SATA models which differ only in
electronics.

Do you have battery-backed write-back RAID cache (which acknowledges
flushes before the data is written out to disks)? PC can't do that.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03 13:34                                                                                   ` Krzysztof Halasa
@ 2009-09-03 13:50                                                                                     ` Ric Wheeler
  2009-09-03 13:59                                                                                       ` Krzysztof Halasa
  2009-09-03 14:35                                                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david
  1 sibling, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-03 13:50 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/03/2009 09:34 AM, Krzysztof Halasa wrote:
> Ric Wheeler<rwheeler@redhat.com>  writes:
>
>>>> Just to add some support to this, all of the external RAID arrays that
>>>> I know of normally run with write cache disabled on the component
>>>> drives.
>>>
>>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>>
>> Which drives various vendors ships changes with specific products.
>> Usually, they ship drives that have carefully vetted firmware, etc.
>> but they are close to the same drives you buy on the open market.
>
> But they aren't the same, are they? If they are not, the fact they can
> run well with the write-through cache doesn't mean the off-the-shelf
> ones can do as well.

Storage vendors have a wide range of options, but what you get today is a 
collection of s-ata (not much any more), sas or fc.

Some times they will have different firmware, other times it is the same.


>
> Are they SATA (or PATA) at all? SCSI etc. are usually different
> animals, though there are SCSI and SATA models which differ only in
> electronics.
>
> Do you have battery-backed write-back RAID cache (which acknowledges
> flushes before the data is written out to disks)? PC can't do that.

We (red hat) have all kinds of different raid boxes...

ric



^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03 13:50                                                                                     ` Ric Wheeler
@ 2009-09-03 13:59                                                                                       ` Krzysztof Halasa
  2009-09-03 14:15                                                                                         ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Krzysztof Halasa @ 2009-09-03 13:59 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler <rwheeler@redhat.com> writes:

> We (red hat) have all kinds of different raid boxes...

A have no doubt about it, but are those you know equipped with
battery-backed write-back cache? Are they using SATA disks?

We can _at_best_ compare non-battery-backed RAID using SATA disks with
what we typically have in a PC.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03  2:41                                                                                 ` Rob Landley
@ 2009-09-03 14:14                                                                                   ` jim owens
  2009-09-04  7:44                                                                                     ` Rob Landley
  0 siblings, 1 reply; 309+ messages in thread
From: jim owens @ 2009-09-03 14:14 UTC (permalink / raw)
  To: Rob Landley
  Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

Rob Landley wrote:
> I think he understands he was clueless too, that's why he investigated the 
> failure and wrote it up for posterity.
> 
>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>> and C) filesystems with "Pavel says...".
> 
> I don't care what "Pavel says", so you can leave the ad hominem at the door, 
> thanks.

See, this is exactly the problem we have with all the proposed
documentation.  The reader (you) did not get what the writer (me)
was trying to say.  That does not say either of us was wrong in
what we thought was meant, simply that we did not communicate.

What I meant was we did not want to accept Pavel's incorrect
documentation and post it in kernel docs.

> The kernel presents abstractions, such as block device nodes.  Sometimes 
> implementation details bubble through those abstractions.  Presumably, we 
> agree on that so far.

We don't have any problem with documenting abstractions.  But they
must be written as abstracts and accurate, not as IMO blogs.

It is not "he means well, so we will just accept it".  The rule
for kernel docs should be the same as for code.  If it is not
correct in all cases or causes problems, we don't accept it.

jim

^ permalink raw reply	[flat|nested] 309+ messages in thread

* wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 13:59                                                                                       ` Krzysztof Halasa
@ 2009-09-03 14:15                                                                                         ` Ric Wheeler
  2009-09-03 14:26                                                                                           ` Florian Weimer
                                                                                                             ` (3 more replies)
  0 siblings, 4 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-09-03 14:15 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/03/2009 09:59 AM, Krzysztof Halasa wrote:
> Ric Wheeler<rwheeler@redhat.com>  writes:
>
>> We (red hat) have all kinds of different raid boxes...
>
> A have no doubt about it, but are those you know equipped with
> battery-backed write-back cache? Are they using SATA disks?
>
> We can _at_best_ compare non-battery-backed RAID using SATA disks with
> what we typically have in a PC.

The whole thread above is about software MD using commodity drives (S-ATA or 
SAS) without battery backed write cache.

We have that (and I have it personally) and do test it.

You must disable the write cache on these commodity drives *if* the MD RAID 
level does not support barriers properly.

This will greatly reduce errors after a power loss (both in degraded state and 
non-degraded state), but it will not eliminate data loss entirely. You simply 
cannot do that with any storage device!

Note that even without MD raid, the file system issues IO's in file system block 
size (4096 bytes normally) and most commodity storage devices use a 512  byte 
sector size which means that we have to update 8 512b sectors.

Drives can (and do) have multiple platters and surfaces and it is perfectly 
normal to have contiguous logical ranges of sectors map to non-contiguous 
sectors physically. Imagine a 4KB write stripe that straddles two adjacent 
tracks on one platter (requiring a seek) or mapped across two surfaces 
(requiring a head switch). Also, a remapped sector can require more or less a 
full surface seek from where ever you are to the remapped sector area of the drive.

These are all examples that can after a power loss,  even a local (non-MD) 
device,  do a partial update of that 4KB write range of sectors. Note that 
unlike unlike RAID/MD, local storage has no parity on the server to detect this 
partial write.

This is why new file systems like btrfs and zfs do checksumming of data and 
metadata. This won't prevent partial updates during a write, but can at least 
detect them and try to do some kind of recovery.

In other words, this is not just an MD issue, it is entirely possible even with 
non-MD devices.

Also, when you enable the write cache (MD or not) you are buffering multiple 
MB's of data that can go away on power loss. Far greater (10x) the exposure that 
the partial RAID rewrite case worries about.

ric

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 14:15                                                                                         ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
@ 2009-09-03 14:26                                                                                           ` Florian Weimer
  2009-09-03 15:09                                                                                             ` Ric Wheeler
  2009-09-03 23:50                                                                                           ` Krzysztof Halasa
                                                                                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 309+ messages in thread
From: Florian Weimer @ 2009-09-03 14:26 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev,
	david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

* Ric Wheeler:

> Note that even without MD raid, the file system issues IO's in file
> system block size (4096 bytes normally) and most commodity storage
> devices use a 512  byte sector size which means that we have to update
> 8 512b sectors.

Database software often attempts to deal with this phenomenon
(sometimes called "torn page writes").  For example, you can make sure
that the first time you write to a database page, you keep a full copy
in your transaction log.  If the machine crashes, the log is replayed,
first completely overwriting the partially-written page.  Only after
that, you can perform logical/incremental logging.

The log itself has to be protected with a different mechanism, so that
you don't try to replay bad data.  But you haven't comitted to this
data yet, so it is fine to skip bad records.

Therefore, sub-page corruption is a fundamentally different issue from
super-page corruption.

BTW, older textbooks will tell you that mirroring requires that you
read from two copies of the data and compare it (and have some sort of
tie breaker if you need availability).  And you also have to re-read
data you've just written to disk, to make sure it's actually there and
hit the expected sectors.  We can't even do this anymore, thanks to
disk caches.  And it doesn't seem to be necessary in most cases.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03 13:34                                                                                   ` Krzysztof Halasa
  2009-09-03 13:50                                                                                     ` Ric Wheeler
@ 2009-09-03 14:35                                                                                     ` david
  1 sibling, 0 replies; 309+ messages in thread
From: david @ 2009-09-03 14:35 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Ric Wheeler, Christoph Hellwig, Mark Lord, Michael Tokarev,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Thu, 3 Sep 2009, Krzysztof Halasa wrote:

> Ric Wheeler <rwheeler@redhat.com> writes:
>
>>>> Just to add some support to this, all of the external RAID arrays that
>>>> I know of normally run with write cache disabled on the component
>>>> drives.
>>>
>>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>>
>> Which drives various vendors ships changes with specific products.
>> Usually, they ship drives that have carefully vetted firmware, etc.
>> but they are close to the same drives you buy on the open market.
>
> But they aren't the same, are they? If they are not, the fact they can
> run well with the write-through cache doesn't mean the off-the-shelf
> ones can do as well.

frequently they are exactly the same drives, with exactly the same 
firmware.

you disable the write caches on the drives themselves, but you add a large 
write cache (with battery backup) in the raid card/chassis

> Are they SATA (or PATA) at all? SCSI etc. are usually different
> animals, though there are SCSI and SATA models which differ only in
> electronics.

it depends on what raid array you use, some use SATA, some use SAS/SCSI

David Lang

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 14:26                                                                                           ` Florian Weimer
@ 2009-09-03 15:09                                                                                             ` Ric Wheeler
  0 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-09-03 15:09 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev,
	david, Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On 09/03/2009 10:26 AM, Florian Weimer wrote:
> * Ric Wheeler:
>
>> Note that even without MD raid, the file system issues IO's in file
>> system block size (4096 bytes normally) and most commodity storage
>> devices use a 512  byte sector size which means that we have to update
>> 8 512b sectors.
>
> Database software often attempts to deal with this phenomenon
> (sometimes called "torn page writes").  For example, you can make sure
> that the first time you write to a database page, you keep a full copy
> in your transaction log.  If the machine crashes, the log is replayed,
> first completely overwriting the partially-written page.  Only after
> that, you can perform logical/incremental logging.
>
> The log itself has to be protected with a different mechanism, so that
> you don't try to replay bad data.  But you haven't comitted to this
> data yet, so it is fine to skip bad records.

Yes - databases worry a lot about this. Another technique that they tend to use 
is to have state bits at the beginning and end of their logical pages. For 
example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as 
you update.

If the bits don't match, that is a quick level indication of a torn write.

Even with the above scheme, you can still have data loss of course - you just 
need an IO error in the log and in your db table that was recently updated. Not 
entirely unlikely, especially if you use write cache enabled storage and don't 
flush that cache :-)

>
> Therefore, sub-page corruption is a fundamentally different issue from
> super-page corruption.

We have to be careful to keep our terms clear since the DB pages are (usually) 
larger than the FS block size which in turn is larger than non-RAID storage 
sector size. At the FS level, we send down multiples of fs blocks (not 
blocked/aligned at RAID stripe levels, etc).

In any case, we can get sub-FS block level "torn writes" even with a local S-ATA 
drive in edge conditions.


>
> BTW, older textbooks will tell you that mirroring requires that you
> read from two copies of the data and compare it (and have some sort of
> tie breaker if you need availability).  And you also have to re-read
> data you've just written to disk, to make sure it's actually there and
> hit the expected sectors.  We can't even do this anymore, thanks to
> disk caches.  And it doesn't seem to be necessary in most cases.
>

We can do something like this with the built in RAID in btrfs. If you detect an 
IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy.

Also note that the SCSI T10 DIF/DIX has baked in support for applications to 
layer on extra data integrity (look for MKP's slide decks). This is really neat 
since you can intercept bad IO's on the way down and prevent overwriting good data.

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* what fsck can (and can't) do was Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-29 20:22                                                           ` Rob Landley
  2009-08-29 21:34                                                             ` Pavel Machek
@ 2009-09-03 16:56                                                             ` david
  2009-09-03 19:27                                                               ` Theodore Tso
  1 sibling, 1 reply; 309+ messages in thread
From: david @ 2009-09-03 16:56 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, Ric Wheeler, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sat, 29 Aug 2009, Rob Landley wrote:

> On Saturday 29 August 2009 05:05:58 Pavel Machek wrote:
>> On Fri 2009-08-28 07:49:38, david@lang.hm wrote:
>>> On Thu, 27 Aug 2009, Rob Landley wrote:
>>>> Pavel's response was to attempt to document this.  Not that journaling
>>>> is _bad_, but that it doesn't protect against this class of problem.
>>>
>>> I don't think anyone is disagreeing with the statement that journaling
>>> doesn't protect against this class of problems, but Pavel's statements
>>> didn't say that. he stated that ext3 is more dangerous than ext2.
>>
>> Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.
>
> The filesystem itself isn't more dangerous, but it may provide a false sense of
> security when used on storage devices it wasn't designed for.

from this discussin (and the similar discussion on lwn.net) there appears 
to be confusion/disagreement over what fsck does and what the results of 
not running it are.

it has been stated here that fsck cannot fix broken data, all it tries to 
do is to clean up metadata, but it would probably help to get a clear 
statement of what exactly that means.

I know that it:

finds entries that don't actually have data and deletes them

finds entries where multiple files share data blocks and duplicates the 
(bad for one file) data to seperate them

finds blocks that have been orphaned (allocated, but no directory pointer 
to them) and creates entries in lost+found

but if a fsck does not get run on a filesystem that has been damaged, what 
additional damage can be done?

can it overwrite data that could have been saved?

can it cause new files that are created (or new data written to existing, 
but uncorrupted files) to be lost?

or is it just a matter of not knowing about existing corruption?

David Lang


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: what fsck can (and can't) do was Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-09-03 16:56                                                             ` what fsck can (and can't) do was " david
@ 2009-09-03 19:27                                                               ` Theodore Tso
  0 siblings, 0 replies; 309+ messages in thread
From: Theodore Tso @ 2009-09-03 19:27 UTC (permalink / raw)
  To: david
  Cc: Rob Landley, Pavel Machek, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Thu, Sep 03, 2009 at 09:56:48AM -0700, david@lang.hm wrote:
> from this discussin (and the similar discussion on lwn.net) there appears 
> to be confusion/disagreement over what fsck does and what the results of  
> not running it are.
>
> it has been stated here that fsck cannot fix broken data, all it tries to 
> do is to clean up metadata, but it would probably help to get a clear  
> statement of what exactly that means.

Let me give you my formulation of fsck which may be helpful.  Fsck can
not fix broken data; and (particularly in fsck -y mode) may not even
recover the maximal amount of lost data caused by metadata corruption.
(This is why sometimes an expert using debugfs can recover more data
than fsck -y, and if you have some really precious data, like ten
years' worth of Ph.D. research that you've never bothered to back
up[1], the first thing you should do is buy a new hard drive and make a
sector-by-sector copy of the disk and *then* run fsck.  A new
terrabyte hard drive costs $100; how much is your data worth to you?)

[1] This isn't hypothetical; while I was at MIT this sort of thing
actually happened more than once --- which brings up the philosophical
question of whether someone who is that stupid about not doing backups
on critical data *deserves* to get a Ph.D. degree.  :-)

Fsck's primary job is to make sure that further writes to the
filesystem, whether you are creating new files or removing directory
hierarchies, etc., will not cause *additional* data loss due to meta
data corruption in the file system.  Its secondary goals are to
preserve as much data as possible, and to make sure that file system
metadata is valid (i.e., so that a block pointer contains a valid
block address, so that an attempt to read a file won't cause an I/O
error when the filesystems attempts to seek to a non-existent sector
on disk).

For some filesystems, invalid, corrupt metadata can actually cause a
system panic or oops message, so it's not necessarily safe to mount a
filesystem with corrupt metadata read-only without risking the need to
reboot the machine in question.  More recently, there are folks who
have been filing security bugs when they detect such cases, so there
are fewer examples of such cases, but historically it was a good idea
to run fsck because otherwise it's possible the kernel might oops or
panic when it tripped over some particularly nasty metadata corruption.

> but if a fsck does not get run on a filesystem that has been damaged, 
> what additional damage can be done?

Consider the case where there are data blocks in use by inodes,
containing precious data, but which are marked free in a filesystem
allocation data structures (e.g., ext3's block bitmaps, but this
applies to pretty much any filesystem, whether it's xfs, reiserfs,
btrfs, etc.).  When you create a new file on that filesystem, there's
a chance that blocks that really contain data belonging to other
inodes (perhaps the aforementioned ten years' of unbacked-up
Ph.D. thesis research) will get overwritten by the newly created file.

Another example is an inode which has multiple hard links, but the
hard link count is wrong by being too low.  Now when you delete one of
the hard links, the inode will be released, and the inode and its data
blocks returned to the free pool, despite the fact that it is still
accessible via another directory entry in the filesystem, and despite
the fact that the file contents should be saved.

In the case where you have a block which is claimed by more than one
file, if that file is rewritten in place, it's possible that the newly
written file could have its data corrupted, so it's not just a matter
of potential corruption to existing files; the newly created files are
at risk as well.

> can it overwrite data that could have been saved?
>
> can it cause new files that are created (or new data written to existing, 
> but uncorrupted files) to be lost?
>
> or is it just a matter of not knowing about existing corruption?

So it's yes to all of the above; yes, you can overwrite existing data
files; yes it can cause data blocks belonging to newly created files
to be list; and no you won't know about data loss caused by metadata
corruption.  (Again, you won't know about data loss caused by
corruption to the data blocks.)

					- Ted


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 14:15                                                                                         ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
  2009-09-03 14:26                                                                                           ` Florian Weimer
@ 2009-09-03 23:50                                                                                           ` Krzysztof Halasa
  2009-09-04  0:39                                                                                             ` Ric Wheeler
  2009-09-04 21:21                                                                                           ` Mark Lord
  2009-09-07 11:45                                                                                           ` Pavel Machek
  3 siblings, 1 reply; 309+ messages in thread
From: Krzysztof Halasa @ 2009-09-03 23:50 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler <rwheeler@redhat.com> writes:

> The whole thread above is about software MD using commodity drives
> (S-ATA or SAS) without battery backed write cache.

Yes. However, you mentioned external RAID arrays disable disk caches.
That's why I asked if they are using SATA or SCSI/etc. disks, and if
they have battery-backed cache.

> Also, when you enable the write cache (MD or not) you are buffering
> multiple MB's of data that can go away on power loss. Far greater
> (10x) the exposure that the partial RAID rewrite case worries about.

The cache is flushed with working barriers. I guess it should be
superior to disabled WB cache, in both performance and expected disk
lifetime.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 23:50                                                                                           ` Krzysztof Halasa
@ 2009-09-04  0:39                                                                                             ` Ric Wheeler
  0 siblings, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-09-04  0:39 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Christoph Hellwig, Mark Lord, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/03/2009 07:50 PM, Krzysztof Halasa wrote:
> Ric Wheeler<rwheeler@redhat.com>  writes:
>
>    
>> The whole thread above is about software MD using commodity drives
>> (S-ATA or SAS) without battery backed write cache.
>>      
> Yes. However, you mentioned external RAID arrays disable disk caches.
> That's why I asked if they are using SATA or SCSI/etc. disks, and if
> they have battery-backed cache.
>
>    

Sorry for the confusion - they disable the write caches on the component 
drives normally, but have their own write cache which is not disabled in 
most cases.

>> Also, when you enable the write cache (MD or not) you are buffering
>> multiple MB's of data that can go away on power loss. Far greater
>> (10x) the exposure that the partial RAID rewrite case worries about.
>>      
> The cache is flushed with working barriers. I guess it should be
> superior to disabled WB cache, in both performance and expected disk
> lifetime.
>    

True - barriers (especially on big, slow s-ata drives) usually give you 
an overall win. SAS drives it seems to make less of an impact, but then 
you always need to benchmark your workload on anything to get the only 
numbers that really matter :-)

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-03 14:14                                                                                   ` jim owens
@ 2009-09-04  7:44                                                                                     ` Rob Landley
  2009-09-04 11:49                                                                                       ` Ric Wheeler
  0 siblings, 1 reply; 309+ messages in thread
From: Rob Landley @ 2009-09-04  7:44 UTC (permalink / raw)
  To: jim owens
  Cc: Ric Wheeler, Pavel Machek, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Thursday 03 September 2009 09:14:43 jim owens wrote:
> Rob Landley wrote:
> > I think he understands he was clueless too, that's why he investigated
> > the failure and wrote it up for posterity.
> >
> >> And Ric said do not stigmatize whole classes of A) devices, B) raid,
> >> and C) filesystems with "Pavel says...".
> >
> > I don't care what "Pavel says", so you can leave the ad hominem at the
> > door, thanks.
>
> See, this is exactly the problem we have with all the proposed
> documentation.  The reader (you) did not get what the writer (me)
> was trying to say.  That does not say either of us was wrong in
> what we thought was meant, simply that we did not communicate.

That's why I've mostly stopped bothering with this thread.  I could respond to 
Ric Wheeler's latest (what does write barriers have to do with whether or not 
a multi-sector stripe is guaranteed to be atomically updated during a panic or 
power failure?) but there's just no point.

The LWN article on the topic is out, and incomplete as it is I expect it's the 
best documentation anybody will actually _read_.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-04  7:44                                                                                     ` Rob Landley
@ 2009-09-04 11:49                                                                                       ` Ric Wheeler
  2009-09-05 10:28                                                                                         ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-04 11:49 UTC (permalink / raw)
  To: Rob Landley
  Cc: jim owens, Ric Wheeler, Pavel Machek, david, Theodore Tso,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/04/2009 03:44 AM, Rob Landley wrote:
> On Thursday 03 September 2009 09:14:43 jim owens wrote:
>    
>> Rob Landley wrote:
>>      
>>> I think he understands he was clueless too, that's why he investigated
>>> the failure and wrote it up for posterity.
>>>
>>>        
>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>>>> and C) filesystems with "Pavel says...".
>>>>          
>>> I don't care what "Pavel says", so you can leave the ad hominem at the
>>> door, thanks.
>>>        
>> See, this is exactly the problem we have with all the proposed
>> documentation.  The reader (you) did not get what the writer (me)
>> was trying to say.  That does not say either of us was wrong in
>> what we thought was meant, simply that we did not communicate.
>>      
> That's why I've mostly stopped bothering with this thread.  I could respond to
> Ric Wheeler's latest (what does write barriers have to do with whether or not
> a multi-sector stripe is guaranteed to be atomically updated during a panic or
> power failure?) but there's just no point.
>    

The point of that post was that the failure that you and Pavel both 
attribute to RAID and journalled fs happens whenever the storage cannot 
promise to do atomic writes of a logical FS block (prevent torn 
pages/split writes/etc). I gave a specific example of why this happens 
even with simple, single disk systems.

Further, if  you have the write cache enabled on your local S-ATA/SAS 
drives and do not have working barriers (as is the case with MD 
RAID5/6), you have a hard promise of data loss on power outage and these 
split writes are not going to be the cause of your issues.

You can verify this by testing. Or, try to find people that do storage 
and file systems that you would listen to and ask.
> The LWN article on the topic is out, and incomplete as it is I expect it's the
> best documentation anybody will actually _read_.
>
> Rob
>    


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 14:15                                                                                         ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
  2009-09-03 14:26                                                                                           ` Florian Weimer
  2009-09-03 23:50                                                                                           ` Krzysztof Halasa
@ 2009-09-04 21:21                                                                                           ` Mark Lord
  2009-09-04 21:29                                                                                             ` Ric Wheeler
  2009-09-07 11:45                                                                                           ` Pavel Machek
  3 siblings, 1 reply; 309+ messages in thread
From: Mark Lord @ 2009-09-04 21:21 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler wrote:
..
> You must disable the write cache on these commodity drives *if* the MD 
> RAID level does not support barriers properly.
..

Rather than further trying to cripple Linux on the notebook,
(it's bad enough already)..

How about instead, *fixing* the MD layer to properly support barriers?
That would be far more useful, productive, and better for end-users.

Cheers

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-04 21:21                                                                                           ` Mark Lord
@ 2009-09-04 21:29                                                                                             ` Ric Wheeler
  2009-09-05 12:57                                                                                               ` Mark Lord
  0 siblings, 1 reply; 309+ messages in thread
From: Ric Wheeler @ 2009-09-04 21:29 UTC (permalink / raw)
  To: Mark Lord
  Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/04/2009 05:21 PM, Mark Lord wrote:
> Ric Wheeler wrote:
> ..
>> You must disable the write cache on these commodity drives *if* the 
>> MD RAID level does not support barriers properly.
> ..
>
> Rather than further trying to cripple Linux on the notebook,
> (it's bad enough already)..

People using MD on notebooks (not sure there are that many using RAID5 
MD) could leave their write cache enabled.

>
> How about instead, *fixing* the MD layer to properly support barriers?
> That would be far more useful, productive, and better for end-users.
>
> Cheers

Fixing MD would be great - not sure that it would end up still faster 
(look at md1 devices with working barriers with compared to md1 with 
write cache disabled).

In the mean time, if you are using MD to make your data more reliable, I 
would still strongly urge you to disable the write cache when you see 
"barriers disabled" messages spit out in /var/log/messages :-)

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-04 11:49                                                                                       ` Ric Wheeler
@ 2009-09-05 10:28                                                                                         ` Pavel Machek
  2009-09-05 12:20                                                                                           ` Ric Wheeler
  2009-09-05 13:54                                                                                           ` Jonathan Corbet
  0 siblings, 2 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-05 10:28 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Rob Landley, jim owens, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Fri 2009-09-04 07:49:34, Ric Wheeler wrote:
> On 09/04/2009 03:44 AM, Rob Landley wrote:
>> On Thursday 03 September 2009 09:14:43 jim owens wrote:
>>    
>>> Rob Landley wrote:
>>>      
>>>> I think he understands he was clueless too, that's why he investigated
>>>> the failure and wrote it up for posterity.
>>>>
>>>>        
>>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>>>>> and C) filesystems with "Pavel says...".
>>>>>          
>>>> I don't care what "Pavel says", so you can leave the ad hominem at the
>>>> door, thanks.
>>>>        
>>> See, this is exactly the problem we have with all the proposed
>>> documentation.  The reader (you) did not get what the writer (me)
>>> was trying to say.  That does not say either of us was wrong in
>>> what we thought was meant, simply that we did not communicate.
>>>      
>> That's why I've mostly stopped bothering with this thread.  I could respond to
>> Ric Wheeler's latest (what does write barriers have to do with whether or not
>> a multi-sector stripe is guaranteed to be atomically updated during a panic or
>> power failure?) but there's just no point.
>>    
>
> The point of that post was that the failure that you and Pavel both  
> attribute to RAID and journalled fs happens whenever the storage cannot  
> promise to do atomic writes of a logical FS block (prevent torn  
> pages/split writes/etc). I gave a specific example of why this happens  
> even with simple, single disk systems.

ext3 does not expect atomic write of 4K block, according to Ted. So
no, it is not broken on single disk.
 
>> The LWN article on the topic is out, and incomplete as it is I expect it's the
>> best documentation anybody will actually _read_.

Would anyone (probably privately?) share the lwn link?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 15:20                                                             ` Theodore Tso
                                                                                 ` (2 preceding siblings ...)
  2009-09-05 10:34                                                               ` Pavel Machek
@ 2009-09-05 10:34                                                               ` Pavel Machek
  3 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-05 10:34 UTC (permalink / raw)
  To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Hi!

> > If it only was this simple. We don't have 'check brakes' (aka
> > 'journalling ineffective') warning light. If we had that, I would not
> > have problem.
> 
> But we do; comptently designed (and in the cast of software RAID,
> competently packaged) RAID subsystems send notifications to the system
> administrator when there is a hard drive failure.  Some hardware RAID
> systems will send a page to the system administrator.  A mid-range
> Areca card has a separate ethernet port so it can send e-mail to the
> administrator, even if the OS is hosed for some reason.

Well, my MMC/uSD cards do not have ethernet ports to remind me that
they are unreliable :-(.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-30 15:20                                                             ` Theodore Tso
  2009-08-31 17:49                                                               ` Jesse Brandeburg
  2009-08-31 17:49                                                               ` Jesse Brandeburg
@ 2009-09-05 10:34                                                               ` Pavel Machek
  2009-09-05 10:34                                                               ` Pavel Machek
  3 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-09-05 10:34 UTC (permalink / raw)
  To: Theodore Tso, NeilBrown, Ric Wheeler, Rob Landley,
	Florian Weimer, Goswin von Brederlow

Hi!

> > If it only was this simple. We don't have 'check brakes' (aka
> > 'journalling ineffective') warning light. If we had that, I would not
> > have problem.
> 
> But we do; comptently designed (and in the cast of software RAID,
> competently packaged) RAID subsystems send notifications to the system
> administrator when there is a hard drive failure.  Some hardware RAID
> systems will send a page to the system administrator.  A mid-range
> Areca card has a separate ethernet port so it can send e-mail to the
> administrator, even if the OS is hosed for some reason.

Well, my MMC/uSD cards do not have ethernet ports to remind me that
they are unreliable :-(.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-05 10:28                                                                                         ` Pavel Machek
@ 2009-09-05 12:20                                                                                           ` Ric Wheeler
  2009-09-05 13:54                                                                                           ` Jonathan Corbet
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-09-05 12:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, jim owens, david, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On 09/05/2009 06:28 AM, Pavel Machek wrote:
> On Fri 2009-09-04 07:49:34, Ric Wheeler wrote:
>    
>> On 09/04/2009 03:44 AM, Rob Landley wrote:
>>      
>>> On Thursday 03 September 2009 09:14:43 jim owens wrote:
>>>
>>>        
>>>> Rob Landley wrote:
>>>>
>>>>          
>>>>> I think he understands he was clueless too, that's why he investigated
>>>>> the failure and wrote it up for posterity.
>>>>>
>>>>>
>>>>>            
>>>>>> And Ric said do not stigmatize whole classes of A) devices, B) raid,
>>>>>> and C) filesystems with "Pavel says...".
>>>>>>
>>>>>>              
>>>>> I don't care what "Pavel says", so you can leave the ad hominem at the
>>>>> door, thanks.
>>>>>
>>>>>            
>>>> See, this is exactly the problem we have with all the proposed
>>>> documentation.  The reader (you) did not get what the writer (me)
>>>> was trying to say.  That does not say either of us was wrong in
>>>> what we thought was meant, simply that we did not communicate.
>>>>
>>>>          
>>> That's why I've mostly stopped bothering with this thread.  I could respond to
>>> Ric Wheeler's latest (what does write barriers have to do with whether or not
>>> a multi-sector stripe is guaranteed to be atomically updated during a panic or
>>> power failure?) but there's just no point.
>>>
>>>        
>> The point of that post was that the failure that you and Pavel both
>> attribute to RAID and journalled fs happens whenever the storage cannot
>> promise to do atomic writes of a logical FS block (prevent torn
>> pages/split writes/etc). I gave a specific example of why this happens
>> even with simple, single disk systems.
>>      
> ext3 does not expect atomic write of 4K block, according to Ted. So
> no, it is not broken on single disk.
>    

I am not sure what you mean by "expect."

ext3 (and other file systems) certainly expect that acknowledged writes 
will still be there after a crash.

With your disk write cache on (and no working barriers or non-volatile 
write cache), this will always require a repair via fsck or leave you 
with corrupted data or metadata.

ext4, btrfs and zfs all do checksumming of writes, but this is a 
detection mechanism.

Repair of the partial write is done on detection (if you have another 
copy in btrfs or xfs) or by repair (ext4's fsck).

For what it's worth, this is the same story with databases (DB2, Oracle, 
etc). They spend a lot of energy trying to detect partial writes from 
the application level's point of view and their granularity is often 
multiple fs blocks....

>
>    
>>> The LWN article on the topic is out, and incomplete as it is I expect it's the
>>> best documentation anybody will actually _read_.
>>>        
> Would anyone (probably privately?) share the lwn link?
> 								Pavel
>    


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-04 21:29                                                                                             ` Ric Wheeler
@ 2009-09-05 12:57                                                                                               ` Mark Lord
  2009-09-05 13:40                                                                                                 ` Ric Wheeler
  2009-09-05 21:43                                                                                                 ` NeilBrown
  0 siblings, 2 replies; 309+ messages in thread
From: Mark Lord @ 2009-09-05 12:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Ric Wheeler wrote:
> On 09/04/2009 05:21 PM, Mark Lord wrote:
..
>> How about instead, *fixing* the MD layer to properly support barriers?
>> That would be far more useful, productive, and better for end-users.
..
> Fixing MD would be great - not sure that it would end up still faster 
> (look at md1 devices with working barriers with compared to md1 with 
> write cache disabled).
..

There's no inherent reason for it to be slower, except possibly
drives with b0rked FUA support.

So the first step is to fix MD to pass barriers to the LLDs
for most/all RAID types. 

Then, if it has performance issues, those can be addressed
by more application of little grey cells.  :)

Cheers

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-05 12:57                                                                                               ` Mark Lord
@ 2009-09-05 13:40                                                                                                 ` Ric Wheeler
  2009-09-05 21:43                                                                                                 ` NeilBrown
  1 sibling, 0 replies; 309+ messages in thread
From: Ric Wheeler @ 2009-09-05 13:40 UTC (permalink / raw)
  To: Mark Lord
  Cc: Krzysztof Halasa, Christoph Hellwig, Michael Tokarev, david,
	Pavel Machek, Theodore Tso, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On 09/05/2009 08:57 AM, Mark Lord wrote:
> Ric Wheeler wrote:
>> On 09/04/2009 05:21 PM, Mark Lord wrote:
> ..
>>> How about instead, *fixing* the MD layer to properly support barriers?
>>> That would be far more useful, productive, and better for end-users.
> ..
>> Fixing MD would be great - not sure that it would end up still faster 
>> (look at md1 devices with working barriers with compared to md1 with 
>> write cache disabled).
> ..
>
> There's no inherent reason for it to be slower, except possibly
> drives with b0rked FUA support.
>
> So the first step is to fix MD to pass barriers to the LLDs
> for most/all RAID types.
> Then, if it has performance issues, those can be addressed
> by more application of little grey cells.  :)
>
> Cheers

The performance issue with MD is that the "simple" answer is to not only 
pass on those downstream barrier ops, but also to block and wait until 
all of those dependent barrier ops complete before ack'ing the IO.

When you do that implementation at least, you will see a very large 
performance impact and I am not sure that you would see any degradation 
vs just turning off the write caches.

Sounds like we should actually do some testing and actually measure, I 
do think that it will vary with the class of device quite a lot just 
like we see with single disk barriers vs write cache disabled on SAS vs 
S-ATA, etc...

ric


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-05 10:28                                                                                         ` Pavel Machek
  2009-09-05 12:20                                                                                           ` Ric Wheeler
@ 2009-09-05 13:54                                                                                           ` Jonathan Corbet
  2009-09-05 21:27                                                                                             ` Pavel Machek
  1 sibling, 1 reply; 309+ messages in thread
From: Jonathan Corbet @ 2009-09-05 13:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Rob Landley, jim owens, david, Theodore Tso,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

On Sat, 5 Sep 2009 12:28:10 +0200
Pavel Machek <pavel@ucw.cz> wrote:

> >> The LWN article on the topic is out, and incomplete as it is I expect it's the
> >> best documentation anybody will actually _read_.  
> 
> Would anyone (probably privately?) share the lwn link?

	http://lwn.net/SubscriberLink/349970/9875eff987190551/

assuming you've not already gotten one from elsewhere.

jon

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-05 13:54                                                                                           ` Jonathan Corbet
@ 2009-09-05 21:27                                                                                             ` Pavel Machek
  2009-09-05 21:56                                                                                               ` Theodore Tso
  0 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-09-05 21:27 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Ric Wheeler, Rob Landley, jim owens, david, Theodore Tso,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

On Sat 2009-09-05 07:54:24, Jonathan Corbet wrote:
> On Sat, 5 Sep 2009 12:28:10 +0200
> Pavel Machek <pavel@ucw.cz> wrote:
> 
> > >> The LWN article on the topic is out, and incomplete as it is I expect it's the
> > >> best documentation anybody will actually _read_.  
> > 
> > Would anyone (probably privately?) share the lwn link?
> 
> 	http://lwn.net/SubscriberLink/349970/9875eff987190551/
> 
> assuming you've not already gotten one from elsewhere.

Thanks, and thanks for nice article!
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-05 12:57                                                                                               ` Mark Lord
  2009-09-05 13:40                                                                                                 ` Ric Wheeler
@ 2009-09-05 21:43                                                                                                 ` NeilBrown
  1 sibling, 0 replies; 309+ messages in thread
From: NeilBrown @ 2009-09-05 21:43 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Michael Tokarev, david, Pavel Machek, Theodore Tso, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

On Sat, September 5, 2009 10:57 pm, Mark Lord wrote:
> Ric Wheeler wrote:
>> On 09/04/2009 05:21 PM, Mark Lord wrote:
> ..
>>> How about instead, *fixing* the MD layer to properly support barriers?
>>> That would be far more useful, productive, and better for end-users.
> ..
>> Fixing MD would be great - not sure that it would end up still faster
>> (look at md1 devices with working barriers with compared to md1 with
>> write cache disabled).
> ..
>
> There's no inherent reason for it to be slower, except possibly
> drives with b0rked FUA support.
>
> So the first step is to fix MD to pass barriers to the LLDs
> for most/all RAID types.

Having MD "pass barriers" to LLDs isn't really very useful.
The barrier need to act with respect to all addresses of the device,
and once you pass it down, it can only act with respect to addresses
on that device.
What any striping RAID level needs to do when it sees a barrier
is:
   suspend all future writes
   drain and flush all queues
   submit the barrier write
   drain and flush all queues
   unsuspend writes

I guess "drain can flush all queues" can be done with an empty barrier
so maybe that is exactly what you meant.

The double flush which (I think) is required by the barrier semantic
is unfortunate.  I wonder if it would actually make things slower than
necessary.

NeilBrown

>
> Then, if it has performance issues, those can be addressed
> by more application of little grey cells.  :)
>
> Cheers
>


^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-09-05 21:27                                                                                             ` Pavel Machek
@ 2009-09-05 21:56                                                                                               ` Theodore Tso
  0 siblings, 0 replies; 309+ messages in thread
From: Theodore Tso @ 2009-09-05 21:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jonathan Corbet, Ric Wheeler, Rob Landley, jim owens, david,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4

On Sat, Sep 05, 2009 at 11:27:32PM +0200, Pavel Machek wrote:
> 
> Thanks, and thanks for nice article!

I agree; it's very nicely written, balanced, and doesn't scare users
unduly.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-03 14:15                                                                                         ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
                                                                                                             ` (2 preceding siblings ...)
  2009-09-04 21:21                                                                                           ` Mark Lord
@ 2009-09-07 11:45                                                                                           ` Pavel Machek
  2009-09-07 13:10                                                                                             ` Theodore Tso
  3 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-09-07 11:45 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Krzysztof Halasa, Christoph Hellwig, Mark Lord, Michael Tokarev,
	david, Theodore Tso, NeilBrown, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

Hi!

> Note that even without MD raid, the file system issues IO's in file 
> system block size (4096 bytes normally) and most commodity storage 
> devices use a 512  byte sector size which means that we have to update 8 
> 512b sectors.
>
> Drives can (and do) have multiple platters and surfaces and it is 
> perfectly normal to have contiguous logical ranges of sectors map to 
> non-contiguous sectors physically. Imagine a 4KB write stripe that 
> straddles two adjacent tracks on one platter (requiring a seek) or mapped 
> across two surfaces (requiring a head switch). Also, a remapped sector 
> can require more or less a full surface seek from where ever you are to 
> the remapped sector area of the drive.

Yes, but ext3 was designed to handle the partial write  (according to
tytso).

> These are all examples that can after a power loss,  even a local 
> (non-MD) device,  do a partial update of that 4KB write range of
> sectors. 

Yes, but ext3 journal protects metadata integrity in that case.

> In other words, this is not just an MD issue, it is entirely possible 
> even with non-MD devices.
>
> Also, when you enable the write cache (MD or not) you are buffering 
> multiple MB's of data that can go away on power loss. Far greater (10x) 
> the exposure that the partial RAID rewrite case worries about.

Yes, that's what barriers are for. Except that they are not there on
MD0/MD5/MD6. They actually work on local sata drives...

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage
  2009-09-07 11:45                                                                                           ` Pavel Machek
@ 2009-09-07 13:10                                                                                             ` Theodore Tso
  2010-04-04 13:47                                                                                                 ` Pavel Machek
  0 siblings, 1 reply; 309+ messages in thread
From: Theodore Tso @ 2009-09-07 13:10 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord,
	Michael Tokarev, david, NeilBrown, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Mon, Sep 07, 2009 at 01:45:34PM +0200, Pavel Machek wrote:
> 
> Yes, but ext3 was designed to handle the partial write  (according to
> tytso).

I'm not sure what made you think that I said that.  In practice things
usually work out, as a conseuqence of the fact that ext3 uses physical
block journaling, but it's not perfect, becase...

> > Also, when you enable the write cache (MD or not) you are buffering 
> > multiple MB's of data that can go away on power loss. Far greater (10x) 
> > the exposure that the partial RAID rewrite case worries about.
> 
> Yes, that's what barriers are for. Except that they are not there on
> MD0/MD5/MD6. They actually work on local sata drives...

Yes, but ext3 does not enable barriers by default (the patch has been
submitted but akpm has balked because he doesn't like the performance
degredation and doesn't believe that Chris Mason's "workload of doom"
is a common case).  Note though that it is possible for dirty blocks
to remain in the track buffer for *minutes* without being written to
spinning rust platters without a barrier.

See Chris Mason's report of this phenonmenon here:

	http://lkml.org/lkml/2009/3/30/297

Here's Chris Mason "barrier test" which will corrupt ext3 filesystems
50% of the time after a power drop if the filesystem is mounted with
barriers disabled (which is the default; use the mount option
barrier=1 to enable barriers):

	http://lkml.indiana.edu/hypermail/linux/kernel/0805.2/1518.html

(Yes, ext4 has barriers enabled by default.)

							- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 18:02                                             ` Theodore Tso
  2009-08-27  6:28                                                 ` Eric Sandeen
  2009-11-09  8:53                                               ` periodic fsck was " Pavel Machek
@ 2009-11-09  8:53                                               ` Pavel Machek
  2009-11-09 14:05                                                 ` Theodore Tso
  2 siblings, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2009-11-09  8:53 UTC (permalink / raw)
  To: Theodore Tso, david, Rik van Riel, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack

On Wed 2009-08-26 14:02:48, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote:
> >>>
> >>> as the ext3 authors have stated many times over the years, you still need
> >>> to run fsck periodicly anyway.
> >>
> >> Where is that documented?
> >
> > linux-kernel mailing list archives.
> 
> Probably from some 6-8 years ago, in e-mail postings that I made.  My
> argument has always been that PC-class hardware is crap, and it's a

Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I
believed that it was really bad idea at that point, but because I
could not find piece of documentation recommending them, I lost the
argument.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-08-26 18:02                                             ` Theodore Tso
  2009-08-27  6:28                                                 ` Eric Sandeen
@ 2009-11-09  8:53                                               ` Pavel Machek
  2009-11-09  8:53                                               ` Pavel Machek
  2 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-11-09  8:53 UTC (permalink / raw)
  To: Theodore Tso, david, Rik van Riel, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow

On Wed 2009-08-26 14:02:48, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 06:43:24AM -0700, david@lang.hm wrote:
> >>>
> >>> as the ext3 authors have stated many times over the years, you still need
> >>> to run fsck periodicly anyway.
> >>
> >> Where is that documented?
> >
> > linux-kernel mailing list archives.
> 
> Probably from some 6-8 years ago, in e-mail postings that I made.  My
> argument has always been that PC-class hardware is crap, and it's a

Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I
believed that it was really bad idea at that point, but because I
could not find piece of documentation recommending them, I lost the
argument.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)
  2009-08-28  7:31                                                             ` NeilBrown
  (?)
@ 2009-11-09 10:50                                                             ` Pavel Machek
  -1 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2009-11-09 10:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ric Wheeler, Rob Landley, Theodore Tso, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

Hi!

> >> If you have a specific bug in MD code, please propose a patch.
> >
> > Interesting. So, what's technically wrong with the patch below?
> >
> 
> You mean apart from ".... that high highly undesirable ...." ??
>                                ^^^^^^^^^^^
> 

Ok, I still believe kernel documentation should be ... well... in
kernel, not in LWN article, so I fixed the patch according to your
comments.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..14d0324
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,21 @@
+There are storage devices that have highly undesirable properties when
+they are disconnected or suffer power failures while writes are in
+progress; such devices include flash devices and degraded DM/MD RAID
+4/5/6 (*) arrays.  These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using any devices, and these
+devices in particular is also a Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
+
+(*) If device failure causes the array to become degraded during or
+immediately after the power failure, the same problem can result.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 309+ messages in thread

* Re: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-11-09  8:53                                               ` Pavel Machek
@ 2009-11-09 14:05                                                 ` Theodore Tso
  2009-11-09 15:58                                                   ` Andreas Dilger
  0 siblings, 1 reply; 309+ messages in thread
From: Theodore Tso @ 2009-11-09 14:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Rik van Riel, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet, jack

On Mon, Nov 09, 2009 at 09:53:18AM +0100, Pavel Machek wrote:
> 
> Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I
> believed that it was really bad idea at that point, but because I
> could not find piece of documentation recommending them, I lost the
> argument.

It's an engineering trade-off.  If you have perfect memory that is
never has cosmic-ray hiccups, and hard drives that never write data to
the wrong place, etc. then you don't need periodic fsck's.

If you do have imperfect hardware, the question then is how imperfect
your hardware is, and how frequently it introduces errors.  If you
check too frequently, though, users get upset, especially when it
happens at the most inconvenient time (when you're trying to recover
from unscheduled downtime by rebooting); if you check too infrequently
then it doesn't help you too much since too much data gets damaged
before fsck notices.

So these days, what I strongly recommend is that people use LVM
snapshots, and schedule weekly checks during some low usage period
(i.e., 3am on Saturdays), using something like the e2croncheck shell
script.

						- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: periodic fsck was Re: [patch] ext2/3: document conditions when reliable operation is possible
  2009-11-09 14:05                                                 ` Theodore Tso
@ 2009-11-09 15:58                                                   ` Andreas Dilger
  0 siblings, 0 replies; 309+ messages in thread
From: Andreas Dilger @ 2009-11-09 15:58 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, david, Rik van Riel, Ric Wheeler, Florian Weimer,
	Goswin von Brederlow, Rob Landley, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, ext4 development, corbet,
	Jan Kara, Bryan Kadzban, Karel Zak, LVM Mailing List

On 2009-11-09, at 07:05, Theodore Tso wrote:
> So these days, what I strongly recommend is that people use LVM
> snapshots, and schedule weekly checks during some low usage period
> (i.e., 3am on Saturdays), using something like the e2croncheck shell
> script.


There was another script written to do this that handled the e2fsck,  
reiserfsck
and xfs_check, detecting all volume groups automatically, along with  
e.g.
validating that the snapshot volume doesn't exist before starting the  
check
(which may indicate that the previous e2fsck is still running), and  
not running while on AC power.

The last version was in the thread "forced fsck (again?)" dated  
2008-01-28.
Would it be better to use that one?  In that thread we discussed not  
clobbering
the last checked time as e2croncheck does, so the admin can see how  
long it
was since the filesystem was last checked.

Maybe it makes more sense to get the lvcheck script included into util- 
linux-ng
or lvm2 packages, and have it added automatically to the cron.weekly  
directory?
Then the distros could disable the at-boot checking safely, while  
still being
able to detect corruption caused by cables/RAM/drives/software.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 309+ messages in thread

* fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2009-09-07 13:10                                                                                             ` Theodore Tso
@ 2010-04-04 13:47                                                                                                 ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2010-04-04 13:47 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev, david, NeilBrown, Rob Landley,
	Florian Weimer, Goswin von Brederlow, kernel list, Andrew Morton,
	mtk.manpages, rdunlap, linux-doc, linux-ext4, corbet

Hi!

> > Yes, but ext3 was designed to handle the partial write  (according to
> > tytso).
> 
> I'm not sure what made you think that I said that.  In practice things
> usually work out, as a conseuqence of the fact that ext3 uses physical
> block journaling, but it's not perfect, becase...

Ok; so the journalling actually  is not reliable on many machines --
not even disk drive manufacturers guarantee full block writes AFAICT.

Maybe there's time to reviwe the patch to increase mount count by >1
when journal is replayed, to do fsck more often when powerfails are
present?


> > > Also, when you enable the write cache (MD or not) you are buffering 
> > > multiple MB's of data that can go away on power loss. Far greater (10x) 
> > > the exposure that the partial RAID rewrite case worries about.
> > 
> > Yes, that's what barriers are for. Except that they are not there on
> > MD0/MD5/MD6. They actually work on local sata drives...
> 
> Yes, but ext3 does not enable barriers by default (the patch has been
> submitted but akpm has balked because he doesn't like the performance
> degredation and doesn't believe that Chris Mason's "workload of doom"
> is a common case).  Note though that it is possible for dirty blocks
> to remain in the track buffer for *minutes* without being written to
> spinning rust platters without a barrier.

So we do wrong thing by default. Another reason to do fsck more often
when powerfails are present?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
@ 2010-04-04 13:47                                                                                                 ` Pavel Machek
  0 siblings, 0 replies; 309+ messages in thread
From: Pavel Machek @ 2010-04-04 13:47 UTC (permalink / raw)
  To: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev

Hi!

> > Yes, but ext3 was designed to handle the partial write  (according to
> > tytso).
> 
> I'm not sure what made you think that I said that.  In practice things
> usually work out, as a conseuqence of the fact that ext3 uses physical
> block journaling, but it's not perfect, becase...

Ok; so the journalling actually  is not reliable on many machines --
not even disk drive manufacturers guarantee full block writes AFAICT.

Maybe there's time to reviwe the patch to increase mount count by >1
when journal is replayed, to do fsck more often when powerfails are
present?


> > > Also, when you enable the write cache (MD or not) you are buffering 
> > > multiple MB's of data that can go away on power loss. Far greater (10x) 
> > > the exposure that the partial RAID rewrite case worries about.
> > 
> > Yes, that's what barriers are for. Except that they are not there on
> > MD0/MD5/MD6. They actually work on local sata drives...
> 
> Yes, but ext3 does not enable barriers by default (the patch has been
> submitted but akpm has balked because he doesn't like the performance
> degredation and doesn't believe that Chris Mason's "workload of doom"
> is a common case).  Note though that it is possible for dirty blocks
> to remain in the track buffer for *minutes* without being written to
> spinning rust platters without a barrier.

So we do wrong thing by default. Another reason to do fsck more often
when powerfails are present?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2010-04-04 13:47                                                                                                 ` Pavel Machek
  (?)
@ 2010-04-04 17:39                                                                                                 ` tytso
  -1 siblings, 0 replies; 309+ messages in thread
From: tytso @ 2010-04-04 17:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ric Wheeler, Krzysztof Halasa, Christoph Hellwig, Mark Lord,
	Michael Tokarev, david, NeilBrown, Rob Landley, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sun, Apr 04, 2010 at 03:47:29PM +0200, Pavel Machek wrote:
> > Yes, but ext3 does not enable barriers by default (the patch has been
> > submitted but akpm has balked because he doesn't like the performance
> > degredation and doesn't believe that Chris Mason's "workload of doom"
> > is a common case).  Note though that it is possible for dirty blocks
> > to remain in the track buffer for *minutes* without being written to
> > spinning rust platters without a barrier.
> 
> So we do wrong thing by default. Another reason to do fsck more often
> when powerfails are present?

Or migrate to ext4, which does use barriers by defaults, as well as
journal-level checksumming.  :-)

As far as changing the default to enable barriers for ext3, you'll
need to talk to akpm about that; he's the one who has been against it
in the past.

					- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2010-04-04 13:47                                                                                                 ` Pavel Machek
  (?)
  (?)
@ 2010-04-04 17:59                                                                                                 ` Rob Landley
  2010-04-04 18:45                                                                                                   ` Pavel Machek
  2010-04-04 19:29                                                                                                   ` tytso
  -1 siblings, 2 replies; 309+ messages in thread
From: Rob Landley @ 2010-04-04 17:59 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sunday 04 April 2010 08:47:29 Pavel Machek wrote:
> Maybe there's time to reviwe the patch to increase mount count by >1
> when journal is replayed, to do fsck more often when powerfails are
> present?

Wow, you mean there are Linux users left who _don't_ rip that out?

The auto-fsck stuff is an instance of "we the developers know what you the 
users need far more than you ever could, so let me ram this down your throat".  
I don't know of a server anywhere that can afford an unscheduled extra four 
hours of downtime due to the system deciding to fsck itself, and I don't know 
a Linux laptop user anywhere who would be happy to fire up their laptop and 
suddenly be told "oh, you can't do anything with it for two hours, and you 
can't power it down either".

I keep my laptop backed up to an external terabyte USB drive and the volatile 
subset of it to a network drive (rsync is great for both), and when it dies, 
it dies.  But I've never lost data due to an issue fsck would have fixed.  I've 
lost data to disks overheating, disks wearing out, disks being run undervolt 
because the cat chewed on the power supply cord... I've copied floppy images to 
/dev/hda instead of /dev/fd0... I even ran over my laptop with my car once.  
(Amazingly enough, that hard drive survived.)

But fsck has never once protected any data of mine, that I am aware of, since 
journaling was introduced.

I'm all for btrfs coming along and being able to fsck itself behind my back 
where I don't have to care about it.  (Although I want to tell it _not_ to do 
that when on battery power.)  But the "fsck lottery" at powerup is just 
stupid.

> > > > Also, when you enable the write cache (MD or not) you are buffering
> > > > multiple MB's of data that can go away on power loss. Far greater
> > > > (10x) the exposure that the partial RAID rewrite case worries about.
> > >
> > > Yes, that's what barriers are for. Except that they are not there on
> > > MD0/MD5/MD6. They actually work on local sata drives...
> >
> > Yes, but ext3 does not enable barriers by default (the patch has been
> > submitted but akpm has balked because he doesn't like the performance
> > degredation and doesn't believe that Chris Mason's "workload of doom"
> > is a common case).  Note though that it is possible for dirty blocks
> > to remain in the track buffer for *minutes* without being written to
> > spinning rust platters without a barrier.
>
> So we do wrong thing by default. Another reason to do fsck more often
> when powerfails are present?

My laptop power fails all the time, due to battery exhaustion.  Back under KDE 
it was decent about suspending when it was ran low on power, but ever since 
KDE 4 came out and I had to switch to XFCE, it's using the gnome 
infrastructure, which collects funky statistics and heuristics but can never 
quite save them to disk because suddenly running out of power when it thinks 
it's got 20 minutes left doesn't give it the opportunity to save its database.  
So it'll never auto-suspend, just suddenly die if I don't hit the button.

As a result of one of these, two large media files in my "anime" subdirectory 
are not only crosslinked, but the common sector they share is bad.  (It ran 
out of power in the act of writing that sector.  I left it copying large files 
to the drive and forgot to plug it in, and it did the loud click emergency 
park and power down thing when the hardware voltage regulator tripped.)

This corruption has been there for a year now.  Presumably if it overwrote 
that sector it might recover (perhaps by allocating one of the spares), but 
the drive firmware has proven unwilling to do so in response to _reading_ the 
bad sector, and I'm largely ignoring it because it's by no means the worst 
thing wrong with this laptop's hardware, and some glorious day I'll probably 
break down and buy a macintosh.  The stuff I have on it's backed up, and in the 
year since it hasn't developed a second bad sector and I haven't deleted those 
files.  (Yes, I could replace the hard drive _again_ but this laptop's on its 
third hard drive already and it's just not worth the effort.)

I'm much more comfortable living with this until I can get a new laptop than 
with the idea of running fsck on the system and letting it do who knows what 
it response to something that is not actually a problem.

> 									Pavel

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2010-04-04 17:59                                                                                                 ` Rob Landley
@ 2010-04-04 18:45                                                                                                   ` Pavel Machek
  2010-04-04 19:35                                                                                                     ` tytso
  2010-04-04 19:29                                                                                                   ` tytso
  1 sibling, 1 reply; 309+ messages in thread
From: Pavel Machek @ 2010-04-04 18:45 UTC (permalink / raw)
  To: Rob Landley
  Cc: Theodore Tso, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sun 2010-04-04 12:59:16, Rob Landley wrote:
> On Sunday 04 April 2010 08:47:29 Pavel Machek wrote:
> > Maybe there's time to reviwe the patch to increase mount count by >1
> > when journal is replayed, to do fsck more often when powerfails are
> > present?
> 
> Wow, you mean there are Linux users left who _don't_ rip that out?

Yes, there are. It actually helped pinpoint corruption here, 4 time it
was major corruption.

And yes, I'd like fsck more often, when they are power failures and
less often when the shutdowns are orderly...

I'm not sure of what right intervals between check are for you, but
I'd say that fsck once a year or every 100 mounts or every 10 power
failures is probably good idea for everybody...

> The auto-fsck stuff is an instance of "we the developers know what you the 
> users need far more than you ever could, so let me ram this down your throat".  
> I don't know of a server anywhere that can afford an unscheduled extra four 
> hours of downtime due to the system deciding to fsck itself, and I don't know 
> a Linux laptop user anywhere who would be happy to fire up their laptop and 
> suddenly be told "oh, you can't do anything with it for two hours, and you 
> can't power it down either".

On laptop situation is easy. Pull the plug, hit reset, wait for fsck,
plug AC back in. Done that, too :-).

Yep, it would be nice if fsck had "escape" button.

> I'm all for btrfs coming along and being able to fsck itself behind my back 
> where I don't have to care about it.  (Although I want to tell it _not_ to do 
> that when on battery power.)  But the "fsck lottery" at powerup is just 
> stupid.

fsck lottery. :-).
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2010-04-04 17:59                                                                                                 ` Rob Landley
  2010-04-04 18:45                                                                                                   ` Pavel Machek
@ 2010-04-04 19:29                                                                                                   ` tytso
  2010-04-04 23:58                                                                                                     ` Rob Landley
  1 sibling, 1 reply; 309+ messages in thread
From: tytso @ 2010-04-04 19:29 UTC (permalink / raw)
  To: Rob Landley
  Cc: Pavel Machek, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote:
> I don't know of a server anywhere that can afford an unscheduled
> extra four hours of downtime due to the system deciding to fsck
> itself, and I don't know a Linux laptop user anywhere who would be
> happy to fire up their laptop and suddenly be told "oh, you can't do
> anything with it for two hours, and you can't power it down either".

So what I recommend for server class machines is to either turn off
the automatic fsck's (it's the default, but it's documented and there
are supported ways of turning it off --- that's hardly developers
"ramming" it down user's throats), or more preferably, to use LVM, and
then use a snapshot and running fsck on the snapshot.

> I'm all for btrfs coming along and being able to fsck itself behind
> my back where I don't have to care about it.  (Although I want to
> tell it _not_ to do that when on battery power.)  

You can do this with ext3/ext4 today, now.  Just take a look at
e2croncheck in the contrib directory of e2fsprogs.  Changing it to not
do this when on battery power is a trivial exercise.  

> My laptop power fails all the time, due to battery exhaustion.  Back
> under KDE it was decent about suspending when it was ran low on
> power, but ever since KDE 4 came out and I had to switch to XFCE,
> it's using the gnome infrastructure, which collects funky statistics
> and heuristics but can never quite save them to disk because
> suddenly running out of power when it thinks it's got 20 minutes
> left doesn't give it the opportunity to save its database.  So it'll
> never auto-suspend, just suddenly die if I don't hit the button.

Hmm, why are you running on battery so often?  I make a point of
running connected to the AC mains whenever possible, because a LiOn
battery only has about 200 full-cycle charge/discharges in it, and
given the cost of LiOn batteries, basically each charge/discharge
cycle costs a dollar each.  So I only run on batteries when I
absolutely have to, and in practice it's rare that I dip below 30% or
so.

> As a result of one of these, two large media files in my "anime"
> subdirectory are not only crosslinked, but the common sector they
> share is bad.  (It ran out of power in the act of writing that
> sector.  I left it copying large files to the drive and forgot to
> plug it in, and it did the loud click emergency park and power down
> thing when the hardware voltage regulator tripped.)

So e2fsck would fix the cross-linking.  We do need to have some better
tools to do forced rewrite of sectors that have gone bad in a HDD.  It
can be done by using badblocks -n, but translating the sector number
emitted by the device driver (which for some drivers is relative to
the beginning of the partition, and for others is relative to the
beginning of the disk).  It is possible to run badblocks -w on the
whole disk, of course, but it's better to just run it on the specific
block in question.

> I'm much more comfortable living with this until I can get a new laptop than 
> with the idea of running fsck on the system and letting it do who knows what 
> it response to something that is not actually a problem.

Well, it actually is a problem.  And there may be other problems
hiding that you're not aware of.  Running "badblocks -b 4096 -n" may
discover other blocks that have failed, and you can then decide
whether you want to let fsck fix things up.  If you don't, though,
it's probably not fair to blame ext3 or e2fsck for any future
failures (not that it's likely to stop you :-).

	      	   	       	       	   - Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2010-04-04 18:45                                                                                                   ` Pavel Machek
@ 2010-04-04 19:35                                                                                                     ` tytso
  0 siblings, 0 replies; 309+ messages in thread
From: tytso @ 2010-04-04 19:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rob Landley, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sun, Apr 04, 2010 at 08:45:46PM +0200, Pavel Machek wrote:
> 
> I'm not sure of what right intervals between check are for you, but
> I'd say that fsck once a year or every 100 mounts or every 10 power
> failures is probably good idea for everybody...

For people using e2croncheck, where you can check it when the system
is idle and without needing to do a power cycle, I'd recommend once a
week, actually.

> > hours of downtime due to the system deciding to fsck itself, and I
> > don't know a Linux laptop user anywhere who would be happy to fire
> > up their laptop and suddenly be told "oh, you can't do anything
> > with it for two hours, and you can't power it down either".
> 
> On laptop situation is easy. Pull the plug, hit reset, wait for fsck,
> plug AC back in. Done that, too :-).

Some distributions will allow you to cancel an fsck; either by using
^C, or hitting escape.  That's a matter for the boot scripts, which
are distribution specific.  Ubuntu has a way of doing this, for
example, if I recall correctly --- although since I've started using
e2croncheck, I've never had an issue with an e2fsck taking place on
bootup.  Also, ext4, fscks are so much much faster that even before I
upgraded to using an SSD, it's never been an issue for me.  It's
certainly not hours any more....

> Yep, it would be nice if fsck had "escape" button.

Complain to your distribution.  :-)

Or this is Linux and open source; fix it yourself, and submit the
patches back to your distribution.  If all you want to do is whine,
then maybe Rob's choice is the best way, go switch to the velvet-lined
closed system/jail which is the Macintosh.  :-)

(I created e2croncheck to solve my problem; if that isn't good enough
for you, I encourage you to find/create your own fixes.)

							- Ted

^ permalink raw reply	[flat|nested] 309+ messages in thread

* Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)
  2010-04-04 19:29                                                                                                   ` tytso
@ 2010-04-04 23:58                                                                                                     ` Rob Landley
  0 siblings, 0 replies; 309+ messages in thread
From: Rob Landley @ 2010-04-04 23:58 UTC (permalink / raw)
  To: tytso
  Cc: Pavel Machek, Ric Wheeler, Krzysztof Halasa, Christoph Hellwig,
	Mark Lord, Michael Tokarev, david, NeilBrown, Florian Weimer,
	Goswin von Brederlow, kernel list, Andrew Morton, mtk.manpages,
	rdunlap, linux-doc, linux-ext4, corbet

On Sunday 04 April 2010 14:29:12 tytso@mit.edu wrote:
> On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote:
> > I don't know of a server anywhere that can afford an unscheduled
> > extra four hours of downtime due to the system deciding to fsck
> > itself, and I don't know a Linux laptop user anywhere who would be
> > happy to fire up their laptop and suddenly be told "oh, you can't do
> > anything with it for two hours, and you can't power it down either".
>
> So what I recommend for server class machines is to either turn off
> the automatic fsck's (it's the default, but it's documented and there
> are supported ways of turning it off --- that's hardly developers
> "ramming" it down user's throats), or more preferably, to use LVM, and
> then use a snapshot and running fsck on the snapshot.

Turning off the automatic fsck is what I see people do, yes.

My point is that if you don't force the thing to run memtest86 overnight every 
20 boots, forcing it to run fsck seems a bit silly.

> > I'm all for btrfs coming along and being able to fsck itself behind
> > my back where I don't have to care about it.  (Although I want to
> > tell it _not_ to do that when on battery power.)
>
> You can do this with ext3/ext4 today, now.  Just take a look at
> e2croncheck in the contrib directory of e2fsprogs.  Changing it to not
> do this when on battery power is a trivial exercise.
>
> > My laptop power fails all the time, due to battery exhaustion.  Back
> > under KDE it was decent about suspending when it was ran low on
> > power, but ever since KDE 4 came out and I had to switch to XFCE,
> > it's using the gnome infrastructure, which collects funky statistics
> > and heuristics but can never quite save them to disk because
> > suddenly running out of power when it thinks it's got 20 minutes
> > left doesn't give it the opportunity to save its database.  So it'll
> > never auto-suspend, just suddenly die if I don't hit the button.
>
> Hmm, why are you running on battery so often?

Personal working style?

When I was in Pittsburgh, I used the laptop on the bus to and from work every 
day.  Here in Austin, my laundromat has free wifi.  It also gets usable free 
wifi from the coffee shop to the right, the japanese restaurant to the left, and 
the ice cream shop across the street.  (And when I'm not in a wifi area, my 
cell phone can bluetooth associate to give me net access too.)

I like coffee shops.  (Of course the fact that if I try to work from home I 
have to fight off the affections of four cats might have something to do with it 
too...)

> I make a point of
> running connected to the AC mains whenever possible, because a LiOn
> battery only has about 200 full-cycle charge/discharges in it, and
> given the cost of LiOn batteries, basically each charge/discharge
> cycle costs a dollar each.

Actually the battery's about $50, so that would be 25 cents each.

My laptop is on its third battery.  It's also on its third hard drive.

> So I only run on batteries when I
> absolutely have to, and in practice it's rare that I dip below 30% or
> so.

Actually I find the suckers die just as quickly from simply being plugged in 
and kept hot by the electronics, and never used so they're pegged at 100% with 
slight trickle current beyond that constantly overcharging them.

> > As a result of one of these, two large media files in my "anime"
> > subdirectory are not only crosslinked, but the common sector they
> > share is bad.  (It ran out of power in the act of writing that
> > sector.  I left it copying large files to the drive and forgot to
> > plug it in, and it did the loud click emergency park and power down
> > thing when the hardware voltage regulator tripped.)
>
> So e2fsck would fix the cross-linking.  We do need to have some better
> tools to do forced rewrite of sectors that have gone bad in a HDD.  It
> can be done by using badblocks -n, but translating the sector number
> emitted by the device driver (which for some drivers is relative to
> the beginning of the partition, and for others is relative to the
> beginning of the disk).  It is possible to run badblocks -w on the
> whole disk, of course, but it's better to just run it on the specific
> block in question.

The point I was trying to make is that running "preemptive" fsck is imposing a 
significant burden on users in an attempt to find purely theoretical problems, 
with the expectation that a given run will _not_ find them.  I've had systems 
taken out by actual hardware issues often enough that keeping good backups and 
being prepared to lose the entire laptop at any time is just common sense.

I knocked my laptop into the bathtub last month.  Luckily there wasn't any 
water in the thing at the time, but it made a very loud bang when it hit, and 
it was on at the time.  (Checked dmesg several times over the next few days 
and it didn't start spitting errors at me, so that's something...)

> > I'm much more comfortable living with this until I can get a new laptop
> > than with the idea of running fsck on the system and letting it do who
> > knows what it response to something that is not actually a problem.
>
> Well, it actually is a problem.  And there may be other problems
> hiding that you're not aware of.  Running "badblocks -b 4096 -n" may
> discover other blocks that have failed, and you can then decide
> whether you want to let fsck fix things up.  If you don't, though,
> it's probably not fair to blame ext3 or e2fsck for any future
> failures (not that it's likely to stop you :-).

I'm not blaming ext2.  I'm saying I've spilled sodas into my working machines 
on so many occasions over the years I've lost _track_.  (The vast majority of 
'em survived, actually.)

Random example of current cascading badness: The latch sensor on my laptop is 
no longer debounced.  That happened when I upgraded to Ubuntu 9.04 but I'm not 
sure how that _can_ screw that up, you'd think the bios would be in charge of 
that.  So anyway, it now has a nasty habit of waking itself up in the nice 
insulated pocket in my backpack and then shutting itself down hard five minutes 
later when the thermal sensors trip (at the bios level I think, not in the 
OS).  So I now regularly suspend to disk instead of to ram because that way it 
can't spuriously wake itself back up just because it got jostled slightly.  
Except that when it resumes from disk, the console it suspended in is totally 
misprogrammed (vertical lines on what it _thinks_ is text mode), and sometimes 
the chip is so horked I can hear the sucker making a screeching noise.  The 
easy workarond is to ctrl-alt-F1 and suspend from a text console, then Ctrl-
alt-f7 gets me back to the desktop.  But going back to that text console 
remembers the misprogramming, and I get vertical lines and an adible whine 
coming from something that isn't a speaker.  (Luckly cursor-up and enter works 
to re-suspend, so I can just sacrifice one console to the suspend bug.)

The _fun_ part is that the last system I had where X11 regularly misprogramed 
it so badly I could _hear_ the video chip, said video chip eventually 
overheated and melted bits of the motherboard.  (That was a toshiba laptop.  
It took out the keyboard controller first, and I used it for a few months with 
an external keyboard until the whole thing just went one day.  The display you 
get when your video chip finally goes can be pretty impressive.  Way prettier 
than the time I was caught in a thunderstorm and my laptop got soaked and two 
vertical sections of the display were flickering white while the rest was 
displaying normally -- that system actally started working again when it dried 
out...)

It just wouldn't be a Linux box to me if I didn't have workarounds for the 
side effects of my workarounds.

Anyway, this is the perspective from which I say that the fsck to look for 
purely theoretical badness on my otherwise perfect system is not worth 2 hours 
to never find anything wrong.

If Ubuntu's little upgrade icon had a "recommend fsck" thing that lights up 
every 3 months which I could hit some weekend when I was going out anyway, 
that would be one thing.  But "Ah, Ubuntu 9.04 moved DRM from X11 into the 
kernel and the Intel 945 3D driver is now psychotic and it froze your machine 
for the second time this week.  Since you're rebooting anyway, you won't mind 
if I add an extra 3 hours to the process"...?  That stopped really being a 
viable assumption some time before hard drives were regularly measured in 
terabytes.

> 	      	   	       	       	   - Ted

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds

^ permalink raw reply	[flat|nested] 309+ messages in thread

end of thread, other threads:[~2010-04-04 23:58 UTC | newest]

Thread overview: 309+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-12  9:21 ext2/3: document conditions when reliable operation is possible Pavel Machek
2009-03-12 11:40 ` Jochen Voß
2009-03-21 11:24   ` Pavel Machek
2009-03-21 11:24     ` Pavel Machek
2009-03-12 19:13 ` Rob Landley
2009-03-16 12:28   ` Pavel Machek
2009-03-16 19:26     ` Rob Landley
2009-03-23 10:45       ` Pavel Machek
2009-03-30 15:06         ` Goswin von Brederlow
2009-08-24  9:26           ` Pavel Machek
2009-08-24  9:31           ` [patch] " Pavel Machek
2009-08-24 11:19             ` Florian Weimer
2009-08-24 13:01               ` Theodore Tso
2009-08-24 14:55                 ` Artem Bityutskiy
2009-08-24 22:30                   ` Rob Landley
2009-08-24 19:52                 ` Pavel Machek
2009-08-24 19:52                   ` Pavel Machek
2009-08-24 20:24                   ` Ric Wheeler
2009-08-24 20:52                     ` Pavel Machek
2009-08-24 21:08                       ` Ric Wheeler
2009-08-24 21:25                         ` Pavel Machek
2009-08-24 22:05                           ` Ric Wheeler
2009-08-24 22:22                             ` Zan Lynx
2009-08-24 22:44                               ` Pavel Machek
2009-08-25  0:34                                 ` Ric Wheeler
2009-08-24 23:42                               ` david
2009-08-24 22:41                             ` Pavel Machek
2009-08-24 22:39                           ` Theodore Tso
2009-08-24 23:00                             ` Pavel Machek
2009-08-25  0:02                               ` david
2009-08-25  9:32                                 ` Pavel Machek
2009-08-25  0:06                               ` Ric Wheeler
2009-08-25  9:34                                 ` Pavel Machek
2009-08-25 15:34                                   ` david
2009-08-26  3:32                                   ` Rik van Riel
2009-08-26 11:17                                     ` Pavel Machek
2009-08-26 11:29                                       ` david
2009-08-26 13:10                                         ` Pavel Machek
2009-08-26 13:43                                           ` david
2009-08-26 18:02                                             ` Theodore Tso
2009-08-27  6:28                                               ` Eric Sandeen
2009-08-27  6:28                                                 ` Eric Sandeen
2009-11-09  8:53                                               ` periodic fsck was " Pavel Machek
2009-11-09  8:53                                               ` Pavel Machek
2009-11-09 14:05                                                 ` Theodore Tso
2009-11-09 15:58                                                   ` Andreas Dilger
2009-08-30  7:03                                             ` Pavel Machek
2009-08-26 12:28                                       ` Theodore Tso
2009-08-27  6:06                                         ` Rob Landley
2009-08-27  6:54                                           ` david
2009-08-27  7:34                                             ` Rob Landley
2009-08-28 14:37                                               ` david
2009-08-30  7:19                                             ` Pavel Machek
2009-08-30 12:48                                               ` david
2009-08-27  5:27                                     ` Rob Landley
2009-08-25  0:08                               ` Theodore Tso
2009-08-25  9:42                                 ` Pavel Machek
2009-08-25  9:42                                 ` Pavel Machek
2009-08-25 13:37                                   ` Ric Wheeler
2009-08-25 13:42                                     ` Alan Cox
2009-08-27  3:16                                       ` Rob Landley
2009-08-25 21:15                                     ` Pavel Machek
2009-08-25 22:42                                       ` Ric Wheeler
2009-08-25 22:51                                         ` Pavel Machek
2009-08-25 23:03                                           ` david
2009-08-25 23:29                                             ` Pavel Machek
2009-08-25 23:03                                           ` Ric Wheeler
2009-08-25 23:26                                             ` Pavel Machek
2009-08-25 23:40                                               ` Ric Wheeler
2009-08-25 23:48                                                 ` david
2009-08-25 23:53                                                 ` Pavel Machek
2009-08-26  0:11                                                   ` Ric Wheeler
2009-08-26  0:16                                                     ` Pavel Machek
2009-08-26  0:31                                                       ` Ric Wheeler
2009-08-26  1:00                                                         ` Theodore Tso
2009-08-26  1:15                                                           ` Ric Wheeler
2009-08-26  2:58                                                             ` Theodore Tso
2009-08-26 10:39                                                               ` Ric Wheeler
2009-08-26 10:39                                                               ` Ric Wheeler
2009-08-26 11:12                                                                 ` Pavel Machek
2009-08-26 11:28                                                                   ` david
2009-08-29  9:49                                                                     ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
2009-08-29 11:28                                                                       ` Ric Wheeler
2009-09-02 20:12                                                                         ` Pavel Machek
2009-09-02 20:42                                                                           ` Ric Wheeler
2009-09-02 23:00                                                                             ` Rob Landley
2009-09-02 23:09                                                                               ` david
2009-09-03  8:55                                                                                 ` Pavel Machek
2009-09-03  0:36                                                                               ` jim owens
2009-09-03  2:41                                                                                 ` Rob Landley
2009-09-03 14:14                                                                                   ` jim owens
2009-09-04  7:44                                                                                     ` Rob Landley
2009-09-04 11:49                                                                                       ` Ric Wheeler
2009-09-05 10:28                                                                                         ` Pavel Machek
2009-09-05 12:20                                                                                           ` Ric Wheeler
2009-09-05 13:54                                                                                           ` Jonathan Corbet
2009-09-05 21:27                                                                                             ` Pavel Machek
2009-09-05 21:56                                                                                               ` Theodore Tso
2009-09-02 22:45                                                                           ` Rob Landley
2009-09-02 22:49                                                                           ` [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case Rob Landley
2009-09-03  9:08                                                                             ` Pavel Machek
2009-09-03 12:05                                                                             ` Ric Wheeler
2009-09-03 12:31                                                                               ` Pavel Machek
2009-08-29 16:35                                                                       ` [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david
2009-08-29 16:35                                                                         ` david
2009-08-30  7:07                                                                         ` Pavel Machek
2009-08-26 12:01                                                                   ` [patch] ext2/3: document conditions when reliable operation is possible Ric Wheeler
2009-08-26 12:23                                                                   ` Theodore Tso
2009-08-30  7:01                                                                     ` Pavel Machek
2009-08-30  7:01                                                                     ` Pavel Machek
2009-08-27  5:19                                                               ` Rob Landley
2009-08-27 12:24                                                                 ` Theodore Tso
2009-08-27 13:10                                                                   ` Ric Wheeler
2009-08-27 13:10                                                                   ` Ric Wheeler
2009-08-27 16:54                                                                     ` MD/DM and barriers (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Jeff Garzik
2009-08-27 18:09                                                                       ` Alasdair G Kergon
2009-09-01 14:01                                                                       ` Pavel Machek
2009-09-02 16:17                                                                         ` Michael Tokarev
2009-08-29 10:02                                                                   ` [patch] ext2/3: document conditions when reliable operation is possible Pavel Machek
2009-08-29 10:02                                                                   ` Pavel Machek
2009-08-26  1:15                                                           ` Ric Wheeler
2009-08-26  1:16                                                           ` Pavel Machek
2009-08-26  1:16                                                           ` Pavel Machek
2009-08-26  2:55                                                             ` Theodore Tso
2009-08-26 13:37                                                               ` Ric Wheeler
2009-08-26 13:37                                                               ` Ric Wheeler
2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
2009-08-26  2:53                                                           ` Henrique de Moraes Holschuh
2009-09-03  9:47                                                           ` Pavel Machek
2009-09-03  9:47                                                             ` Pavel Machek
2009-08-26  3:50                                                   ` Rik van Riel
2009-08-27  3:53                                                 ` Rob Landley
2009-08-27 11:43                                                   ` Ric Wheeler
2009-08-27 20:51                                                     ` Rob Landley
2009-08-27 22:00                                                       ` Ric Wheeler
2009-08-28 14:49                                                       ` david
2009-08-29 10:05                                                         ` Pavel Machek
2009-08-29 20:22                                                           ` Rob Landley
2009-08-29 21:34                                                             ` Pavel Machek
2009-09-03 16:56                                                             ` what fsck can (and can't) do was " david
2009-09-03 19:27                                                               ` Theodore Tso
2009-08-27 22:13                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Pavel Machek
2009-08-28  1:32                                                       ` Ric Wheeler
2009-08-28  6:44                                                         ` Pavel Machek
2009-08-28  7:31                                                           ` NeilBrown
2009-08-28  7:31                                                             ` NeilBrown
2009-11-09 10:50                                                             ` Pavel Machek
2009-08-28 11:16                                                           ` Ric Wheeler
2009-09-01 13:58                                                             ` Pavel Machek
2009-08-28  7:11                                                         ` raid is dangerous but that's secret Florian Weimer
2009-08-28  7:23                                                           ` NeilBrown
2009-08-28 12:08                                                         ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) Theodore Tso
2009-08-30  7:51                                                           ` Pavel Machek
2009-08-30  9:01                                                             ` Christian Kujau
2009-09-02 20:55                                                               ` Pavel Machek
2009-08-30 12:55                                                             ` david
2009-08-30 14:12                                                               ` Ric Wheeler
2009-08-30 14:44                                                                 ` Michael Tokarev
2009-08-30 16:10                                                                   ` Ric Wheeler
2009-08-30 16:35                                                                   ` Christoph Hellwig
2009-08-31 13:15                                                                     ` Ric Wheeler
2009-08-31 13:16                                                                       ` Christoph Hellwig
2009-08-31 13:19                                                                         ` Mark Lord
2009-08-31 13:21                                                                           ` Christoph Hellwig
2009-08-31 15:14                                                                             ` jim owens
2009-09-03  1:59                                                                             ` Ric Wheeler
2009-09-03 11:12                                                                               ` Krzysztof Halasa
2009-09-03 11:18                                                                                 ` Ric Wheeler
2009-09-03 13:34                                                                                   ` Krzysztof Halasa
2009-09-03 13:50                                                                                     ` Ric Wheeler
2009-09-03 13:59                                                                                       ` Krzysztof Halasa
2009-09-03 14:15                                                                                         ` wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage Ric Wheeler
2009-09-03 14:26                                                                                           ` Florian Weimer
2009-09-03 15:09                                                                                             ` Ric Wheeler
2009-09-03 23:50                                                                                           ` Krzysztof Halasa
2009-09-04  0:39                                                                                             ` Ric Wheeler
2009-09-04 21:21                                                                                           ` Mark Lord
2009-09-04 21:29                                                                                             ` Ric Wheeler
2009-09-05 12:57                                                                                               ` Mark Lord
2009-09-05 13:40                                                                                                 ` Ric Wheeler
2009-09-05 21:43                                                                                                 ` NeilBrown
2009-09-07 11:45                                                                                           ` Pavel Machek
2009-09-07 13:10                                                                                             ` Theodore Tso
2010-04-04 13:47                                                                                               ` fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage) Pavel Machek
2010-04-04 13:47                                                                                                 ` Pavel Machek
2010-04-04 17:39                                                                                                 ` tytso
2010-04-04 17:59                                                                                                 ` Rob Landley
2010-04-04 18:45                                                                                                   ` Pavel Machek
2010-04-04 19:35                                                                                                     ` tytso
2010-04-04 19:29                                                                                                   ` tytso
2010-04-04 23:58                                                                                                     ` Rob Landley
2009-09-03 14:35                                                                                     ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) david
2009-08-31 13:22                                                                         ` Ric Wheeler
2009-08-31 15:50                                                                           ` david
2009-08-31 16:21                                                                             ` Ric Wheeler
2009-08-31 18:31                                                                             ` Christoph Hellwig
2009-08-31 19:11                                                                               ` david
2009-08-30 15:05                                                               ` Pavel Machek
2009-08-30 15:20                                                             ` Theodore Tso
2009-08-31 17:49                                                               ` Jesse Brandeburg
2009-08-31 18:01                                                                 ` Ric Wheeler
2009-08-31 21:01                                                                   ` MD5/6? (was Re: raid is dangerous but that's secret ...) Ron Johnson
2009-08-31 18:07                                                                 ` raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible) martin f krafft
2009-08-31 22:26                                                                   ` Jesse Brandeburg
2009-08-31 22:26                                                                     ` Jesse Brandeburg
2009-08-31 23:19                                                                     ` Ron Johnson
2009-09-01  5:45                                                                     ` martin f krafft
2009-08-31 17:49                                                               ` Jesse Brandeburg
2009-09-05 10:34                                                               ` Pavel Machek
2009-09-05 10:34                                                               ` Pavel Machek
2009-08-30  7:51                                                           ` Pavel Machek
2009-08-25 23:46                                               ` [patch] ext2/3: document conditions when reliable operation is possible david
2009-08-25 23:08                                       ` Neil Brown
2009-08-25 23:44                                         ` Pavel Machek
2009-08-26  4:08                                           ` Rik van Riel
2009-08-26 11:15                                             ` Pavel Machek
2009-08-27  3:29                                               ` Rik van Riel
2009-08-25 16:11                                   ` Theodore Tso
2009-08-25 22:21                                     ` [patch] document flash/RAID dangers Pavel Machek
2009-08-25 22:21                                       ` Pavel Machek
2009-08-25 22:33                                       ` david
2009-08-25 22:40                                         ` Pavel Machek
2009-08-25 22:59                                           ` david
2009-08-25 23:37                                             ` Pavel Machek
2009-08-25 23:48                                               ` Ric Wheeler
2009-08-26  0:06                                                 ` Pavel Machek
2009-08-26  0:12                                                   ` Ric Wheeler
2009-08-26  0:20                                                     ` Pavel Machek
2009-08-26  0:26                                                       ` david
2009-08-26  0:28                                                       ` Ric Wheeler
2009-08-26  0:38                                                         ` Pavel Machek
2009-08-26  0:45                                                           ` Ric Wheeler
2009-08-26 11:21                                                             ` Pavel Machek
2009-08-26 11:58                                                               ` Ric Wheeler
2009-08-26 12:40                                                                 ` Theodore Tso
2009-08-26 13:11                                                                   ` Ric Wheeler
2009-08-26 13:11                                                                   ` Ric Wheeler
2009-08-26 13:44                                                                     ` david
2009-08-26 13:40                                                                   ` Chris Adams
2009-08-26 13:47                                                                     ` Alan Cox
2009-08-26 14:11                                                                       ` Chris Adams
2009-08-27 21:50                                                                     ` Pavel Machek
2009-08-29  9:38                                                                 ` Pavel Machek
2009-08-26  4:24                                                       ` Rik van Riel
2009-08-26 11:22                                                         ` Pavel Machek
2009-08-26 14:45                                                           ` Rik van Riel
2009-08-29  9:39                                                             ` Pavel Machek
2009-08-29 11:47                                                               ` Ron Johnson
2009-08-29 16:12                                                                 ` jim owens
2009-08-25 23:56                                               ` david
2009-08-26  0:12                                                 ` Pavel Machek
2009-08-26  0:20                                                   ` david
2009-08-26  0:39                                                     ` Pavel Machek
2009-08-26  1:17                                                       ` david
2009-08-26  0:26                                                   ` Ric Wheeler
2009-08-26  0:44                                                     ` Pavel Machek
2009-08-26  0:50                                                       ` Ric Wheeler
2009-08-26  1:19                                                       ` david
2009-08-26 11:25                                                         ` Pavel Machek
2009-08-26 12:37                                                           ` Theodore Tso
2009-08-30  6:49                                                             ` Pavel Machek
2009-08-30  6:49                                                             ` Pavel Machek
2009-08-26  4:20                                           ` Rik van Riel
2009-08-25 22:27                                     ` [patch] document that ext2 can't handle barriers Pavel Machek
2009-08-25 22:27                                     ` Pavel Machek
2009-08-27  3:34                                 ` [patch] ext2/3: document conditions when reliable operation is possible Rob Landley
2009-08-27  8:46                                 ` David Woodhouse
2009-08-28 14:46                                   ` david
2009-08-29 10:09                                     ` Pavel Machek
2009-08-29 16:27                                       ` david
2009-08-29 21:33                                         ` Pavel Machek
2009-08-24 23:00                             ` Pavel Machek
2009-08-25 13:57                             ` Chris Adams
2009-08-25 22:58                             ` Neil Brown
2009-08-25 23:10                               ` Ric Wheeler
2009-08-25 23:32                                 ` NeilBrown
2009-08-25 23:32                                   ` NeilBrown
2009-08-24 21:11                       ` Greg Freemyer
2009-08-24 21:11                         ` Greg Freemyer
2009-08-25 20:56                         ` Rob Landley
2009-08-25 21:08                           ` david
2009-08-25 18:52                     ` Rob Landley
2009-08-25 14:43                 ` Florian Weimer
2009-08-24 13:50               ` Theodore Tso
2009-08-24 18:48                 ` Pavel Machek
2009-08-24 18:48                 ` Pavel Machek
2009-08-24 18:39               ` Pavel Machek
2009-08-24 13:21             ` Greg Freemyer
2009-08-24 13:21               ` Greg Freemyer
2009-08-24 18:44               ` Pavel Machek
2009-08-25 23:28               ` Neil Brown
2009-08-25 23:28                 ` Neil Brown
2009-08-26  1:34                 ` david
2009-08-24 21:11             ` Rob Landley
2009-08-24 21:33               ` Pavel Machek
2009-08-25 18:45                 ` Jan Kara
2009-03-16 12:30   ` Pavel Machek
2009-03-16 19:03     ` Theodore Tso
2009-03-23 18:23       ` Pavel Machek
2009-03-23 18:23         ` Pavel Machek
2009-03-16 19:40     ` Sitsofe Wheeler
2009-03-16 21:43       ` Rob Landley
2009-03-17  4:55         ` Kyle Moffett
2009-03-23 11:00       ` Pavel Machek
2009-08-29  1:33   ` Robert Hancock
2009-08-29 13:04     ` Alan Cox
2009-03-16 19:45 ` Greg Freemyer
2009-03-16 19:45   ` Greg Freemyer
2009-03-16 21:48   ` Pavel Machek
     [not found] <dddMw-7Xt-57@gated-at.bofh.it>
     [not found] ` <dddWc-83W-51@gated-at.bofh.it>
     [not found]   ` <ddefx-5R-37@gated-at.bofh.it>
     [not found]     ` <ddep7-bI-7@gated-at.bofh.it>
     [not found]       ` <ddeyR-iy-11@gated-at.bofh.it>
     [not found]         ` <ddeyR-iy-9@gated-at.bofh.it>
     [not found]           ` <ddeIu-n9-9@gated-at.bofh.it>
     [not found]             ` <ddeSb-th-29@gated-at.bofh.it>
     [not found]               ` <ddoRI-11a-39@gated-at.bofh.it>
     [not found] <ciPTu-7f2-47@gated-at.bofh.it>
     [not found] ` <clrhX-3R1-13@gated-at.bofh.it>
     [not found]   ` <dcEc1-33Z-17@gated-at.bofh.it>
     [not found]     ` <dcFUz-4yN-9@gated-at.bofh.it>
     [not found]       ` <dcHtf-5XC-17@gated-at.bofh.it>
     [not found]         ` <dcNSc-2DU-11@gated-at.bofh.it>
     [not found]           ` <dcOl9-3aC-19@gated-at.bofh.it>
     [not found]             ` <dcOOd-3qM-33@gated-at.bofh.it>
     [not found]               ` <dcOXN-3M5-15@gated-at.bofh.it>

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.