All of lore.kernel.org
 help / color / mirror / Atom feed
* writing file to disk: not as easy as it looks
@ 2008-12-02  9:40 Pavel Machek
  2008-12-02 14:04 ` Theodore Tso
  2008-12-02 23:01 ` Mikulas Patocka
  0 siblings, 2 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-02  9:40 UTC (permalink / raw)
  To: mikulas, clock, kernel list, aviro

Actually, it looks like POSIX file interface is on the lowest step of
Rusty's scale: one that is impossible to use correctly. Yes, it seems
impossible to reliably&safely write file to disk under Linux. Double
plus uncool.

So... how to write file to disk and wait for it to reach the stable
storage, with proper error handling?

> file

	...does not work, because it fails to check for errors.

touch file || error_handling.

	Is not a lot better, unless you mount your filesystems "sync"
	... and noone does that.

dd conv=fsync if=something of=file 2> /dev/zero || error_handling

	Is a bit better; not much, unless you mount your filesystems
	"dirsync", because you have file data on disk, but they do not have
	directory entry pointing to them. Noone uses dirsync.

	So you need something like

dd conv=fsync if=something of=file 2> /dev/zero || error_handling
fsync . || error_handling
fsync .. || error_handling
fsync ../.. || error_handling
fsync ../../.. || error_handling

	... which mostly works...

	If you are alone on the filesystem... fsync only returns
	errors to the first process. So if you have other process that
	does fsync ., maybe it gets "your" error and you do not learn
	of the problem.

Question is... Is there way that I missed and that actually works?
									Pavel	
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02  9:40 writing file to disk: not as easy as it looks Pavel Machek
@ 2008-12-02 14:04 ` Theodore Tso
  2008-12-02 15:26   ` Pavel Machek
  2008-12-02 23:01 ` Mikulas Patocka
  1 sibling, 1 reply; 25+ messages in thread
From: Theodore Tso @ 2008-12-02 14:04 UTC (permalink / raw)
  To: Pavel Machek; +Cc: mikulas, clock, kernel list, aviro

On Tue, Dec 02, 2008 at 10:40:59AM +0100, Pavel Machek wrote:
> Actually, it looks like POSIX file interface is on the lowest step of
> Rusty's scale: one that is impossible to use correctly. Yes, it seems
> impossible to reliably&safely write file to disk under Linux. Double
> plus uncool.
> 
> So... how to write file to disk and wait for it to reach the stable
> storage, with proper error handling?

Are you trying to do this in C or shell?  There is no "fsync" shell
command as far as I know, which is what is confusing me.  And whether
"> file" checks for errors or not obviously depends on the application
which is writing to stdout.  Some might check for errors, some might
not....

Why do you feel the need to error check "fsync ../.." and "fsync
../../..", et.  al?

I can understand why you might want to fsync the containing directory
to make sure the directory entry got written to disk --- but if you're
that paranoid, many modern filesystems use some kind of tree structure
for the directory, and there is always the chance that a second later,
in a b-tree node split, due to a disk error the directory entry gets
lost.

What exactly are your requirements here, and what are you trying to
do?  What are you worried about?  Most MTA's are quite happy settling
with an fsync() to make sure the data made it to the disk safely and
the super-paranoid might also keep an open fd on the spool directory
and fsync that too.  That's been enough for most POSIX programs.

More generally, if you have a higher need for making sure, most system
administrators will spend effort robustifying the storage layer (i.e.,
RAID, battery-backed journals, etc.) rather than obsession over some
API that can tell an application --- "you know that file you just
finished writing 50 milliseconds ago?  Well, another application
created 100 files, which forced a b-tree node split, and
golly-gee-willickers, when I tried to modify the directory to
accomodate the node split, we ended up losing 50 directory entries,
including that file you just finished writing, fsyncing, and
closing...."

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 14:04 ` Theodore Tso
@ 2008-12-02 15:26   ` Pavel Machek
  2008-12-02 16:37     ` Theodore Tso
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-12-02 15:26 UTC (permalink / raw)
  To: Theodore Tso, mikulas, clock, kernel list, aviro

On Tue 2008-12-02 09:04:39, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 10:40:59AM +0100, Pavel Machek wrote:
> > Actually, it looks like POSIX file interface is on the lowest step of
> > Rusty's scale: one that is impossible to use correctly. Yes, it seems
> > impossible to reliably&safely write file to disk under Linux. Double
> > plus uncool.
> > 
> > So... how to write file to disk and wait for it to reach the stable
> > storage, with proper error handling?
> 
> Are you trying to do this in C or shell?  There is no "fsync" shell
> command as far as I know, which is what is confusing me.  And whether
> "> file" checks for errors or not obviously depends on the application
> which is writing to stdout.  Some might check for errors, some might
> not....

True. I'd prefer to use shell, but C is okay, too. 'fsync' shell
command seems to exist on opensuse, sorry for confusion.

> Why do you feel the need to error check "fsync ../.." and "fsync
> ../../..", et.  al?

> I can understand why you might want to fsync the containing directory
> to make sure the directory entry got written to disk --- but if you're
> that paranoid, many modern filesystems use some kind of tree
> structure

If I'm trying to write foo/bar/baz/file, and file/baz inodes/dentries
are written to disk, but foo is not, file still will not be found
under full name - and recovering it from lost&found is hard to do
automatically.

> for the directory, and there is always the chance that a second later,
> in a b-tree node split, due to a disk error the directory entry gets
> lost.

If disk looses data after acknowledging the write, all hope is lost.
Else I expect filesystem to preserve data I successfully synced.

     (In the b-tree split failed case I'd expect transaction commit to
     fail because new data could not be weitten; at that point
     disk+journal should still contain all the data needed for
     recovery of synced/old files, right?)

> What exactly are your requirements here, and what are you trying to
> do?  What are you worried about?  Most MTA's are quite happy
> settling

I'm trying to put my main filesystem on a SD card. hp2133 has only 4GB
internal flash, so I got 32GB SDHC. Unfortunately, SD card on hp is
very easy to eject by mistake.

> with an fsync() to make sure the data made it to the disk safely and
> the super-paranoid might also keep an open fd on the spool directory
> and fsync that too.  That's been enough for most POSIX programs.

Well.. I believe those POSIX programs are unsafe on removable media.

mta #1 	 	       	     	      mta #2

cat > mail1
fsync mail1
					cat > mail2
					fsync mail2
		(spool media
				removed)
					fsync . -> ERROR
					corrrectly reports
					mail2 as undelivered
fsync . -> success; first fsync cleared
      	   error condition


I'm trying to figure out why I'm loosing data on flashes. So far it
seems that both SD cards and USB flash disks have problems, and that
ext2/3 have problems... and that combination of ext2/3+flash  can't
even work in thery :-(.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 15:26   ` Pavel Machek
@ 2008-12-02 16:37     ` Theodore Tso
  2008-12-02 17:22       ` Chris Friesen
  2008-12-02 19:10       ` Folkert van Heusden
  0 siblings, 2 replies; 25+ messages in thread
From: Theodore Tso @ 2008-12-02 16:37 UTC (permalink / raw)
  To: Pavel Machek; +Cc: mikulas, clock, kernel list, aviro

On Tue, Dec 02, 2008 at 04:26:18PM +0100, Pavel Machek wrote:
> > I can understand why you might want to fsync the containing directory
> > to make sure the directory entry got written to disk --- but if you're
> > that paranoid, many modern filesystems use some kind of tree
> > structure
> 
> If I'm trying to write foo/bar/baz/file, and file/baz inodes/dentries
> are written to disk, but foo is not, file still will not be found
> under full name - and recovering it from lost&found is hard to do
> automatically.

Only if you've freshly created the foo/bar/baz directories...  If you
have, then yes, you'll need to sync each one.  Normally the paranoid
programs do this after each mkdir call, though.

For ext3/ext4, becaused of the entangled commit factor, fsync()'ing
the file is sufficient, but that's not something you can properly
count upon.

> If disk looses data after acknowledging the write, all hope is lost.
> Else I expect filesystem to preserve data I successfully synced.
> 
>      (In the b-tree split failed case I'd expect transaction commit to
>      fail because new data could not be weitten; at that point
>      disk+journal should still contain all the data needed for
>      recovery of synced/old files, right?)

Not necessarily.  For filesystems that do logical journalling (i.e.,
xfs, jfs, et. al), the only thing written in the journal is the
logical change (i.e., "new dir entry 'file_that_causes_the_node_split'").

The transaction commits *first*, and then the filesystem tries to
write update the filesystem with the change, and it's only then that
the write fails.  Data can very easily get lost.

Even for ext3/ext4 which is doing physical journalling, it's still the
case that the journal commits first, and it's only later when the
write happens that we write out the change.  If the disk fails some of
the writes, it's possible to lose data, especially if the two blocks
involved in the node split are far apart, and the write to the
existing old btree block fails.

> > What exactly are your requirements here, and what are you trying to
> > do?  What are you worried about?  Most MTA's are quite happy
> > settling
> 
> I'm trying to put my main filesystem on a SD card. hp2133 has only 4GB
> internal flash, so I got 32GB SDHC. Unfortunately, SD card on hp is
> very easy to eject by mistake.

So what you really want is some way of constantly flushing data to the
disk, probably after every single mkdir, every single close operation.
Of course, that has the tradeoff your flash card will get a lot of
extra wear.  I hate to say this, but have you considered something
like tape or velcro to secure the SD card?

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 16:37     ` Theodore Tso
@ 2008-12-02 17:22       ` Chris Friesen
  2008-12-02 20:55         ` Theodore Tso
  2008-12-02 19:10       ` Folkert van Heusden
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Friesen @ 2008-12-02 17:22 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Pavel Machek, mikulas, clock, kernel list, aviro

Theodore Tso wrote:

> Even for ext3/ext4 which is doing physical journalling, it's still the
> case that the journal commits first, and it's only later when the
> write happens that we write out the change.  If the disk fails some of
> the writes, it's possible to lose data, especially if the two blocks
> involved in the node split are far apart, and the write to the
> existing old btree block fails.

Yikes.  I was under the impression that once the journal hit the platter 
then the data were safe (barring media corruption).

It seems like the more I learn about filesystems, the more failure modes 
there are and the fewer guarantees can be made.  It's amazing that 
things work as well as they do...

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 16:37     ` Theodore Tso
  2008-12-02 17:22       ` Chris Friesen
@ 2008-12-02 19:10       ` Folkert van Heusden
  1 sibling, 0 replies; 25+ messages in thread
From: Folkert van Heusden @ 2008-12-02 19:10 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, mikulas, clock, kernel list, aviro

> > If disk looses data after acknowledging the write, all hope is lost.
> > Else I expect filesystem to preserve data I successfully synced.
> >      (In the b-tree split failed case I'd expect transaction commit to
> >      fail because new data could not be weitten; at that point
> >      disk+journal should still contain all the data needed for
> >      recovery of synced/old files, right?)
> 
> Not necessarily.  For filesystems that do logical journalling (i.e.,
> xfs, jfs, et. al), the only thing written in the journal is the
> logical change (i.e., "new dir entry 'file_that_causes_the_node_split'").

> The transaction commits *first*, and then the filesystem tries to
> write update the filesystem with the change, and it's only then that
> the write fails.  Data can very easily get lost.

> Even for ext3/ext4 which is doing physical journalling, it's still the

So do I understand this right that ext3/4 are more robust?


Folkert van Heusden

-- 
MultiTail is a versatile tool for watching logfiles and output of
commands. Filtering, coloring, merging, diff-view, etc.
http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 17:22       ` Chris Friesen
@ 2008-12-02 20:55         ` Theodore Tso
  2008-12-02 22:44           ` Pavel Machek
  2008-12-15 11:03           ` Pavel Machek
  0 siblings, 2 replies; 25+ messages in thread
From: Theodore Tso @ 2008-12-02 20:55 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Pavel Machek, mikulas, clock, kernel list, aviro

On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote:
> Theodore Tso wrote:
>
>> Even for ext3/ext4 which is doing physical journalling, it's still the
>> case that the journal commits first, and it's only later when the
>> write happens that we write out the change.  If the disk fails some of
>> the writes, it's possible to lose data, especially if the two blocks
>> involved in the node split are far apart, and the write to the
>> existing old btree block fails.
>
> Yikes.  I was under the impression that once the journal hit the platter  
> then the data were safe (barring media corruption).

Well, this is a case of media corruption (or a cosmic ray hitting
hitting a ribbon cable in the disk controller sending the write to the
wrong location on disk, or someone bumping the server causing the disk
head to lift up a little higher than normal while it was writing the
disk sector, etc.).  But it is a case of the hard drive misbehaving. 

Heck, if you have a hiccup while writing an inode table block out to
disk (for example a power failure at just the wrong time), so the
memory (which is more voltage sensitive than hard drives) DMA's
garbage which gets written to the inode table, you could lose a large
number of adjacent inodes when garbage gets splatted over the inode
table.

Ext3 tends to recover from this better than other filesystems, thanks
to the fact that it does physical block journalling, but you do pay
for this in terms of performance if you have a metadata-intensive
workload, because you're writing more bytes to the journal for each
metadata opeation.

> It seems like the more I learn about filesystems, the more failure modes  
> there are and the fewer guarantees can be made.  It's amazing that  
> things work as well as they do...

There are certainly things you can do.  Put your fileservers's on
UPS's.  Use RAID.  Make backups.   Do all three.  :-)

	    	   		      	  - Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 20:55         ` Theodore Tso
@ 2008-12-02 22:44           ` Pavel Machek
  2008-12-02 22:50             ` Pavel Machek
  2008-12-03  5:07             ` Theodore Tso
  2008-12-15 11:03           ` Pavel Machek
  1 sibling, 2 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-02 22:44 UTC (permalink / raw)
  To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro

On Tue 2008-12-02 15:55:58, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote:
> > Theodore Tso wrote:
> >
> >> Even for ext3/ext4 which is doing physical journalling, it's still the
> >> case that the journal commits first, and it's only later when the
> >> write happens that we write out the change.  If the disk fails some of
> >> the writes, it's possible to lose data, especially if the two blocks
> >> involved in the node split are far apart, and the write to the
> >> existing old btree block fails.
> >
> > Yikes.  I was under the impression that once the journal hit the platter  
> > then the data were safe (barring media corruption).
> 
> Well, this is a case of media corruption (or a cosmic ray hitting
> hitting a ribbon cable in the disk controller sending the write to the
> wrong location on disk, or someone bumping the server causing the disk
> head to lift up a little higher than normal while it was writing the
> disk sector, etc.).  But it is a case of the hard drive misbehaving. 

I could not parse this. Negation seems to be missing somewhere.

> Heck, if you have a hiccup while writing an inode table block out to
> disk (for example a power failure at just the wrong time), so the
> memory (which is more voltage sensitive than hard drives) DMA's
> garbage which gets written to the inode table, you could lose a large
> number of adjacent inodes when garbage gets splatted over the inode
> table.

Ok, "memory failed before disk" is ... bad hardware.

...but... you seem to be saying that modern filesystems can damage
data even on "sane" hardware.

Lets define sane as:

1) if disk says sector was successfully written, it is so, until you
start writing to that sector again.

	(but disk may say "error writing". Filesystem should propagate
	that back to the userland, reliably. "Error writing" is
	extremely rare on modern disks, but can happen if you run out
	of spare blocks.)

	(and if you ask for sector write, sector is in undefined 
	state until drive returns success. Flashes behave like this
	-- reads return errors. Do disks?)

2) connection to the disk either works or fails totally. Bit errors
are reliably detected at connection level.

3) power may fail any time.

You seem to be saying that ext2/ext3 only work if these are met:

1) power may fail any time.

2) writes are always successful.

3) connection to the disk always works.

AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
without unmounting (missing fsync error propagation), and it is unsafe
to run ext2/ext3 on any flash-based storage with block interface (SD
cards, flash sticks).
 
> Ext3 tends to recover from this better than other filesystems, thanks
> to the fact that it does physical block journalling, but you do pay
> for this in terms of performance if you have a metadata-intensive
> workload, because you're writing more bytes to the journal for each
> metadata opeation.

And thanks for that! Actually I'd be willing to pay some more
performance to get reliability up.

> > It seems like the more I learn about filesystems, the more failure modes  
> > there are and the fewer guarantees can be made.  It's amazing that  
> > things work as well as they do...
> 
> There are certainly things you can do.  Put your fileservers's on
> UPS's.  Use RAID.  Make backups.   Do all three.  :-)

I was almost stupid enough to move primary copy of ~ and linux trees
to SD... I do have UPSes, unfortunately they are li-ion and i'm
running off them most of the time. I do have backups, but restoring
them all the time is boring & time consuming. I'll try to stick two
MMC cards into SD slot to make it RAID 1 :-).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 22:44           ` Pavel Machek
@ 2008-12-02 22:50             ` Pavel Machek
  2008-12-03  5:07             ` Theodore Tso
  1 sibling, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-02 22:50 UTC (permalink / raw)
  To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro



...and it is unsafe to run ext2/ext3 on any media that can return
error on write. That includes perfectly working disk drives that just
ran out of spare blocks.

> AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
> without unmounting (missing fsync error propagation), and it is

To be fair, bad fsync semantics (error only reported to the first
person that asks) looks like fundamental Unix problem, nothing ext2/3
specific...

> to run ext2/ext3 on any flash-based storage with block interface (SD
> cards, flash sticks).

...and I'm aware of no filesystem that _can_ reliably work on SD
cards/USB flash sticks...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02  9:40 writing file to disk: not as easy as it looks Pavel Machek
  2008-12-02 14:04 ` Theodore Tso
@ 2008-12-02 23:01 ` Mikulas Patocka
  1 sibling, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2008-12-02 23:01 UTC (permalink / raw)
  To: Pavel Machek; +Cc: clock, kernel list, aviro

On Tue, 2 Dec 2008, Pavel Machek wrote:

> Actually, it looks like POSIX file interface is on the lowest step of
> Rusty's scale: one that is impossible to use correctly. Yes, it seems
> impossible to reliably&safely write file to disk under Linux. Double
> plus uncool.
> 
> So... how to write file to disk and wait for it to reach the stable
> storage, with proper error handling?
> 
> > file
> 
> 	...does not work, because it fails to check for errors.
> 
> touch file || error_handling.
> 
> 	Is not a lot better, unless you mount your filesystems "sync"
> 	... and noone does that.
> 
> dd conv=fsync if=something of=file 2> /dev/zero || error_handling
> 
> 	Is a bit better; not much, unless you mount your filesystems
> 	"dirsync", because you have file data on disk, but they do not have
> 	directory entry pointing to them. Noone uses dirsync.
> 
> 	So you need something like
> 
> dd conv=fsync if=something of=file 2> /dev/zero || error_handling
> fsync . || error_handling
> fsync .. || error_handling
> fsync ../.. || error_handling
> fsync ../../.. || error_handling
> 
> 	... which mostly works...
> 
> 	If you are alone on the filesystem... fsync only returns
> 	errors to the first process. So if you have other process that
> 	does fsync ., maybe it gets "your" error and you do not learn
> 	of the problem.
> 
> Question is... Is there way that I missed and that actually works?
> 									Pavel	

Hi!

I think you are right about this. There's no way how to fsync directory 
reliably.


My idea is that when the filesystem hits metadata write error, it should 
stop commiting any transactions and return error to all writes.

Write errors don't happen because of physical errors on media --- all 
current disks have sector realocation. Write errors can happen because of 
bad cabling, voltage drops, firmware bugs, corruption of PCI bus by rogue 
card, etc. Most of these cases are fixable by the administrator.

If you continue operating the filesystem after a write error (it doesn't 
matter if you report the error to userspace or not), you are risking 
filesystem damage (for example, cross linked files if there was error 
writing the bitmap) or security breach (users reading blocks containing 
deleted data of other users). If you freeze the filesystem on write error 
and do not allow further writes, the administrator can fix the underlying 
problem and the computer will run without any data damage and security 
problems.


It happened to me just yesterday, my disk was spinning down & up 
repeatedly and returning errors because of insufficient power. My kernel 
kicked spadfs filesystem off on the first write error and didn't allow any 
further commits. I fixed the problem by adding the second power supply and 
connecting some disks to it --- and now, after the incident, there are 
zero corruptions. Just imagine what massacre would happen on the 
filesystem if the kernel didn't kick it off and if it were operting under 
the condition "some writes get through - some not" unattended for some 
time.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 22:44           ` Pavel Machek
  2008-12-02 22:50             ` Pavel Machek
@ 2008-12-03  5:07             ` Theodore Tso
  2008-12-03  8:46               ` Pavel Machek
                                 ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Theodore Tso @ 2008-12-03  5:07 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Chris Friesen, mikulas, clock, kernel list, aviro

On Tue, Dec 02, 2008 at 11:44:03PM +0100, Pavel Machek wrote:
> > >
> > > Yikes.  I was under the impression that once the journal hit the platter  
> > > then the data were safe (barring media corruption).
> > 
> > Well, this is a case of media corruption (or a cosmic ray hitting
> > hitting a ribbon cable in the disk controller sending the write to the
> > wrong location on disk, or someone bumping the server causing the disk
> > head to lift up a little higher than normal while it was writing the
> > disk sector, etc.).  But it is a case of the hard drive misbehaving. 
> 
> I could not parse this. Negation seems to be missing somewhere.

I was agreeing with your original statement.  Once the journal hits
the platter, the data is safe, barring hard drive malfunctions (not
just media corruption).  I was just listing the many other types of
hard drive failures that could cause data loss.

> > Heck, if you have a hiccup while writing an inode table block out to
> > disk (for example a power failure at just the wrong time), so the
> > memory (which is more voltage sensitive than hard drives) DMA's
> > garbage which gets written to the inode table, you could lose a large
> > number of adjacent inodes when garbage gets splatted over the inode
> > table.
> 
> Ok, "memory failed before disk" is ... bad hardware.

It's PC class hardware.  Live with it.  Back when SGI made their own
hardware, they noticed this problem, and so they wired up their SGI
machines with powerfail interrupts, and extra big capacitors in their
power supplies, and when Irix got a powerfail interrupt, it would
frantically run around aborting DMA transfers to avoid this particular
problem.  At least, that's what an old-timer SGI engineer (who is
unfortunately no longer at SGI) told me.

PC class hardware don't have power fail interrupts.  Hence, my advice
to you is that if you use a filesystem that does logical journalling
--- better have a UPS.

> ...but... you seem to be saying that modern filesystems can damage
> data even on "sane" hardware.

The example I gave was one where a disk failure could cause a file
that had previously been sucessfully written to disk and fsync()'ed to
be damaged by another filesystem operation ***in the face of hard
drive failure***.  Surely that is obvious.  The most obvious case of
that might be if the disk controller gets confused and slams a data
block into the wrong location on disk (there's a reason why DIF
includes the sector number in its checksum and why some enterprise
databases do the same thing in their tablespace blocks --- it happens
often enough that paranoid data integrity engineers worry about it).

The example I gave, where a b-tree is doing split, and there is a
failure writing to the b-tree causing ancillary damage files
referenced in the b-tree node getting split, can happen with **any**
filesystem.  The only thing that will save you here would be a
copy-on-write type filesystem, such as WAFL or btrfs.

> You seem to be saying that ext2/ext3 only work if these are met:
> 
> 1) power may fail any time.

Well, ext2/ext3 will work fine if the power is always reliable, too.  :-)

> 2) writes are always successful.

To the extent that write failures while writing filesystem metdata
can, if you are unluky be catastrophic, yeah.  Fortunally normally
such write failures are fairly rare, but if you worry about such
things, RAID is the answer.  As I said, I believe this is going to be
true for pretty much any update-in-place filesystem.  It's always
possible to construct failure scenarios if the hardware is unreliable.

> 
> 3) connection to the disk always works.
> 
> AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
> without unmounting (missing fsync error propagation), and it is unsafe
> to run ext2/ext3 on any flash-based storage with block interface (SD
> cards, flash sticks).

The data on the disk before the connection is yanked should be safe
(although as we mentioned in another thread, the flash drive itself
may not be happy if you are writing to the Flash Translation Layer at
the time when power is cut; if that causes a previously written sector
to disappear, that's an example of a hardware failure that **any**
filesystem won't necessarily be able to recover from).

Your definition of "safe" seems to include worrying about making sure
that all processes that may have previously touched a file or a
directory gets an error when they try to do an fsync() on that file or
directory, and that given that fsync clears the error condition after
it returns it,it is therefore "unsafe".  

The reality is that most applications don't proper error checking, and
even fewer actually call fsync(), so if you are putting your root
filesytem on a 32G flash card, and it pops out easily due to hardware
design issues, the question of whether fsync() gets properly progated
to all potentially interested applications is the ***least*** of your
worries.

							- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03  5:07             ` Theodore Tso
@ 2008-12-03  8:46               ` Pavel Machek
  2008-12-03 15:50                 ` Mikulas Patocka
  2008-12-03 16:42                 ` Theodore Tso
  2008-12-03 15:34               ` Mikulas Patocka
  2008-12-15 10:24               ` [patch] " Pavel Machek
  2 siblings, 2 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-03  8:46 UTC (permalink / raw)
  To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro

On Wed 2008-12-03 00:07:09, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:44:03PM +0100, Pavel Machek wrote:
> > > >
> > > > Yikes.  I was under the impression that once the journal hit the platter  
> > > > then the data were safe (barring media corruption).
> > > 
> > > Well, this is a case of media corruption (or a cosmic ray hitting
> > > hitting a ribbon cable in the disk controller sending the write to the
> > > wrong location on disk, or someone bumping the server causing the disk
> > > head to lift up a little higher than normal while it was writing the
> > > disk sector, etc.).  But it is a case of the hard drive misbehaving. 
> > 
> > I could not parse this. Negation seems to be missing somewhere.
> 
> I was agreeing with your original statement.  Once the journal hits
> the platter, the data is safe, barring hard drive malfunctions (not
> just media corruption).  I was just listing the many other types of
> hard drive failures that could cause data loss.

Aha, ok, sorry for confusion.

> > Ok, "memory failed before disk" is ... bad hardware.
> 
> It's PC class hardware.  Live with it.  Back when SGI made their own
> hardware, they noticed this problem, and so they wired up their SGI
> machines with powerfail interrupts, and extra big capacitors in their
> power supplies, and when Irix got a powerfail interrupt, it would
> frantically run around aborting DMA transfers to avoid this particular
> problem.  At least, that's what an old-timer SGI engineer (who is
> unfortunately no longer at SGI) told me.
> 
> PC class hardware don't have power fail interrupts.  Hence, my advice
> to you is that if you use a filesystem that does logical journalling
> --- better have a UPS.

Hmm, 'just avoid logical journalling' seems like a better solution
:-).

> > ...but... you seem to be saying that modern filesystems can damage
> > data even on "sane" hardware.
> 
> The example I gave was one where a disk failure could cause a file
> that had previously been sucessfully written to disk and fsync()'ed to
> be damaged by another filesystem operation ***in the face of hard
> drive failure***.  Surely that is obvious.  The most obvious case of

Ok.

> The example I gave, where a b-tree is doing split, and there is a
> failure writing to the b-tree causing ancillary damage files
> referenced in the b-tree node getting split, can happen with **any**
> filesystem.  The only thing that will save you here would be a
> copy-on-write type filesystem, such as WAFL or btrfs.

ext3-like physical journaling could be extended to handle write
failures (at speed penalty), no?

Write 'I will rewrite block A containing B with C' into journal... ok,
I guess I should wait for btrfs.

> > You seem to be saying that ext2/ext3 only work if these are met:
> > 
> > 1) power may fail any time.
> 
> Well, ext2/ext3 will work fine if the power is always reliable, too.  :-)

:-) ok.

> > 2) writes are always successful.
> 
> To the extent that write failures while writing filesystem metdata
> can, if you are unluky be catastrophic, yeah.  Fortunally normally
> such write failures are fairly rare, but if you worry about such
> things, RAID is the answer.  As I said, I believe this is going to be
> true for pretty much any update-in-place filesystem.  It's always
> possible to construct failure scenarios if the hardware is unreliable.

Ok.

> > 3) connection to the disk always works.
> > 
> > AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
> > without unmounting (missing fsync error propagation), and it is unsafe
> > to run ext2/ext3 on any flash-based storage with block interface (SD
> > cards, flash sticks).
> 
> The data on the disk before the connection is yanked should be safe
> (although as we mentioned in another thread, the flash drive itself
> may not be happy if you are writing to the Flash Translation Layer at
> the time when power is cut; if that causes a previously written sector
> to disappear, that's an example of a hardware failure that **any**
> filesystem won't necessarily be able to recover from).
> 
> Your definition of "safe" seems to include worrying about making sure
> that all processes that may have previously touched a file or a
> directory gets an error when they try to do an fsync() on that file or
> directory, and that given that fsync clears the error condition after
> it returns it,it is therefore "unsafe".  

Yes. fsync() seeems surprisingly high on Rusty's list of broken
interfaces classification ('impossible to use correctly').

I wonder if some reasonable solution exists? Mark filesystem as failed
on first  write error is one of those (and default for ext2/3?). Did
SGI/big unixen solve this somehow?

> The reality is that most applications don't proper error checking, and
> even fewer actually call fsync(), so if you are putting your root
> filesytem on a 32G flash card, and it pops out easily due to hardware
> design issues, the question of whether fsync() gets properly progated
> to all potentially interested applications is the ***least*** of your
> worries.

Yes, most applications are bad. Yes, I should just glue the card into
the slot. No, fsync interface does not look properly designed. No, it
is not causing me immediate problems (mount -o dirsync mostly works
around that). I wonder if good, long-term solution exists...


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03  5:07             ` Theodore Tso
  2008-12-03  8:46               ` Pavel Machek
@ 2008-12-03 15:34               ` Mikulas Patocka
  2008-12-15 10:24               ` [patch] " Pavel Machek
  2 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2008-12-03 15:34 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Pavel Machek, Chris Friesen, clock, kernel list, aviro



On Wed, 3 Dec 2008, Theodore Tso wrote:

> > Ok, "memory failed before disk" is ... bad hardware.
> 
> It's PC class hardware. Live with it.  Back when SGI made their own
> hardware, they noticed this problem, and so they wired up their SGI
> machines with powerfail interrupts, and extra big capacitors in their
> power supplies, and when Irix got a powerfail interrupt, it would
> frantically run around aborting DMA transfers to avoid this particular
> problem.  At least, that's what an old-timer SGI engineer (who is
> unfortunately no longer at SGI) told me.

I heard this too --- I just don't understand why did they route it to an 
interrupt and undertook the complicated sequence of aborting the commands 
by the kernel --- instead of simply routing it to PCI reset line --- that 
would reset the controller and stop it from feeding data to disks.

Also, if they had ECC memory, the chipset should detect unrecoverable 
garbage and respond with target-abort or full system reset and not feed 
bad data to the controller.

> PC class hardware don't have power fail interrupts.  Hence, my advice
> to you is that if you use a filesystem that does logical journalling
> --- better have a UPS.

ATX has PWR_OK pin that should be deasserted on power failure before the 
voltage drops.

I don't know if motherboards use it --- but there should be no problem 
routing the pin to the chipset reset and stop it before power goes low.

> > ...but... you seem to be saying that modern filesystems can damage 
> > data even on "sane" hardware.
> 
> The example I gave was one where a disk failure could cause a file
> that had previously been sucessfully written to disk and fsync()'ed to
> be damaged by another filesystem operation ***in the face of hard
> drive failure***.  Surely that is obvious.  The most obvious case of
> that might be if the disk controller gets confused and slams a data
> block into the wrong location on disk (there's a reason why DIF
> includes the sector number in its checksum and why some enterprise
> databases do the same thing in their tablespace blocks --- it happens
> often enough that paranoid data integrity engineers worry about it).

You can read the block number back from ATA disk after you write it and 
before you submit the command.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03  8:46               ` Pavel Machek
@ 2008-12-03 15:50                 ` Mikulas Patocka
  2008-12-03 15:54                   ` Alan Cox
  2008-12-03 16:42                 ` Theodore Tso
  1 sibling, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2008-12-03 15:50 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Theodore Tso, Chris Friesen, kernel list, aviro

> Yes. fsync() seeems surprisingly high on Rusty's list of broken
> interfaces classification ('impossible to use correctly').
> 
> I wonder if some reasonable solution exists? Mark filesystem as failed
> on first  write error is one of those (and default for ext2/3?). Did
> SGI/big unixen solve this somehow?

When OS/2 hit write error, it wrote to another location on disk and added 
this to its sector remap table. It could remap both metadata and data this 
way. But today it is meaningless, because the same algorithm is 
implemented in disk firmware. Write errors are reported for disk 
connection problems, not media problems.

For connection problems, another solution may be to retry writes 
indefinitely until the admin aborts it or reconnects the disk. But I don't 
know how common these recoverable disk connection errors are.

> > The reality is that most applications don't proper error checking, and
> > even fewer actually call fsync(), so if you are putting your root
> > filesytem on a 32G flash card, and it pops out easily due to hardware
> > design issues, the question of whether fsync() gets properly progated
> > to all potentially interested applications is the ***least*** of your
> > worries.

If you are running a transaction processing software, then it is a very 
important worry. All the database software is written with the assumption 
that when the database returns transaction committed, then the changes are 
permanent.

Most of the business software can deal with the fact that the server 
crashes, but can't deal with the fact that database returnes committed 
status for transaction that wasn't really committed.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 15:50                 ` Mikulas Patocka
@ 2008-12-03 15:54                   ` Alan Cox
  2008-12-03 17:37                     ` Mikulas Patocka
  0 siblings, 1 reply; 25+ messages in thread
From: Alan Cox @ 2008-12-03 15:54 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Pavel Machek, Theodore Tso, Chris Friesen, kernel list, aviro

> implemented in disk firmware. Write errors are reported for disk 
> connection problems, not media problems.

Media errors are reported for writes when the drive knows there are
problems. That may be deferred to the cache flush afterwards but the
information is still generated and shipped back to us - eventually.

> For connection problems, another solution may be to retry writes 
> indefinitely until the admin aborts it or reconnects the disk. But I don't 
> know how common these recoverable disk connection errors are.

CRC errors, lost IRQs and the like are retried by the midlayer and
drivers and the error handling strategies will also try things like
reducing link speeds on repeated CRC errors.

Alan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03  8:46               ` Pavel Machek
  2008-12-03 15:50                 ` Mikulas Patocka
@ 2008-12-03 16:42                 ` Theodore Tso
  2008-12-03 17:43                   ` Mikulas Patocka
  1 sibling, 1 reply; 25+ messages in thread
From: Theodore Tso @ 2008-12-03 16:42 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Chris Friesen, mikulas, clock, kernel list, aviro

On Wed, Dec 03, 2008 at 09:46:40AM +0100, Pavel Machek wrote:
> Yes. fsync() seeems surprisingly high on Rusty's list of broken
> interfaces classification ('impossible to use correctly').

To be fair, fsync() was primarily intended for making sure that the
data had been written to disk, and not necessarily as a way of making
sure that write errors would be properly reflected back to the
application.  As you've pointed out, it's not really adequate for that
purpose.

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 15:54                   ` Alan Cox
@ 2008-12-03 17:37                     ` Mikulas Patocka
  2008-12-03 17:52                       ` Alan Cox
  2008-12-03 18:16                       ` Pavel Machek
  0 siblings, 2 replies; 25+ messages in thread
From: Mikulas Patocka @ 2008-12-03 17:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: Pavel Machek, Theodore Tso, Chris Friesen, kernel list, aviro

On Wed, 3 Dec 2008, Alan Cox wrote:

> > implemented in disk firmware. Write errors are reported for disk 
> > connection problems, not media problems.
> 
> Media errors are reported for writes when the drive knows there are
> problems. That may be deferred to the cache flush afterwards but the
> information is still generated and shipped back to us - eventually.

It a question, how to process cache flush errors correctly. A cache flush 
error reported for one filesystem may belong to the data written by other 
filesystem. So should some flag "there was an error" be set for all 
partitions and report it to every filesystem when it does cache flush? Or 
record the time of the last error in the driver and let the filesystem 
query it (so that the filesystem can tell if the error happened before or 
after it was mounted).

BTW. how does SCSI report cache flush errors? Does it report them on 
SYNCHRONIZE CACHE command or does it report them on defered senses?

Another point is that unless the sector remap table is full, there should 
be no cache flush errors.

> > For connection problems, another solution may be to retry writes 
> > indefinitely until the admin aborts it or reconnects the disk. But I don't 
> > know how common these recoverable disk connection errors are.
> 
> CRC errors, lost IRQs and the like are retried by the midlayer and
> drivers and the error handling strategies will also try things like
> reducing link speeds on repeated CRC errors.

I meant for example loose cable or so --- does it make sense to retry 
indefinitely (until the admin plugs the cable or unmounts the filesystem) 
or return error to the filesystem after few retries?

Mikulas

> Alan
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 16:42                 ` Theodore Tso
@ 2008-12-03 17:43                   ` Mikulas Patocka
  2008-12-03 18:26                     ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2008-12-03 17:43 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Pavel Machek, Chris Friesen, kernel list, aviro

> On Wed, Dec 03, 2008 at 09:46:40AM +0100, Pavel Machek wrote:
> > Yes. fsync() seeems surprisingly high on Rusty's list of broken
> > interfaces classification ('impossible to use correctly').

BTW where is that list.

> To be fair, fsync() was primarily intended for making sure that the
> data had been written to disk, and not necessarily as a way of making
> sure that write errors would be properly reflected back to the
> application.  As you've pointed out, it's not really adequate for that
> purpose.
> 
> 						- Ted

Well, what else do you want to use for databases? (where crashing the 
whole computer makes less damage than pretending that transaction was 
committed while it wasn't).

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 17:37                     ` Mikulas Patocka
@ 2008-12-03 17:52                       ` Alan Cox
  2008-12-03 18:16                       ` Pavel Machek
  1 sibling, 0 replies; 25+ messages in thread
From: Alan Cox @ 2008-12-03 17:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Pavel Machek, Theodore Tso, Chris Friesen, kernel list, aviro

> error reported for one filesystem may belong to the data written by other 
> filesystem. So should some flag "there was an error" be set for all 
> partitions and report it to every filesystem when it does cache flush? Or 
> record the time of the last error in the driver and let the filesystem 
> query it (so that the filesystem can tell if the error happened before or 
> after it was mounted).

Good question - not working that high up the stack I don't know the right
answer there.
> 
> BTW. how does SCSI report cache flush errors? Does it report them on 
> SYNCHRONIZE CACHE command or does it report them on defered senses?

Not sure. I thought the same way.

> Another point is that unless the sector remap table is full, there should 
> be no cache flush errors.

You can get them on partial writes to large sector devices, assorted
errors on SSD devices and various 'catastrophic' errors.

> I meant for example loose cable or so --- does it make sense to retry 
> indefinitely (until the admin plugs the cable or unmounts the filesystem) 
> or return error to the filesystem after few retries?

At the low level we have to return an error so that RAID and the like can
work.

Alan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 17:37                     ` Mikulas Patocka
  2008-12-03 17:52                       ` Alan Cox
@ 2008-12-03 18:16                       ` Pavel Machek
  2008-12-03 18:33                         ` Mikulas Patocka
  1 sibling, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-12-03 18:16 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Alan Cox, Theodore Tso, Chris Friesen, kernel list, aviro

> > CRC errors, lost IRQs and the like are retried by the midlayer and
> > drivers and the error handling strategies will also try things like
> > reducing link speeds on repeated CRC errors.
> 
> I meant for example loose cable or so --- does it make sense to retry 
> indefinitely (until the admin plugs the cable or unmounts the filesystem) 
> or return error to the filesystem after few retries?

It is quite non-trivial to detect if it is "disk plugged back in"
vs. "faulty disk unplugged, new one plugged in"... so I suppose
automatic retry after failure of connection to disk is quite hard to
get right.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 17:43                   ` Mikulas Patocka
@ 2008-12-03 18:26                     ` Pavel Machek
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-03 18:26 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Theodore Tso, Chris Friesen, kernel list, aviro

On Wed 2008-12-03 18:43:18, Mikulas Patocka wrote:
> > On Wed, Dec 03, 2008 at 09:46:40AM +0100, Pavel Machek wrote:
> > > Yes. fsync() seeems surprisingly high on Rusty's list of broken
> > > interfaces classification ('impossible to use correctly').
> 
> BTW where is that list.

http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html

> > To be fair, fsync() was primarily intended for making sure that the
> > data had been written to disk, and not necessarily as a way of making
> > sure that write errors would be properly reflected back to the
> > application.  As you've pointed out, it's not really adequate for that
> > purpose.
> 
> Well, what else do you want to use for databases? (where crashing the 
> whole computer makes less damage than pretending that transaction was 
> committed while it wasn't).

I guess we could modify fsync() to fail if there was _ever_ write
problem on same filesystem. That would make it "safe". And as
ext2/ext3 can't handle metadata write errors anyway... maybe that
should be done?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-03 18:16                       ` Pavel Machek
@ 2008-12-03 18:33                         ` Mikulas Patocka
  0 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2008-12-03 18:33 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Alan Cox, Theodore Tso, Chris Friesen, kernel list, aviro

On Wed, 3 Dec 2008, Pavel Machek wrote:

> > > CRC errors, lost IRQs and the like are retried by the midlayer and
> > > drivers and the error handling strategies will also try things like
> > > reducing link speeds on repeated CRC errors.
> > 
> > I meant for example loose cable or so --- does it make sense to retry 
> > indefinitely (until the admin plugs the cable or unmounts the filesystem) 
> > or return error to the filesystem after few retries?
> 
> It is quite non-trivial to detect if it is "disk plugged back in"
> vs. "faulty disk unplugged, new one plugged in"... so I suppose
> automatic retry after failure of connection to disk is quite hard to
> get right.

Unless the SATA controller has the plug interrupt (very few have), there 
is no way for the kernel to detect that an old SATA disk was unplugged and 
a new one was plugged in.

So the answer is that the admin must not hot-swap disk unless unmounting 
the filesystem or notifying the RAID layer about it. If you hot-swap 
softraid1/4/5 disk, you definitely damage data, because the softraid layer 
has no way to find out about the hotswap.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch] Re: writing file to disk: not as easy as it looks
  2008-12-03  5:07             ` Theodore Tso
  2008-12-03  8:46               ` Pavel Machek
  2008-12-03 15:34               ` Mikulas Patocka
@ 2008-12-15 10:24               ` Pavel Machek
  2 siblings, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-15 10:24 UTC (permalink / raw)
  To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro
  Cc: Andrew Morton

Hi!

> > > Heck, if you have a hiccup while writing an inode table block out to
> > > disk (for example a power failure at just the wrong time), so the
> > > memory (which is more voltage sensitive than hard drives) DMA's
> > > garbage which gets written to the inode table, you could lose a large
> > > number of adjacent inodes when garbage gets splatted over the inode
> > > table.
> > 
> > Ok, "memory failed before disk" is ... bad hardware.
> 
> It's PC class hardware.  Live with it.  Back when SGI made their own
> hardware, they noticed this problem, and so they wired up their SGI
> machines with powerfail interrupts, and extra big capacitors in
> their

Seems like bad hardware is very common indeed. Anyway, I guess it
would be fair to document what ext3 expects from disk subsystem for
safe operation. Does that summary sound correct/fair?

Signed-off-by: Pavel Machek <pavel@suse.cz>

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..3855fbd 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +188,34 @@ mke2fs: 	create a ext3 partition with th
 debugfs: 	ext2 and ext3 file system debugger.
 ext2online:	online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* writes to media never fail. Even if disk returns error condition during
+  write, ext3 can't handle that correctly, because success on fsync was already
+  returned when data hit the journal.
+
+	   (Fortunately writes failing are very uncommon on disks, as they
+	   have spare sectors they use when write fails.)
+
+* either whole sector is correctly written or nothing is written during
+  powerfail.
+
+	   (Unfortuantely, all the cheap USB/SD flash cards I seen do behave
+	   like this, and are unsuitable for ext3. Because RAM tends to fail
+	   faster than rest of system during powerfail, special hw killing
+	   DMA transfers may be neccessary. Not sure how common that problem
+	   is on generic PC machines).
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+	   (Note that barriers are disabled by default, use "barrier=1"
+	   mount option after making sure hw can support them). 
+
 
 References
 ==========





-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-02 20:55         ` Theodore Tso
  2008-12-02 22:44           ` Pavel Machek
@ 2008-12-15 11:03           ` Pavel Machek
  2008-12-15 20:08             ` Folkert van Heusden
  1 sibling, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-12-15 11:03 UTC (permalink / raw)
  To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro

On Tue 2008-12-02 15:55:58, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote:
> > Theodore Tso wrote:
> >
> >> Even for ext3/ext4 which is doing physical journalling, it's still the
> >> case that the journal commits first, and it's only later when the
> >> write happens that we write out the change.  If the disk fails some of
> >> the writes, it's possible to lose data, especially if the two blocks
> >> involved in the node split are far apart, and the write to the
> >> existing old btree block fails.
> >
> > Yikes.  I was under the impression that once the journal hit the platter  
> > then the data were safe (barring media corruption).
> 
> Well, this is a case of media corruption (or a cosmic ray hitting
> hitting a ribbon cable in the disk controller sending the write to the
> wrong location on disk, or someone bumping the server causing the disk
> head to lift up a little higher than normal while it was writing the
> disk sector, etc.).  But it is a case of the hard drive misbehaving. 
> 
> Heck, if you have a hiccup while writing an inode table block out to
> disk (for example a power failure at just the wrong time), so the
...
> Ext3 tends to recover from this better than other filesystems, thanks
> to the fact that it does physical block journalling, but you do pay
> for this in terms of performance if you have a metadata-intensive
> workload, because you're writing more bytes to the journal for each
> metadata opeation.
> 
> > It seems like the more I learn about filesystems, the more failure modes  
> > there are and the fewer guarantees can be made.  It's amazing that  
> > things work as well as they do...
> 
> There are certainly things you can do.  Put your fileservers's on
> UPS's.  Use RAID.  Make backups.   Do all three.  :-)

Okay, so we pretty much	know that ext3 journalling helps in "user hit
the reset button" case. (And we are pretty sure ext2/ext3 works in
"clean unmount" case). Otherwise

*) kernel bug -> journalling does not help.

*) sudden powerfail -> journalling helps works on SGI high-end
hardware. It may or may not help on PC-class hardware.

We already do periodic checks, even on ext3. Maybe we should do fsck
more often if we see evidence of unclean shutdowns (because we know
PC hardware is crap...). I actually have patch somewhere, should I
ressurect it?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: writing file to disk: not as easy as it looks
  2008-12-15 11:03           ` Pavel Machek
@ 2008-12-15 20:08             ` Folkert van Heusden
  0 siblings, 0 replies; 25+ messages in thread
From: Folkert van Heusden @ 2008-12-15 20:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro

> > There are certainly things you can do.  Put your fileservers's on
> > UPS's.  Use RAID.  Make backups.   Do all three.  :-)
> 
> Okay, so we pretty much	know that ext3 journalling helps in "user hit
> the reset button" case. (And we are pretty sure ext2/ext3 works in
> "clean unmount" case). Otherwise
> 
> *) kernel bug -> journalling does not help.
> 
> *) sudden powerfail -> journalling helps works on SGI high-end
> hardware. It may or may not help on PC-class hardware.
> 
> We already do periodic checks, even on ext3. Maybe we should do fsck
> more often if we see evidence of unclean shutdowns (because we know
> PC hardware is crap...).

What we might need is on-line fsck. E.g. fsck while the fs is still
mounted.

Might be tricky to implement.



Folkert van Heusden

-- 
MultiTail is a versatile tool for watching logfiles and output of
commands. Filtering, coloring, merging, diff-view, etc.
http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-12-15 20:09 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-02  9:40 writing file to disk: not as easy as it looks Pavel Machek
2008-12-02 14:04 ` Theodore Tso
2008-12-02 15:26   ` Pavel Machek
2008-12-02 16:37     ` Theodore Tso
2008-12-02 17:22       ` Chris Friesen
2008-12-02 20:55         ` Theodore Tso
2008-12-02 22:44           ` Pavel Machek
2008-12-02 22:50             ` Pavel Machek
2008-12-03  5:07             ` Theodore Tso
2008-12-03  8:46               ` Pavel Machek
2008-12-03 15:50                 ` Mikulas Patocka
2008-12-03 15:54                   ` Alan Cox
2008-12-03 17:37                     ` Mikulas Patocka
2008-12-03 17:52                       ` Alan Cox
2008-12-03 18:16                       ` Pavel Machek
2008-12-03 18:33                         ` Mikulas Patocka
2008-12-03 16:42                 ` Theodore Tso
2008-12-03 17:43                   ` Mikulas Patocka
2008-12-03 18:26                     ` Pavel Machek
2008-12-03 15:34               ` Mikulas Patocka
2008-12-15 10:24               ` [patch] " Pavel Machek
2008-12-15 11:03           ` Pavel Machek
2008-12-15 20:08             ` Folkert van Heusden
2008-12-02 19:10       ` Folkert van Heusden
2008-12-02 23:01 ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.