* writing file to disk: not as easy as it looks
@ 2008-12-02 9:40 Pavel Machek
2008-12-02 14:04 ` Theodore Tso
2008-12-02 23:01 ` Mikulas Patocka
0 siblings, 2 replies; 25+ messages in thread
From: Pavel Machek @ 2008-12-02 9:40 UTC (permalink / raw)
To: mikulas, clock, kernel list, aviro
Actually, it looks like POSIX file interface is on the lowest step of
Rusty's scale: one that is impossible to use correctly. Yes, it seems
impossible to reliably&safely write file to disk under Linux. Double
plus uncool.
So... how to write file to disk and wait for it to reach the stable
storage, with proper error handling?
> file
...does not work, because it fails to check for errors.
touch file || error_handling.
Is not a lot better, unless you mount your filesystems "sync"
... and noone does that.
dd conv=fsync if=something of=file 2> /dev/zero || error_handling
Is a bit better; not much, unless you mount your filesystems
"dirsync", because you have file data on disk, but they do not have
directory entry pointing to them. Noone uses dirsync.
So you need something like
dd conv=fsync if=something of=file 2> /dev/zero || error_handling
fsync . || error_handling
fsync .. || error_handling
fsync ../.. || error_handling
fsync ../../.. || error_handling
... which mostly works...
If you are alone on the filesystem... fsync only returns
errors to the first process. So if you have other process that
does fsync ., maybe it gets "your" error and you do not learn
of the problem.
Question is... Is there way that I missed and that actually works?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 9:40 writing file to disk: not as easy as it looks Pavel Machek @ 2008-12-02 14:04 ` Theodore Tso 2008-12-02 15:26 ` Pavel Machek 2008-12-02 23:01 ` Mikulas Patocka 1 sibling, 1 reply; 25+ messages in thread From: Theodore Tso @ 2008-12-02 14:04 UTC (permalink / raw) To: Pavel Machek; +Cc: mikulas, clock, kernel list, aviro On Tue, Dec 02, 2008 at 10:40:59AM +0100, Pavel Machek wrote: > Actually, it looks like POSIX file interface is on the lowest step of > Rusty's scale: one that is impossible to use correctly. Yes, it seems > impossible to reliably&safely write file to disk under Linux. Double > plus uncool. > > So... how to write file to disk and wait for it to reach the stable > storage, with proper error handling? Are you trying to do this in C or shell? There is no "fsync" shell command as far as I know, which is what is confusing me. And whether "> file" checks for errors or not obviously depends on the application which is writing to stdout. Some might check for errors, some might not.... Why do you feel the need to error check "fsync ../.." and "fsync ../../..", et. al? I can understand why you might want to fsync the containing directory to make sure the directory entry got written to disk --- but if you're that paranoid, many modern filesystems use some kind of tree structure for the directory, and there is always the chance that a second later, in a b-tree node split, due to a disk error the directory entry gets lost. What exactly are your requirements here, and what are you trying to do? What are you worried about? Most MTA's are quite happy settling with an fsync() to make sure the data made it to the disk safely and the super-paranoid might also keep an open fd on the spool directory and fsync that too. That's been enough for most POSIX programs. More generally, if you have a higher need for making sure, most system administrators will spend effort robustifying the storage layer (i.e., RAID, battery-backed journals, etc.) rather than obsession over some API that can tell an application --- "you know that file you just finished writing 50 milliseconds ago? Well, another application created 100 files, which forced a b-tree node split, and golly-gee-willickers, when I tried to modify the directory to accomodate the node split, we ended up losing 50 directory entries, including that file you just finished writing, fsyncing, and closing...." - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 14:04 ` Theodore Tso @ 2008-12-02 15:26 ` Pavel Machek 2008-12-02 16:37 ` Theodore Tso 0 siblings, 1 reply; 25+ messages in thread From: Pavel Machek @ 2008-12-02 15:26 UTC (permalink / raw) To: Theodore Tso, mikulas, clock, kernel list, aviro On Tue 2008-12-02 09:04:39, Theodore Tso wrote: > On Tue, Dec 02, 2008 at 10:40:59AM +0100, Pavel Machek wrote: > > Actually, it looks like POSIX file interface is on the lowest step of > > Rusty's scale: one that is impossible to use correctly. Yes, it seems > > impossible to reliably&safely write file to disk under Linux. Double > > plus uncool. > > > > So... how to write file to disk and wait for it to reach the stable > > storage, with proper error handling? > > Are you trying to do this in C or shell? There is no "fsync" shell > command as far as I know, which is what is confusing me. And whether > "> file" checks for errors or not obviously depends on the application > which is writing to stdout. Some might check for errors, some might > not.... True. I'd prefer to use shell, but C is okay, too. 'fsync' shell command seems to exist on opensuse, sorry for confusion. > Why do you feel the need to error check "fsync ../.." and "fsync > ../../..", et. al? > I can understand why you might want to fsync the containing directory > to make sure the directory entry got written to disk --- but if you're > that paranoid, many modern filesystems use some kind of tree > structure If I'm trying to write foo/bar/baz/file, and file/baz inodes/dentries are written to disk, but foo is not, file still will not be found under full name - and recovering it from lost&found is hard to do automatically. > for the directory, and there is always the chance that a second later, > in a b-tree node split, due to a disk error the directory entry gets > lost. If disk looses data after acknowledging the write, all hope is lost. Else I expect filesystem to preserve data I successfully synced. (In the b-tree split failed case I'd expect transaction commit to fail because new data could not be weitten; at that point disk+journal should still contain all the data needed for recovery of synced/old files, right?) > What exactly are your requirements here, and what are you trying to > do? What are you worried about? Most MTA's are quite happy > settling I'm trying to put my main filesystem on a SD card. hp2133 has only 4GB internal flash, so I got 32GB SDHC. Unfortunately, SD card on hp is very easy to eject by mistake. > with an fsync() to make sure the data made it to the disk safely and > the super-paranoid might also keep an open fd on the spool directory > and fsync that too. That's been enough for most POSIX programs. Well.. I believe those POSIX programs are unsafe on removable media. mta #1 mta #2 cat > mail1 fsync mail1 cat > mail2 fsync mail2 (spool media removed) fsync . -> ERROR corrrectly reports mail2 as undelivered fsync . -> success; first fsync cleared error condition I'm trying to figure out why I'm loosing data on flashes. So far it seems that both SD cards and USB flash disks have problems, and that ext2/3 have problems... and that combination of ext2/3+flash can't even work in thery :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 15:26 ` Pavel Machek @ 2008-12-02 16:37 ` Theodore Tso 2008-12-02 17:22 ` Chris Friesen 2008-12-02 19:10 ` Folkert van Heusden 0 siblings, 2 replies; 25+ messages in thread From: Theodore Tso @ 2008-12-02 16:37 UTC (permalink / raw) To: Pavel Machek; +Cc: mikulas, clock, kernel list, aviro On Tue, Dec 02, 2008 at 04:26:18PM +0100, Pavel Machek wrote: > > I can understand why you might want to fsync the containing directory > > to make sure the directory entry got written to disk --- but if you're > > that paranoid, many modern filesystems use some kind of tree > > structure > > If I'm trying to write foo/bar/baz/file, and file/baz inodes/dentries > are written to disk, but foo is not, file still will not be found > under full name - and recovering it from lost&found is hard to do > automatically. Only if you've freshly created the foo/bar/baz directories... If you have, then yes, you'll need to sync each one. Normally the paranoid programs do this after each mkdir call, though. For ext3/ext4, becaused of the entangled commit factor, fsync()'ing the file is sufficient, but that's not something you can properly count upon. > If disk looses data after acknowledging the write, all hope is lost. > Else I expect filesystem to preserve data I successfully synced. > > (In the b-tree split failed case I'd expect transaction commit to > fail because new data could not be weitten; at that point > disk+journal should still contain all the data needed for > recovery of synced/old files, right?) Not necessarily. For filesystems that do logical journalling (i.e., xfs, jfs, et. al), the only thing written in the journal is the logical change (i.e., "new dir entry 'file_that_causes_the_node_split'"). The transaction commits *first*, and then the filesystem tries to write update the filesystem with the change, and it's only then that the write fails. Data can very easily get lost. Even for ext3/ext4 which is doing physical journalling, it's still the case that the journal commits first, and it's only later when the write happens that we write out the change. If the disk fails some of the writes, it's possible to lose data, especially if the two blocks involved in the node split are far apart, and the write to the existing old btree block fails. > > What exactly are your requirements here, and what are you trying to > > do? What are you worried about? Most MTA's are quite happy > > settling > > I'm trying to put my main filesystem on a SD card. hp2133 has only 4GB > internal flash, so I got 32GB SDHC. Unfortunately, SD card on hp is > very easy to eject by mistake. So what you really want is some way of constantly flushing data to the disk, probably after every single mkdir, every single close operation. Of course, that has the tradeoff your flash card will get a lot of extra wear. I hate to say this, but have you considered something like tape or velcro to secure the SD card? - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 16:37 ` Theodore Tso @ 2008-12-02 17:22 ` Chris Friesen 2008-12-02 20:55 ` Theodore Tso 2008-12-02 19:10 ` Folkert van Heusden 1 sibling, 1 reply; 25+ messages in thread From: Chris Friesen @ 2008-12-02 17:22 UTC (permalink / raw) To: Theodore Tso; +Cc: Pavel Machek, mikulas, clock, kernel list, aviro Theodore Tso wrote: > Even for ext3/ext4 which is doing physical journalling, it's still the > case that the journal commits first, and it's only later when the > write happens that we write out the change. If the disk fails some of > the writes, it's possible to lose data, especially if the two blocks > involved in the node split are far apart, and the write to the > existing old btree block fails. Yikes. I was under the impression that once the journal hit the platter then the data were safe (barring media corruption). It seems like the more I learn about filesystems, the more failure modes there are and the fewer guarantees can be made. It's amazing that things work as well as they do... Chris ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 17:22 ` Chris Friesen @ 2008-12-02 20:55 ` Theodore Tso 2008-12-02 22:44 ` Pavel Machek 2008-12-15 11:03 ` Pavel Machek 0 siblings, 2 replies; 25+ messages in thread From: Theodore Tso @ 2008-12-02 20:55 UTC (permalink / raw) To: Chris Friesen; +Cc: Pavel Machek, mikulas, clock, kernel list, aviro On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote: > Theodore Tso wrote: > >> Even for ext3/ext4 which is doing physical journalling, it's still the >> case that the journal commits first, and it's only later when the >> write happens that we write out the change. If the disk fails some of >> the writes, it's possible to lose data, especially if the two blocks >> involved in the node split are far apart, and the write to the >> existing old btree block fails. > > Yikes. I was under the impression that once the journal hit the platter > then the data were safe (barring media corruption). Well, this is a case of media corruption (or a cosmic ray hitting hitting a ribbon cable in the disk controller sending the write to the wrong location on disk, or someone bumping the server causing the disk head to lift up a little higher than normal while it was writing the disk sector, etc.). But it is a case of the hard drive misbehaving. Heck, if you have a hiccup while writing an inode table block out to disk (for example a power failure at just the wrong time), so the memory (which is more voltage sensitive than hard drives) DMA's garbage which gets written to the inode table, you could lose a large number of adjacent inodes when garbage gets splatted over the inode table. Ext3 tends to recover from this better than other filesystems, thanks to the fact that it does physical block journalling, but you do pay for this in terms of performance if you have a metadata-intensive workload, because you're writing more bytes to the journal for each metadata opeation. > It seems like the more I learn about filesystems, the more failure modes > there are and the fewer guarantees can be made. It's amazing that > things work as well as they do... There are certainly things you can do. Put your fileservers's on UPS's. Use RAID. Make backups. Do all three. :-) - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 20:55 ` Theodore Tso @ 2008-12-02 22:44 ` Pavel Machek 2008-12-02 22:50 ` Pavel Machek 2008-12-03 5:07 ` Theodore Tso 2008-12-15 11:03 ` Pavel Machek 1 sibling, 2 replies; 25+ messages in thread From: Pavel Machek @ 2008-12-02 22:44 UTC (permalink / raw) To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro On Tue 2008-12-02 15:55:58, Theodore Tso wrote: > On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote: > > Theodore Tso wrote: > > > >> Even for ext3/ext4 which is doing physical journalling, it's still the > >> case that the journal commits first, and it's only later when the > >> write happens that we write out the change. If the disk fails some of > >> the writes, it's possible to lose data, especially if the two blocks > >> involved in the node split are far apart, and the write to the > >> existing old btree block fails. > > > > Yikes. I was under the impression that once the journal hit the platter > > then the data were safe (barring media corruption). > > Well, this is a case of media corruption (or a cosmic ray hitting > hitting a ribbon cable in the disk controller sending the write to the > wrong location on disk, or someone bumping the server causing the disk > head to lift up a little higher than normal while it was writing the > disk sector, etc.). But it is a case of the hard drive misbehaving. I could not parse this. Negation seems to be missing somewhere. > Heck, if you have a hiccup while writing an inode table block out to > disk (for example a power failure at just the wrong time), so the > memory (which is more voltage sensitive than hard drives) DMA's > garbage which gets written to the inode table, you could lose a large > number of adjacent inodes when garbage gets splatted over the inode > table. Ok, "memory failed before disk" is ... bad hardware. ...but... you seem to be saying that modern filesystems can damage data even on "sane" hardware. Lets define sane as: 1) if disk says sector was successfully written, it is so, until you start writing to that sector again. (but disk may say "error writing". Filesystem should propagate that back to the userland, reliably. "Error writing" is extremely rare on modern disks, but can happen if you run out of spare blocks.) (and if you ask for sector write, sector is in undefined state until drive returns success. Flashes behave like this -- reads return errors. Do disks?) 2) connection to the disk either works or fails totally. Bit errors are reliably detected at connection level. 3) power may fail any time. You seem to be saying that ext2/ext3 only work if these are met: 1) power may fail any time. 2) writes are always successful. 3) connection to the disk always works. AFAICT it is unsafe to run ext2/ext3 on any media that can be removed without unmounting (missing fsync error propagation), and it is unsafe to run ext2/ext3 on any flash-based storage with block interface (SD cards, flash sticks). > Ext3 tends to recover from this better than other filesystems, thanks > to the fact that it does physical block journalling, but you do pay > for this in terms of performance if you have a metadata-intensive > workload, because you're writing more bytes to the journal for each > metadata opeation. And thanks for that! Actually I'd be willing to pay some more performance to get reliability up. > > It seems like the more I learn about filesystems, the more failure modes > > there are and the fewer guarantees can be made. It's amazing that > > things work as well as they do... > > There are certainly things you can do. Put your fileservers's on > UPS's. Use RAID. Make backups. Do all three. :-) I was almost stupid enough to move primary copy of ~ and linux trees to SD... I do have UPSes, unfortunately they are li-ion and i'm running off them most of the time. I do have backups, but restoring them all the time is boring & time consuming. I'll try to stick two MMC cards into SD slot to make it RAID 1 :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 22:44 ` Pavel Machek @ 2008-12-02 22:50 ` Pavel Machek 2008-12-03 5:07 ` Theodore Tso 1 sibling, 0 replies; 25+ messages in thread From: Pavel Machek @ 2008-12-02 22:50 UTC (permalink / raw) To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro ...and it is unsafe to run ext2/ext3 on any media that can return error on write. That includes perfectly working disk drives that just ran out of spare blocks. > AFAICT it is unsafe to run ext2/ext3 on any media that can be removed > without unmounting (missing fsync error propagation), and it is To be fair, bad fsync semantics (error only reported to the first person that asks) looks like fundamental Unix problem, nothing ext2/3 specific... > to run ext2/ext3 on any flash-based storage with block interface (SD > cards, flash sticks). ...and I'm aware of no filesystem that _can_ reliably work on SD cards/USB flash sticks... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 22:44 ` Pavel Machek 2008-12-02 22:50 ` Pavel Machek @ 2008-12-03 5:07 ` Theodore Tso 2008-12-03 8:46 ` Pavel Machek ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Theodore Tso @ 2008-12-03 5:07 UTC (permalink / raw) To: Pavel Machek; +Cc: Chris Friesen, mikulas, clock, kernel list, aviro On Tue, Dec 02, 2008 at 11:44:03PM +0100, Pavel Machek wrote: > > > > > > Yikes. I was under the impression that once the journal hit the platter > > > then the data were safe (barring media corruption). > > > > Well, this is a case of media corruption (or a cosmic ray hitting > > hitting a ribbon cable in the disk controller sending the write to the > > wrong location on disk, or someone bumping the server causing the disk > > head to lift up a little higher than normal while it was writing the > > disk sector, etc.). But it is a case of the hard drive misbehaving. > > I could not parse this. Negation seems to be missing somewhere. I was agreeing with your original statement. Once the journal hits the platter, the data is safe, barring hard drive malfunctions (not just media corruption). I was just listing the many other types of hard drive failures that could cause data loss. > > Heck, if you have a hiccup while writing an inode table block out to > > disk (for example a power failure at just the wrong time), so the > > memory (which is more voltage sensitive than hard drives) DMA's > > garbage which gets written to the inode table, you could lose a large > > number of adjacent inodes when garbage gets splatted over the inode > > table. > > Ok, "memory failed before disk" is ... bad hardware. It's PC class hardware. Live with it. Back when SGI made their own hardware, they noticed this problem, and so they wired up their SGI machines with powerfail interrupts, and extra big capacitors in their power supplies, and when Irix got a powerfail interrupt, it would frantically run around aborting DMA transfers to avoid this particular problem. At least, that's what an old-timer SGI engineer (who is unfortunately no longer at SGI) told me. PC class hardware don't have power fail interrupts. Hence, my advice to you is that if you use a filesystem that does logical journalling --- better have a UPS. > ...but... you seem to be saying that modern filesystems can damage > data even on "sane" hardware. The example I gave was one where a disk failure could cause a file that had previously been sucessfully written to disk and fsync()'ed to be damaged by another filesystem operation ***in the face of hard drive failure***. Surely that is obvious. The most obvious case of that might be if the disk controller gets confused and slams a data block into the wrong location on disk (there's a reason why DIF includes the sector number in its checksum and why some enterprise databases do the same thing in their tablespace blocks --- it happens often enough that paranoid data integrity engineers worry about it). The example I gave, where a b-tree is doing split, and there is a failure writing to the b-tree causing ancillary damage files referenced in the b-tree node getting split, can happen with **any** filesystem. The only thing that will save you here would be a copy-on-write type filesystem, such as WAFL or btrfs. > You seem to be saying that ext2/ext3 only work if these are met: > > 1) power may fail any time. Well, ext2/ext3 will work fine if the power is always reliable, too. :-) > 2) writes are always successful. To the extent that write failures while writing filesystem metdata can, if you are unluky be catastrophic, yeah. Fortunally normally such write failures are fairly rare, but if you worry about such things, RAID is the answer. As I said, I believe this is going to be true for pretty much any update-in-place filesystem. It's always possible to construct failure scenarios if the hardware is unreliable. > > 3) connection to the disk always works. > > AFAICT it is unsafe to run ext2/ext3 on any media that can be removed > without unmounting (missing fsync error propagation), and it is unsafe > to run ext2/ext3 on any flash-based storage with block interface (SD > cards, flash sticks). The data on the disk before the connection is yanked should be safe (although as we mentioned in another thread, the flash drive itself may not be happy if you are writing to the Flash Translation Layer at the time when power is cut; if that causes a previously written sector to disappear, that's an example of a hardware failure that **any** filesystem won't necessarily be able to recover from). Your definition of "safe" seems to include worrying about making sure that all processes that may have previously touched a file or a directory gets an error when they try to do an fsync() on that file or directory, and that given that fsync clears the error condition after it returns it,it is therefore "unsafe". The reality is that most applications don't proper error checking, and even fewer actually call fsync(), so if you are putting your root filesytem on a 32G flash card, and it pops out easily due to hardware design issues, the question of whether fsync() gets properly progated to all potentially interested applications is the ***least*** of your worries. - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 5:07 ` Theodore Tso @ 2008-12-03 8:46 ` Pavel Machek 2008-12-03 15:50 ` Mikulas Patocka 2008-12-03 16:42 ` Theodore Tso 2008-12-03 15:34 ` Mikulas Patocka 2008-12-15 10:24 ` [patch] " Pavel Machek 2 siblings, 2 replies; 25+ messages in thread From: Pavel Machek @ 2008-12-03 8:46 UTC (permalink / raw) To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro On Wed 2008-12-03 00:07:09, Theodore Tso wrote: > On Tue, Dec 02, 2008 at 11:44:03PM +0100, Pavel Machek wrote: > > > > > > > > Yikes. I was under the impression that once the journal hit the platter > > > > then the data were safe (barring media corruption). > > > > > > Well, this is a case of media corruption (or a cosmic ray hitting > > > hitting a ribbon cable in the disk controller sending the write to the > > > wrong location on disk, or someone bumping the server causing the disk > > > head to lift up a little higher than normal while it was writing the > > > disk sector, etc.). But it is a case of the hard drive misbehaving. > > > > I could not parse this. Negation seems to be missing somewhere. > > I was agreeing with your original statement. Once the journal hits > the platter, the data is safe, barring hard drive malfunctions (not > just media corruption). I was just listing the many other types of > hard drive failures that could cause data loss. Aha, ok, sorry for confusion. > > Ok, "memory failed before disk" is ... bad hardware. > > It's PC class hardware. Live with it. Back when SGI made their own > hardware, they noticed this problem, and so they wired up their SGI > machines with powerfail interrupts, and extra big capacitors in their > power supplies, and when Irix got a powerfail interrupt, it would > frantically run around aborting DMA transfers to avoid this particular > problem. At least, that's what an old-timer SGI engineer (who is > unfortunately no longer at SGI) told me. > > PC class hardware don't have power fail interrupts. Hence, my advice > to you is that if you use a filesystem that does logical journalling > --- better have a UPS. Hmm, 'just avoid logical journalling' seems like a better solution :-). > > ...but... you seem to be saying that modern filesystems can damage > > data even on "sane" hardware. > > The example I gave was one where a disk failure could cause a file > that had previously been sucessfully written to disk and fsync()'ed to > be damaged by another filesystem operation ***in the face of hard > drive failure***. Surely that is obvious. The most obvious case of Ok. > The example I gave, where a b-tree is doing split, and there is a > failure writing to the b-tree causing ancillary damage files > referenced in the b-tree node getting split, can happen with **any** > filesystem. The only thing that will save you here would be a > copy-on-write type filesystem, such as WAFL or btrfs. ext3-like physical journaling could be extended to handle write failures (at speed penalty), no? Write 'I will rewrite block A containing B with C' into journal... ok, I guess I should wait for btrfs. > > You seem to be saying that ext2/ext3 only work if these are met: > > > > 1) power may fail any time. > > Well, ext2/ext3 will work fine if the power is always reliable, too. :-) :-) ok. > > 2) writes are always successful. > > To the extent that write failures while writing filesystem metdata > can, if you are unluky be catastrophic, yeah. Fortunally normally > such write failures are fairly rare, but if you worry about such > things, RAID is the answer. As I said, I believe this is going to be > true for pretty much any update-in-place filesystem. It's always > possible to construct failure scenarios if the hardware is unreliable. Ok. > > 3) connection to the disk always works. > > > > AFAICT it is unsafe to run ext2/ext3 on any media that can be removed > > without unmounting (missing fsync error propagation), and it is unsafe > > to run ext2/ext3 on any flash-based storage with block interface (SD > > cards, flash sticks). > > The data on the disk before the connection is yanked should be safe > (although as we mentioned in another thread, the flash drive itself > may not be happy if you are writing to the Flash Translation Layer at > the time when power is cut; if that causes a previously written sector > to disappear, that's an example of a hardware failure that **any** > filesystem won't necessarily be able to recover from). > > Your definition of "safe" seems to include worrying about making sure > that all processes that may have previously touched a file or a > directory gets an error when they try to do an fsync() on that file or > directory, and that given that fsync clears the error condition after > it returns it,it is therefore "unsafe". Yes. fsync() seeems surprisingly high on Rusty's list of broken interfaces classification ('impossible to use correctly'). I wonder if some reasonable solution exists? Mark filesystem as failed on first write error is one of those (and default for ext2/3?). Did SGI/big unixen solve this somehow? > The reality is that most applications don't proper error checking, and > even fewer actually call fsync(), so if you are putting your root > filesytem on a 32G flash card, and it pops out easily due to hardware > design issues, the question of whether fsync() gets properly progated > to all potentially interested applications is the ***least*** of your > worries. Yes, most applications are bad. Yes, I should just glue the card into the slot. No, fsync interface does not look properly designed. No, it is not causing me immediate problems (mount -o dirsync mostly works around that). I wonder if good, long-term solution exists... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 8:46 ` Pavel Machek @ 2008-12-03 15:50 ` Mikulas Patocka 2008-12-03 15:54 ` Alan Cox 2008-12-03 16:42 ` Theodore Tso 1 sibling, 1 reply; 25+ messages in thread From: Mikulas Patocka @ 2008-12-03 15:50 UTC (permalink / raw) To: Pavel Machek; +Cc: Theodore Tso, Chris Friesen, kernel list, aviro > Yes. fsync() seeems surprisingly high on Rusty's list of broken > interfaces classification ('impossible to use correctly'). > > I wonder if some reasonable solution exists? Mark filesystem as failed > on first write error is one of those (and default for ext2/3?). Did > SGI/big unixen solve this somehow? When OS/2 hit write error, it wrote to another location on disk and added this to its sector remap table. It could remap both metadata and data this way. But today it is meaningless, because the same algorithm is implemented in disk firmware. Write errors are reported for disk connection problems, not media problems. For connection problems, another solution may be to retry writes indefinitely until the admin aborts it or reconnects the disk. But I don't know how common these recoverable disk connection errors are. > > The reality is that most applications don't proper error checking, and > > even fewer actually call fsync(), so if you are putting your root > > filesytem on a 32G flash card, and it pops out easily due to hardware > > design issues, the question of whether fsync() gets properly progated > > to all potentially interested applications is the ***least*** of your > > worries. If you are running a transaction processing software, then it is a very important worry. All the database software is written with the assumption that when the database returns transaction committed, then the changes are permanent. Most of the business software can deal with the fact that the server crashes, but can't deal with the fact that database returnes committed status for transaction that wasn't really committed. Mikulas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 15:50 ` Mikulas Patocka @ 2008-12-03 15:54 ` Alan Cox 2008-12-03 17:37 ` Mikulas Patocka 0 siblings, 1 reply; 25+ messages in thread From: Alan Cox @ 2008-12-03 15:54 UTC (permalink / raw) To: Mikulas Patocka Cc: Pavel Machek, Theodore Tso, Chris Friesen, kernel list, aviro > implemented in disk firmware. Write errors are reported for disk > connection problems, not media problems. Media errors are reported for writes when the drive knows there are problems. That may be deferred to the cache flush afterwards but the information is still generated and shipped back to us - eventually. > For connection problems, another solution may be to retry writes > indefinitely until the admin aborts it or reconnects the disk. But I don't > know how common these recoverable disk connection errors are. CRC errors, lost IRQs and the like are retried by the midlayer and drivers and the error handling strategies will also try things like reducing link speeds on repeated CRC errors. Alan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 15:54 ` Alan Cox @ 2008-12-03 17:37 ` Mikulas Patocka 2008-12-03 17:52 ` Alan Cox 2008-12-03 18:16 ` Pavel Machek 0 siblings, 2 replies; 25+ messages in thread From: Mikulas Patocka @ 2008-12-03 17:37 UTC (permalink / raw) To: Alan Cox; +Cc: Pavel Machek, Theodore Tso, Chris Friesen, kernel list, aviro On Wed, 3 Dec 2008, Alan Cox wrote: > > implemented in disk firmware. Write errors are reported for disk > > connection problems, not media problems. > > Media errors are reported for writes when the drive knows there are > problems. That may be deferred to the cache flush afterwards but the > information is still generated and shipped back to us - eventually. It a question, how to process cache flush errors correctly. A cache flush error reported for one filesystem may belong to the data written by other filesystem. So should some flag "there was an error" be set for all partitions and report it to every filesystem when it does cache flush? Or record the time of the last error in the driver and let the filesystem query it (so that the filesystem can tell if the error happened before or after it was mounted). BTW. how does SCSI report cache flush errors? Does it report them on SYNCHRONIZE CACHE command or does it report them on defered senses? Another point is that unless the sector remap table is full, there should be no cache flush errors. > > For connection problems, another solution may be to retry writes > > indefinitely until the admin aborts it or reconnects the disk. But I don't > > know how common these recoverable disk connection errors are. > > CRC errors, lost IRQs and the like are retried by the midlayer and > drivers and the error handling strategies will also try things like > reducing link speeds on repeated CRC errors. I meant for example loose cable or so --- does it make sense to retry indefinitely (until the admin plugs the cable or unmounts the filesystem) or return error to the filesystem after few retries? Mikulas > Alan > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 17:37 ` Mikulas Patocka @ 2008-12-03 17:52 ` Alan Cox 2008-12-03 18:16 ` Pavel Machek 1 sibling, 0 replies; 25+ messages in thread From: Alan Cox @ 2008-12-03 17:52 UTC (permalink / raw) To: Mikulas Patocka Cc: Pavel Machek, Theodore Tso, Chris Friesen, kernel list, aviro > error reported for one filesystem may belong to the data written by other > filesystem. So should some flag "there was an error" be set for all > partitions and report it to every filesystem when it does cache flush? Or > record the time of the last error in the driver and let the filesystem > query it (so that the filesystem can tell if the error happened before or > after it was mounted). Good question - not working that high up the stack I don't know the right answer there. > > BTW. how does SCSI report cache flush errors? Does it report them on > SYNCHRONIZE CACHE command or does it report them on defered senses? Not sure. I thought the same way. > Another point is that unless the sector remap table is full, there should > be no cache flush errors. You can get them on partial writes to large sector devices, assorted errors on SSD devices and various 'catastrophic' errors. > I meant for example loose cable or so --- does it make sense to retry > indefinitely (until the admin plugs the cable or unmounts the filesystem) > or return error to the filesystem after few retries? At the low level we have to return an error so that RAID and the like can work. Alan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 17:37 ` Mikulas Patocka 2008-12-03 17:52 ` Alan Cox @ 2008-12-03 18:16 ` Pavel Machek 2008-12-03 18:33 ` Mikulas Patocka 1 sibling, 1 reply; 25+ messages in thread From: Pavel Machek @ 2008-12-03 18:16 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Alan Cox, Theodore Tso, Chris Friesen, kernel list, aviro > > CRC errors, lost IRQs and the like are retried by the midlayer and > > drivers and the error handling strategies will also try things like > > reducing link speeds on repeated CRC errors. > > I meant for example loose cable or so --- does it make sense to retry > indefinitely (until the admin plugs the cable or unmounts the filesystem) > or return error to the filesystem after few retries? It is quite non-trivial to detect if it is "disk plugged back in" vs. "faulty disk unplugged, new one plugged in"... so I suppose automatic retry after failure of connection to disk is quite hard to get right. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 18:16 ` Pavel Machek @ 2008-12-03 18:33 ` Mikulas Patocka 0 siblings, 0 replies; 25+ messages in thread From: Mikulas Patocka @ 2008-12-03 18:33 UTC (permalink / raw) To: Pavel Machek; +Cc: Alan Cox, Theodore Tso, Chris Friesen, kernel list, aviro On Wed, 3 Dec 2008, Pavel Machek wrote: > > > CRC errors, lost IRQs and the like are retried by the midlayer and > > > drivers and the error handling strategies will also try things like > > > reducing link speeds on repeated CRC errors. > > > > I meant for example loose cable or so --- does it make sense to retry > > indefinitely (until the admin plugs the cable or unmounts the filesystem) > > or return error to the filesystem after few retries? > > It is quite non-trivial to detect if it is "disk plugged back in" > vs. "faulty disk unplugged, new one plugged in"... so I suppose > automatic retry after failure of connection to disk is quite hard to > get right. Unless the SATA controller has the plug interrupt (very few have), there is no way for the kernel to detect that an old SATA disk was unplugged and a new one was plugged in. So the answer is that the admin must not hot-swap disk unless unmounting the filesystem or notifying the RAID layer about it. If you hot-swap softraid1/4/5 disk, you definitely damage data, because the softraid layer has no way to find out about the hotswap. Mikulas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 8:46 ` Pavel Machek 2008-12-03 15:50 ` Mikulas Patocka @ 2008-12-03 16:42 ` Theodore Tso 2008-12-03 17:43 ` Mikulas Patocka 1 sibling, 1 reply; 25+ messages in thread From: Theodore Tso @ 2008-12-03 16:42 UTC (permalink / raw) To: Pavel Machek; +Cc: Chris Friesen, mikulas, clock, kernel list, aviro On Wed, Dec 03, 2008 at 09:46:40AM +0100, Pavel Machek wrote: > Yes. fsync() seeems surprisingly high on Rusty's list of broken > interfaces classification ('impossible to use correctly'). To be fair, fsync() was primarily intended for making sure that the data had been written to disk, and not necessarily as a way of making sure that write errors would be properly reflected back to the application. As you've pointed out, it's not really adequate for that purpose. - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 16:42 ` Theodore Tso @ 2008-12-03 17:43 ` Mikulas Patocka 2008-12-03 18:26 ` Pavel Machek 0 siblings, 1 reply; 25+ messages in thread From: Mikulas Patocka @ 2008-12-03 17:43 UTC (permalink / raw) To: Theodore Tso; +Cc: Pavel Machek, Chris Friesen, kernel list, aviro > On Wed, Dec 03, 2008 at 09:46:40AM +0100, Pavel Machek wrote: > > Yes. fsync() seeems surprisingly high on Rusty's list of broken > > interfaces classification ('impossible to use correctly'). BTW where is that list. > To be fair, fsync() was primarily intended for making sure that the > data had been written to disk, and not necessarily as a way of making > sure that write errors would be properly reflected back to the > application. As you've pointed out, it's not really adequate for that > purpose. > > - Ted Well, what else do you want to use for databases? (where crashing the whole computer makes less damage than pretending that transaction was committed while it wasn't). Mikulas ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 17:43 ` Mikulas Patocka @ 2008-12-03 18:26 ` Pavel Machek 0 siblings, 0 replies; 25+ messages in thread From: Pavel Machek @ 2008-12-03 18:26 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Theodore Tso, Chris Friesen, kernel list, aviro On Wed 2008-12-03 18:43:18, Mikulas Patocka wrote: > > On Wed, Dec 03, 2008 at 09:46:40AM +0100, Pavel Machek wrote: > > > Yes. fsync() seeems surprisingly high on Rusty's list of broken > > > interfaces classification ('impossible to use correctly'). > > BTW where is that list. http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html > > To be fair, fsync() was primarily intended for making sure that the > > data had been written to disk, and not necessarily as a way of making > > sure that write errors would be properly reflected back to the > > application. As you've pointed out, it's not really adequate for that > > purpose. > > Well, what else do you want to use for databases? (where crashing the > whole computer makes less damage than pretending that transaction was > committed while it wasn't). I guess we could modify fsync() to fail if there was _ever_ write problem on same filesystem. That would make it "safe". And as ext2/ext3 can't handle metadata write errors anyway... maybe that should be done? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-03 5:07 ` Theodore Tso 2008-12-03 8:46 ` Pavel Machek @ 2008-12-03 15:34 ` Mikulas Patocka 2008-12-15 10:24 ` [patch] " Pavel Machek 2 siblings, 0 replies; 25+ messages in thread From: Mikulas Patocka @ 2008-12-03 15:34 UTC (permalink / raw) To: Theodore Tso; +Cc: Pavel Machek, Chris Friesen, clock, kernel list, aviro On Wed, 3 Dec 2008, Theodore Tso wrote: > > Ok, "memory failed before disk" is ... bad hardware. > > It's PC class hardware. Live with it. Back when SGI made their own > hardware, they noticed this problem, and so they wired up their SGI > machines with powerfail interrupts, and extra big capacitors in their > power supplies, and when Irix got a powerfail interrupt, it would > frantically run around aborting DMA transfers to avoid this particular > problem. At least, that's what an old-timer SGI engineer (who is > unfortunately no longer at SGI) told me. I heard this too --- I just don't understand why did they route it to an interrupt and undertook the complicated sequence of aborting the commands by the kernel --- instead of simply routing it to PCI reset line --- that would reset the controller and stop it from feeding data to disks. Also, if they had ECC memory, the chipset should detect unrecoverable garbage and respond with target-abort or full system reset and not feed bad data to the controller. > PC class hardware don't have power fail interrupts. Hence, my advice > to you is that if you use a filesystem that does logical journalling > --- better have a UPS. ATX has PWR_OK pin that should be deasserted on power failure before the voltage drops. I don't know if motherboards use it --- but there should be no problem routing the pin to the chipset reset and stop it before power goes low. > > ...but... you seem to be saying that modern filesystems can damage > > data even on "sane" hardware. > > The example I gave was one where a disk failure could cause a file > that had previously been sucessfully written to disk and fsync()'ed to > be damaged by another filesystem operation ***in the face of hard > drive failure***. Surely that is obvious. The most obvious case of > that might be if the disk controller gets confused and slams a data > block into the wrong location on disk (there's a reason why DIF > includes the sector number in its checksum and why some enterprise > databases do the same thing in their tablespace blocks --- it happens > often enough that paranoid data integrity engineers worry about it). You can read the block number back from ATA disk after you write it and before you submit the command. Mikulas ^ permalink raw reply [flat|nested] 25+ messages in thread
* [patch] Re: writing file to disk: not as easy as it looks 2008-12-03 5:07 ` Theodore Tso 2008-12-03 8:46 ` Pavel Machek 2008-12-03 15:34 ` Mikulas Patocka @ 2008-12-15 10:24 ` Pavel Machek 2 siblings, 0 replies; 25+ messages in thread From: Pavel Machek @ 2008-12-15 10:24 UTC (permalink / raw) To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro Cc: Andrew Morton Hi! > > > Heck, if you have a hiccup while writing an inode table block out to > > > disk (for example a power failure at just the wrong time), so the > > > memory (which is more voltage sensitive than hard drives) DMA's > > > garbage which gets written to the inode table, you could lose a large > > > number of adjacent inodes when garbage gets splatted over the inode > > > table. > > > > Ok, "memory failed before disk" is ... bad hardware. > > It's PC class hardware. Live with it. Back when SGI made their own > hardware, they noticed this problem, and so they wired up their SGI > machines with powerfail interrupts, and extra big capacitors in > their Seems like bad hardware is very common indeed. Anyway, I guess it would be fair to document what ext3 expects from disk subsystem for safe operation. Does that summary sound correct/fair? Signed-off-by: Pavel Machek <pavel@suse.cz> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9dd2a3b..3855fbd 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -188,6 +188,34 @@ mke2fs: create a ext3 partition with th debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 expects disk/storage subsystem to behave sanely. On sanely +behaving disk subsystem, data that have been successfully synced will +stay on the disk. Sane means: + +* writes to media never fail. Even if disk returns error condition during + write, ext3 can't handle that correctly, because success on fsync was already + returned when data hit the journal. + + (Fortunately writes failing are very uncommon on disks, as they + have spare sectors they use when write fails.) + +* either whole sector is correctly written or nothing is written during + powerfail. + + (Unfortuantely, all the cheap USB/SD flash cards I seen do behave + like this, and are unsuitable for ext3. Because RAM tends to fail + faster than rest of system during powerfail, special hw killing + DMA transfers may be neccessary. Not sure how common that problem + is on generic PC machines). + +* either write caching is disabled, or hw can do barriers and they are enabled. + + (Note that barriers are disabled by default, use "barrier=1" + mount option after making sure hw can support them). + References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 20:55 ` Theodore Tso 2008-12-02 22:44 ` Pavel Machek @ 2008-12-15 11:03 ` Pavel Machek 2008-12-15 20:08 ` Folkert van Heusden 1 sibling, 1 reply; 25+ messages in thread From: Pavel Machek @ 2008-12-15 11:03 UTC (permalink / raw) To: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro On Tue 2008-12-02 15:55:58, Theodore Tso wrote: > On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote: > > Theodore Tso wrote: > > > >> Even for ext3/ext4 which is doing physical journalling, it's still the > >> case that the journal commits first, and it's only later when the > >> write happens that we write out the change. If the disk fails some of > >> the writes, it's possible to lose data, especially if the two blocks > >> involved in the node split are far apart, and the write to the > >> existing old btree block fails. > > > > Yikes. I was under the impression that once the journal hit the platter > > then the data were safe (barring media corruption). > > Well, this is a case of media corruption (or a cosmic ray hitting > hitting a ribbon cable in the disk controller sending the write to the > wrong location on disk, or someone bumping the server causing the disk > head to lift up a little higher than normal while it was writing the > disk sector, etc.). But it is a case of the hard drive misbehaving. > > Heck, if you have a hiccup while writing an inode table block out to > disk (for example a power failure at just the wrong time), so the ... > Ext3 tends to recover from this better than other filesystems, thanks > to the fact that it does physical block journalling, but you do pay > for this in terms of performance if you have a metadata-intensive > workload, because you're writing more bytes to the journal for each > metadata opeation. > > > It seems like the more I learn about filesystems, the more failure modes > > there are and the fewer guarantees can be made. It's amazing that > > things work as well as they do... > > There are certainly things you can do. Put your fileservers's on > UPS's. Use RAID. Make backups. Do all three. :-) Okay, so we pretty much know that ext3 journalling helps in "user hit the reset button" case. (And we are pretty sure ext2/ext3 works in "clean unmount" case). Otherwise *) kernel bug -> journalling does not help. *) sudden powerfail -> journalling helps works on SGI high-end hardware. It may or may not help on PC-class hardware. We already do periodic checks, even on ext3. Maybe we should do fsck more often if we see evidence of unclean shutdowns (because we know PC hardware is crap...). I actually have patch somewhere, should I ressurect it? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-15 11:03 ` Pavel Machek @ 2008-12-15 20:08 ` Folkert van Heusden 0 siblings, 0 replies; 25+ messages in thread From: Folkert van Heusden @ 2008-12-15 20:08 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Chris Friesen, mikulas, clock, kernel list, aviro > > There are certainly things you can do. Put your fileservers's on > > UPS's. Use RAID. Make backups. Do all three. :-) > > Okay, so we pretty much know that ext3 journalling helps in "user hit > the reset button" case. (And we are pretty sure ext2/ext3 works in > "clean unmount" case). Otherwise > > *) kernel bug -> journalling does not help. > > *) sudden powerfail -> journalling helps works on SGI high-end > hardware. It may or may not help on PC-class hardware. > > We already do periodic checks, even on ext3. Maybe we should do fsck > more often if we see evidence of unclean shutdowns (because we know > PC hardware is crap...). What we might need is on-line fsck. E.g. fsck while the fs is still mounted. Might be tricky to implement. Folkert van Heusden -- MultiTail is a versatile tool for watching logfiles and output of commands. Filtering, coloring, merging, diff-view, etc. http://www.vanheusden.com/multitail/ ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 16:37 ` Theodore Tso 2008-12-02 17:22 ` Chris Friesen @ 2008-12-02 19:10 ` Folkert van Heusden 1 sibling, 0 replies; 25+ messages in thread From: Folkert van Heusden @ 2008-12-02 19:10 UTC (permalink / raw) To: Theodore Tso, Pavel Machek, mikulas, clock, kernel list, aviro > > If disk looses data after acknowledging the write, all hope is lost. > > Else I expect filesystem to preserve data I successfully synced. > > (In the b-tree split failed case I'd expect transaction commit to > > fail because new data could not be weitten; at that point > > disk+journal should still contain all the data needed for > > recovery of synced/old files, right?) > > Not necessarily. For filesystems that do logical journalling (i.e., > xfs, jfs, et. al), the only thing written in the journal is the > logical change (i.e., "new dir entry 'file_that_causes_the_node_split'"). > The transaction commits *first*, and then the filesystem tries to > write update the filesystem with the change, and it's only then that > the write fails. Data can very easily get lost. > Even for ext3/ext4 which is doing physical journalling, it's still the So do I understand this right that ext3/4 are more robust? Folkert van Heusden -- MultiTail is a versatile tool for watching logfiles and output of commands. Filtering, coloring, merging, diff-view, etc. http://www.vanheusden.com/multitail/ ---------------------------------------------------------------------- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: writing file to disk: not as easy as it looks 2008-12-02 9:40 writing file to disk: not as easy as it looks Pavel Machek 2008-12-02 14:04 ` Theodore Tso @ 2008-12-02 23:01 ` Mikulas Patocka 1 sibling, 0 replies; 25+ messages in thread From: Mikulas Patocka @ 2008-12-02 23:01 UTC (permalink / raw) To: Pavel Machek; +Cc: clock, kernel list, aviro On Tue, 2 Dec 2008, Pavel Machek wrote: > Actually, it looks like POSIX file interface is on the lowest step of > Rusty's scale: one that is impossible to use correctly. Yes, it seems > impossible to reliably&safely write file to disk under Linux. Double > plus uncool. > > So... how to write file to disk and wait for it to reach the stable > storage, with proper error handling? > > > file > > ...does not work, because it fails to check for errors. > > touch file || error_handling. > > Is not a lot better, unless you mount your filesystems "sync" > ... and noone does that. > > dd conv=fsync if=something of=file 2> /dev/zero || error_handling > > Is a bit better; not much, unless you mount your filesystems > "dirsync", because you have file data on disk, but they do not have > directory entry pointing to them. Noone uses dirsync. > > So you need something like > > dd conv=fsync if=something of=file 2> /dev/zero || error_handling > fsync . || error_handling > fsync .. || error_handling > fsync ../.. || error_handling > fsync ../../.. || error_handling > > ... which mostly works... > > If you are alone on the filesystem... fsync only returns > errors to the first process. So if you have other process that > does fsync ., maybe it gets "your" error and you do not learn > of the problem. > > Question is... Is there way that I missed and that actually works? > Pavel Hi! I think you are right about this. There's no way how to fsync directory reliably. My idea is that when the filesystem hits metadata write error, it should stop commiting any transactions and return error to all writes. Write errors don't happen because of physical errors on media --- all current disks have sector realocation. Write errors can happen because of bad cabling, voltage drops, firmware bugs, corruption of PCI bus by rogue card, etc. Most of these cases are fixable by the administrator. If you continue operating the filesystem after a write error (it doesn't matter if you report the error to userspace or not), you are risking filesystem damage (for example, cross linked files if there was error writing the bitmap) or security breach (users reading blocks containing deleted data of other users). If you freeze the filesystem on write error and do not allow further writes, the administrator can fix the underlying problem and the computer will run without any data damage and security problems. It happened to me just yesterday, my disk was spinning down & up repeatedly and returning errors because of insufficient power. My kernel kicked spadfs filesystem off on the first write error and didn't allow any further commits. I fixed the problem by adding the second power supply and connecting some disks to it --- and now, after the incident, there are zero corruptions. Just imagine what massacre would happen on the filesystem if the kernel didn't kick it off and if it were operting under the condition "some writes get through - some not" unattended for some time. Mikulas ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2008-12-15 20:09 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-12-02 9:40 writing file to disk: not as easy as it looks Pavel Machek 2008-12-02 14:04 ` Theodore Tso 2008-12-02 15:26 ` Pavel Machek 2008-12-02 16:37 ` Theodore Tso 2008-12-02 17:22 ` Chris Friesen 2008-12-02 20:55 ` Theodore Tso 2008-12-02 22:44 ` Pavel Machek 2008-12-02 22:50 ` Pavel Machek 2008-12-03 5:07 ` Theodore Tso 2008-12-03 8:46 ` Pavel Machek 2008-12-03 15:50 ` Mikulas Patocka 2008-12-03 15:54 ` Alan Cox 2008-12-03 17:37 ` Mikulas Patocka 2008-12-03 17:52 ` Alan Cox 2008-12-03 18:16 ` Pavel Machek 2008-12-03 18:33 ` Mikulas Patocka 2008-12-03 16:42 ` Theodore Tso 2008-12-03 17:43 ` Mikulas Patocka 2008-12-03 18:26 ` Pavel Machek 2008-12-03 15:34 ` Mikulas Patocka 2008-12-15 10:24 ` [patch] " Pavel Machek 2008-12-15 11:03 ` Pavel Machek 2008-12-15 20:08 ` Folkert van Heusden 2008-12-02 19:10 ` Folkert van Heusden 2008-12-02 23:01 ` Mikulas Patocka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.