* Re: fsync() errors is unsafe and risks data loss [not found] ` <20180410184356.GD3563@thunk.org> @ 2018-04-10 19:47 ` Martin Steigerwald 2018-04-18 16:52 ` J. Bruce Fields 0 siblings, 1 reply; 57+ messages in thread From: Martin Steigerwald @ 2018-04-10 19:47 UTC (permalink / raw) To: Theodore Y. Ts'o; +Cc: Joshua D. Drake, linux-ext4, linux-fsdevel Hi Theodore, Darrick, Joshua. CC´d fsdevel as it does not appear to be Ext4 specific to me (and to you as well, Theodore). Theodore Y. Ts'o - 10.04.18, 20:43: > This isn't actually an ext4 issue, but a long-standing VFS/MM issue. […] > First of all, what storage devices will do when they hit an exception > condition is quite non-deterministic. For example, the vast majority > of SSD's are not power fail certified. What this means is that if > they suffer a power drop while they are doing a GC, it is quite > possible for data written six months ago to be lost as a result. The > LBA could potentialy be far, far away from any LBA's that were > recently written, and there could have been multiple CACHE FLUSH > operations in the since the LBA in question was last written six > months ago. No matter; for a consumer-grade SSD, it's possible for > that LBA to be trashed after an unexpected power drop. Guh. I was not aware of this. I knew consumer-grade SSDs often do not have power loss protection, but still thought they´d handle garble collection in an atomic way. Sometimes I am tempted to sing an "all hardware is crap" song (starting with Meltdown/Spectre, then probably heading over to storage devices and so on… including firmware crap like Intel ME). > Next, the reason why fsync() has the behaviour that it does is one > ofhe the most common cases of I/O storage errors in buffered use > cases, certainly as seen by the community distros, is the user who > pulls out USB stick while it is in use. In that case, if there are > dirtied pages in the page cache, the question is what can you do? > Sooner or later the writes will time out, and if you leave the pages > dirty, then it effectively becomes a permanent memory leak. You can't > unmount the file system --- that requires writing out all of the pages > such that the dirty bit is turned off. And if you don't clear the > dirty bit on an I/O error, then they can never be cleaned. You can't > even re-insert the USB stick; the re-inserted USB stick will get a new > block device. Worse, when the USB stick was pulled, it will have > suffered a power drop, and see above about what could happen after a > power drop for non-power fail certified flash devices --- it goes > double for the cheap sh*t USB sticks found in the checkout aisle of > Micro Center. >From the original PostgreSQL mailing list thread I did not get on how exactly FreeBSD differs in behavior, compared to Linux. I am aware of one operating system that from a user point of view handles this in almost the right way IMHO: AmigaOS. When you removed a floppy disk from the drive while the OS was writing to it it showed a "You MUST insert volume somename into drive somedrive:" and if you did, it just continued writing. (The part that did not work well was that with the original filesystem if you did not insert it back, the whole disk was corrupted, usually to the point beyond repair, so the "MUST" was no joke.) In my opinion from a user´s point of view this is the only sane way to handle the premature removal of removable media. I have read of a GSoC project to implement something like this for NetBSD but I did not check on the outcome of it. But in MS-DOS I think there has been something similar, however MS-DOS is not an multitasking operating system as AmigaOS is. Implementing something like this for Linux would be quite a feat, I think, cause in addition to the implementation in the kernel, the desktop environment or whatever other userspace you use would need to handle it as well, so you´d have to adapt udev / udisks / probably Systemd. And probably this behavior needs to be restricted to anything that is really removable and even then in order to prevent memory exhaustion in case processes continue to write to an removed and not yet re-inserted USB harddisk the kernel would need to halt I/O processes which dirty I/O to this device. (I believe this is what AmigaOS did. It just blocked all subsequent I/O to the device still it was re-inserted. But then the I/O handling in that OS at that time is quite different from what Linux does.) > So this is the explanation for why Linux handles I/O errors by > clearing the dirty bit after reporting the error up to user space. > And why there is not eagerness to solve the problem simply by "don't > clear the dirty bit". For every one Postgres installation that might > have a better recover after an I/O error, there's probably a thousand > clueless Fedora and Ubuntu users who will have a much worse user > experience after a USB stick pull happens. I was not aware that flash based media may be as crappy as you hint at. >From my tests with AmigaOS 4.something or AmigaOS 3.9 + 3rd Party Poseidon USB stack the above mechanism worked even with USB sticks. I however did not test this often and I did not check for data corruption after a test. Thanks, -- Martin ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-10 19:47 ` fsync() errors is unsafe and risks data loss Martin Steigerwald @ 2018-04-18 16:52 ` J. Bruce Fields 2018-04-19 8:39 ` Christoph Hellwig 0 siblings, 1 reply; 57+ messages in thread From: J. Bruce Fields @ 2018-04-18 16:52 UTC (permalink / raw) To: Martin Steigerwald Cc: Theodore Y. Ts'o, Joshua D. Drake, linux-ext4, linux-fsdevel > Theodore Y. Ts'o - 10.04.18, 20:43: > > First of all, what storage devices will do when they hit an exception > > condition is quite non-deterministic. For example, the vast majority > > of SSD's are not power fail certified. What this means is that if > > they suffer a power drop while they are doing a GC, it is quite > > possible for data written six months ago to be lost as a result. The > > LBA could potentialy be far, far away from any LBA's that were > > recently written, and there could have been multiple CACHE FLUSH > > operations in the since the LBA in question was last written six > > months ago. No matter; for a consumer-grade SSD, it's possible for > > that LBA to be trashed after an unexpected power drop. Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts. I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature. --b. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-18 16:52 ` J. Bruce Fields @ 2018-04-19 8:39 ` Christoph Hellwig 2018-04-19 14:10 ` J. Bruce Fields 0 siblings, 1 reply; 57+ messages in thread From: Christoph Hellwig @ 2018-04-19 8:39 UTC (permalink / raw) To: J. Bruce Fields Cc: Martin Steigerwald, Theodore Y. Ts'o, Joshua D. Drake, linux-ext4, linux-fsdevel On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote: > > Theodore Y. Ts'o - 10.04.18, 20:43: > > > First of all, what storage devices will do when they hit an exception > > > condition is quite non-deterministic. For example, the vast majority > > > of SSD's are not power fail certified. What this means is that if > > > they suffer a power drop while they are doing a GC, it is quite > > > possible for data written six months ago to be lost as a result. The > > > LBA could potentialy be far, far away from any LBA's that were > > > recently written, and there could have been multiple CACHE FLUSH > > > operations in the since the LBA in question was last written six > > > months ago. No matter; for a consumer-grade SSD, it's possible for > > > that LBA to be trashed after an unexpected power drop. > > Pointers to documentation or papers or anything? The only google > results I can find for "power fail certified" are your posts. > > I've always been confused by SSD power-loss protection, as nobody seems > completely clear whether it's a safety or a performance feature. Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen. > > --b. ---end quoted text--- ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-19 8:39 ` Christoph Hellwig @ 2018-04-19 14:10 ` J. Bruce Fields 0 siblings, 0 replies; 57+ messages in thread From: J. Bruce Fields @ 2018-04-19 14:10 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin Steigerwald, Theodore Y. Ts'o, Joshua D. Drake, linux-ext4, linux-fsdevel On Thu, Apr 19, 2018 at 01:39:04AM -0700, Christoph Hellwig wrote: > On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote: > > > Theodore Y. Ts'o - 10.04.18, 20:43: > > > > First of all, what storage devices will do when they hit an exception > > > > condition is quite non-deterministic. For example, the vast majority > > > > of SSD's are not power fail certified. What this means is that if > > > > they suffer a power drop while they are doing a GC, it is quite > > > > possible for data written six months ago to be lost as a result. The > > > > LBA could potentialy be far, far away from any LBA's that were > > > > recently written, and there could have been multiple CACHE FLUSH > > > > operations in the since the LBA in question was last written six > > > > months ago. No matter; for a consumer-grade SSD, it's possible for > > > > that LBA to be trashed after an unexpected power drop. > > > > Pointers to documentation or papers or anything? The only google > > results I can find for "power fail certified" are your posts. > > > > I've always been confused by SSD power-loss protection, as nobody seems > > completely clear whether it's a safety or a performance feature. > > Devices from reputable vendors should always be power fail safe, bugs > notwithstanding. What power-loss protection in marketing slides usually > means is that an SSD has a non-volatile write cache. That is once a > write is ACKed data is persisted and no additional cache flush needs to > be sent. This is a feature only available in expensive eterprise SSDs > as the required capacitors are expensive. Cheaper consumer or boot > driver SSDs have a volatile write cache, that is we need to do a > separate cache flush to persist data (REQ_OP_FLUSH in Linux). But > a reasonable implementation of those still won't corrupt previously > written data, they will just lose the volatile write cache that hasn't > been flushed. Occasional bugs, bad actors or other issues might still > happen. Thanks! That was my understanding too. But then the name is terrible. As is all the vendor documentation I can find: https://insights.samsung.com/2016/03/22/power-loss-protection-how-ssds-are-protecting-data-integrity-white-paper/ "Power loss protection is a critical aspect of ensuring data integrity, especially in servers or data centers." https://www.intel.com/content/.../ssd-320-series-power-loss-data-protection-brief.pdf "Data safety features prepare for unexpected power-loss and protect system and user data." Why do they all neglect to mention that their consumer drives are also perfectly capable of well-defined behavior after power loss, just at the expense of flush performance? It's ridiculously confusing. --b. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss @ 2018-04-10 22:07 Andres Freund 2018-04-11 21:52 ` Andreas Dilger 2018-04-13 14:48 ` Matthew Wilcox 0 siblings, 2 replies; 57+ messages in thread From: Andres Freund @ 2018-04-10 22:07 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: linux-ext4, linux-fsdevel, Joshua D. Drake, Andreas Dilger (Sorry if I screwed up the thread structure - I'd to reconstruct the reply-to and CC list from web archive as I've not found a way to properly download an mbox or such of old content. Was subscribed to fsdevel but not ext4 lists) Hi, 2018-04-10 18:43:56 Ted wrote: > I'll try to give as unbiased a description as possible, but certainly > some of this is going to be filtered by my own biases no matter how > careful I can be. Same ;) 2018-04-10 18:43:56 Ted wrote: > So for better or for worse, there has not been as much investment in > buffered I/O and data robustness in the face of exception handling of > storage devices. That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced. 2018-04-10 18:43:56 Ted wrote: > So this is the explanation for why Linux handles I/O errors by > clearing the dirty bit after reporting the error up to user space. > And why there is not eagerness to solve the problem simply by "don't > clear the dirty bit". For every one Postgres installation that might > have a better recover after an I/O error, there's probably a thousand > clueless Fedora and Ubuntu users who will have a much worse user > experience after a USB stick pull happens. I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that. While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed. Both in postgres, *and* a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file writtten. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing. You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information. 2018-04-10 18:43:56 Ted wrote: > If you are aware of a company who is willing to pay to have a new > kernel feature implemented to meet your needs, we might be able to > refer you to a company or a consultant who might be able to do that > work. I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable. 2018-04-10 19:44:48 Andreas wrote: > The confusion is whether fsync() is a "level" state (return error > forever if there were pages that could not be written), or an "edge" > state (return error only for any write failures since the previous > fsync() call). I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, *THAT'S* the big issue. Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and *then* assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons. 2018-04-10 18:43:56 Ted wrote: > I think Anthony Iliopoulos was pretty clear in his multiple > descriptions in that thread of why the current behaviour is needed > (OOM of the whole system if dirty pages are kept around forever), but > many others were stuck on "I can't believe this is happening??? This > is totally unacceptable and every kernel needs to change to match my > expectations!!!" without looking at the larger picture of what is > practical to change and where the issue should best be fixed. Everone can participate in discussions... Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-10 22:07 Andres Freund @ 2018-04-11 21:52 ` Andreas Dilger 2018-04-12 0:09 ` Dave Chinner 2018-04-12 2:17 ` Andres Freund 2018-04-13 14:48 ` Matthew Wilcox 1 sibling, 2 replies; 57+ messages in thread From: Andreas Dilger @ 2018-04-11 21:52 UTC (permalink / raw) To: 20180410184356.GD3563 Cc: Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake, andres [-- Attachment #1: Type: text/plain, Size: 9079 bytes --] On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote: > 2018-04-10 18:43:56 Ted wrote: >> So for better or for worse, there has not been as much investment in >> buffered I/O and data robustness in the face of exception handling of >> storage devices. > > That's a bit of a cop out. It's not just databases that care. Even more > basic tools like SCM, package managers and editors care whether they can > proper responses back from fsync that imply things actually were synced. Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error. If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written. There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this. > 2018-04-10 18:43:56 Ted wrote: >> So this is the explanation for why Linux handles I/O errors by >> clearing the dirty bit after reporting the error up to user space. >> And why there is not eagerness to solve the problem simply by "don't >> clear the dirty bit". For every one Postgres installation that might >> have a better recover after an I/O error, there's probably a thousand >> clueless Fedora and Ubuntu users who will have a much worse user >> experience after a USB stick pull happens. > > I don't think these necessarily are as contradictory goals as you paint > them. At least in postgres' case we can deal with the fact that an > fsync retry isn't going to fix the problem by reentering crash recovery > or just shutting down - therefore we don't need to keep all the dirty > buffers around. A per-inode or per-superblock bit that causes further > fsyncs to fail would be entirely sufficent for that. > > While there's some differing opinions on the referenced postgres thread, > the fundamental problem isn't so much that a retry won't fix the > problem, it's that we might NEVER see the failure. If writeback happens > in the background, encounters an error, undirties the buffer, we will > happily carry on because we've never seen that. That's when we're > majorly screwed. I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)". If there is background data writeback *without an open file descriptor*, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back. Consider if there was a per-inode "there was once an error writing this inode" flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned. IMHO, the only alternative would be to keep the dirty pages in memory until they are written to disk. If that was not possible, what then? It would need a reboot to clear the dirty pages, or truncate the file (discarding all data)? > Both in postgres, *and* a lot of other applications, it's not at all > guaranteed to consistently have one FD open for every file > written. Therefore even the more recent per-fd errseq logic doesn't > guarantee that the failure will ever be seen by an application > diligently fsync()ing. ... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it). It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is "re-opened", and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool. That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems. > You'd not even need to have per inode information or such in the case > that the block device goes away entirely. As the FS isn't generally > unmounted in that case, you could trivially keep a per-mount (or > superblock?) bit that says "I died" and set that instead of keeping per > inode/whatever information. The filesystem will definitely return an error in this case, I don't think this needs any kind of changes: int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO; > 2018-04-10 18:43:56 Ted wrote: >> If you are aware of a company who is willing to pay to have a new >> kernel feature implemented to meet your needs, we might be able to >> refer you to a company or a consultant who might be able to do that >> work. > > I find it a bit dissapointing response. I think it's fair to say that > for advanced features, but we're talking about the basic guarantee that > fsync actually does something even remotely reasonable. Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others. Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this. That said, _even_ if a fix was available for Linux tomorrow, it would be *years* before a majority of users would have it available on their system, that includes even the errseq mechanism that was landed a few months ago. That implies to me that you'd want something that fixes PG *now* so that it works around whatever (perceived) breakage exists in the Linux fsync() implementation. Since the thread indicates that non-Linux kernels have the same fsync() behaviour, it makes sense to do that even if the Linux fix was available. > 2018-04-10 19:44:48 Andreas wrote: >> The confusion is whether fsync() is a "level" state (return error >> forever if there were pages that could not be written), or an "edge" >> state (return error only for any write failures since the previous >> fsync() call). > > I don't think that's the full issue. We can deal with the fact that an > fsync failure is edge-triggered if there's a guarantee that every > process doing so would get it. The fact that one needs to have an FD > open from before any failing writes occurred to get a failure, *THAT'S* > the big issue. > > Beyond postgres, it's a pretty common approach to do work on a lot of > files without fsyncing, then iterate over the directory fsync > everything, and *then* assume you're safe. But unless I severaly > misunderstand something that'd only be safe if you kept an FD for > every file open, which isn't realistic for pretty obvious reasons. I can't say how common or uncommon such a workload is, though PG is the only application that I've heard of doing it, and I've been working on filesystems for 20 years. I'm a bit surprised that anyone expects fsync() on a newly-opened fd to have any state from write() calls that predate the open. I can understand fsync() returning an error for any IO that happens within the context of that fsync(), but how far should it go back for reporting errors on that file? Forever? The only way to clear the error would be to reboot the system, since I'm not aware of any existing POSIX code to clear such an error. > 2018-04-10 19:44:48 Andreas wrote: >> I think Anthony Iliopoulos was pretty clear in his multiple >> descriptions in that thread of why the current behaviour is needed >> (OOM of the whole system if dirty pages are kept around forever), but >> many others were stuck on "I can't believe this is happening??? This >> is totally unacceptable and every kernel needs to change to match my >> expectations!!!" without looking at the larger picture of what is >> practical to change and where the issue should best be fixed. > > Everone can participate in discussions... > > Greetings, > > Andres Freund Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-11 21:52 ` Andreas Dilger @ 2018-04-12 0:09 ` Dave Chinner 2018-04-12 2:32 ` Andres Freund 2018-04-12 2:17 ` Andres Freund 1 sibling, 1 reply; 57+ messages in thread From: Dave Chinner @ 2018-04-12 0:09 UTC (permalink / raw) To: Andreas Dilger Cc: 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake, andres On Wed, Apr 11, 2018 at 03:52:44PM -0600, Andreas Dilger wrote: > On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote: > > 2018-04-10 18:43:56 Ted wrote: > >> So for better or for worse, there has not been as much investment in > >> buffered I/O and data robustness in the face of exception handling of > >> storage devices. > > > > That's a bit of a cop out. It's not just databases that care. Even more > > basic tools like SCM, package managers and editors care whether they can > > proper responses back from fsync that imply things actually were synced. > > Sure, but it is mostly PG that is doing (IMHO) crazy things like writing > to thousands(?) of files, closing the file descriptors, then expecting > fsync() on a newly-opened fd to return a historical error. Yeah, this seems like a recipe for disaster, especially on cross-platform code where every OS platform behaves differently and almost never to expectation. And speaking of "behaving differently to expectations", nobody has mentioned that close() can also return write errors. Hence if you do write - close - open - fsync the the write error might get reported on close, not fsync. IOWs, the assumption that "async writeback errors will persist across close to open" is fundamentally broken to begin with. It's even documented as a slient data loss vector in the close(2) man page: $ man 2 close ..... Dealing with error returns from close() A careful programmer will check the return value of close(), since it is quite possible that errors on a previous write(2) operation are reported only on the final close() that releases the open file description. Failing to check the return value when closing a file may lead to silent loss of data. This can especially be observed with NFS and with disk quota. Yeah, ensuring data integrity in the face of IO errors is a really hard problem. :/ ---- To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do <complex things PG needs>" discussion. In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit. Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking. This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 0:09 ` Dave Chinner @ 2018-04-12 2:32 ` Andres Freund 2018-04-12 2:51 ` Andres Freund ` (3 more replies) 0 siblings, 4 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 2:32 UTC (permalink / raw) To: Dave Chinner Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake Hi, On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: > To pound the broken record: there are many good reasons why Linux > filesystem developers have said "you should use direct IO" to the PG > devs each time we have this "the kernel doesn't do <complex things > PG needs>" discussion. I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc. And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al). > In this case, robust IO error reporting is easy with DIO. It's one > of the reasons most of the high performance database engines are > either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use > O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also > being driven by the availability of high performance, high IOPS > solid state storage where buffering in RAM to optimise IO patterns > and throughput provides no real performance benefit. > > Using the AIO+DIO infrastructure ensures errors are reported for the > specific write that fails at failure time (i.e. in the aio > completion event for the specific IO), yet high IO throughput can be > maintained without the application needing it's own threading > infrastructure to prevent blocking. > > This means the application doesn't have to guess where the write > error occurred to retry/recover, have to handle async write errors > on close(), have to use fsync() to gather write IO errors and then > infer where the IO failure was, or require kernels on every > supported platform to jump through hoops to try to do exactly the > right thing in error conditions for everyone in all circumstances at > all times.... Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:32 ` Andres Freund @ 2018-04-12 2:51 ` Andres Freund 2018-04-12 5:09 ` Theodore Y. Ts'o ` (2 subsequent siblings) 3 siblings, 0 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 2:51 UTC (permalink / raw) To: Dave Chinner Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake Hi, On 2018-04-11 19:32:21 -0700, Andres Freund wrote: > And there's cases where that just doesn't help at all. Being able to > untar a database from backup / archive / timetravel / whatnot, and then > fsyncing the directory tree to make sure it's actually safe, is really > not an insane idea. Or even just cp -r ing it, and then starting up a > copy of the database. What you're saying is that none of that is doable > in a safe way, unless you use special-case DIO using tooling for the > whole operation (or at least tools that fsync carefully without ever > closing a fd, which certainly isn't the case for cp et al). And before somebody argues that that's a too small window to trigger the problem realistically: Restoring large databases happens pretty commonly (for new replicas, testcases, or actual fatal issues), takes time, and it's where a lot of storage is actually written to for the first time in a while, so it's far from unlikely to trigger bad block errors or such. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:32 ` Andres Freund 2018-04-12 2:51 ` Andres Freund @ 2018-04-12 5:09 ` Theodore Y. Ts'o 2018-04-12 5:45 ` Dave Chinner 2018-04-12 10:19 ` Lukas Czerner 3 siblings, 0 replies; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-12 5:09 UTC (permalink / raw) To: Andres Freund Cc: Dave Chinner, Andreas Dilger, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote: > > Most of that sounds like a good thing to do, but you got to recognize > that that's a lot of linux specific code. I know it's not what PG has chosen, but realistically all of the other major databases and userspace based storage systems have used DIO precisely *because* it's the way to avoid OS-specific behavior or require OS-specific code. DIO is simple, and pretty much the same everywhere. In contrast, the exact details of how buffered I/O workrs can be quite different on different OS's. This is especially true if you take performance related details (e.g., the cleaning algorithm, how pages get chosen for eviction, etc.) As I read the PG-hackers thread, I thought I saw acknowledgement that some of the behaviors you don't like with Linux also show up on other Unix or Unix-like systems? - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:32 ` Andres Freund 2018-04-12 2:51 ` Andres Freund 2018-04-12 5:09 ` Theodore Y. Ts'o @ 2018-04-12 5:45 ` Dave Chinner 2018-04-12 11:24 ` Jeff Layton 2018-04-12 10:19 ` Lukas Czerner 3 siblings, 1 reply; 57+ messages in thread From: Dave Chinner @ 2018-04-12 5:45 UTC (permalink / raw) To: Andres Freund Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote: > Hi, > > On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: > > To pound the broken record: there are many good reasons why Linux > > filesystem developers have said "you should use direct IO" to the PG > > devs each time we have this "the kernel doesn't do <complex things > > PG needs>" discussion. > > I personally am on board with doing that. But you also gotta recognize > that an efficient DIO usage is a metric ton of work, and you need a > large amount of differing logic for different platforms. It's just not > realistic to do so for every platform. Postgres is developed by a small > number of people, isn't VC backed etc. The amount of resources we can > throw at something is fairly limited. I'm hoping to work on adding > linux DIO support to pg, but I'm sure as hell not going to do be able to > do the same on windows (solaris, hpux, aix, ...) etc. > > And there's cases where that just doesn't help at all. Being able to > untar a database from backup / archive / timetravel / whatnot, and then > fsyncing the directory tree to make sure it's actually safe, is really > not an insane idea. Yes it is. This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible. > Or even just cp -r ing it, and then starting up a > copy of the database. What you're saying is that none of that is doable > in a safe way, unless you use special-case DIO using tooling for the > whole operation (or at least tools that fsync carefully without ever > closing a fd, which certainly isn't the case for cp et al). No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 5:45 ` Dave Chinner @ 2018-04-12 11:24 ` Jeff Layton 2018-04-12 21:11 ` Andres Freund 0 siblings, 1 reply; 57+ messages in thread From: Jeff Layton @ 2018-04-12 11:24 UTC (permalink / raw) To: Dave Chinner, Andres Freund Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote: > On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote: > > Hi, > > > > On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: > > > To pound the broken record: there are many good reasons why Linux > > > filesystem developers have said "you should use direct IO" to the PG > > > devs each time we have this "the kernel doesn't do <complex things > > > PG needs>" discussion. > > > > I personally am on board with doing that. But you also gotta recognize > > that an efficient DIO usage is a metric ton of work, and you need a > > large amount of differing logic for different platforms. It's just not > > realistic to do so for every platform. Postgres is developed by a small > > number of people, isn't VC backed etc. The amount of resources we can > > throw at something is fairly limited. I'm hoping to work on adding > > linux DIO support to pg, but I'm sure as hell not going to do be able to > > do the same on windows (solaris, hpux, aix, ...) etc. > > > > And there's cases where that just doesn't help at all. Being able to > > untar a database from backup / archive / timetravel / whatnot, and then > > fsyncing the directory tree to make sure it's actually safe, is really > > not an insane idea. > > Yes it is. > > This is what syncfs() is for - making sure a large amount of of data > and metadata spread across many files and subdirectories in a single > filesystem is pushed to stable storage in the most efficient manner > possible. > Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error. > > Or even just cp -r ing it, and then starting up a > > copy of the database. What you're saying is that none of that is doable > > in a safe way, unless you use special-case DIO using tooling for the > > whole operation (or at least tools that fsync carefully without ever > > closing a fd, which certainly isn't the case for cp et al). > > No, Just saying fsyncing individual files and directories is about > the most inefficient way you could possible go about doing this. > You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 11:24 ` Jeff Layton @ 2018-04-12 21:11 ` Andres Freund 0 siblings, 0 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 21:11 UTC (permalink / raw) To: Jeff Layton Cc: Dave Chinner, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On 2018-04-12 07:24:12 -0400, Jeff Layton wrote: > On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote: > > On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: > > > > To pound the broken record: there are many good reasons why Linux > > > > filesystem developers have said "you should use direct IO" to the PG > > > > devs each time we have this "the kernel doesn't do <complex things > > > > PG needs>" discussion. > > > > > > I personally am on board with doing that. But you also gotta recognize > > > that an efficient DIO usage is a metric ton of work, and you need a > > > large amount of differing logic for different platforms. It's just not > > > realistic to do so for every platform. Postgres is developed by a small > > > number of people, isn't VC backed etc. The amount of resources we can > > > throw at something is fairly limited. I'm hoping to work on adding > > > linux DIO support to pg, but I'm sure as hell not going to do be able to > > > do the same on windows (solaris, hpux, aix, ...) etc. > > > > > > And there's cases where that just doesn't help at all. Being able to > > > untar a database from backup / archive / timetravel / whatnot, and then > > > fsyncing the directory tree to make sure it's actually safe, is really > > > not an insane idea. > > > > Yes it is. > > > > This is what syncfs() is for - making sure a large amount of of data > > and metadata spread across many files and subdirectories in a single > > filesystem is pushed to stable storage in the most efficient manner > > possible. syncfs isn't standardized, it operates on an entire filesystem (thus writing out unnecessary stuff), it has no meaningful documentation of it's return codes. Yes, using syncfs() might better performancewise, but it doesn't seem like it actually solves anything, performance aside: > Just note that the error return from syncfs is somewhat iffy. It doesn't > necessarily return an error when one inode fails to be written back. I > think it mainly returns errors when you get a metadata writeback error. > You can still use syncfs but what you'd probably have to do is call > syncfs while you still hold all of the fd's open, and then fsync each > one afterward to ensure that they all got written back properly. That > should work as you'd expect. Which again doesn't allow one to use any non-bespoke tooling (like tar or whatnot). And it means you'll have to call syncfs() every few hundred files, because you'll obviously run into filehandle limitations. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:32 ` Andres Freund ` (2 preceding siblings ...) 2018-04-12 5:45 ` Dave Chinner @ 2018-04-12 10:19 ` Lukas Czerner 2018-04-12 19:46 ` Andres Freund 3 siblings, 1 reply; 57+ messages in thread From: Lukas Czerner @ 2018-04-12 10:19 UTC (permalink / raw) To: Andres Freund Cc: Dave Chinner, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote: > And there's cases where that just doesn't help at all. Being able to > untar a database from backup / archive / timetravel / whatnot, and then > fsyncing the directory tree to make sure it's actually safe, is really > not an insane idea. Or even just cp -r ing it, and then starting up a > copy of the database. What you're saying is that none of that is doable > in a safe way, unless you use special-case DIO using tooling for the > whole operation (or at least tools that fsync carefully without ever > closing a fd, which certainly isn't the case for cp et al). Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot. -Lukas ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 10:19 ` Lukas Czerner @ 2018-04-12 19:46 ` Andres Freund 0 siblings, 0 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 19:46 UTC (permalink / raw) To: Lukas Czerner Cc: Dave Chinner, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake Hi, On 2018-04-12 12:19:26 +0200, Lukas Czerner wrote: > On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote: > > > And there's cases where that just doesn't help at all. Being able to > > untar a database from backup / archive / timetravel / whatnot, and then > > fsyncing the directory tree to make sure it's actually safe, is really > > not an insane idea. Or even just cp -r ing it, and then starting up a > > copy of the database. What you're saying is that none of that is doable > > in a safe way, unless you use special-case DIO using tooling for the > > whole operation (or at least tools that fsync carefully without ever > > closing a fd, which certainly isn't the case for cp et al). > > Does not seem like a problem to me, just checksum the thing if you > really need to be extra safe. You should probably be doing it anyway if > you backup / archive / timetravel / whatnot. That doesn't really help, unless you want to sync() and then re-read all the data to make sure it's the same. Rereading multi-TB backups just to know whether there was an error that the OS knew about isn't particularly fun. Without verifying after sync it's not going to improve the situation measurably, you're still only going to discover that $data isn't available when it's needed. What you're saying here is that there's no way to use standard linux tools to manipulate files and know whether it failed, without filtering kernel logs for IO errors. Or am I missing something? Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-11 21:52 ` Andreas Dilger 2018-04-12 0:09 ` Dave Chinner @ 2018-04-12 2:17 ` Andres Freund 2018-04-12 3:02 ` Matthew Wilcox ` (2 more replies) 1 sibling, 3 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 2:17 UTC (permalink / raw) To: Andreas Dilger Cc: 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake Hi, On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote: > On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote: > > 2018-04-10 18:43:56 Ted wrote: > >> So for better or for worse, there has not been as much investment in > >> buffered I/O and data robustness in the face of exception handling of > >> storage devices. > > > > That's a bit of a cop out. It's not just databases that care. Even more > > basic tools like SCM, package managers and editors care whether they can > > proper responses back from fsync that imply things actually were synced. > > Sure, but it is mostly PG that is doing (IMHO) crazy things like writing > to thousands(?) of files, closing the file descriptors, then expecting > fsync() on a newly-opened fd to return a historical error. It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too: /* We want to guarantee the extracted files are on the disk, so that the * subsequent renames to the info database do not end up with old or zero * length files in case of a system crash. As neither dpkg-deb nor tar do * explicit fsync()s, we have to do them here. * XXX: This could be avoided by switching to an internal tar extractor. */ dir_sync_contents(cidir); (a bunch of other places too) Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that. I think there's some legitimate arguments that a database should use direct IO (more on that as a reply to David), but claiming that all sorts of random utilities need to use DIO with buffering etc is just insane. > If an editor tries to write a file, then calls fsync and gets an > error, the user will enter a new pathname and retry the write. The > package manager will assume the package installation failed, and > uninstall the parts of the package that were already written. Except that they won't notice that they got a failure, at least in the dpkg case. And happily continue installing corrupted data > There is no way the filesystem can handle the package manager failure case, > and keeping the pages dirty and retrying indefinitely may never work (e.g. > disk is dead or disconnected, is a sparse volume without any free space, > etc). This (IMHO) implies that the higher layer (which knows more about > what the write failure implies) needs to deal with this. Yea, I agree that'd not be sane. As far as I understand the dpkg code (all of 10min reading it), that'd also be unnecessary. It can abort the installation, but only if it detects the error. Which isn't happening. > > While there's some differing opinions on the referenced postgres thread, > > the fundamental problem isn't so much that a retry won't fix the > > problem, it's that we might NEVER see the failure. If writeback happens > > in the background, encounters an error, undirties the buffer, we will > > happily carry on because we've never seen that. That's when we're > > majorly screwed. > > > I think there are two issues here - "fsync() on an fd that was just opened" > and "persistent error state (without keeping dirty pages in memory)". > > If there is background data writeback *without an open file descriptor*, > there is no mechanism for the kernel to return an error to any application > which may exist, or may not ever come back. And that's *horrible*. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus. Or even more extreme, you untar/zip/git clone a directory. Then do a sync. And you don't know whether anything actually succeeded. > Consider if there was a per-inode "there was once an error writing this > inode" flag. Then fsync() would return an error on the inode forever, > since there is no way in POSIX to clear this state, since it would need > to be kept in case some new fd is opened on the inode and does an fsync() > and wants the error to be returned. The data in the file also is corrupt. Having to unmount or delete the file to reset the fact that it can't safely be assumed to be on disk isn't insane. > > Both in postgres, *and* a lot of other applications, it's not at all > > guaranteed to consistently have one FD open for every file > > written. Therefore even the more recent per-fd errseq logic doesn't > > guarantee that the failure will ever be seen by an application > > diligently fsync()ing. > > ... only if the application closes all fds for the file before calling > fsync. If any fd is kept open from the time of the failure, it will > return the original error on fsync() (and then no longer return it). > > It's not that you need to keep every fd open forever. You could put them > into a shared pool, and re-use them if the file is "re-opened", and call > fsync on each fd before it is closed (because the pool is getting too big > or because you want to flush the data for that file, or shut down the DB). > That wouldn't require a huge re-architecture of PG, just a small library > to handle the shared fd pool. Except that postgres uses multiple processes. And works on a lot of architectures. If we started to fsync all opened files on process exit our users would *lynch* us. We'd need a complicated scheme that sends processes across sockets between processes, then deduplicate them on the receiving side, somehow figuring out which is the oldest filedescriptors (handling clockdrift safely). Note that it'd be perfectly fine that we've "thrown away" the buffer contents if we'd get notified that the fsync failed. We could just do WAL replay, and restore the contents (just was we do after crashes and/or for replication). > That might even improve performance, because opening and closing files > is itself not free, especially if you are working with remote filesystems. There's already a per-process cache of open files. > > You'd not even need to have per inode information or such in the case > > that the block device goes away entirely. As the FS isn't generally > > unmounted in that case, you could trivially keep a per-mount (or > > superblock?) bit that says "I died" and set that instead of keeping per > > inode/whatever information. > > The filesystem will definitely return an error in this case, I don't > think this needs any kind of changes: > > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > { > if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) > return -EIO; Well, I'm making that argument because several people argued that throwing away buffer contents in this case is the only way to not cause OOMs, and that that's incompatible with reporting errors. It's clearly not... > > 2018-04-10 18:43:56 Ted wrote: > >> If you are aware of a company who is willing to pay to have a new > >> kernel feature implemented to meet your needs, we might be able to > >> refer you to a company or a consultant who might be able to do that > >> work. > > > > I find it a bit dissapointing response. I think it's fair to say that > > for advanced features, but we're talking about the basic guarantee that > > fsync actually does something even remotely reasonable. > > Linux (as PG) is run by people who develop it for their own needs, or > are paid to develop it for the needs of others. Sure. > Everyone already has too much work to do, so you need to find someone > who has an interest in fixing this (IMHO very peculiar) use case. If > PG developers want to add a tunable "keep dirty pages in RAM on IO > failure", I don't think that it would be too hard for someone to do. > It might be harder to convince some of the kernel maintainers to > accept it, and I've been on the losing side of that battle more than > once. However, like everything you don't pay for, you can't require > someone else to do this for you. It wouldn't hurt to see if Jeff > Layton, who wrote the errseq patches, would be interested to work on > something like this. I don't think this is that PG specific, as explained above. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:17 ` Andres Freund @ 2018-04-12 3:02 ` Matthew Wilcox 2018-04-12 11:09 ` Jeff Layton 2018-04-12 5:34 ` Theodore Y. Ts'o 2018-04-18 18:09 ` J. Bruce Fields 2 siblings, 1 reply; 57+ messages in thread From: Matthew Wilcox @ 2018-04-12 3:02 UTC (permalink / raw) To: Andres Freund Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > > > While there's some differing opinions on the referenced postgres thread, > > > the fundamental problem isn't so much that a retry won't fix the > > > problem, it's that we might NEVER see the failure. If writeback happens > > > in the background, encounters an error, undirties the buffer, we will > > > happily carry on because we've never seen that. That's when we're > > > majorly screwed. > > > > > > I think there are two issues here - "fsync() on an fd that was just opened" > > and "persistent error state (without keeping dirty pages in memory)". > > > > If there is background data writeback *without an open file descriptor*, > > there is no mechanism for the kernel to return an error to any application > > which may exist, or may not ever come back. > > And that's *horrible*. If I cp a file, and writeback fails in the > background, and I then cat that file before restarting, I should be able > to see that that failed. Instead of returning something bogus. At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error. That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files. Jeff, what do you think? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 3:02 ` Matthew Wilcox @ 2018-04-12 11:09 ` Jeff Layton 2018-04-12 11:19 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 57+ messages in thread From: Jeff Layton @ 2018-04-12 11:09 UTC (permalink / raw) To: Matthew Wilcox, Andres Freund Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote: > On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > > > > While there's some differing opinions on the referenced postgres thread, > > > > the fundamental problem isn't so much that a retry won't fix the > > > > problem, it's that we might NEVER see the failure. If writeback happens > > > > in the background, encounters an error, undirties the buffer, we will > > > > happily carry on because we've never seen that. That's when we're > > > > majorly screwed. > > > > > > > > > I think there are two issues here - "fsync() on an fd that was just opened" > > > and "persistent error state (without keeping dirty pages in memory)". > > > > > > If there is background data writeback *without an open file descriptor*, > > > there is no mechanism for the kernel to return an error to any application > > > which may exist, or may not ever come back. > > > > And that's *horrible*. If I cp a file, and writeback fails in the > > background, and I then cat that file before restarting, I should be able > > to see that that failed. Instead of returning something bogus. What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read? > At the moment, when we open a file, we sample the current state of the > writeback error and only report new errors. We could set it to zero > instead, and report the most recent error as soon as anything happens > which would report an error. That way err = close(open("file")); would > report the most recent error. > > That's not going to be persistent across the data structure for that inode > being removed from memory; we'd need filesystem support for persisting > that. But maybe it's "good enough" to only support it for recent files. > > Jeff, what do you think? I hate it :). We could do that, but....yecchhhh. Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful. I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer: When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do? One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event. Maybe that's ok in the face of a writeback error though? IDK. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 11:09 ` Jeff Layton @ 2018-04-12 11:19 ` Matthew Wilcox 2018-04-12 12:01 ` Dave Chinner ` (2 subsequent siblings) 3 siblings, 0 replies; 57+ messages in thread From: Matthew Wilcox @ 2018-04-12 11:19 UTC (permalink / raw) To: Jeff Layton Cc: Andres Freund, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote: > On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote: > > At the moment, when we open a file, we sample the current state of the > > writeback error and only report new errors. We could set it to zero > > instead, and report the most recent error as soon as anything happens > > which would report an error. That way err = close(open("file")); would > > report the most recent error. > > > > That's not going to be persistent across the data structure for that inode > > being removed from memory; we'd need filesystem support for persisting > > that. But maybe it's "good enough" to only support it for recent files. > > > > Jeff, what do you think? > > I hate it :). We could do that, but....yecchhhh. > > Reporting errors only in the case where the inode happened to stick > around in the cache seems too unreliable for real-world usage, and might > be problematic for some use cases. I'm also not sure it would really be > helpful. Yeah, it's definitely half-arsed. We could make further changes to improve the situation, but they'd have wider impact. For example, we can tell if the error has been sampled by any existing fd, so we could bias our inode reaping to have inodes with unreported errors stick around in the cache for longer. > I think the crux of the matter here is not really about error reporting, > per-se. I asked this at LSF last year, and got no real answer: > > When there is a writeback error, what should be done with the dirty > page(s)? Right now, we usually just mark them clean and carry on. Is > that the right thing to do? I suspect it isn't. If there's a transient error then we should reattempt the write. OTOH if the error is permanent then reattempting the write isn't going to do any good and it's just going to cause the drive to go through the whole error handling dance again. And what do we do if we're low on memory and need these pages back to avoid going OOM? There's a lot of options here, all of them bad in one situation or another. > One possibility would be to invalidate the range that failed to be > written (or the whole file) and force the pages to be faulted in again > on the next access. It could be surprising for some applications to not > see the results of their writes on a subsequent read after such an > event. > > Maybe that's ok in the face of a writeback error though? IDK. I don't know either. It'd force the application to face up to the fact that the data is gone immediately rather than only finding it out after a reboot. Again though that might cause more problems than it solves. It's hard to know what the right thing to do is. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 11:09 ` Jeff Layton 2018-04-12 11:19 ` Matthew Wilcox @ 2018-04-12 12:01 ` Dave Chinner 2018-04-12 15:08 ` Jeff Layton 2018-04-12 15:16 ` Theodore Y. Ts'o 2018-04-12 20:24 ` Andres Freund 2018-04-21 18:14 ` Jan Kara 3 siblings, 2 replies; 57+ messages in thread From: Dave Chinner @ 2018-04-12 12:01 UTC (permalink / raw) To: Jeff Layton Cc: Matthew Wilcox, Andres Freund, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote: > When there is a writeback error, what should be done with the dirty > page(s)? Right now, we usually just mark them clean and carry on. Is > that the right thing to do? There isn't a right thing. Whatever we do will be wrong for someone. > One possibility would be to invalidate the range that failed to be > written (or the whole file) and force the pages to be faulted in again > on the next access. It could be surprising for some applications to not > see the results of their writes on a subsequent read after such an > event. Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed. > Maybe that's ok in the face of a writeback error though? IDK. No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 12:01 ` Dave Chinner @ 2018-04-12 15:08 ` Jeff Layton 2018-04-12 22:44 ` Dave Chinner 2018-04-12 15:16 ` Theodore Y. Ts'o 1 sibling, 1 reply; 57+ messages in thread From: Jeff Layton @ 2018-04-12 15:08 UTC (permalink / raw) To: Dave Chinner, lsf-pc Cc: Matthew Wilcox, Andres Freund, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote: > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote: > > When there is a writeback error, what should be done with the dirty > > page(s)? Right now, we usually just mark them clean and carry on. Is > > that the right thing to do? > > There isn't a right thing. Whatever we do will be wrong for someone. > > > One possibility would be to invalidate the range that failed to be > > written (or the whole file) and force the pages to be faulted in again > > on the next access. It could be surprising for some applications to not > > see the results of their writes on a subsequent read after such an > > event. > > Not to mention a POSIX IO ordering violation. Seeing stale data > after a "successful" write is simply not allowed. > I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back? Given that the pages are clean after these failures, we aren't doing this even today: Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared. That can even happen before you get the chance to call fsync, so even a write()+read()+fsync() is not guaranteed to be safe in this regard today, given sufficient memory pressure. I think the current situation is fine from a "let's not OOM at all costs" standpoint, but not so good for application predictability. We should really consider ways to do better here. > > Maybe that's ok in the face of a writeback error though? IDK. > > No matter what we do for async writeback error handling, it will be > slightly different from filesystem to filesystem, not to mention OS > to OS. The is no magic bullet here, so I'm not sure we should worry > too much. There's direct IO for anyone who cares that need to know > about the completion status of every single write IO.... I think we we have an opportunity here to come up with better defined and hopefully more useful behavior for buffered I/O in the face of writeback errors. The first step would be to hash out what we'd want it to look like. Maybe we need a plenary session at LSF/MM? -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 15:08 ` Jeff Layton @ 2018-04-12 22:44 ` Dave Chinner 2018-04-13 13:18 ` Jeff Layton 0 siblings, 1 reply; 57+ messages in thread From: Dave Chinner @ 2018-04-12 22:44 UTC (permalink / raw) To: Jeff Layton Cc: lsf-pc, Matthew Wilcox, Andres Freund, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote: > On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote: > > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote: > > > When there is a writeback error, what should be done with the dirty > > > page(s)? Right now, we usually just mark them clean and carry on. Is > > > that the right thing to do? > > > > There isn't a right thing. Whatever we do will be wrong for someone. > > > > > One possibility would be to invalidate the range that failed to be > > > written (or the whole file) and force the pages to be faulted in again > > > on the next access. It could be surprising for some applications to not > > > see the results of their writes on a subsequent read after such an > > > event. > > > > Not to mention a POSIX IO ordering violation. Seeing stale data > > after a "successful" write is simply not allowed. > > I'm not so sure here, given that we're dealing with an error condition. > Are we really obligated not to allow any changes to pages that we can't > write back? Posix says this about write(): After a write() to a regular file has successfully returned: Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified. IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred. > Given that the pages are clean after these failures, we aren't doing > this even today: > > Suppose we're unable to do writes but can do reads vs. the backing > store. After a wb failure, the page has the dirty bit cleared. If it > gets kicked out of the cache before the read occurs, it'll have to be > faulted back in. Poof -- your write just disappeared. Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors? To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements. This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 22:44 ` Dave Chinner @ 2018-04-13 13:18 ` Jeff Layton 2018-04-13 13:25 ` Andres Freund 2018-04-13 14:02 ` Matthew Wilcox 0 siblings, 2 replies; 57+ messages in thread From: Jeff Layton @ 2018-04-13 13:18 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc, Matthew Wilcox, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote: > On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote: > > On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote: > > > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote: > > > > When there is a writeback error, what should be done with the dirty > > > > page(s)? Right now, we usually just mark them clean and carry on. Is > > > > that the right thing to do? > > > > > > There isn't a right thing. Whatever we do will be wrong for someone. > > > > > > > One possibility would be to invalidate the range that failed to be > > > > written (or the whole file) and force the pages to be faulted in again > > > > on the next access. It could be surprising for some applications to not > > > > see the results of their writes on a subsequent read after such an > > > > event. > > > > > > Not to mention a POSIX IO ordering violation. Seeing stale data > > > after a "successful" write is simply not allowed. > > > > I'm not so sure here, given that we're dealing with an error condition. > > Are we really obligated not to allow any changes to pages that we can't > > write back? > > Posix says this about write(): > > After a write() to a regular file has successfully returned: > > Any successful read() from each byte position in the file that > was modified by that write shall return the data specified by > the write() for that position until such byte positions are > again modified. > > IOWs, even if there is a later error, we told the user the write was > successful, and so according to POSIX we are not allowed to wind > back the data to what it was before the write() occurred. > > > Given that the pages are clean after these failures, we aren't doing > > this even today: > > > > Suppose we're unable to do writes but can do reads vs. the backing > > store. After a wb failure, the page has the dirty bit cleared. If it > > gets kicked out of the cache before the read occurs, it'll have to be > > faulted back in. Poof -- your write just disappeared. > > Yes - I was pointing out what the specification we supposedly > conform to says about this behaviour, not that our current behaviour > conforms to the spec. Indeed, have you even noticed > xfs_aops_discard_page() and it's surrounding context on page > writeback submission errors? > > To save you looking, XFS will trash the page contents completely on > a filesystem level ->writepage error. It doesn't mark them "clean", > doesn't attempt to redirty and rewrite them - it clears the uptodate > state and may invalidate it completely. IOWs, the data written > "sucessfully" to the cached page is now gone. It will be re-read > from disk on the next read() call, in direct violation of the above > POSIX requirements. > > This is my point: we've done that in XFS knowing that we violate > POSIX specifications in this specific corner case - it's the lesser > of many evils we have to chose between. Hence if we chose to encode > that behaviour as the general writeback IO error handling algorithm, > then it needs to done with the knowledge it is a specification > violation. Not to mention be documented as a POSIX violation in the > various relevant man pages and that this is how all filesystems will > behave on async writeback error..... > Got it, thanks. Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems. So to summarize, at this point in the discussion, I think we want to consider doing the following: * better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out. * invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this. Did I miss anything? Would that be enough to help the Pg usecase? I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-13 13:18 ` Jeff Layton @ 2018-04-13 13:25 ` Andres Freund 2018-04-13 14:02 ` Matthew Wilcox 1 sibling, 0 replies; 57+ messages in thread From: Andres Freund @ 2018-04-13 13:25 UTC (permalink / raw) To: Jeff Layton Cc: Dave Chinner, lsf-pc, Matthew Wilcox, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake Hi, On 2018-04-13 09:18:56 -0400, Jeff Layton wrote: > Yes, I think we ought to probably do the same thing globally. It's nice > to know that xfs has already been doing this. That makes me feel better > about making this behavior the gold standard for Linux filesystems. > > So to summarize, at this point in the discussion, I think we want to > consider doing the following: > > * better reporting from syncfs (report an error when even one inode > failed to be written back since last syncfs call). We'll probably > implement this via a per-sb errseq_t in some fashion, though there are > some implementation issues to work out. > > * invalidate or clear uptodate flag on pages that experience writeback > errors, across filesystems. Encourage this as standard behavior for > filesystems and maybe add helpers to make it easier to do this. > > Did I miss anything? Would that be enough to help the Pg usecase? > > I don't see us ever being able to reasonably support its current > expectation that writeback errors will be seen on fd's that were opened > after the error occurred. That's a really thorny problem from an object > lifetime perspective. It's not perfect, but I think the amount of hacky OS specific code should be acceptable. And it does allow for a wrapper tool that can be used around backup restores etc to syncfs all the necessary filesystems. Let me mull with others for a bit. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-13 13:18 ` Jeff Layton 2018-04-13 13:25 ` Andres Freund @ 2018-04-13 14:02 ` Matthew Wilcox 2018-04-14 1:47 ` Dave Chinner 1 sibling, 1 reply; 57+ messages in thread From: Matthew Wilcox @ 2018-04-13 14:02 UTC (permalink / raw) To: Jeff Layton Cc: Dave Chinner, lsf-pc, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote: > On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote: > > To save you looking, XFS will trash the page contents completely on > > a filesystem level ->writepage error. It doesn't mark them "clean", > > doesn't attempt to redirty and rewrite them - it clears the uptodate > > state and may invalidate it completely. IOWs, the data written > > "sucessfully" to the cached page is now gone. It will be re-read > > from disk on the next read() call, in direct violation of the above > > POSIX requirements. > > > > This is my point: we've done that in XFS knowing that we violate > > POSIX specifications in this specific corner case - it's the lesser > > of many evils we have to chose between. Hence if we chose to encode > > that behaviour as the general writeback IO error handling algorithm, > > then it needs to done with the knowledge it is a specification > > violation. Not to mention be documented as a POSIX violation in the > > various relevant man pages and that this is how all filesystems will > > behave on async writeback error..... > > > > Got it, thanks. > > Yes, I think we ought to probably do the same thing globally. It's nice > to know that xfs has already been doing this. That makes me feel better > about making this behavior the gold standard for Linux filesystems. > > So to summarize, at this point in the discussion, I think we want to > consider doing the following: > > * better reporting from syncfs (report an error when even one inode > failed to be written back since last syncfs call). We'll probably > implement this via a per-sb errseq_t in some fashion, though there are > some implementation issues to work out. > > * invalidate or clear uptodate flag on pages that experience writeback > errors, across filesystems. Encourage this as standard behavior for > filesystems and maybe add helpers to make it easier to do this. > > Did I miss anything? Would that be enough to help the Pg usecase? > > I don't see us ever being able to reasonably support its current > expectation that writeback errors will be seen on fd's that were opened > after the error occurred. That's a really thorny problem from an object > lifetime perspective. I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!) 1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty. 2. Background writebacks should skip pages which are PageError. 3. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate. I think kupdate writes are the same as for_background writes. for_reclaim is tougher. I don't want to see us getting into OOM because we're hanging onto stale data, but we don't necessarily have an open fd to report the error on. I think I'm leaning towards behaving the same for for_reclaim as for_sync, but this is probably a subject on which reasonable people can disagree. And this logic all needs to be on one place, although invoked from each filesystem. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-13 14:02 ` Matthew Wilcox @ 2018-04-14 1:47 ` Dave Chinner 2018-04-14 2:04 ` Andres Freund 2018-04-14 2:38 ` Matthew Wilcox 0 siblings, 2 replies; 57+ messages in thread From: Dave Chinner @ 2018-04-14 1:47 UTC (permalink / raw) To: Matthew Wilcox Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote: > On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote: > > On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote: > > > To save you looking, XFS will trash the page contents completely on > > > a filesystem level ->writepage error. It doesn't mark them "clean", > > > doesn't attempt to redirty and rewrite them - it clears the uptodate > > > state and may invalidate it completely. IOWs, the data written > > > "sucessfully" to the cached page is now gone. It will be re-read > > > from disk on the next read() call, in direct violation of the above > > > POSIX requirements. > > > > > > This is my point: we've done that in XFS knowing that we violate > > > POSIX specifications in this specific corner case - it's the lesser > > > of many evils we have to chose between. Hence if we chose to encode > > > that behaviour as the general writeback IO error handling algorithm, > > > then it needs to done with the knowledge it is a specification > > > violation. Not to mention be documented as a POSIX violation in the > > > various relevant man pages and that this is how all filesystems will > > > behave on async writeback error..... > > > > > > > Got it, thanks. > > > > Yes, I think we ought to probably do the same thing globally. It's nice > > to know that xfs has already been doing this. That makes me feel better > > about making this behavior the gold standard for Linux filesystems. > > > > So to summarize, at this point in the discussion, I think we want to > > consider doing the following: > > > > * better reporting from syncfs (report an error when even one inode > > failed to be written back since last syncfs call). We'll probably > > implement this via a per-sb errseq_t in some fashion, though there are > > some implementation issues to work out. > > > > * invalidate or clear uptodate flag on pages that experience writeback > > errors, across filesystems. Encourage this as standard behavior for > > filesystems and maybe add helpers to make it easier to do this. > > > > Did I miss anything? Would that be enough to help the Pg usecase? > > > > I don't see us ever being able to reasonably support its current > > expectation that writeback errors will be seen on fd's that were opened > > after the error occurred. That's a really thorny problem from an object > > lifetime perspective. > > I think we can do better than XFS is currently doing (but I agree that > we should have the same behaviour across all Linux filesystems!) > > 1. If we get an error while wbc->for_background is true, we should not clear > uptodate on the page, rather SetPageError and SetPageDirty. So you're saying we should treat it as a transient error rather than a permanent error. > 2. Background writebacks should skip pages which are PageError. That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems? e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine. This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports. > 3. for_sync writebacks should attempt one last write. Maybe it'll > succeed this time. If it does, just ClearPageError. If not, we have > somebody to report this writeback error to, and ClearPageUptodate. Which may well be unmount. Are we really going to wait until unmount to report fatal errors? We used to do this with XFS metadata. We'd just keep trying to write metadata and keep the filesystem running (because it's consistent in memory and it might be a transient error) rather than shutting down the filesystem after a couple of retries. the result was that users wouldn't notice there were problems until unmount, and the most common sympton of that was "why is system shutdown hanging?". We now don't hang at unmount by default: $ cat /sys/fs/xfs/dm-0/error/fail_at_unmount 1 $ And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/... We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....) > And this logic all needs to be on one place, although invoked from > each filesystem. Perhaps so, but as there's no "one-size-fits-all" behaviour, I really want to extend the XFS error config infrastructure to control what the filesystem does on error here. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-14 1:47 ` Dave Chinner @ 2018-04-14 2:04 ` Andres Freund 2018-04-18 23:59 ` Dave Chinner 2018-04-14 2:38 ` Matthew Wilcox 1 sibling, 1 reply; 57+ messages in thread From: Andres Freund @ 2018-04-14 2:04 UTC (permalink / raw) To: Dave Chinner Cc: Matthew Wilcox, Jeff Layton, lsf-pc, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake Hi, On 2018-04-14 11:47:52 +1000, Dave Chinner wrote: > And we treat different errors according to their seriousness. EIO > and device ENOSPC we default to retry forever because they are often > transient, but for ENODEV we fail and shutdown immediately (someone > pulled the USB stick out). metadata failure behaviour is configured > via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/... > > We've planned to extend this failure configuration to data IO, too, > but never quite got around to it yet. this is a clear example of > "one size doesn't fit all" and I think we'll end up doing the same > sort of error behaviour configuration in XFS for these cases. > (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....) Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter? Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-14 2:04 ` Andres Freund @ 2018-04-18 23:59 ` Dave Chinner 2018-04-19 0:23 ` Eric Sandeen 0 siblings, 1 reply; 57+ messages in thread From: Dave Chinner @ 2018-04-18 23:59 UTC (permalink / raw) To: Andres Freund Cc: Matthew Wilcox, Jeff Layton, lsf-pc, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote: > Hi, > > On 2018-04-14 11:47:52 +1000, Dave Chinner wrote: > > And we treat different errors according to their seriousness. EIO > > and device ENOSPC we default to retry forever because they are often > > transient, but for ENODEV we fail and shutdown immediately (someone > > pulled the USB stick out). metadata failure behaviour is configured > > via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/... > > > > We've planned to extend this failure configuration to data IO, too, > > but never quite got around to it yet. this is a clear example of > > "one size doesn't fit all" and I think we'll end up doing the same > > sort of error behaviour configuration in XFS for these cases. > > (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....) > > Have you considered adding an ext/fat/jfs > errors=remount-ro/panic/continue style mount parameter? That's for metadata writeback error behaviour, not data writeback IO errors. We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-18 23:59 ` Dave Chinner @ 2018-04-19 0:23 ` Eric Sandeen 0 siblings, 0 replies; 57+ messages in thread From: Eric Sandeen @ 2018-04-19 0:23 UTC (permalink / raw) To: Dave Chinner, Andres Freund Cc: Matthew Wilcox, Jeff Layton, lsf-pc, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On 4/18/18 6:59 PM, Dave Chinner wrote: > On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote: >> Hi, >> >> On 2018-04-14 11:47:52 +1000, Dave Chinner wrote: >>> And we treat different errors according to their seriousness. EIO >>> and device ENOSPC we default to retry forever because they are often >>> transient, but for ENODEV we fail and shutdown immediately (someone >>> pulled the USB stick out). metadata failure behaviour is configured >>> via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/... >>> >>> We've planned to extend this failure configuration to data IO, too, >>> but never quite got around to it yet. this is a clear example of >>> "one size doesn't fit all" and I think we'll end up doing the same >>> sort of error behaviour configuration in XFS for these cases. >>> (i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....) >> >> Have you considered adding an ext/fat/jfs >> errors=remount-ro/panic/continue style mount parameter? > > That's for metadata writeback error behaviour, not data writeback > IO errors. /me points casually at data_err=abort & data_err=ignore in ext4... data_err=ignore Just print an error message if an error occurs in a file data buffer in ordered mode. data_err=abort Abort the journal if an error occurs in a file data buffer in ordered mode. Just sayin' > We are definitely not planning to add mount options to configure IO > error behaviors. Mount options are a horrible way to configure > filesystem behaviour and we've already got other, fine-grained > configuration infrastructure for configuring IO error behaviour. > Which, as I just pointed out, was designed to be be extended to data > writeback and other operational error handling in the filesystem > (e.g. dealing with ENOMEM in different ways). I don't disagree, but there are already mount-option knobs in ext4, FWIW. -Eric > Cheers, > > Dave. > ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-14 1:47 ` Dave Chinner 2018-04-14 2:04 ` Andres Freund @ 2018-04-14 2:38 ` Matthew Wilcox 2018-04-19 0:13 ` Dave Chinner 1 sibling, 1 reply; 57+ messages in thread From: Matthew Wilcox @ 2018-04-14 2:38 UTC (permalink / raw) To: Dave Chinner Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote: > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote: > > 1. If we get an error while wbc->for_background is true, we should not clear > > uptodate on the page, rather SetPageError and SetPageDirty. > > So you're saying we should treat it as a transient error rather than > a permanent error. Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else. > > 2. Background writebacks should skip pages which are PageError. > > That seems decidedly dodgy in the case where there is a transient > error - it requires a user to specifically run sync to get the data > to disk after the transient error has occurred. Say they don't > notice the problem because it's fleeting and doesn't cause any > obvious problems? That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to). > e.g. XFS gets to enospc, runs out of reserve pool blocks so can't > allocate space to write back the page, then space is freed up a few > seconds later and so the next write will work just fine. > > This is a recipe for "I lost data that I wrote /days/ before the > system crashed" bug reports. So ... exponential backoff on retries? > > 3. for_sync writebacks should attempt one last write. Maybe it'll > > succeed this time. If it does, just ClearPageError. If not, we have > > somebody to report this writeback error to, and ClearPageUptodate. > > Which may well be unmount. Are we really going to wait until unmount > to report fatal errors? Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-14 2:38 ` Matthew Wilcox @ 2018-04-19 0:13 ` Dave Chinner 2018-04-19 0:40 ` Matthew Wilcox 0 siblings, 1 reply; 57+ messages in thread From: Dave Chinner @ 2018-04-19 0:13 UTC (permalink / raw) To: Matthew Wilcox Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote: > On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote: > > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote: > > > 1. If we get an error while wbc->for_background is true, we should not clear > > > uptodate on the page, rather SetPageError and SetPageDirty. > > > > So you're saying we should treat it as a transient error rather than > > a permanent error. > > Yes, I'm proposing leaving the data in memory in case the user wants to > try writing it somewhere else. And if it's getting IO errors because of USB stick pull? What then? > > > 2. Background writebacks should skip pages which are PageError. > > > > That seems decidedly dodgy in the case where there is a transient > > error - it requires a user to specifically run sync to get the data > > to disk after the transient error has occurred. Say they don't > > notice the problem because it's fleeting and doesn't cause any > > obvious problems? > > That's fair. What I want to avoid is triggering the same error every > 30 seconds (or whatever the periodic writeback threshold is set to). So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing? > > e.g. XFS gets to enospc, runs out of reserve pool blocks so can't > > allocate space to write back the page, then space is freed up a few > > seconds later and so the next write will work just fine. > > > > This is a recipe for "I lost data that I wrote /days/ before the > > system crashed" bug reports. > > So ... exponential backoff on retries? Maybe, but I don't think that actually helps anything and adds yet more "when should we write this" complication to inode writeback.... > > > 3. for_sync writebacks should attempt one last write. Maybe it'll > > > succeed this time. If it does, just ClearPageError. If not, we have > > > somebody to report this writeback error to, and ClearPageUptodate. > > > > Which may well be unmount. Are we really going to wait until unmount > > to report fatal errors? > > Goodness, no. The errors would be immediately reportable using the wb_err > mechanism, as soon as the first error was encountered. But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs().... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-19 0:13 ` Dave Chinner @ 2018-04-19 0:40 ` Matthew Wilcox 2018-04-19 1:08 ` Theodore Y. Ts'o 2018-04-19 23:28 ` Dave Chinner 0 siblings, 2 replies; 57+ messages in thread From: Matthew Wilcox @ 2018-04-19 0:40 UTC (permalink / raw) To: Dave Chinner Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote: > On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote: > > On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote: > > > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote: > > > > 1. If we get an error while wbc->for_background is true, we should not clear > > > > uptodate on the page, rather SetPageError and SetPageDirty. > > > > > > So you're saying we should treat it as a transient error rather than > > > a permanent error. > > > > Yes, I'm proposing leaving the data in memory in case the user wants to > > try writing it somewhere else. > > And if it's getting IO errors because of USB stick pull? What > then? I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny. > > > > 2. Background writebacks should skip pages which are PageError. > > > > > > That seems decidedly dodgy in the case where there is a transient > > > error - it requires a user to specifically run sync to get the data > > > to disk after the transient error has occurred. Say they don't > > > notice the problem because it's fleeting and doesn't cause any > > > obvious problems? > > > > That's fair. What I want to avoid is triggering the same error every > > 30 seconds (or whatever the periodic writeback threshold is set to). > > So if kernel ring buffer overflows and so users miss the first error > report, they'll have no idea that the data writeback is still > failing? I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is). > > > > 3. for_sync writebacks should attempt one last write. Maybe it'll > > > > succeed this time. If it does, just ClearPageError. If not, we have > > > > somebody to report this writeback error to, and ClearPageUptodate. > > > > > > Which may well be unmount. Are we really going to wait until unmount > > > to report fatal errors? > > > > Goodness, no. The errors would be immediately reportable using the wb_err > > mechanism, as soon as the first error was encountered. > > But if there are no open files when the error occurs, that error > won't get reported to anyone. Which means the next time anyone > accesses that inode from a user context could very well be unmount > or a third party sync/syncfs().... Right. But then that's on the application. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-19 0:40 ` Matthew Wilcox @ 2018-04-19 1:08 ` Theodore Y. Ts'o 2018-04-19 17:40 ` Matthew Wilcox 2018-04-19 23:28 ` Dave Chinner 1 sibling, 1 reply; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-19 1:08 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote: > > I've been thinking about this. Ideally we want to pass some kind of > notification all the way up to the desktop and tell the user to plug the > damn stick back in. Then have the USB stick become the same blockdev > that it used to be, and complete the writeback. We are so far from > being able to do that right now that it's not even funny.o Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-19 1:08 ` Theodore Y. Ts'o @ 2018-04-19 17:40 ` Matthew Wilcox 2018-04-19 23:27 ` Theodore Y. Ts'o 0 siblings, 1 reply; 57+ messages in thread From: Matthew Wilcox @ 2018-04-19 17:40 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Dave Chinner, Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Wed, Apr 18, 2018 at 09:08:19PM -0400, Theodore Y. Ts'o wrote: > On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote: > > > > I've been thinking about this. Ideally we want to pass some kind of > > notification all the way up to the desktop and tell the user to plug the > > damn stick back in. Then have the USB stick become the same blockdev > > that it used to be, and complete the writeback. We are so far from > > being able to do that right now that it's not even funny.o > > Maybe we shouldn't be trying to do any of this in the kernel, or at > least as little as possible in the kernel? Perhaps it would be better > to do most of this as a device mapper hack; I suspect we'll need > userspace help to igure out whether the user has plugged the same USB > stick in, or a different USB stick, anyway. The device mapper target (dm-removable?) was my first idea too, but I kept thinking through use cases and I think we end up wanting this functionality in the block layer. Let's try a story. Stephen the PFY goes into the data centre looking to hotswap a failed drive. Due to the eight pints of lager he had for lunch, he pulls out the root drive instead of the failed drive. The air raid siren warbles and he realises his mistake, shoving the drive back in. CYOA: Currently: All writes are lost, calamities ensue. The PFY is fired. With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault. Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-19 17:40 ` Matthew Wilcox @ 2018-04-19 23:27 ` Theodore Y. Ts'o 0 siblings, 0 replies; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-19 23:27 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 19, 2018 at 10:40:10AM -0700, Matthew Wilcox wrote: > With dm-removable: Nobody thought to set up dm-removable on the root > drive. Calamities still ensue, but now it's the BOFH's fault instead > of the PFY's fault. > > Built into the block layer: After a brief hiccup while we reattach the > drive to its block_device, the writes resume and nobody loses their job. What you're talking about is a deployment issue, though. Ultimately the distribution will set up dm-removable automatically if the user requests it, much like it sets up dm-crypt automatically for laptop users upon request. My concern is that not all removable devices have a globally unique id number available in hardware so the kernel can tell whether or not it's the same device that has been plugged in. There are hueristics you could use -- for example, you could look at the file system uuid plus the last fsck time. But they tend to be very file system specific, and not things we would want ot have in the kernel. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-19 0:40 ` Matthew Wilcox 2018-04-19 1:08 ` Theodore Y. Ts'o @ 2018-04-19 23:28 ` Dave Chinner 1 sibling, 0 replies; 57+ messages in thread From: Dave Chinner @ 2018-04-19 23:28 UTC (permalink / raw) To: Matthew Wilcox Cc: Jeff Layton, lsf-pc, Andres Freund, Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote: > On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote: > > On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote: > > > On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote: > > > > On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote: > > > > > 1. If we get an error while wbc->for_background is true, we should not clear > > > > > uptodate on the page, rather SetPageError and SetPageDirty. > > > > > > > > So you're saying we should treat it as a transient error rather than > > > > a permanent error. > > > > > > Yes, I'm proposing leaving the data in memory in case the user wants to > > > try writing it somewhere else. > > > > And if it's getting IO errors because of USB stick pull? What > > then? > > I've been thinking about this. Ideally we want to pass some kind of > notification all the way up to the desktop and tell the user to plug the > damn stick back in. Then have the USB stick become the same blockdev > that it used to be, and complete the writeback. We are so far from > being able to do that right now that it's not even funny. *nod* But in the meantime, device unplug (should give ENODEV, not EIO) is a fatal error and we need to toss away the data. > > > > > 2. Background writebacks should skip pages which are PageError. > > > > > > > > That seems decidedly dodgy in the case where there is a transient > > > > error - it requires a user to specifically run sync to get the data > > > > to disk after the transient error has occurred. Say they don't > > > > notice the problem because it's fleeting and doesn't cause any > > > > obvious problems? > > > > > > That's fair. What I want to avoid is triggering the same error every > > > 30 seconds (or whatever the periodic writeback threshold is set to). > > > > So if kernel ring buffer overflows and so users miss the first error > > report, they'll have no idea that the data writeback is still > > failing? > > I wasn't thinking about kernel ringbuffer based reporting; I was thinking > about errseq_t based reporting, so the application can tell the fsync > failed and maybe does something application-level to recover like send > the transactions across to another node in the cluster (or whatever this > hypothetical application is). But if it's still failing, then we should be still trying to report the error. i.e. if fsync fails and the page remains dirty, then the next attmept to write it is a new error and fsync should report that. IOWs, I think we should be returning errors at every occasion errors need to be reported if we have a persistent writeback failure... > > > > > 3. for_sync writebacks should attempt one last write. Maybe it'll > > > > > succeed this time. If it does, just ClearPageError. If not, we have > > > > > somebody to report this writeback error to, and ClearPageUptodate. > > > > > > > > Which may well be unmount. Are we really going to wait until unmount > > > > to report fatal errors? > > > > > > Goodness, no. The errors would be immediately reportable using the wb_err > > > mechanism, as soon as the first error was encountered. > > > > But if there are no open files when the error occurs, that error > > won't get reported to anyone. Which means the next time anyone > > accesses that inode from a user context could very well be unmount > > or a third party sync/syncfs().... > > Right. But then that's on the application. Which we know don't do the right thing. Seems like a lot of hoops to jump through given it still won't work if the appliction isn't changed to support linux specific error handling requirements... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 12:01 ` Dave Chinner 2018-04-12 15:08 ` Jeff Layton @ 2018-04-12 15:16 ` Theodore Y. Ts'o 2018-04-12 20:13 ` Andres Freund 1 sibling, 1 reply; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-12 15:16 UTC (permalink / raw) To: Dave Chinner Cc: Jeff Layton, Matthew Wilcox, Andres Freund, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 10:01:22PM +1000, Dave Chinner wrote: > On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote: > > When there is a writeback error, what should be done with the dirty > > page(s)? Right now, we usually just mark them clean and carry on. Is > > that the right thing to do? > > There isn't a right thing. Whatever we do will be wrong for someone. That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they *think* they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us *again*, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life. Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources... We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors). Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site. It's certainly that's what *I* would do if I didn't decide to use a hosted cloud solution, such as Cloud SQL for Postgres, and let someone else solve the really hard problems of dealing with real-world HDD failures. :-) - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 15:16 ` Theodore Y. Ts'o @ 2018-04-12 20:13 ` Andres Freund 2018-04-12 20:28 ` Matthew Wilcox 0 siblings, 1 reply; 57+ messages in thread From: Andres Freund @ 2018-04-12 20:13 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Dave Chinner, Jeff Layton, Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake Hi, On 2018-04-12 11:16:46 -0400, Theodore Y. Ts'o wrote: > That's the problem. The best that could be done (and it's not enough) > would be to have a mode which does with the PG folks want (or what > they *think* they want). It seems what they want is to have an error > result in the page being marked clean. When they discover the outcome > (OOM-city and the unability to unmount a file system on a failed > drive), then they will complain to us *again*, at which point we can > tell them that want they really want is another variation on O_PONIES, > and welcome to the real world and real life. I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient. I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem. If the drive is entirely gone there's obviously no point in keeping per-file information around, so per-blockdev/fs information suffices entirely to return an error on fsync (which at least on ext4 appears to happen if the underlying blockdev is gone). Have fun making up things we want, but I'm not sure it's particularly productive. > Which is why, even if they were to pay someone to implement what they > want, I'm not sure we would want to accept it upstream --- or distro's > might consider it a support nightmare, and refuse to allow that mode > to be enabled on enterprise distro's. But at least, it will have been > some PG-based company who will have implemented it, so they're not > wasting other people's time or other people's resources... Well, that's why I'm discussing here so we can figure out what's acceptable before considering wasting money and revew cycles doing or paying somebody to do some crazy useless shit. > We could try to get something like what Google is doing upstream, > which is to have the I/O errors sent to userspace via a netlink > channel (without changing anything else about how buffered writeback > is handled in the face of errors). Ah, darn. After you'd mentioned that in an earlier mail I'd hoped that'd be upstream. And yes, that'd be perfect. > Then userspace applications could switch to Direct I/O like all of the > other really serious userspace storage solutions I'm aware of, and > then someone could try to write some kind of HDD health monitoring > system that tries to do the right thing when a disk is discovered to > have developed some media errors or something more serious (e.g., a > head failure). That plus some kind of RAID solution is I think the > only thing which is really realistic for a typical PG site. As I said earlier, I think there's good reason to move to DIO for postgres. But to keep that performant is going to need some serious work. But afaict such a solution wouldn't really depend on applications using DIO or not. Before finishing a checkpoint (logging it persistently and allowing to throw older data away), we could check if any errors have been reported and give up if there have been any. And after starting postgres on a directory restored from backup using $tool, we can fsync the directory recursively, check for such errors, and give up if there've been any. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 20:13 ` Andres Freund @ 2018-04-12 20:28 ` Matthew Wilcox 2018-04-12 21:14 ` Jeff Layton 2018-04-12 21:21 ` Theodore Y. Ts'o 0 siblings, 2 replies; 57+ messages in thread From: Matthew Wilcox @ 2018-04-12 20:28 UTC (permalink / raw) To: Andres Freund Cc: Theodore Y. Ts'o, Dave Chinner, Jeff Layton, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > I think a per-file or even per-blockdev/fs error state that'd be > returned by fsync() would be more than sufficient. Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd. > I don't see that > that'd realistically would trigger OOM or the inability to unmount a > filesystem. Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 20:28 ` Matthew Wilcox @ 2018-04-12 21:14 ` Jeff Layton 2018-04-12 21:31 ` Matthew Wilcox 2018-04-12 21:21 ` Theodore Y. Ts'o 1 sibling, 1 reply; 57+ messages in thread From: Jeff Layton @ 2018-04-12 21:14 UTC (permalink / raw) To: Matthew Wilcox, Andres Freund Cc: Theodore Y. Ts'o, Dave Chinner, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote: > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > I think a per-file or even per-blockdev/fs error state that'd be > > returned by fsync() would be more than sufficient. > > Ah; this was my suggestion to Jeff on IRC. That we add a per- > superblock > wb_err and then allow syncfs() to return it. So you'd open an fd on > a directory (for example), and call syncfs() which would return -EIO > or -ENOSPC if either of those conditions had occurred since you > opened > the fd. Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping. We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor? > > I don't see that > > that'd realistically would trigger OOM or the inability to unmount > > a > > filesystem. > > Ted's referring to the current state of affairs where the writeback > error > is held in the inode; if we can't evict the inode because it's > holding > the error indicator, that can send us OOM. If instead we transfer > the > error indicator to the superblock, then there's no problem. > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:14 ` Jeff Layton @ 2018-04-12 21:31 ` Matthew Wilcox 2018-04-13 12:56 ` Jeff Layton 0 siblings, 1 reply; 57+ messages in thread From: Matthew Wilcox @ 2018-04-12 21:31 UTC (permalink / raw) To: Jeff Layton Cc: Andres Freund, Theodore Y. Ts'o, Dave Chinner, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote: > On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote: > > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > > I think a per-file or even per-blockdev/fs error state that'd be > > > returned by fsync() would be more than sufficient. > > > > Ah; this was my suggestion to Jeff on IRC. That we add a per- > > superblock > > wb_err and then allow syncfs() to return it. So you'd open an fd on > > a directory (for example), and call syncfs() which would return -EIO > > or -ENOSPC if either of those conditions had occurred since you > > opened > > the fd. > > Not a bad idea and shouldn't be too costly. mapping_set_error could > flag the superblock one before or after the one in the mapping. > > We'd need to define what happens if you interleave fsync and syncfs > calls on the same inode though. How do we handle file->f_wb_err in that > case? Would we need a second field in struct file to act as the per-sb > error cursor? Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case. Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:31 ` Matthew Wilcox @ 2018-04-13 12:56 ` Jeff Layton 0 siblings, 0 replies; 57+ messages in thread From: Jeff Layton @ 2018-04-13 12:56 UTC (permalink / raw) To: Matthew Wilcox Cc: Andres Freund, Theodore Y. Ts'o, Dave Chinner, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, 2018-04-12 at 14:31 -0700, Matthew Wilcox wrote: > On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote: > > On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote: > > > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > > > I think a per-file or even per-blockdev/fs error state that'd be > > > > returned by fsync() would be more than sufficient. > > > > > > Ah; this was my suggestion to Jeff on IRC. That we add a per- > > > superblock > > > wb_err and then allow syncfs() to return it. So you'd open an fd on > > > a directory (for example), and call syncfs() which would return -EIO > > > or -ENOSPC if either of those conditions had occurred since you > > > opened > > > the fd. > > > > Not a bad idea and shouldn't be too costly. mapping_set_error could > > flag the superblock one before or after the one in the mapping. > > > > We'd need to define what happens if you interleave fsync and syncfs > > calls on the same inode though. How do we handle file->f_wb_err in that > > case? Would we need a second field in struct file to act as the per-sb > > error cursor? > > Ooh. I hadn't thought that through. Bleh. I don't want to add a field > to struct file for this uncommon case. > > Maybe O_PATH could be used for this? It gets you a file descriptor on > a particular filesystem, so syncfs() is defined, but it can't report > a writeback error. So if you open something O_PATH, you can use the > file's f_wb_err for the mapping's error cursor. > That might work. It'd be a syscall behavioral change so we'd need to document that well. It's probably innocuous though -- I doubt we have a lot of callers in the field opening files with O_PATH and calling syncfs on them. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 20:28 ` Matthew Wilcox 2018-04-12 21:14 ` Jeff Layton @ 2018-04-12 21:21 ` Theodore Y. Ts'o 2018-04-12 21:24 ` Matthew Wilcox 2018-04-12 21:37 ` Andres Freund 1 sibling, 2 replies; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-12 21:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Andres Freund, Dave Chinner, Jeff Layton, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote: > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > I think a per-file or even per-blockdev/fs error state that'd be > > returned by fsync() would be more than sufficient. > > Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock > wb_err and then allow syncfs() to return it. So you'd open an fd on > a directory (for example), and call syncfs() which would return -EIO > or -ENOSPC if either of those conditions had occurred since you opened > the fd. When or how would the per-superblock wb_err flag get cleared? Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO? > > I don't see that > > that'd realistically would trigger OOM or the inability to unmount a > > filesystem. > > Ted's referring to the current state of affairs where the writeback error > is held in the inode; if we can't evict the inode because it's holding > the error indicator, that can send us OOM. If instead we transfer the > error indicator to the superblock, then there's no problem. Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty. If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:21 ` Theodore Y. Ts'o @ 2018-04-12 21:24 ` Matthew Wilcox 2018-04-12 21:37 ` Andres Freund 1 sibling, 0 replies; 57+ messages in thread From: Matthew Wilcox @ 2018-04-12 21:24 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Andres Freund, Dave Chinner, Jeff Layton, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 05:21:44PM -0400, Theodore Y. Ts'o wrote: > On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote: > > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > > I think a per-file or even per-blockdev/fs error state that'd be > > > returned by fsync() would be more than sufficient. > > > > Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock > > wb_err and then allow syncfs() to return it. So you'd open an fd on > > a directory (for example), and call syncfs() which would return -EIO > > or -ENOSPC if either of those conditions had occurred since you opened > > the fd. > > When or how would the per-superblock wb_err flag get cleared? That's not how errseq works, Ted ;-) > Would all subsequent fsync() calls on that file system now return EIO? > Or would only all subsequent syncfs() calls return EIO? Only ones which occur after the last sampling get reported through this particular file descriptor. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:21 ` Theodore Y. Ts'o 2018-04-12 21:24 ` Matthew Wilcox @ 2018-04-12 21:37 ` Andres Freund 1 sibling, 0 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 21:37 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Matthew Wilcox, Dave Chinner, Jeff Layton, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake Hi, On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote: > On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote: > > On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote: > > > I think a per-file or even per-blockdev/fs error state that'd be > > > returned by fsync() would be more than sufficient. > > > > Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock > > wb_err and then allow syncfs() to return it. So you'd open an fd on > > a directory (for example), and call syncfs() which would return -EIO > > or -ENOSPC if either of those conditions had occurred since you opened > > the fd. > > When or how would the per-superblock wb_err flag get cleared? I don't think unmount + resettable via /sys would be an insane approach. Requiring explicit action to acknowledge data loss isn't a crazy concept. But I think that's something reasonable minds could disagree with. > Would all subsequent fsync() calls on that file system now return EIO? > Or would only all subsequent syncfs() calls return EIO? If it were tied to syncfs, I wonder if there's a way to have some errseq type logic. Store a per superblock (or whatever equivalent thing) errseq value of errors. For each fd calling syncfs() report the error once, but then store the current value in a separate per-fd field. And if that's considered too weird, only report the errors to fds that have been opened from before the error occurred. I can see writing a tool 'pg_run_and_sync /directo /ries -- command' which opens an fd for each of the filesystems the directories reside on, and calls syncfs() after. That'd allow to use backup/restore tools at least semi safely. > > > I don't see that > > > that'd realistically would trigger OOM or the inability to unmount a > > > filesystem. > > > > Ted's referring to the current state of affairs where the writeback error > > is held in the inode; if we can't evict the inode because it's holding > > the error indicator, that can send us OOM. If instead we transfer the > > error indicator to the superblock, then there's no problem. > > Actually, I was referring to the pg-hackers original ask, which was > that after an error, all of the dirty pages that couldn't be written > out would stay dirty. Well, it's an open list, everyone can argue. And initially people at first didn't know the OOM explanation, and then it takes some time to revise ones priors :). I think it's a design question that reasonable people can disagree upon (if "hot" removed devices are handled by throwing data away regardless, at least). But as it's clearly not something viable, we can move on to something that can solve the problem. > If it's only as single inode which is pinned in memory with the dirty > flag, that's bad, but it's not as bad as pinning all of the memory > pages for which there was a failed write. We would still need to > invent some mechanism or define some semantic when it would be OK to > clear the per-inode flag and let the memory associated with that > pinned inode get released, though. Yea, I agree that that's not obvious. One way would be to say that it's only automatically cleared when you unlink the file. A bit heavyhanded, but not too crazy. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 11:09 ` Jeff Layton 2018-04-12 11:19 ` Matthew Wilcox 2018-04-12 12:01 ` Dave Chinner @ 2018-04-12 20:24 ` Andres Freund 2018-04-12 21:27 ` Jeff Layton 2018-04-21 18:14 ` Jan Kara 3 siblings, 1 reply; 57+ messages in thread From: Andres Freund @ 2018-04-12 20:24 UTC (permalink / raw) To: Jeff Layton Cc: Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On 2018-04-12 07:09:14 -0400, Jeff Layton wrote: > On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote: > > On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > > > > > While there's some differing opinions on the referenced postgres thread, > > > > > the fundamental problem isn't so much that a retry won't fix the > > > > > problem, it's that we might NEVER see the failure. If writeback happens > > > > > in the background, encounters an error, undirties the buffer, we will > > > > > happily carry on because we've never seen that. That's when we're > > > > > majorly screwed. > > > > > > > > > > > > I think there are two issues here - "fsync() on an fd that was just opened" > > > > and "persistent error state (without keeping dirty pages in memory)". > > > > > > > > If there is background data writeback *without an open file descriptor*, > > > > there is no mechanism for the kernel to return an error to any application > > > > which may exist, or may not ever come back. > > > > > > And that's *horrible*. If I cp a file, and writeback fails in the > > > background, and I then cat that file before restarting, I should be able > > > to see that that failed. Instead of returning something bogus. > > What are you expecting to happen in this case? Are you expecting a read > error due to a writeback failure? Or are you just saying that we should > be invalidating pages that failed to be written back, so that they can > be re-read? Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care *that* much. At the very least *some* way to *know* that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background. If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing. Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now. 1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message Isn't beautiful to script, but it's also not absolutely terrible. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 20:24 ` Andres Freund @ 2018-04-12 21:27 ` Jeff Layton 2018-04-12 21:53 ` Andres Freund 0 siblings, 1 reply; 57+ messages in thread From: Jeff Layton @ 2018-04-12 21:27 UTC (permalink / raw) To: Andres Freund Cc: Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote: > On 2018-04-12 07:09:14 -0400, Jeff Layton wrote: > > On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote: > > > On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > > > > > > While there's some differing opinions on the referenced postgres thread, > > > > > > the fundamental problem isn't so much that a retry won't fix the > > > > > > problem, it's that we might NEVER see the failure. If writeback happens > > > > > > in the background, encounters an error, undirties the buffer, we will > > > > > > happily carry on because we've never seen that. That's when we're > > > > > > majorly screwed. > > > > > > > > > > > > > > > I think there are two issues here - "fsync() on an fd that was just opened" > > > > > and "persistent error state (without keeping dirty pages in memory)". > > > > > > > > > > If there is background data writeback *without an open file descriptor*, > > > > > there is no mechanism for the kernel to return an error to any application > > > > > which may exist, or may not ever come back. > > > > > > > > And that's *horrible*. If I cp a file, and writeback fails in the > > > > background, and I then cat that file before restarting, I should be able > > > > to see that that failed. Instead of returning something bogus. > > > > What are you expecting to happen in this case? Are you expecting a read > > error due to a writeback failure? Or are you just saying that we should > > be invalidating pages that failed to be written back, so that they can > > be re-read? > > Yes, I'd hope for a read error after a writeback failure. I think that's > sane behaviour. But I don't really care *that* much. > I'll have to respectfully disagree. Why should I interpret an error on a read() syscall to mean that writeback failed? Note that the data is still potentially intact. What _might_ make sense, IMO, is to just invalidate the pages that failed to be written back. Then you could potentially do a read to fault them in again (i.e. sync the pagecache and the backing store) and possibly redirty them for another try. Note that you can detect this situation by checking the return code from fsync. It should report the latest error once per file description. > At the very least *some* way to *know* that such a failure occurred from > userland without having to parse the kernel log. As far as I understand, > neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report > an error if it was encountered by writeback in the background. > > If that's indeed true for syncfs(2), even if the fd has been opened > before (which I can see how it could happen from an implementation POV, > nothing would associate a random FD with failures on different files), > it's really impossible to detect this stuff from userland without text > parsing. > syncfs could use some work. I'm warming to willy's idea to add a per-sb errseq_t. I think that might be a simple way to get better semantics here. Not sure how we want to handle the reporting end yet though... We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either. > Even if it'd were just a perf-fs /sys/$something file that'd return the > current count of unreported errors in a filesystem independent way, it'd > be better than what we have right now. > > 1) figure out /sys/$whatnot $directory belongs to > 2) oldcount=$(cat /sys/$whatnot/unreported_errors) > 3) filesystem operations in $directory > 4) sync;sync; > 5) newcount=$(cat /sys/$whatnot/unreported_errors) > 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message > > Isn't beautiful to script, but it's also not absolutely terrible. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:27 ` Jeff Layton @ 2018-04-12 21:53 ` Andres Freund 2018-04-12 21:57 ` Theodore Y. Ts'o 0 siblings, 1 reply; 57+ messages in thread From: Andres Freund @ 2018-04-12 21:53 UTC (permalink / raw) To: Jeff Layton Cc: Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On 2018-04-12 17:27:54 -0400, Jeff Layton wrote: > On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote: > > At the very least *some* way to *know* that such a failure occurred from > > userland without having to parse the kernel log. As far as I understand, > > neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report > > an error if it was encountered by writeback in the background. > > > > If that's indeed true for syncfs(2), even if the fd has been opened > > before (which I can see how it could happen from an implementation POV, > > nothing would associate a random FD with failures on different files), > > it's really impossible to detect this stuff from userland without text > > parsing. > > > > syncfs could use some work. It's really too bad that it doesn't have a flags argument. > We probably also need to consider how to better track metadata > writeback errors (on e.g. ext2). We don't really do that properly at > quite yet either. > > > > Even if it'd were just a perf-fs /sys/$something file that'd return the > > current count of unreported errors in a filesystem independent way, it'd > > be better than what we have right now. > > > > 1) figure out /sys/$whatnot $directory belongs to > > 2) oldcount=$(cat /sys/$whatnot/unreported_errors) > > 3) filesystem operations in $directory > > 4) sync;sync; > > 5) newcount=$(cat /sys/$whatnot/unreported_errors) > > 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message > > > > Isn't beautiful to script, but it's also not absolutely terrible. ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places. By my reading XFS doesn't seem to have something similar. Wouldn't be bad to standardize... Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:53 ` Andres Freund @ 2018-04-12 21:57 ` Theodore Y. Ts'o 0 siblings, 0 replies; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-12 21:57 UTC (permalink / raw) To: Andres Freund Cc: Jeff Layton, Matthew Wilcox, Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu, Apr 12, 2018 at 02:53:19PM -0700, Andres Freund wrote: > > > > > > Isn't beautiful to script, but it's also not absolutely terrible. > > ext4 seems to have something roughly like that > (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to > be incremented from the necessary places. This is only for file system inconsistencies noticed by the kernel. We don't bump that count for data block I/O errors. The same idea could be used on a block device level. It would be pretty simple to maintain a counter for I/O errors, and when the last error was detected on a particular device. You could evne break out and track read errors and write errors eparately if that would be useful. If you don't care what block was bad, but just that _some_ I/O error had happened, a counter is definitely the simplest approach, and less hair to implemnet and use than something like a netlink channel or scraping dmesg.... - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 11:09 ` Jeff Layton ` (2 preceding siblings ...) 2018-04-12 20:24 ` Andres Freund @ 2018-04-21 18:14 ` Jan Kara 3 siblings, 0 replies; 57+ messages in thread From: Jan Kara @ 2018-04-21 18:14 UTC (permalink / raw) To: Jeff Layton Cc: Matthew Wilcox, Andres Freund, Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Joshua D. Drake On Thu 12-04-18 07:09:14, Jeff Layton wrote: > On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote: > > At the moment, when we open a file, we sample the current state of the > > writeback error and only report new errors. We could set it to zero > > instead, and report the most recent error as soon as anything happens > > which would report an error. That way err = close(open("file")); would > > report the most recent error. > > > > That's not going to be persistent across the data structure for that inode > > being removed from memory; we'd need filesystem support for persisting > > that. But maybe it's "good enough" to only support it for recent files. > > > > Jeff, what do you think? > > I hate it :). We could do that, but....yecchhhh. > > Reporting errors only in the case where the inode happened to stick > around in the cache seems too unreliable for real-world usage, and might > be problematic for some use cases. I'm also not sure it would really be > helpful. So this is never going to be perfect but I think we could do good enough by: 1) Mark inodes that hit IO error. 2) If the inode gets evicted from memory we store the fact that we hit an error for this IO in a more space efficient data structure (sparse bitmap, radix tree, extent tree, whatever). 3) If the underlying device gets destroyed, we can just switch the whole SB to an error state and forget per inode info. 4) If there's too much of per-inode error info (probably per-fs configurable limit in terms of number of inodes), we would yell in the kernel log, switch the whole fs to the error state and forget per inode info. This way there won't be silent loss of IO errors. Memory usage would be reasonably limited. It could happen the whole fs would switch to error state "prematurely" but if that's a problem for the machine, admin could tune the limit for number of inodes to keep IO errors for... > I think the crux of the matter here is not really about error reporting, > per-se. I think this is related but a different question. > I asked this at LSF last year, and got no real answer: > > When there is a writeback error, what should be done with the dirty > page(s)? Right now, we usually just mark them clean and carry on. Is > that the right thing to do? > > One possibility would be to invalidate the range that failed to be > written (or the whole file) and force the pages to be faulted in again > on the next access. It could be surprising for some applications to not > see the results of their writes on a subsequent read after such an > event. > > Maybe that's ok in the face of a writeback error though? IDK. I can see the admin wanting to rather kill the machine with OOM than having to deal with data loss due to IO errors (e.g. if he has HA server fail over set up). Or retry for some time before dropping the dirty data. Or do what we do now (possibly with invalidating pages as you say). As Dave said elsewhere there's not one strategy that's going to please everybody. So it might be beneficial to have this configurable like XFS has it for metadata. OTOH if I look at the problem from application developer POV, most apps will just declare game over at the face of IO errors (if they take care to check for them at all). And the sophisticated apps that will try some kind of error recovery have to be prepared that the data is just gone (as depending on what exactly the kernel does is rather fragile) so I'm not sure how much practical value the configurable behavior on writeback errors would bring. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:17 ` Andres Freund 2018-04-12 3:02 ` Matthew Wilcox @ 2018-04-12 5:34 ` Theodore Y. Ts'o 2018-04-12 19:55 ` Andres Freund 2018-04-18 18:09 ` J. Bruce Fields 2 siblings, 1 reply; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-12 5:34 UTC (permalink / raw) To: Andres Freund Cc: Andreas Dilger, 20180410184356.GD3563, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > > If there is background data writeback *without an open file descriptor*, > > there is no mechanism for the kernel to return an error to any application > > which may exist, or may not ever come back. > > And that's *horrible*. If I cp a file, and writeback fails in the > background, and I then cat that file before restarting, I should be able > to see that that failed. Instead of returning something bogus. If there is no open file descriptor, and in many cases, no process (because it has already exited), it may be horrible, but what the h*ll else do you expect the OS to do? The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon. If it detects errors on a particular hard drive, it tells the cluster file system to stop using that disk, and to reconstruct from erasure code all of the data chunks on that disk onto other disks in the cluster. We then run a series of disk diagnostics to make sure we find all of the bad sectors (every often, where there is one bad sector, there are several more waiting to be found), and then afterwards, put the disk back into service. By making it be a separate health monitoring process, we can have HDD experts write much more sophisticated code that can ask the disk firmware for more information (e.g., SMART, the grown defect list), do much more careful scrubbing of the disk media, etc., before returning the disk back to service. > > Everyone already has too much work to do, so you need to find someone > > who has an interest in fixing this (IMHO very peculiar) use case. If > > PG developers want to add a tunable "keep dirty pages in RAM on IO > > failure", I don't think that it would be too hard for someone to do. > > It might be harder to convince some of the kernel maintainers to > > accept it, and I've been on the losing side of that battle more than > > once. However, like everything you don't pay for, you can't require > > someone else to do this for you. It wouldn't hurt to see if Jeff > > Layton, who wrote the errseq patches, would be interested to work on > > something like this. > > I don't think this is that PG specific, as explained above. The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something. The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program. If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system. Regards, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 5:34 ` Theodore Y. Ts'o @ 2018-04-12 19:55 ` Andres Freund 2018-04-12 21:52 ` Theodore Y. Ts'o 0 siblings, 1 reply; 57+ messages in thread From: Andres Freund @ 2018-04-12 19:55 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Andreas Dilger, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake Hi, On 2018-04-12 01:34:45 -0400, Theodore Y. Ts'o wrote: > The solution we use at Google is that we watch for I/O errors using a > completely different process that is responsible for monitoring > machine health. It used to scrape dmesg, but we now arrange to have > I/O errors get sent via a netlink channel to the machine health > monitoring daemon. Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing. > The reality is that recovering from disk errors is tricky business, > and I very much doubt most userspace applications, including distro > package managers, are going to want to engineer for trying to detect > and recover from disk errors. If that were true, then Red Hat and/or > SuSE have kernel engineers, and they would have implemented everything > everything on your wish list. They haven't, and that should tell you > something. The problem really isn't about *recovering* from disk errors. *Knowing* about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that "we" get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up. > The other reality is that once a disk starts developing errors, in > reality you will probably need to take the disk off-line, scrub it to > find any other media errors, and there's a good chance you'll need to > rewrite bad sectors (incluing some which are on top of file system > metadata, so you probably will have to run fsck or reformat the whole > file system). I certainly don't think it's realistic to assume adding > lots of sophistication to each and every userspace program. > If you have tens or hundreds of thousands of disk drives, then you > will need to do tsomething automated, but I claim that you really > don't want to smush all of that detailed exception handling and HDD > repair technology into each database or cluster file system component. > It really needs to be done in a separate health-monitor and > machine-level management system. Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 19:55 ` Andres Freund @ 2018-04-12 21:52 ` Theodore Y. Ts'o 2018-04-12 22:03 ` Andres Freund 0 siblings, 1 reply; 57+ messages in thread From: Theodore Y. Ts'o @ 2018-04-12 21:52 UTC (permalink / raw) To: Andres Freund Cc: Andreas Dilger, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Thu, Apr 12, 2018 at 12:55:36PM -0700, Andres Freund wrote: > > Any pointers to that the underling netlink mechanism? If we can force > postgres to kill itself when such an error is detected (via a dedicated > monitoring process), I'd personally be happy enough. It'd be nicer if > we could associate that knowledge with particular filesystems etc > (which'd possibly hard through dm etc?), but this'd be much better than > nothing. Yeah, sorry, it never got upstreamed. It's not really all that complicated, it was just that there were some other folks who wanted to do something similar, and there was a round of bike-sheddingh several years ago, and nothing ever went upstream. Part of the problem was that our orignial scheme sent up information about file system-level corruption reports --- e.g, those stemming from calls to ext4_error() --- and lots of people had different ideas about how tot get all of the possible information up in some structured format. (Think something like uerf from Digtial's OSF/1.) We did something *really* simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg). That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases. > The problem really isn't about *recovering* from disk errors. *Knowing* > about them is the crucial part. We do not want to give back clients the > information that an operation succeeded, when it actually didn't. There > could be improvements above that, but as long as it's guaranteed that > "we" get the error (rather than just some kernel log we don't have > access to, which looks different due to config etc), it's ok. We can > throw our hands up in the air and give up. Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't _that_ much variability. > Yea, agreed on all that. I don't think anybody actually involved in > postgres wants to do anything like that. Seems far outside of postgres' > remit. Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's *possible* that might help, but it's not likely to be helpful in my experience. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 21:52 ` Theodore Y. Ts'o @ 2018-04-12 22:03 ` Andres Freund 0 siblings, 0 replies; 57+ messages in thread From: Andres Freund @ 2018-04-12 22:03 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Andreas Dilger, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake Hi, On 2018-04-12 17:52:52 -0400, Theodore Y. Ts'o wrote: > We did something *really* simple/stupid. We just sent essentially an > ascii test string out the netlink socket. That's because what we were > doing before was essentially scraping the output of dmesg > (e.g. /dev/kmssg). > > That's actually probably the simplest thing to do, and it has the > advantage that it will work even on ancient enterprise kernels that PG > users are likely to want to use. So you will need to implement the > dmesg text scraper anyway, and that's probably good enough for most > use cases. The worst part of that is, as you mention below, needing to handle a lot of different error message formats. I guess it's reasonable enough if you control your hardware, but no such luck. Aren't there quite realistic scenarios where one could miss kmsg style messages due to it being a ringbuffer? > Right, it's a little challenging because the actual regexp's you would > need to use do vary from device driver to device driver. Fortunately > nearly everything is a SCSI/SATA device these days, so there isn't > _that_ much variability. There's also SAN / NAS type stuff - not all of that presents as a SCSI/SATA device, right? > > Yea, agreed on all that. I don't think anybody actually involved in > > postgres wants to do anything like that. Seems far outside of postgres' > > remit. > > Some people on the pg-hackers list were talking about wanting to retry > the fsync() and hoping that would cause the write to somehow suceed. > It's *possible* that might help, but it's not likely to be helpful in > my experience. Depends on the type of error and storage. ENOSPC, especially over NFS, has some reasonable chances of being cleared up. And for networked block storage it's also not impossible to think of scenarios where that'd work for EIO. But I think besides hope of clearing up itself, it has the advantage that it trivially can give *some* feedback to the user. The user'll get back strerror(ENOSPC) with some decent SQL error code, which'll hopefully cause them to investigate (well, once monitoring detects high error rates). It's much nicer for the user to type COMMIT; get an appropriate error back etc, than if the database just commits suicide. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-12 2:17 ` Andres Freund 2018-04-12 3:02 ` Matthew Wilcox 2018-04-12 5:34 ` Theodore Y. Ts'o @ 2018-04-18 18:09 ` J. Bruce Fields 2 siblings, 0 replies; 57+ messages in thread From: J. Bruce Fields @ 2018-04-18 18:09 UTC (permalink / raw) To: Andres Freund Cc: Andreas Dilger, 20180410184356.GD3563, Theodore Y. Ts'o, Ext4 Developers List, Linux FS Devel, Jeff Layton, Joshua D. Drake On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > Hi, > > On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote: > > On Apr 10, 2018, at 4:07 PM, Andres Freund <andres@anarazel.de> wrote: > > > 2018-04-10 18:43:56 Ted wrote: > > >> So for better or for worse, there has not been as much investment in > > >> buffered I/O and data robustness in the face of exception handling of > > >> storage devices. > > > > > > That's a bit of a cop out. It's not just databases that care. Even more > > > basic tools like SCM, package managers and editors care whether they can > > > proper responses back from fsync that imply things actually were synced. > > > > Sure, but it is mostly PG that is doing (IMHO) crazy things like writing > > to thousands(?) of files, closing the file descriptors, then expecting > > fsync() on a newly-opened fd to return a historical error. > > It's not just postgres. dpkg (underlying apt, on debian derived distros) > to take an example I just randomly guessed, does too: > /* We want to guarantee the extracted files are on the disk, so that the > * subsequent renames to the info database do not end up with old or zero > * length files in case of a system crash. As neither dpkg-deb nor tar do > * explicit fsync()s, we have to do them here. > * XXX: This could be avoided by switching to an internal tar extractor. */ > dir_sync_contents(cidir); > > (a bunch of other places too) > > Especially on ext3 but also on newer filesystems it's performancewise > entirely infeasible to fsync() every single file individually - the > performance becomes entirely attrocious if you do that. Is that still true if you're able to use some kind of parallelism? (async io, or fsync from multiple processes?) --b. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-10 22:07 Andres Freund 2018-04-11 21:52 ` Andreas Dilger @ 2018-04-13 14:48 ` Matthew Wilcox 2018-04-21 16:59 ` Jan Kara 1 sibling, 1 reply; 57+ messages in thread From: Matthew Wilcox @ 2018-04-13 14:48 UTC (permalink / raw) To: Andres Freund Cc: Theodore Y. Ts'o, linux-ext4, linux-fsdevel, Joshua D. Drake, Andreas Dilger On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote: > I don't think that's the full issue. We can deal with the fact that an > fsync failure is edge-triggered if there's a guarantee that every > process doing so would get it. The fact that one needs to have an FD > open from before any failing writes occurred to get a failure, *THAT'S* > the big issue. > > Beyond postgres, it's a pretty common approach to do work on a lot of > files without fsyncing, then iterate over the directory fsync > everything, and *then* assume you're safe. But unless I severaly > misunderstand something that'd only be safe if you kept an FD for every > file open, which isn't realistic for pretty obvious reasons. While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work. Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write. If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error. B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether *anyone* has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor: Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> diff --git a/lib/errseq.c b/lib/errseq.c index df782418b333..093f1fba4ee0 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); errseq_t errseq_sample(errseq_t *eseq) { errseq_t old = READ_ONCE(*eseq); - errseq_t new = old; - /* - * For the common case of no errors ever having been set, we can skip - * marking the SEEN bit. Once an error has been set, the value will - * never go back to zero. - */ - if (old != 0) { - new |= ERRSEQ_SEEN; - if (old != new) - cmpxchg(eseq, old, new); - } - return new; + /* If nobody has seen this error yet, then we can be the first. */ + if (!(old & ERRSEQ_SEEN)) + old = 0; + return old; } EXPORT_SYMBOL(errseq_sample); ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: fsync() errors is unsafe and risks data loss 2018-04-13 14:48 ` Matthew Wilcox @ 2018-04-21 16:59 ` Jan Kara 0 siblings, 0 replies; 57+ messages in thread From: Jan Kara @ 2018-04-21 16:59 UTC (permalink / raw) To: Matthew Wilcox Cc: Andres Freund, Theodore Y. Ts'o, linux-ext4, linux-fsdevel, Joshua D. Drake, Andreas Dilger On Fri 13-04-18 07:48:07, Matthew Wilcox wrote: > On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote: > > I don't think that's the full issue. We can deal with the fact that an > > fsync failure is edge-triggered if there's a guarantee that every > > process doing so would get it. The fact that one needs to have an FD > > open from before any failing writes occurred to get a failure, *THAT'S* > > the big issue. > > > > Beyond postgres, it's a pretty common approach to do work on a lot of > > files without fsyncing, then iterate over the directory fsync > > everything, and *then* assume you're safe. But unless I severaly > > misunderstand something that'd only be safe if you kept an FD for every > > file open, which isn't realistic for pretty obvious reasons. > > While accepting that under memory pressure we can still evict the error > indicators, we can do a better job than we do today. The current design > of error reporting says that all errors which occurred before you opened > the file descriptor are of no interest to you. I don't think that's > necessarily true, and it's actually a change of behaviour from before > the errseq work. > > Consider Stupid Task A which calls open(), write(), close(), and Smart > Task B which calls open(), write(), fsync(), close() operating on the > same file. If A goes entirely before B and encounters an error, before > errseq_t, B would see the error from A's write. > > If A and B overlap, even a little bit, then B still gets to see A's > error today. But if writeback happens for A's write before B opens the > file then B will never see the error. > > B doesn't want to see historical errors that a previous invocation of > B has already handled, but we know whether *anyone* has seen the error > or not. So here's a patch which restores the historical behaviour of > seeing old unhandled errors on a fresh file descriptor: > > Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> So I agree with going to the old semantics of reporting errors from before a file was open at least once to someone. As the PG case shows apps are indeed relying on the old behavior. As much as it is unreliable, it ends up doing the right thing for these apps in 99% of cases and we shouldn't break them (BTW IMO the changelog should contain a note that this fixes a regression of PostgreSQL, a reference to this thread and CC to stable). Anyway feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Oh, and to make myself clear I do think we need to find a better way of reporting IO errors. I consider this just an immediate band-aid to avoid userspace regressions. Honza > diff --git a/lib/errseq.c b/lib/errseq.c > index df782418b333..093f1fba4ee0 100644 > --- a/lib/errseq.c > +++ b/lib/errseq.c > @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); > errseq_t errseq_sample(errseq_t *eseq) > { > errseq_t old = READ_ONCE(*eseq); > - errseq_t new = old; > > - /* > - * For the common case of no errors ever having been set, we can skip > - * marking the SEEN bit. Once an error has been set, the value will > - * never go back to zero. > - */ > - if (old != 0) { > - new |= ERRSEQ_SEEN; > - if (old != new) > - cmpxchg(eseq, old, new); > - } > - return new; > + /* If nobody has seen this error yet, then we can be the first. */ > + if (!(old & ERRSEQ_SEEN)) > + old = 0; > + return old; -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2018-04-21 23:19 UTC | newest] Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <8da874c9-cf9c-d40a-3474-b773190878e7@commandprompt.com> [not found] ` <20180410184356.GD3563@thunk.org> 2018-04-10 19:47 ` fsync() errors is unsafe and risks data loss Martin Steigerwald 2018-04-18 16:52 ` J. Bruce Fields 2018-04-19 8:39 ` Christoph Hellwig 2018-04-19 14:10 ` J. Bruce Fields 2018-04-10 22:07 Andres Freund 2018-04-11 21:52 ` Andreas Dilger 2018-04-12 0:09 ` Dave Chinner 2018-04-12 2:32 ` Andres Freund 2018-04-12 2:51 ` Andres Freund 2018-04-12 5:09 ` Theodore Y. Ts'o 2018-04-12 5:45 ` Dave Chinner 2018-04-12 11:24 ` Jeff Layton 2018-04-12 21:11 ` Andres Freund 2018-04-12 10:19 ` Lukas Czerner 2018-04-12 19:46 ` Andres Freund 2018-04-12 2:17 ` Andres Freund 2018-04-12 3:02 ` Matthew Wilcox 2018-04-12 11:09 ` Jeff Layton 2018-04-12 11:19 ` Matthew Wilcox 2018-04-12 12:01 ` Dave Chinner 2018-04-12 15:08 ` Jeff Layton 2018-04-12 22:44 ` Dave Chinner 2018-04-13 13:18 ` Jeff Layton 2018-04-13 13:25 ` Andres Freund 2018-04-13 14:02 ` Matthew Wilcox 2018-04-14 1:47 ` Dave Chinner 2018-04-14 2:04 ` Andres Freund 2018-04-18 23:59 ` Dave Chinner 2018-04-19 0:23 ` Eric Sandeen 2018-04-14 2:38 ` Matthew Wilcox 2018-04-19 0:13 ` Dave Chinner 2018-04-19 0:40 ` Matthew Wilcox 2018-04-19 1:08 ` Theodore Y. Ts'o 2018-04-19 17:40 ` Matthew Wilcox 2018-04-19 23:27 ` Theodore Y. Ts'o 2018-04-19 23:28 ` Dave Chinner 2018-04-12 15:16 ` Theodore Y. Ts'o 2018-04-12 20:13 ` Andres Freund 2018-04-12 20:28 ` Matthew Wilcox 2018-04-12 21:14 ` Jeff Layton 2018-04-12 21:31 ` Matthew Wilcox 2018-04-13 12:56 ` Jeff Layton 2018-04-12 21:21 ` Theodore Y. Ts'o 2018-04-12 21:24 ` Matthew Wilcox 2018-04-12 21:37 ` Andres Freund 2018-04-12 20:24 ` Andres Freund 2018-04-12 21:27 ` Jeff Layton 2018-04-12 21:53 ` Andres Freund 2018-04-12 21:57 ` Theodore Y. Ts'o 2018-04-21 18:14 ` Jan Kara 2018-04-12 5:34 ` Theodore Y. Ts'o 2018-04-12 19:55 ` Andres Freund 2018-04-12 21:52 ` Theodore Y. Ts'o 2018-04-12 22:03 ` Andres Freund 2018-04-18 18:09 ` J. Bruce Fields 2018-04-13 14:48 ` Matthew Wilcox 2018-04-21 16:59 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).