* [LSF/MM TOPIC] improving writeback error handling @ 2018-04-17 11:08 Jeff Layton 2018-04-17 22:53 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Jeff Layton @ 2018-04-17 11:08 UTC (permalink / raw) To: lsf-pc; +Cc: linux-fsdevel, Matthew Wilcox, Andres Freund I'd like to have a discussion on how to improve the handling of errors that occur during writeback. I think there are 4 main issues that would be helpful to cover: 1) In v4.16, we added the errseq mechanism to the kernel to make writeback error reporting more reliable. That has helped some use cases, but there are some applications (e.g. PostgreSQL) that were relying on the ability to see writeback errors that occurred before the file description existed. Do we need to tweak this mechanism to help those use cases, or would that do more harm than good? 2) Most filesystems report data wb errors on all fds that were open at the time of the error now, but metadata writeback can also fail, and those don't get reported in the same way so far. Should we extend those semantics to metadata writeback? How do we get there if so? 3) The question of what to do with pages in the pagecache that fail writeback is still unresolved. Most filesystems just clear the dirty bit and and carry on, but some invalidate them or just clear the uptodate bit. This sort of inconsistency (and lack of documentation) is problematic and has led to applications assuming behavior that doesn't exist. I believe we need to declare an "ideal behavior" for Linux filesystems in this regard, add VM/FS helpers to make it easier for filesystems to conform to that behavior, and document it well. The big question is : what sort of behavior makes the most sense here? 4) syncfs doesn't currently report an error when a single inode fails writeback, only when syncing out the block device. Should it report errors in that case as well? Thanks! -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] improving writeback error handling 2018-04-17 11:08 [LSF/MM TOPIC] improving writeback error handling Jeff Layton @ 2018-04-17 22:53 ` Dave Chinner 2018-04-18 16:00 ` [Lsf-pc] " Jeff Layton 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2018-04-17 22:53 UTC (permalink / raw) To: Jeff Layton; +Cc: lsf-pc, linux-fsdevel, Matthew Wilcox, Andres Freund On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote: > I'd like to have a discussion on how to improve the handling of errors > that occur during writeback. I think there are 4 main issues that would > be helpful to cover: > > 1) In v4.16, we added the errseq mechanism to the kernel to make > writeback error reporting more reliable. That has helped some use cases, > but there are some applications (e.g. PostgreSQL) that were relying on > the ability to see writeback errors that occurred before the file > description existed. Do we need to tweak this mechanism to help those > use cases, or would that do more harm than good? More harm than good, IMO. Too many wacky corner cases... > 2) Most filesystems report data wb errors on all fds that were open at > the time of the error now, but metadata writeback can also fail, and > those don't get reported in the same way so far. Should we extend those > semantics to metadata writeback? How do we get there if so? No. Metadata writeback errors are intimately related to internal filesystem consistency - if there's a problem with metadata writeback, it's up to the filesystem to determine if fsync needs to report it or not. And, FWIW, journalling filesystems will follow their global "fatal error" shutdown configuration if there's fatal metadata writeback error. Such an error is essentially filesystem corruption, so once this error condition is set, any subsequent attempt to fsync or modify the filesystem should return errors (i.e. EIO or EFSCORRUPTED). > 3) The question of what to do with pages in the pagecache that fail > writeback is still unresolved. Most filesystems just clear the dirty bit > and and carry on, but some invalidate them or just clear the uptodate > bit. This sort of inconsistency (and lack of documentation) is > problematic and has led to applications assuming behavior that doesn't > exist. I believe we need to declare an "ideal behavior" for Linux > filesystems in this regard, add VM/FS helpers to make it easier for > filesystems to conform to that behavior, and document it well. The big > question is : what sort of behavior makes the most sense here? User configurable on a per-filesystem basis, just like metadata error handling. The default should be what is best for system stability when users do things like pull USB sticks with GB of dirty data still in memory. > 4) syncfs doesn't currently report an error when a single inode fails > writeback, only when syncing out the block device. Should it report > errors in that case as well? Yes. And there's more than that. the filesystem ->sync_fs() method also needs to return an error because that's where most fileystems persist their metadata. And because sync_fs has historically thrown the error returned away, most filesystems don't bother checking or returning errors. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-17 22:53 ` Dave Chinner @ 2018-04-18 16:00 ` Jeff Layton 2018-04-19 0:44 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Jeff Layton @ 2018-04-18 16:00 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote: > On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote: > > I'd like to have a discussion on how to improve the handling of errors > > that occur during writeback. I think there are 4 main issues that would > > be helpful to cover: > > > > 1) In v4.16, we added the errseq mechanism to the kernel to make > > writeback error reporting more reliable. That has helped some use cases, > > but there are some applications (e.g. PostgreSQL) that were relying on > > the ability to see writeback errors that occurred before the file > > description existed. Do we need to tweak this mechanism to help those > > use cases, or would that do more harm than good? > > More harm than good, IMO. Too many wacky corner cases... > One idea that willy had was to set f_wb_err in the file to 0 if the inode's wb_err has the SEEN bit set. That would allow the Pg use case to continue working, as long as they don't wait so long to open the file that the inode gets evicted from the cache. That latter bit is what worries me. Such behavior is non-deterministic. > > 2) Most filesystems report data wb errors on all fds that were open at > > the time of the error now, but metadata writeback can also fail, and > > those don't get reported in the same way so far. Should we extend those > > semantics to metadata writeback? How do we get there if so? > > No. Metadata writeback errors are intimately related to internal > filesystem consistency - if there's a problem with metadata > writeback, it's up to the filesystem to determine if fsync needs to > report it or not. > > And, FWIW, journalling filesystems will follow their global "fatal > error" shutdown configuration if there's fatal metadata writeback > error. Such an error is essentially filesystem corruption, so once > this error condition is set, any subsequent attempt to fsync or > modify the filesystem should return errors (i.e. EIO or > EFSCORRUPTED). > Ok, fair enough. A few months ago, I played around with some patches to do this for ext4. It meant growing struct file though, which was sort of nasty. I also had no idea how to test it properly, so I punted on the idea. We could revisit it, but most folks are using errors=remount-ro on ext4, and that makes it sort of a moot point. > > 3) The question of what to do with pages in the pagecache that fail > > writeback is still unresolved. Most filesystems just clear the dirty bit > > and and carry on, but some invalidate them or just clear the uptodate > > bit. This sort of inconsistency (and lack of documentation) is > > problematic and has led to applications assuming behavior that doesn't > > exist. I believe we need to declare an "ideal behavior" for Linux > > filesystems in this regard, add VM/FS helpers to make it easier for > > filesystems to conform to that behavior, and document it well. The big > > question is : what sort of behavior makes the most sense here? > > User configurable on a per-filesystem basis, just like metadata > error handling. The default should be what is best for system > stability when users do things like pull USB sticks with GB of dirty > data still in memory. > Maybe. I think though we need to consider offering a bit more of a guarantee here. As you pointed out recently, xfs will clear the uptodate bit on a writeback failure, so you'll end up having to re-read the page from disk later if someone issues a read against it. But...that means that we offer no guarantee of the posix requirement to see the result of a write in a subsequent read. If writeback occurs between the two, that write could vanish. So, if you're relying on being able to see the result of a write in your read, then you really _must_ issue fsync prior to any read and check the error code. That sort of sucks, performance-wise. It might be nice if we could ensure that the data sticks around for a short period of time (a few seconds at least) so that you wouldn't have to issue fsync so frequently to get such a guarantee. That may be too difficult to do though. idk. In any case, we should document this much better than we do today, so that userland devs have proper expectations of the behavior here. > > 4) syncfs doesn't currently report an error when a single inode fails > > writeback, only when syncing out the block device. Should it report > > errors in that case as well? > > Yes. > I have a small patch that implements this that I posted a few days ago. I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that it's the right approach. Instead of growing struct file to accommodate a second errseq cursor, this only implements that behavior when the file is opened with O_PATH (as we know that that will have the fsync op set to NULL). Maybe we can do this better somehow though. > And there's more than that. the filesystem ->sync_fs() method also > needs to return an error because that's where most fileystems > persist their metadata. And because sync_fs has historically thrown > the error returned away, most filesystems don't bother checking or > returning errors. > That sounds like a good goal too. sync_fs prototype already returns an error, but it looks like all of the callers ignore it. Making it pass that up would be a good initial goal, and then we could fix up the sync_fs operations one-by-one. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-18 16:00 ` [Lsf-pc] " Jeff Layton @ 2018-04-19 0:44 ` Dave Chinner 2018-04-19 1:47 ` Trond Myklebust 2018-04-19 17:14 ` Jeff Layton 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2018-04-19 0:44 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote: > On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote: > > On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote: > > > I'd like to have a discussion on how to improve the handling of errors > > > that occur during writeback. I think there are 4 main issues that would > > > be helpful to cover: > > > > > > 1) In v4.16, we added the errseq mechanism to the kernel to make > > > writeback error reporting more reliable. That has helped some use cases, > > > but there are some applications (e.g. PostgreSQL) that were relying on > > > the ability to see writeback errors that occurred before the file > > > description existed. Do we need to tweak this mechanism to help those > > > use cases, or would that do more harm than good? > > > > More harm than good, IMO. Too many wacky corner cases... > > > > One idea that willy had was to set f_wb_err in the file to 0 if the > inode's wb_err has the SEEN bit set. That would allow the Pg use case to > continue working, as long as they don't wait so long to open the file > that the inode gets evicted from the cache. > > That latter bit is what worries me. Such behavior is non-deterministic. Yup. IMO, unreliable error reporting is a far worse situation than no error reporting at all. If we can't guarantee delivery of the error, then we should not be trying to expose errors through that mechanism. We could prevent eviction of inodes with pending errors, but we've already rejected that idea because it's a known OOM vector. ..... > > > 3) The question of what to do with pages in the pagecache that fail > > > writeback is still unresolved. Most filesystems just clear the dirty bit > > > and and carry on, but some invalidate them or just clear the uptodate > > > bit. This sort of inconsistency (and lack of documentation) is > > > problematic and has led to applications assuming behavior that doesn't > > > exist. I believe we need to declare an "ideal behavior" for Linux > > > filesystems in this regard, add VM/FS helpers to make it easier for > > > filesystems to conform to that behavior, and document it well. The big > > > question is : what sort of behavior makes the most sense here? > > > > User configurable on a per-filesystem basis, just like metadata > > error handling. The default should be what is best for system > > stability when users do things like pull USB sticks with GB of dirty > > data still in memory. > > > > Maybe. I think though we need to consider offering a bit more of a > guarantee here. > > As you pointed out recently, xfs will clear the uptodate bit on a > writeback failure, so you'll end up having to re-read the page from disk > later if someone issues a read against it. > > But...that means that we offer no guarantee of the posix requirement to > see the result of a write in a subsequent read. If writeback occurs > between the two, that write could vanish. > > So, if you're relying on being able to see the result of a write in your > read, then you really _must_ issue fsync prior to any read and check the > error code. That sort of sucks, performance-wise. > > It might be nice if we could ensure that the data sticks around for a > short period of time (a few seconds at least) so that you wouldn't have > to issue fsync so frequently to get such a guarantee. Well, that's the point of the configurable error behaviour. Look at the XFS metadata config dor a moment: $ ls /sys/fs/xfs/sdc/error/metadata/ EIO ENODEV ENOSPC default $ ls /sys/fs/xfs/sdc/error/metadata/EIO/ max_retries retry_timeout_seconds $ cat /sys/fs/xfs/sdc/error/metadata/EIO/max_retries -1 $ cat /sys/fs/xfs/sdc/error/metadata/EIO/retry_timeout_seconds -1 $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/max_retries 0 $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/retry_timeout_seconds 0 We have different config capability for EIO, ENOSPC and ENODEV and default behaviour for all other metadata writeback errors. The defaults are "retry forever" for EIO (-1 means keep retrying forever and retry whenever the log pushes on the journal to free up space) but for ENODEV (i.e. the "USB stick pull") we default to fail-immediately semantics. i.e. by default we treat EIO as transient and ENODEV as permanent. If we want to change EIO or ENOSPC to be "try a few times, then give up", we set max_retries = 3 and retry_timeout_seconds = 300 to retry writeback 3 times five minutes apart before considering the error as permanent. Once a permanent error is triggered, we shut down the filesystem, as permanent errors in metadata writeback result in filesystem corruption. This is trickier to track for per-page data writeback errors because we don't really have object-based IO control context. However, I really didn't see these error configs to apply per-page errors but to the inode itself. That part is still to be worked out... > > > 4) syncfs doesn't currently report an error when a single inode fails > > > writeback, only when syncing out the block device. Should it report > > > errors in that case as well? > > > > Yes. > > I have a small patch that implements this that I posted a few days ago. > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that > it's the right approach. > > Instead of growing struct file to accommodate a second errseq cursor, > this only implements that behavior when the file is opened with O_PATH > (as we know that that will have the fsync op set to NULL). Maybe we can > do this better somehow though. No idea whether this is possible, or even a good idea, but could we just have syncfs() create a temporary struct file for the duration of the syncfs call, so it works on any fd passed in? (sorta like an internal dupfd() call?) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 0:44 ` Dave Chinner @ 2018-04-19 1:47 ` Trond Myklebust 2018-04-19 1:57 ` Matthew Wilcox 2018-04-19 17:14 ` Jeff Layton 1 sibling, 1 reply; 14+ messages in thread From: Trond Myklebust @ 2018-04-19 1:47 UTC (permalink / raw) To: david, jlayton; +Cc: andres, lsf-pc, linux-fsdevel, willy On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote: > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote: > > Instead of growing struct file to accommodate a second errseq > > cursor, > > this only implements that behavior when the file is opened with > > O_PATH > > (as we know that that will have the fsync op set to NULL). Maybe we > > can > > do this better somehow though. > > No idea whether this is possible, or even a good idea, but could we > just have syncfs() create a temporary struct file for the duration > of the syncfs call, so it works on any fd passed in? (sorta like an > internal dupfd() call?) > If the main use case is something like Postgresql, where you care about just one or two critical files, rather than monitoring the entire filesystem could we perhaps use a dedicated mmap() mode? It should be possible to throw up a bitmap that displays the exact blocks or pages that are affected, once the file has been damaged. For buffered files, you might even consider allowing an appropriately privileged process to discard the dirty pages (perhaps by clearing the bits in the bitmap?) once they have been identified and possibly copied back out of the page cache. Cheers Trond ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 1:47 ` Trond Myklebust @ 2018-04-19 1:57 ` Matthew Wilcox 2018-04-19 2:12 ` Trond Myklebust 2018-04-19 2:15 ` andres 0 siblings, 2 replies; 14+ messages in thread From: Matthew Wilcox @ 2018-04-19 1:57 UTC (permalink / raw) To: Trond Myklebust; +Cc: david, jlayton, andres, lsf-pc, linux-fsdevel On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote: > If the main use case is something like Postgresql, where you care about > just one or two critical files, rather than monitoring the entire > filesystem could we perhaps use a dedicated mmap() mode? It should be > possible to throw up a bitmap that displays the exact blocks or pages > that are affected, once the file has been damaged. Perhaps we need to have a quick summary of the postgres problem ... they're not concerned with "one or two files", otherwise they could just keep those files open and the wb_err mechanism would work fine. The problem is that they have too many files to keep open in their checkpointer process, and when they come along and open the files, they don't see the error.. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 1:57 ` Matthew Wilcox @ 2018-04-19 2:12 ` Trond Myklebust 2018-04-19 18:57 ` andres 2018-04-19 2:15 ` andres 1 sibling, 1 reply; 14+ messages in thread From: Trond Myklebust @ 2018-04-19 2:12 UTC (permalink / raw) To: willy; +Cc: lsf-pc, david, andres, jlayton, linux-fsdevel On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote: > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote: > > If the main use case is something like Postgresql, where you care > > about > > just one or two critical files, rather than monitoring the entire > > filesystem could we perhaps use a dedicated mmap() mode? It should > > be > > possible to throw up a bitmap that displays the exact blocks or > > pages > > that are affected, once the file has been damaged. > > Perhaps we need to have a quick summary of the postgres problem ... > they're not concerned with "one or two files", otherwise they could > just keep those files open and the wb_err mechanism would work fine. > The problem is that they have too many files to keep open in their > checkpointer process, and when they come along and open the files, > they don't see the error.. I thought I understood that there were at least two issues here: 1) Monitoring lots of files to figure out which ones may have an error. 2) Drilling down to see what might be wrong with an individual file. Unless you are in a situation where you can have millions of files all go wrong at the same time, it would seems that the former is the operation that needs to scale. Once you're talking about large numbers of files all getting errors, it would appear that an fsck-like recovery would be necessary. Am I wrong? Cheers Trond ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 2:12 ` Trond Myklebust @ 2018-04-19 18:57 ` andres 0 siblings, 0 replies; 14+ messages in thread From: andres @ 2018-04-19 18:57 UTC (permalink / raw) To: Trond Myklebust; +Cc: willy, lsf-pc, david, jlayton, linux-fsdevel Hi, On 2018-04-19 02:12:33 +0000, Trond Myklebust wrote: > On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote: > > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote: > > > If the main use case is something like Postgresql, where you care > > > about > > > just one or two critical files, rather than monitoring the entire > > > filesystem could we perhaps use a dedicated mmap() mode? It should > > > be > > > possible to throw up a bitmap that displays the exact blocks or > > > pages > > > that are affected, once the file has been damaged. > > > > Perhaps we need to have a quick summary of the postgres problem ... > > they're not concerned with "one or two files", otherwise they could > > just keep those files open and the wb_err mechanism would work fine. > > The problem is that they have too many files to keep open in their > > checkpointer process, and when they come along and open the files, > > they don't see the error.. > > I thought I understood that there were at least two issues here: > > 1) Monitoring lots of files to figure out which ones may have an error. > 2) Drilling down to see what might be wrong with an individual file. > > Unless you are in a situation where you can have millions of files all > go wrong at the same time, it would seems that the former is the > operation that needs to scale. Once you're talking about large numbers > of files all getting errors, it would appear that an fsck-like recovery > would be necessary. Am I wrong? Well, the correctness issue really only centers around 1). Currently there are scenarios (some made less some more likely by the errseq_t changes) where we don't notice IO errors. The result can either be that we wrongly report back to the client that a "COMMIT;" was successful, even though it wasn't persisted, or we might throw away journal data because we think a checkpoint was successful even though it wasn't. To fix the correctness issue we really only need 1). That said, it'd obviously be nice to be able to report back a decent error pointing to individual files and a more descriptive error message than "PANIC: An IO error occurred somewhere. Perhaps look in the kernel logs?" wouldn't hurt either. To give a short overview of how PostgreSQL issues fsync and does the surrounding buffer management: 1) There's a traditional journal (WAL), addressed by LSN. Every modification needs to first be in the WAL before buffers (and thus on-disk data) can be modified. 2) There's a postgres internal buffer cache. Pages are tagged with the WAL LSN that need to be flushed to disk before the data can be written back. 3) Reads and writes between OS and buffer cache are done using buffered IO. There's valid reasons to change that, but it'll require new infrastructure. Each process has a limited size path -> fd cache. 4) Buffers are written out by: - checkpointing process during checkpoints - background writer that attempts to keep some "victim" buffers clean - backends (client connection associated) when they have to reuse dirty victim buffers whenever such a writeout happens information about the file containing that dirty buffer is forwarded to the checkpointer. The checkpointer keeps track of each file that'll need to be fsynced in a hashmap. It's worth to note that each table / index / whatnot is a separate file, and large relations are segmented into 1GB segments. So it's pretty common to have tens of thousands of files in a larger database. 5) During checkpointing, which is paced in most cases and often will be configured to take 30-60min, all buffers from before the start of the checkpoint will be written out. We'll issue SYNC_FILE_RANGE_WRITE requests occasionally to keep the amount of dirty kernel buffers under control. After that we'll fsync each of the dirty files. Once that and some other boring internal stuff succeeded, we'll issue a checkpoint record, and allow discarding WAL from before the checkpoint. Because we cannot realistically keep each of the files open between 4) and the end of 5), and because the fds used in 4) are not the same as the ones in 5) (different processes), we currently aren't guaranteed notification of writeback failures. Realistically we're not going to do much file-specific handling in case there's errors. Either a retry is going to fix the issue (ooops RN, because the error has been "eaten"), or we're doing a crash-recovery cycle from WAL (oops, because we don't necessarily know an error occurred). It's worth to note that for us syncfs() is better than nothing, but it's not perfect. It's pretty common to have temporary files (sort spool files, temporary tables, ...) on the same filesystem as the persistent database. syncfs() has the potential to flush out a lot of unnecessary dirty data. Note that it'd be very unlikely that the temp data files would be moved to DIO - it's *good* that the kernel manages the amount of dirty / cached data. It has a heck of a lot of more knowledge about how much memory pressure the system has than postgres ever will have. One reason, besides some architectural issues inside PG, we've been concerned about when considering DIO, is along similar lines. A lot of people use databases as a part of their stack without focusing on them. Which usually means the database will be largely untuned. With buffered IO that's not so bad, the kernel will dynamically adapt to some extent. With DIO the consequences of a mistuned buffer cache size or such are way worse. It's good for critical databases maintained by dedicated people, not so good outside of that. Matthew, I'm not sure what kind of summary you had in mind. Please let me know if you want more detail in any of the areas, happy to expand. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 1:57 ` Matthew Wilcox 2018-04-19 2:12 ` Trond Myklebust @ 2018-04-19 2:15 ` andres 2018-04-19 2:19 ` Trond Myklebust 1 sibling, 1 reply; 14+ messages in thread From: andres @ 2018-04-19 2:15 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Trond Myklebust, david, jlayton, lsf-pc, linux-fsdevel On 2018-04-18 18:57:23 -0700, Matthew Wilcox wrote: > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote: > > If the main use case is something like Postgresql, where you care about > > just one or two critical files, rather than monitoring the entire > > filesystem could we perhaps use a dedicated mmap() mode? It should be > > possible to throw up a bitmap that displays the exact blocks or pages > > that are affected, once the file has been damaged. > > Perhaps we need to have a quick summary of the postgres problem ... > they're not concerned with "one or two files", otherwise they could > just keep those files open and the wb_err mechanism would work fine. > The problem is that they have too many files to keep open in their > checkpointer process, and when they come along and open the files, > they don't see the error.. Correct. Do you want a summary here or at LSF/MM? Thanks to Jon Corbet I've got a last minute invitation. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 2:15 ` andres @ 2018-04-19 2:19 ` Trond Myklebust 0 siblings, 0 replies; 14+ messages in thread From: Trond Myklebust @ 2018-04-19 2:19 UTC (permalink / raw) To: andres, willy; +Cc: david, lsf-pc, jlayton, linux-fsdevel On Wed, 2018-04-18 at 19:15 -0700, andres@anarazel.de wrote: > On 2018-04-18 18:57:23 -0700, Matthew Wilcox wrote: > > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote: > > > If the main use case is something like Postgresql, where you care > > > about > > > just one or two critical files, rather than monitoring the entire > > > filesystem could we perhaps use a dedicated mmap() mode? It > > > should be > > > possible to throw up a bitmap that displays the exact blocks or > > > pages > > > that are affected, once the file has been damaged. > > > > Perhaps we need to have a quick summary of the postgres problem ... > > they're not concerned with "one or two files", otherwise they could > > just keep those files open and the wb_err mechanism would work > > fine. > > The problem is that they have too many files to keep open in their > > checkpointer process, and when they come along and open the files, > > they don't see the error.. > > Correct. Do you want a summary here or at LSF/MM? Thanks to Jon > Corbet > I've got a last minute invitation. > I'm not attending LSF/MM this year. Cheers Trond ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 0:44 ` Dave Chinner 2018-04-19 1:47 ` Trond Myklebust @ 2018-04-19 17:14 ` Jeff Layton 2018-04-19 23:47 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Jeff Layton @ 2018-04-19 17:14 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote: > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote: > > On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote: > > > On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote: > > > > I'd like to have a discussion on how to improve the handling of errors > > > > that occur during writeback. I think there are 4 main issues that would > > > > be helpful to cover: > > > > > > > > 1) In v4.16, we added the errseq mechanism to the kernel to make > > > > writeback error reporting more reliable. That has helped some use cases, > > > > but there are some applications (e.g. PostgreSQL) that were relying on > > > > the ability to see writeback errors that occurred before the file > > > > description existed. Do we need to tweak this mechanism to help those > > > > use cases, or would that do more harm than good? > > > > > > More harm than good, IMO. Too many wacky corner cases... > > > > > > > One idea that willy had was to set f_wb_err in the file to 0 if the > > inode's wb_err has the SEEN bit set. That would allow the Pg use case to Sorry...that should have been "wb_err has the SEEN bit clear". > > continue working, as long as they don't wait so long to open the file > > that the inode gets evicted from the cache. > > > > That latter bit is what worries me. Such behavior is non-deterministic. > > Yup. IMO, unreliable error reporting is a far worse situation than > no error reporting at all. If we can't guarantee delivery of the > error, then we should not be trying to expose errors through that > mechanism. > > We could prevent eviction of inodes with pending errors, but we've > already rejected that idea because it's a known OOM vector. > > ..... > > > > > 3) The question of what to do with pages in the pagecache that fail > > > > writeback is still unresolved. Most filesystems just clear the dirty bit > > > > and and carry on, but some invalidate them or just clear the uptodate > > > > bit. This sort of inconsistency (and lack of documentation) is > > > > problematic and has led to applications assuming behavior that doesn't > > > > exist. I believe we need to declare an "ideal behavior" for Linux > > > > filesystems in this regard, add VM/FS helpers to make it easier for > > > > filesystems to conform to that behavior, and document it well. The big > > > > question is : what sort of behavior makes the most sense here? > > > > > > User configurable on a per-filesystem basis, just like metadata > > > error handling. The default should be what is best for system > > > stability when users do things like pull USB sticks with GB of dirty > > > data still in memory. > > > > > > > Maybe. I think though we need to consider offering a bit more of a > > guarantee here. > > > > As you pointed out recently, xfs will clear the uptodate bit on a > > writeback failure, so you'll end up having to re-read the page from disk > > later if someone issues a read against it. > > > > But...that means that we offer no guarantee of the posix requirement to > > see the result of a write in a subsequent read. If writeback occurs > > between the two, that write could vanish. > > > > So, if you're relying on being able to see the result of a write in your > > read, then you really _must_ issue fsync prior to any read and check the > > error code. That sort of sucks, performance-wise. > > > > It might be nice if we could ensure that the data sticks around for a > > short period of time (a few seconds at least) so that you wouldn't have > > to issue fsync so frequently to get such a guarantee. > > Well, that's the point of the configurable error behaviour. Look at > the XFS metadata config dor a moment: > > $ ls /sys/fs/xfs/sdc/error/metadata/ > EIO ENODEV ENOSPC default > $ ls /sys/fs/xfs/sdc/error/metadata/EIO/ > max_retries retry_timeout_seconds > $ cat /sys/fs/xfs/sdc/error/metadata/EIO/max_retries > -1 > $ cat /sys/fs/xfs/sdc/error/metadata/EIO/retry_timeout_seconds > -1 > $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/max_retries > 0 > $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/retry_timeout_seconds > 0 > > We have different config capability for EIO, ENOSPC and ENODEV and > default behaviour for all other metadata writeback errors. > > The defaults are "retry forever" for EIO (-1 means keep retrying > forever and retry whenever the log pushes on the journal to free up > space) but for ENODEV (i.e. the "USB stick pull") we default to > fail-immediately semantics. > > i.e. by default we treat EIO as transient and ENODEV as permanent. > > If we want to change EIO or ENOSPC to be "try a few times, then give > up", we set max_retries = 3 and retry_timeout_seconds = 300 to retry > writeback 3 times five minutes apart before considering the error as > permanent. Once a permanent error is triggered, we shut down the > filesystem, as permanent errors in metadata writeback result in > filesystem corruption. > > This is trickier to track for per-page data writeback errors because > we don't really have object-based IO control context. However, I > really didn't see these error configs to apply per-page errors but > to the inode itself. That part is still to be worked out... > That's an interesting approach. Maybe we should expand that sort of interface across filesystems? I worry a bit about knob proliferation though -- when things are that configurable it's easy for users to hang themselves, end up with nonsensical configuration, etc. Still, we may have no choice given the fact that there doesn't seem to be any one-size-fits-all way to do this. > > > > 4) syncfs doesn't currently report an error when a single inode fails > > > > writeback, only when syncing out the block device. Should it report > > > > errors in that case as well? > > > > > > Yes. > > > > I have a small patch that implements this that I posted a few days ago. > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that > > it's the right approach. > > > > Instead of growing struct file to accommodate a second errseq cursor, > > this only implements that behavior when the file is opened with O_PATH > > (as we know that that will have the fsync op set to NULL). Maybe we can > > do this better somehow though. > > No idea whether this is possible, or even a good idea, but could we > just have syncfs() create a temporary struct file for the duration > of the syncfs call, so it works on any fd passed in? (sorta like an > internal dupfd() call?) > No, we need something that will persist the errseq_t cursor between syncfs calls. If we did what you're suggesting, once your temporary file goes away, you'd lose your place in the error stream and you'd end up reporting the same errors more than once on subsequent calls to syncfs. We'll need some way to store that in the struct file, the question is whether we are willing to grow the struct to accomodate it. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 17:14 ` Jeff Layton @ 2018-04-19 23:47 ` Dave Chinner 2018-04-20 11:24 ` Jeff Layton 2018-04-21 17:21 ` Jan Kara 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2018-04-19 23:47 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox On Thu, Apr 19, 2018 at 01:14:04PM -0400, Jeff Layton wrote: > On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote: > > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote: > > > > > 4) syncfs doesn't currently report an error when a single inode fails > > > > > writeback, only when syncing out the block device. Should it report > > > > > errors in that case as well? > > > > > > > > Yes. > > > > > > I have a small patch that implements this that I posted a few days ago. > > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that > > > it's the right approach. > > > > > > Instead of growing struct file to accommodate a second errseq cursor, > > > this only implements that behavior when the file is opened with O_PATH > > > (as we know that that will have the fsync op set to NULL). Maybe we can > > > do this better somehow though. > > > > No idea whether this is possible, or even a good idea, but could we > > just have syncfs() create a temporary struct file for the duration > > of the syncfs call, so it works on any fd passed in? (sorta like an > > internal dupfd() call?) > > > > No, we need something that will persist the errseq_t cursor between > syncfs calls. > > If we did what you're suggesting, once your temporary file goes away, > you'd lose your place in the error stream and you'd end up reporting > the same errors more than once on subsequent calls to syncfs. Which, IMO, is correct behaviour. If there is a persistent data writeback errors, then syncfs() should report it *every time it is called* because that syncfs() call did not result in the filesystem being fully committed to stable storage. I don't care whether the error has been reported before - the context of syncfs() is "commit the entire filesystem to stable storage". If any IO failed then we have not acheived what the application asked us to do and so we should be returning an error on every call that sees a data writeback error. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 23:47 ` Dave Chinner @ 2018-04-20 11:24 ` Jeff Layton 2018-04-21 17:21 ` Jan Kara 1 sibling, 0 replies; 14+ messages in thread From: Jeff Layton @ 2018-04-20 11:24 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox On Fri, 2018-04-20 at 09:47 +1000, Dave Chinner wrote: > On Thu, Apr 19, 2018 at 01:14:04PM -0400, Jeff Layton wrote: > > On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote: > > > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote: > > > > > > 4) syncfs doesn't currently report an error when a single inode fails > > > > > > writeback, only when syncing out the block device. Should it report > > > > > > errors in that case as well? > > > > > > > > > > Yes. > > > > > > > > I have a small patch that implements this that I posted a few days ago. > > > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that > > > > it's the right approach. > > > > > > > > Instead of growing struct file to accommodate a second errseq cursor, > > > > this only implements that behavior when the file is opened with O_PATH > > > > (as we know that that will have the fsync op set to NULL). Maybe we can > > > > do this better somehow though. > > > > > > No idea whether this is possible, or even a good idea, but could we > > > just have syncfs() create a temporary struct file for the duration > > > of the syncfs call, so it works on any fd passed in? (sorta like an > > > internal dupfd() call?) > > > > > > > No, we need something that will persist the errseq_t cursor between > > syncfs calls. > > > > If we did what you're suggesting, once your temporary file goes away, > > you'd lose your place in the error stream and you'd end up reporting > > the same errors more than once on subsequent calls to syncfs. > > Which, IMO, is correct behaviour. If there is a persistent data > writeback errors, then syncfs() should report it *every time it is > called* because that syncfs() call did not result in the filesystem > being fully committed to stable storage. > > I don't care whether the error has been reported before - the > context of syncfs() is "commit the entire filesystem to stable > storage". If any IO failed then we have not acheived what the > application asked us to do and so we should be returning an error on > every call that sees a data writeback error. > The problem is that that you would get back errors even if the problem went away: Suppose you write some data, and that fails writeback. You call syncfs and that returns an error. So, you write the data again, and this time it succeeds, and you call syncfs again on the same fd that you originally called it. It will still return error because we have no way to persist the fact that you already saw the original error on the first syncfs call. That's wrong behavior, IMO. The errseq_t mechanism requires that you keep a record of the last error that you saw, so that we know whether to report it again or not. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling 2018-04-19 23:47 ` Dave Chinner 2018-04-20 11:24 ` Jeff Layton @ 2018-04-21 17:21 ` Jan Kara 1 sibling, 0 replies; 14+ messages in thread From: Jan Kara @ 2018-04-21 17:21 UTC (permalink / raw) To: Dave Chinner Cc: Jeff Layton, linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox On Fri 20-04-18 09:47:45, Dave Chinner wrote: > On Thu, Apr 19, 2018 at 01:14:04PM -0400, Jeff Layton wrote: > > On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote: > > > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote: > > > > > > 4) syncfs doesn't currently report an error when a single inode fails > > > > > > writeback, only when syncing out the block device. Should it report > > > > > > errors in that case as well? > > > > > > > > > > Yes. > > > > > > > > I have a small patch that implements this that I posted a few days ago. > > > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that > > > > it's the right approach. > > > > > > > > Instead of growing struct file to accommodate a second errseq cursor, > > > > this only implements that behavior when the file is opened with O_PATH > > > > (as we know that that will have the fsync op set to NULL). Maybe we can > > > > do this better somehow though. > > > > > > No idea whether this is possible, or even a good idea, but could we > > > just have syncfs() create a temporary struct file for the duration > > > of the syncfs call, so it works on any fd passed in? (sorta like an > > > internal dupfd() call?) > > > > > > > No, we need something that will persist the errseq_t cursor between > > syncfs calls. > > > > If we did what you're suggesting, once your temporary file goes away, > > you'd lose your place in the error stream and you'd end up reporting > > the same errors more than once on subsequent calls to syncfs. > > Which, IMO, is correct behaviour. If there is a persistent data > writeback errors, then syncfs() should report it *every time it is > called* because that syncfs() call did not result in the filesystem > being fully committed to stable storage. > > I don't care whether the error has been reported before - the > context of syncfs() is "commit the entire filesystem to stable > storage". If any IO failed then we have not acheived what the > application asked us to do and so we should be returning an error on > every call that sees a data writeback error. I believe syncfs() behavior should be consistent with an fsync() behavior. Otherwise it would be rather confusing for userspace. So what you suggest would mean that fsync() would have to keep reporting error once it reported an error since we've dropped the dirty bits from pages and thus cannot ever make user data durable anymore. Or we could do what Jeff proposes - i.e., make syncfs() behave like fsync() currently by reporting each error for each fd only once. And I find both behaviors justifiable. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-04-21 23:19 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-17 11:08 [LSF/MM TOPIC] improving writeback error handling Jeff Layton 2018-04-17 22:53 ` Dave Chinner 2018-04-18 16:00 ` [Lsf-pc] " Jeff Layton 2018-04-19 0:44 ` Dave Chinner 2018-04-19 1:47 ` Trond Myklebust 2018-04-19 1:57 ` Matthew Wilcox 2018-04-19 2:12 ` Trond Myklebust 2018-04-19 18:57 ` andres 2018-04-19 2:15 ` andres 2018-04-19 2:19 ` Trond Myklebust 2018-04-19 17:14 ` Jeff Layton 2018-04-19 23:47 ` Dave Chinner 2018-04-20 11:24 ` Jeff Layton 2018-04-21 17:21 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.