All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] improving writeback error handling
@ 2018-04-17 11:08 Jeff Layton
  2018-04-17 22:53 ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Layton @ 2018-04-17 11:08 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, Matthew Wilcox, Andres Freund

I'd like to have a discussion on how to improve the handling of errors
that occur during writeback. I think there are 4 main issues that would
be helpful to cover:

1) In v4.16, we added the errseq mechanism to the kernel to make
writeback error reporting more reliable. That has helped some use cases,
but there are some applications (e.g. PostgreSQL) that were relying on
the ability to see writeback errors that occurred before the file
description existed. Do we need to tweak this mechanism to help those
use cases, or would that do more harm than good?

2) Most filesystems report data wb errors on all fds that were open at
the time of the error now, but metadata writeback can also fail, and
those don't get reported in the same way so far. Should we extend those
semantics to metadata writeback? How do we get there if so?

3) The question of what to do with pages in the pagecache that fail
writeback is still unresolved. Most filesystems just clear the dirty bit
and and carry on, but some invalidate them or just clear the uptodate
bit. This sort of inconsistency (and lack of documentation) is
problematic and has led to applications assuming behavior that doesn't
exist. I believe we need to declare an "ideal behavior" for Linux
filesystems in this regard, add VM/FS helpers to make it easier for
filesystems to conform to that behavior, and document it well. The big
question is : what sort of behavior makes the most sense here?

4) syncfs doesn't currently report an error when a single inode fails
writeback, only when syncing out the block device. Should it report
errors in that case as well?

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] improving writeback error handling
  2018-04-17 11:08 [LSF/MM TOPIC] improving writeback error handling Jeff Layton
@ 2018-04-17 22:53 ` Dave Chinner
  2018-04-18 16:00   ` [Lsf-pc] " Jeff Layton
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2018-04-17 22:53 UTC (permalink / raw)
  To: Jeff Layton; +Cc: lsf-pc, linux-fsdevel, Matthew Wilcox, Andres Freund

On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote:
> I'd like to have a discussion on how to improve the handling of errors
> that occur during writeback. I think there are 4 main issues that would
> be helpful to cover:
> 
> 1) In v4.16, we added the errseq mechanism to the kernel to make
> writeback error reporting more reliable. That has helped some use cases,
> but there are some applications (e.g. PostgreSQL) that were relying on
> the ability to see writeback errors that occurred before the file
> description existed. Do we need to tweak this mechanism to help those
> use cases, or would that do more harm than good?

More harm than good, IMO. Too many wacky corner cases...

> 2) Most filesystems report data wb errors on all fds that were open at
> the time of the error now, but metadata writeback can also fail, and
> those don't get reported in the same way so far. Should we extend those
> semantics to metadata writeback? How do we get there if so?

No. Metadata writeback errors are intimately related to internal
filesystem consistency - if there's a problem with metadata
writeback, it's up to the filesystem to determine if fsync needs to
report it or not.

And, FWIW, journalling filesystems will follow their global "fatal
error" shutdown configuration if there's  fatal metadata writeback
error. Such an error is essentially filesystem corruption, so once
this error condition is set, any subsequent attempt to fsync or
modify the filesystem should return errors (i.e. EIO or
EFSCORRUPTED).

> 3) The question of what to do with pages in the pagecache that fail
> writeback is still unresolved. Most filesystems just clear the dirty bit
> and and carry on, but some invalidate them or just clear the uptodate
> bit. This sort of inconsistency (and lack of documentation) is
> problematic and has led to applications assuming behavior that doesn't
> exist. I believe we need to declare an "ideal behavior" for Linux
> filesystems in this regard, add VM/FS helpers to make it easier for
> filesystems to conform to that behavior, and document it well. The big
> question is : what sort of behavior makes the most sense here?

User configurable on a per-filesystem basis, just like metadata
error handling. The default should be what is best for system
stability when users do things like pull USB sticks with GB of dirty
data still in memory.

> 4) syncfs doesn't currently report an error when a single inode fails
> writeback, only when syncing out the block device. Should it report
> errors in that case as well?

Yes.

And there's more than that. the filesystem ->sync_fs() method also
needs to return an error because that's where most fileystems
persist their metadata. And because sync_fs has historically thrown
the error returned away, most filesystems don't bother checking or
returning errors.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-17 22:53 ` Dave Chinner
@ 2018-04-18 16:00   ` Jeff Layton
  2018-04-19  0:44     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Layton @ 2018-04-18 16:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox

On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote:
> On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote:
> > I'd like to have a discussion on how to improve the handling of errors
> > that occur during writeback. I think there are 4 main issues that would
> > be helpful to cover:
> > 
> > 1) In v4.16, we added the errseq mechanism to the kernel to make
> > writeback error reporting more reliable. That has helped some use cases,
> > but there are some applications (e.g. PostgreSQL) that were relying on
> > the ability to see writeback errors that occurred before the file
> > description existed. Do we need to tweak this mechanism to help those
> > use cases, or would that do more harm than good?
> 
> More harm than good, IMO. Too many wacky corner cases...
> 

One idea that willy had was to set f_wb_err in the file to 0 if the
inode's wb_err has the SEEN bit set. That would allow the Pg use case to
continue working, as long as they don't wait so long to open the file
that the inode gets evicted from the cache.

That latter bit is what worries me. Such behavior is non-deterministic.

> > 2) Most filesystems report data wb errors on all fds that were open at
> > the time of the error now, but metadata writeback can also fail, and
> > those don't get reported in the same way so far. Should we extend those
> > semantics to metadata writeback? How do we get there if so?
> 
> No. Metadata writeback errors are intimately related to internal
> filesystem consistency - if there's a problem with metadata
> writeback, it's up to the filesystem to determine if fsync needs to
> report it or not.
> 
> And, FWIW, journalling filesystems will follow their global "fatal
> error" shutdown configuration if there's  fatal metadata writeback
> error. Such an error is essentially filesystem corruption, so once
> this error condition is set, any subsequent attempt to fsync or
> modify the filesystem should return errors (i.e. EIO or
> EFSCORRUPTED).
> 

Ok, fair enough.

A few months ago, I played around with some patches to do this for ext4.
It meant growing struct file though, which was sort of nasty. I also had
no idea how to test it properly, so I punted on the idea.

We could revisit it, but most folks are using errors=remount-ro on ext4,
and that makes it sort of a moot point.

> > 3) The question of what to do with pages in the pagecache that fail
> > writeback is still unresolved. Most filesystems just clear the dirty bit
> > and and carry on, but some invalidate them or just clear the uptodate
> > bit. This sort of inconsistency (and lack of documentation) is
> > problematic and has led to applications assuming behavior that doesn't
> > exist. I believe we need to declare an "ideal behavior" for Linux
> > filesystems in this regard, add VM/FS helpers to make it easier for
> > filesystems to conform to that behavior, and document it well. The big
> > question is : what sort of behavior makes the most sense here?
> 
> User configurable on a per-filesystem basis, just like metadata
> error handling. The default should be what is best for system
> stability when users do things like pull USB sticks with GB of dirty
> data still in memory.
> 

Maybe. I think though we need to consider offering a bit more of a
guarantee here.

As you pointed out recently, xfs will clear the uptodate bit on a
writeback failure, so you'll end up having to re-read the page from disk
later if someone issues a read against it.

But...that means that we offer no guarantee of the posix requirement to
see the result of a write in a subsequent read. If writeback occurs
between the two, that write could vanish.

So, if you're relying on being able to see the result of a write in your
read, then you really _must_ issue fsync prior to any read and check the
error code. That sort of sucks, performance-wise.

It might be nice if we could ensure that the data sticks around for a
short period of time (a few seconds at least) so that you wouldn't have
to issue fsync so frequently to get such a guarantee.

That may be too difficult to do though. idk.

In any case, we should document this much better than we do today, so
that userland devs have proper expectations of the behavior here.

> > 4) syncfs doesn't currently report an error when a single inode fails
> > writeback, only when syncing out the block device. Should it report
> > errors in that case as well?
> 
> Yes.
> 

I have a small patch that implements this that I posted a few days ago.
I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
it's the right approach.

Instead of growing struct file to accommodate a second errseq cursor,
this only implements that behavior when the file is opened with O_PATH
(as we know that that will have the fsync op set to NULL). Maybe we can
do this better somehow though.

> And there's more than that. the filesystem ->sync_fs() method also
> needs to return an error because that's where most fileystems
> persist their metadata. And because sync_fs has historically thrown
> the error returned away, most filesystems don't bother checking or
> returning errors.
> 

That sounds like a good goal too. sync_fs prototype already returns an
error, but it looks like all of the callers ignore it. Making it pass
that up would be a good initial goal, and then we could fix up the
sync_fs operations one-by-one.
 
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-18 16:00   ` [Lsf-pc] " Jeff Layton
@ 2018-04-19  0:44     ` Dave Chinner
  2018-04-19  1:47       ` Trond Myklebust
  2018-04-19 17:14       ` Jeff Layton
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2018-04-19  0:44 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox

On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote:
> > On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote:
> > > I'd like to have a discussion on how to improve the handling of errors
> > > that occur during writeback. I think there are 4 main issues that would
> > > be helpful to cover:
> > > 
> > > 1) In v4.16, we added the errseq mechanism to the kernel to make
> > > writeback error reporting more reliable. That has helped some use cases,
> > > but there are some applications (e.g. PostgreSQL) that were relying on
> > > the ability to see writeback errors that occurred before the file
> > > description existed. Do we need to tweak this mechanism to help those
> > > use cases, or would that do more harm than good?
> > 
> > More harm than good, IMO. Too many wacky corner cases...
> > 
> 
> One idea that willy had was to set f_wb_err in the file to 0 if the
> inode's wb_err has the SEEN bit set. That would allow the Pg use case to
> continue working, as long as they don't wait so long to open the file
> that the inode gets evicted from the cache.
> 
> That latter bit is what worries me. Such behavior is non-deterministic.

Yup. IMO, unreliable error reporting is a far worse situation than
no error reporting at all. If we can't guarantee delivery of the
error, then we should not be trying to expose errors through that
mechanism.

We could prevent eviction of inodes with pending errors, but we've
already rejected that idea because it's a known OOM vector.

.....

> > > 3) The question of what to do with pages in the pagecache that fail
> > > writeback is still unresolved. Most filesystems just clear the dirty bit
> > > and and carry on, but some invalidate them or just clear the uptodate
> > > bit. This sort of inconsistency (and lack of documentation) is
> > > problematic and has led to applications assuming behavior that doesn't
> > > exist. I believe we need to declare an "ideal behavior" for Linux
> > > filesystems in this regard, add VM/FS helpers to make it easier for
> > > filesystems to conform to that behavior, and document it well. The big
> > > question is : what sort of behavior makes the most sense here?
> > 
> > User configurable on a per-filesystem basis, just like metadata
> > error handling. The default should be what is best for system
> > stability when users do things like pull USB sticks with GB of dirty
> > data still in memory.
> > 
> 
> Maybe. I think though we need to consider offering a bit more of a
> guarantee here.
> 
> As you pointed out recently, xfs will clear the uptodate bit on a
> writeback failure, so you'll end up having to re-read the page from disk
> later if someone issues a read against it.
> 
> But...that means that we offer no guarantee of the posix requirement to
> see the result of a write in a subsequent read. If writeback occurs
> between the two, that write could vanish.
> 
> So, if you're relying on being able to see the result of a write in your
> read, then you really _must_ issue fsync prior to any read and check the
> error code. That sort of sucks, performance-wise.
> 
> It might be nice if we could ensure that the data sticks around for a
> short period of time (a few seconds at least) so that you wouldn't have
> to issue fsync so frequently to get such a guarantee.

Well, that's the point of the configurable error behaviour. Look at
the XFS metadata config dor a moment:

$ ls /sys/fs/xfs/sdc/error/metadata/    
EIO  ENODEV  ENOSPC  default
$ ls /sys/fs/xfs/sdc/error/metadata/EIO/
max_retries  retry_timeout_seconds
$ cat /sys/fs/xfs/sdc/error/metadata/EIO/max_retries 
-1
$ cat /sys/fs/xfs/sdc/error/metadata/EIO/retry_timeout_seconds 
-1
$ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/max_retries 
0
$ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/retry_timeout_seconds 
0

We have different config capability for EIO, ENOSPC and ENODEV and
default behaviour for all other metadata writeback errors.

The defaults are "retry forever" for EIO (-1 means keep retrying
forever and retry whenever the log pushes on the journal to free up
space) but for ENODEV (i.e. the "USB stick pull") we default to
fail-immediately semantics.

i.e. by default we treat EIO as transient and ENODEV as permanent.

If we want to change EIO or ENOSPC to be "try a few times, then give
up", we set max_retries = 3 and retry_timeout_seconds = 300 to retry
writeback 3 times five minutes apart before considering the error as
permanent.  Once a permanent error is triggered, we shut down the
filesystem, as permanent errors in metadata writeback result in
filesystem corruption.

This is trickier to track for per-page data writeback errors because
we don't really have object-based IO control context. However, I
really didn't see these error configs to apply per-page errors but
to the inode itself. That part is still to be worked out...

> > > 4) syncfs doesn't currently report an error when a single inode fails
> > > writeback, only when syncing out the block device. Should it report
> > > errors in that case as well?
> > 
> > Yes.
> 
> I have a small patch that implements this that I posted a few days ago.
> I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
> it's the right approach.
> 
> Instead of growing struct file to accommodate a second errseq cursor,
> this only implements that behavior when the file is opened with O_PATH
> (as we know that that will have the fsync op set to NULL). Maybe we can
> do this better somehow though.

No idea whether this is possible, or even a good idea, but could we
just have syncfs() create a temporary struct file for the duration
of the syncfs call, so it works on any fd passed in? (sorta like an
internal dupfd() call?)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  0:44     ` Dave Chinner
@ 2018-04-19  1:47       ` Trond Myklebust
  2018-04-19  1:57         ` Matthew Wilcox
  2018-04-19 17:14       ` Jeff Layton
  1 sibling, 1 reply; 14+ messages in thread
From: Trond Myklebust @ 2018-04-19  1:47 UTC (permalink / raw)
  To: david, jlayton; +Cc: andres, lsf-pc, linux-fsdevel, willy

On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote:
> On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> > Instead of growing struct file to accommodate a second errseq
> > cursor,
> > this only implements that behavior when the file is opened with
> > O_PATH
> > (as we know that that will have the fsync op set to NULL). Maybe we
> > can
> > do this better somehow though.
> 
> No idea whether this is possible, or even a good idea, but could we
> just have syncfs() create a temporary struct file for the duration
> of the syncfs call, so it works on any fd passed in? (sorta like an
> internal dupfd() call?)
> 

If the main use case is something like Postgresql, where you care about
just one or two critical files, rather than monitoring the entire
filesystem could we perhaps use a dedicated mmap() mode? It should be
possible to throw up a bitmap that displays the exact blocks or pages
that are affected, once the file has been damaged.

For buffered files, you might even consider allowing an appropriately
privileged process to discard the dirty pages (perhaps by clearing the
bits in the bitmap?) once they have been identified and possibly copied
back out of the page cache.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  1:47       ` Trond Myklebust
@ 2018-04-19  1:57         ` Matthew Wilcox
  2018-04-19  2:12           ` Trond Myklebust
  2018-04-19  2:15           ` andres
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2018-04-19  1:57 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: david, jlayton, andres, lsf-pc, linux-fsdevel

On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> If the main use case is something like Postgresql, where you care about
> just one or two critical files, rather than monitoring the entire
> filesystem could we perhaps use a dedicated mmap() mode? It should be
> possible to throw up a bitmap that displays the exact blocks or pages
> that are affected, once the file has been damaged.

Perhaps we need to have a quick summary of the postgres problem ...
they're not concerned with "one or two files", otherwise they could
just keep those files open and the wb_err mechanism would work fine.
The problem is that they have too many files to keep open in their
checkpointer process, and when they come along and open the files,
they don't see the error..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  1:57         ` Matthew Wilcox
@ 2018-04-19  2:12           ` Trond Myklebust
  2018-04-19 18:57             ` andres
  2018-04-19  2:15           ` andres
  1 sibling, 1 reply; 14+ messages in thread
From: Trond Myklebust @ 2018-04-19  2:12 UTC (permalink / raw)
  To: willy; +Cc: lsf-pc, david, andres, jlayton, linux-fsdevel

On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> > If the main use case is something like Postgresql, where you care
> > about
> > just one or two critical files, rather than monitoring the entire
> > filesystem could we perhaps use a dedicated mmap() mode? It should
> > be
> > possible to throw up a bitmap that displays the exact blocks or
> > pages
> > that are affected, once the file has been damaged.
> 
> Perhaps we need to have a quick summary of the postgres problem ...
> they're not concerned with "one or two files", otherwise they could
> just keep those files open and the wb_err mechanism would work fine.
> The problem is that they have too many files to keep open in their
> checkpointer process, and when they come along and open the files,
> they don't see the error..

I thought I understood that there were at least two issues here:

1) Monitoring lots of files to figure out which ones may have an error.
2) Drilling down to see what might be wrong with an individual file.

Unless you are in a situation where you can have millions of files all
go wrong at the same time, it would seems that the former is the
operation that needs to scale. Once you're talking about large numbers
of files all getting errors, it would appear that an fsck-like recovery
 would be necessary. Am I wrong?

Cheers
  Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  1:57         ` Matthew Wilcox
  2018-04-19  2:12           ` Trond Myklebust
@ 2018-04-19  2:15           ` andres
  2018-04-19  2:19             ` Trond Myklebust
  1 sibling, 1 reply; 14+ messages in thread
From: andres @ 2018-04-19  2:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Trond Myklebust, david, jlayton, lsf-pc, linux-fsdevel

On 2018-04-18 18:57:23 -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> > If the main use case is something like Postgresql, where you care about
> > just one or two critical files, rather than monitoring the entire
> > filesystem could we perhaps use a dedicated mmap() mode? It should be
> > possible to throw up a bitmap that displays the exact blocks or pages
> > that are affected, once the file has been damaged.
> 
> Perhaps we need to have a quick summary of the postgres problem ...
> they're not concerned with "one or two files", otherwise they could
> just keep those files open and the wb_err mechanism would work fine.
> The problem is that they have too many files to keep open in their
> checkpointer process, and when they come along and open the files,
> they don't see the error..

Correct. Do you want a summary here or at LSF/MM? Thanks to Jon Corbet
I've got a last minute invitation.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  2:15           ` andres
@ 2018-04-19  2:19             ` Trond Myklebust
  0 siblings, 0 replies; 14+ messages in thread
From: Trond Myklebust @ 2018-04-19  2:19 UTC (permalink / raw)
  To: andres, willy; +Cc: david, lsf-pc, jlayton, linux-fsdevel

On Wed, 2018-04-18 at 19:15 -0700, andres@anarazel.de wrote:
> On 2018-04-18 18:57:23 -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> > > If the main use case is something like Postgresql, where you care
> > > about
> > > just one or two critical files, rather than monitoring the entire
> > > filesystem could we perhaps use a dedicated mmap() mode? It
> > > should be
> > > possible to throw up a bitmap that displays the exact blocks or
> > > pages
> > > that are affected, once the file has been damaged.
> > 
> > Perhaps we need to have a quick summary of the postgres problem ...
> > they're not concerned with "one or two files", otherwise they could
> > just keep those files open and the wb_err mechanism would work
> > fine.
> > The problem is that they have too many files to keep open in their
> > checkpointer process, and when they come along and open the files,
> > they don't see the error..
> 
> Correct. Do you want a summary here or at LSF/MM? Thanks to Jon
> Corbet
> I've got a last minute invitation.
> 

I'm not attending LSF/MM this year.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  0:44     ` Dave Chinner
  2018-04-19  1:47       ` Trond Myklebust
@ 2018-04-19 17:14       ` Jeff Layton
  2018-04-19 23:47         ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Layton @ 2018-04-19 17:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox

On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote:
> On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> > On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote:
> > > On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote:
> > > > I'd like to have a discussion on how to improve the handling of errors
> > > > that occur during writeback. I think there are 4 main issues that would
> > > > be helpful to cover:
> > > > 
> > > > 1) In v4.16, we added the errseq mechanism to the kernel to make
> > > > writeback error reporting more reliable. That has helped some use cases,
> > > > but there are some applications (e.g. PostgreSQL) that were relying on
> > > > the ability to see writeback errors that occurred before the file
> > > > description existed. Do we need to tweak this mechanism to help those
> > > > use cases, or would that do more harm than good?
> > > 
> > > More harm than good, IMO. Too many wacky corner cases...
> > > 
> > 
> > One idea that willy had was to set f_wb_err in the file to 0 if the
> > inode's wb_err has the SEEN bit set. That would allow the Pg use case to


Sorry...that should have been "wb_err has the SEEN bit clear".

> > continue working, as long as they don't wait so long to open the file
> > that the inode gets evicted from the cache.
> > 
> > That latter bit is what worries me. Such behavior is non-deterministic.
> 
> Yup. IMO, unreliable error reporting is a far worse situation than
> no error reporting at all. If we can't guarantee delivery of the
> error, then we should not be trying to expose errors through that
> mechanism.
> 
> We could prevent eviction of inodes with pending errors, but we've
> already rejected that idea because it's a known OOM vector.
> 
> .....
> 
> > > > 3) The question of what to do with pages in the pagecache that fail
> > > > writeback is still unresolved. Most filesystems just clear the dirty bit
> > > > and and carry on, but some invalidate them or just clear the uptodate
> > > > bit. This sort of inconsistency (and lack of documentation) is
> > > > problematic and has led to applications assuming behavior that doesn't
> > > > exist. I believe we need to declare an "ideal behavior" for Linux
> > > > filesystems in this regard, add VM/FS helpers to make it easier for
> > > > filesystems to conform to that behavior, and document it well. The big
> > > > question is : what sort of behavior makes the most sense here?
> > > 
> > > User configurable on a per-filesystem basis, just like metadata
> > > error handling. The default should be what is best for system
> > > stability when users do things like pull USB sticks with GB of dirty
> > > data still in memory.
> > > 
> > 
> > Maybe. I think though we need to consider offering a bit more of a
> > guarantee here.
> > 
> > As you pointed out recently, xfs will clear the uptodate bit on a
> > writeback failure, so you'll end up having to re-read the page from disk
> > later if someone issues a read against it.
> > 
> > But...that means that we offer no guarantee of the posix requirement to
> > see the result of a write in a subsequent read. If writeback occurs
> > between the two, that write could vanish.
> > 
> > So, if you're relying on being able to see the result of a write in your
> > read, then you really _must_ issue fsync prior to any read and check the
> > error code. That sort of sucks, performance-wise.
> > 
> > It might be nice if we could ensure that the data sticks around for a
> > short period of time (a few seconds at least) so that you wouldn't have
> > to issue fsync so frequently to get such a guarantee.
> 
> Well, that's the point of the configurable error behaviour. Look at
> the XFS metadata config dor a moment:
> 
> $ ls /sys/fs/xfs/sdc/error/metadata/    
> EIO  ENODEV  ENOSPC  default
> $ ls /sys/fs/xfs/sdc/error/metadata/EIO/
> max_retries  retry_timeout_seconds
> $ cat /sys/fs/xfs/sdc/error/metadata/EIO/max_retries 
> -1
> $ cat /sys/fs/xfs/sdc/error/metadata/EIO/retry_timeout_seconds 
> -1
> $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/max_retries 
> 0
> $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/retry_timeout_seconds 
> 0
> 
> We have different config capability for EIO, ENOSPC and ENODEV and
> default behaviour for all other metadata writeback errors.
> 
> The defaults are "retry forever" for EIO (-1 means keep retrying
> forever and retry whenever the log pushes on the journal to free up
> space) but for ENODEV (i.e. the "USB stick pull") we default to
> fail-immediately semantics.
> 
> i.e. by default we treat EIO as transient and ENODEV as permanent.
> 
> If we want to change EIO or ENOSPC to be "try a few times, then give
> up", we set max_retries = 3 and retry_timeout_seconds = 300 to retry
> writeback 3 times five minutes apart before considering the error as
> permanent.  Once a permanent error is triggered, we shut down the
> filesystem, as permanent errors in metadata writeback result in
> filesystem corruption.
> 
> This is trickier to track for per-page data writeback errors because
> we don't really have object-based IO control context. However, I
> really didn't see these error configs to apply per-page errors but
> to the inode itself. That part is still to be worked out...
> 

That's an interesting approach. Maybe we should expand that sort of
interface across filesystems? I worry a bit about knob proliferation
though -- when things are that configurable it's easy for users to hang
themselves, end up with nonsensical configuration, etc.

Still, we may have no choice given the fact that there doesn't seem to
be any one-size-fits-all way to do this.

> > > > 4) syncfs doesn't currently report an error when a single inode fails
> > > > writeback, only when syncing out the block device. Should it report
> > > > errors in that case as well?
> > > 
> > > Yes.
> > 
> > I have a small patch that implements this that I posted a few days ago.
> > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
> > it's the right approach.
> > 
> > Instead of growing struct file to accommodate a second errseq cursor,
> > this only implements that behavior when the file is opened with O_PATH
> > (as we know that that will have the fsync op set to NULL). Maybe we can
> > do this better somehow though.
> 
> No idea whether this is possible, or even a good idea, but could we
> just have syncfs() create a temporary struct file for the duration
> of the syncfs call, so it works on any fd passed in? (sorta like an
> internal dupfd() call?)
> 

No, we need something that will persist the errseq_t cursor between
syncfs calls.

If we did what you're suggesting, once your temporary file goes away,
you'd lose your place in the error stream and you'd end up reporting
the same errors more than once on subsequent calls to syncfs.

We'll need some way to store that in the struct file, the question is
whether we are willing to grow the struct to accomodate it.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19  2:12           ` Trond Myklebust
@ 2018-04-19 18:57             ` andres
  0 siblings, 0 replies; 14+ messages in thread
From: andres @ 2018-04-19 18:57 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: willy, lsf-pc, david, jlayton, linux-fsdevel

Hi,

On 2018-04-19 02:12:33 +0000, Trond Myklebust wrote:
> On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> > > If the main use case is something like Postgresql, where you care
> > > about
> > > just one or two critical files, rather than monitoring the entire
> > > filesystem could we perhaps use a dedicated mmap() mode? It should
> > > be
> > > possible to throw up a bitmap that displays the exact blocks or
> > > pages
> > > that are affected, once the file has been damaged.
> >
> > Perhaps we need to have a quick summary of the postgres problem ...
> > they're not concerned with "one or two files", otherwise they could
> > just keep those files open and the wb_err mechanism would work fine.
> > The problem is that they have too many files to keep open in their
> > checkpointer process, and when they come along and open the files,
> > they don't see the error..
>
> I thought I understood that there were at least two issues here:
>
> 1) Monitoring lots of files to figure out which ones may have an error.
> 2) Drilling down to see what might be wrong with an individual file.
>
> Unless you are in a situation where you can have millions of files all
> go wrong at the same time, it would seems that the former is the
> operation that needs to scale. Once you're talking about large numbers
> of files all getting errors, it would appear that an fsck-like recovery
>  would be necessary. Am I wrong?

Well, the correctness issue really only centers around 1). Currently
there are scenarios (some made less some more likely by the errseq_t
changes) where we don't notice IO errors. The result can either be that
we wrongly report back to the client that a "COMMIT;" was successful,
even though it wasn't persisted, or we might throw away journal data
because we think a checkpoint was successful even though it wasn't.  To
fix the correctness issue we really only need 1).  That said, it'd
obviously be nice to be able to report back a decent error pointing to
individual files and a more descriptive error message than
"PANIC: An IO error occurred somewhere. Perhaps look in the kernel logs?"
wouldn't hurt either.

To give a short overview of how PostgreSQL issues fsync and does the
surrounding buffer management:

1) There's a traditional journal (WAL), addressed by LSN. Every
   modification needs to first be in the WAL before buffers (and thus
   on-disk data) can be modified.

2) There's a postgres internal buffer cache. Pages are tagged with the
   WAL LSN that need to be flushed to disk before the data can be
   written back.

3) Reads and writes between OS and buffer cache are done using buffered
   IO. There's valid reasons to change that, but it'll require new
   infrastructure.  Each process has a limited size path -> fd cache.

4) Buffers are written out by:
   - checkpointing process during checkpoints
   - background writer that attempts to keep some "victim" buffers clean
   - backends (client connection associated) when they have to reuse
     dirty victim buffers

   whenever such a writeout happens information about the file
   containing that dirty buffer is forwarded to the checkpointer. The
   checkpointer keeps track of each file that'll need to be fsynced in a
   hashmap.

   It's worth to note that each table / index / whatnot is a separate
   file, and large relations are segmented into 1GB segments. So it's
   pretty common to have tens of thousands of files in a larger
   database.

5) During checkpointing, which is paced in most cases and often will be
   configured to take 30-60min, all buffers from before the start of the
   checkpoint will be written out.  We'll issue SYNC_FILE_RANGE_WRITE
   requests occasionally to keep the amount of dirty kernel buffers
   under control.   After that we'll fsync each of the dirty files.
   Once that and some other boring internal stuff succeeded, we'll issue
   a checkpoint record, and allow discarding WAL from before the checkpoint.

Because we cannot realistically keep each of the files open between 4)
and the end of 5), and because the fds used in 4) are not the same as
the ones in 5) (different processes), we currently aren't guaranteed
notification of writeback failures.

Realistically we're not going to do much file-specific handling in case
there's errors. Either a retry is going to fix the issue (ooops RN,
because the error has been "eaten"), or we're doing a crash-recovery
cycle from WAL (oops, because we don't necessarily know an error
occurred).

It's worth to note that for us syncfs() is better than nothing, but it's
not perfect.  It's pretty common to have temporary files (sort spool
files, temporary tables, ...) on the same filesystem as the persistent
database. syncfs() has the potential to flush out a lot of unnecessary
dirty data.  Note that it'd be very unlikely that the temp data files
would be moved to DIO - it's *good* that the kernel manages the amount
of dirty / cached data. It has a heck of a lot of more knowledge about
how much memory pressure the system has than postgres ever will have.

One reason, besides some architectural issues inside PG, we've been
concerned about when considering DIO, is along similar lines. A lot of
people use databases as a part of their stack without focusing on
them. Which usually means the database will be largely untuned. With
buffered IO that's not so bad, the kernel will dynamically adapt to some
extent. With DIO the consequences of a mistuned buffer cache size or
such are way worse. It's good for critical databases maintained by
dedicated people, not so good outside of that.

Matthew, I'm not sure what kind of summary you had in mind. Please let
me know if you want more detail in any of the areas, happy to expand.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19 17:14       ` Jeff Layton
@ 2018-04-19 23:47         ` Dave Chinner
  2018-04-20 11:24           ` Jeff Layton
  2018-04-21 17:21           ` Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2018-04-19 23:47 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox

On Thu, Apr 19, 2018 at 01:14:04PM -0400, Jeff Layton wrote:
> On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote:
> > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> > > > > 4) syncfs doesn't currently report an error when a single inode fails
> > > > > writeback, only when syncing out the block device. Should it report
> > > > > errors in that case as well?
> > > > 
> > > > Yes.
> > > 
> > > I have a small patch that implements this that I posted a few days ago.
> > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
> > > it's the right approach.
> > > 
> > > Instead of growing struct file to accommodate a second errseq cursor,
> > > this only implements that behavior when the file is opened with O_PATH
> > > (as we know that that will have the fsync op set to NULL). Maybe we can
> > > do this better somehow though.
> > 
> > No idea whether this is possible, or even a good idea, but could we
> > just have syncfs() create a temporary struct file for the duration
> > of the syncfs call, so it works on any fd passed in? (sorta like an
> > internal dupfd() call?)
> > 
> 
> No, we need something that will persist the errseq_t cursor between
> syncfs calls.
> 
> If we did what you're suggesting, once your temporary file goes away,
> you'd lose your place in the error stream and you'd end up reporting
> the same errors more than once on subsequent calls to syncfs.

Which, IMO, is correct behaviour. If there is a persistent data
writeback errors, then syncfs() should report it *every time it is
called* because that syncfs() call did not result in the filesystem
being fully committed to stable storage.

I don't care whether the error has been reported before - the
context of syncfs() is "commit the entire filesystem to stable
storage". If any IO failed then we have not acheived what the
application asked us to do and so we should be returning an error on
every call that sees a data writeback error.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19 23:47         ` Dave Chinner
@ 2018-04-20 11:24           ` Jeff Layton
  2018-04-21 17:21           ` Jan Kara
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff Layton @ 2018-04-20 11:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox

On Fri, 2018-04-20 at 09:47 +1000, Dave Chinner wrote:
> On Thu, Apr 19, 2018 at 01:14:04PM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote:
> > > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> > > > > > 4) syncfs doesn't currently report an error when a single inode fails
> > > > > > writeback, only when syncing out the block device. Should it report
> > > > > > errors in that case as well?
> > > > > 
> > > > > Yes.
> > > > 
> > > > I have a small patch that implements this that I posted a few days ago.
> > > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
> > > > it's the right approach.
> > > > 
> > > > Instead of growing struct file to accommodate a second errseq cursor,
> > > > this only implements that behavior when the file is opened with O_PATH
> > > > (as we know that that will have the fsync op set to NULL). Maybe we can
> > > > do this better somehow though.
> > > 
> > > No idea whether this is possible, or even a good idea, but could we
> > > just have syncfs() create a temporary struct file for the duration
> > > of the syncfs call, so it works on any fd passed in? (sorta like an
> > > internal dupfd() call?)
> > > 
> > 
> > No, we need something that will persist the errseq_t cursor between
> > syncfs calls.
> > 
> > If we did what you're suggesting, once your temporary file goes away,
> > you'd lose your place in the error stream and you'd end up reporting
> > the same errors more than once on subsequent calls to syncfs.
> 
> Which, IMO, is correct behaviour. If there is a persistent data
> writeback errors, then syncfs() should report it *every time it is
> called* because that syncfs() call did not result in the filesystem
> being fully committed to stable storage.
>
> I don't care whether the error has been reported before - the
> context of syncfs() is "commit the entire filesystem to stable
> storage". If any IO failed then we have not acheived what the
> application asked us to do and so we should be returning an error on
> every call that sees a data writeback error.
> 

The problem is that that you would get back errors even if the problem
went away:

Suppose you write some data, and that fails writeback. You call syncfs
and that returns an error. So, you write the data again, and this time
it succeeds, and you call syncfs again on the same fd that you
originally called it. It will still return error because we have no way
to persist the fact that you already saw the original error on the first
syncfs call. That's wrong behavior, IMO.

The errseq_t mechanism requires that you keep a record of the last error
that you saw, so that we know whether to report it again or not.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
  2018-04-19 23:47         ` Dave Chinner
  2018-04-20 11:24           ` Jeff Layton
@ 2018-04-21 17:21           ` Jan Kara
  1 sibling, 0 replies; 14+ messages in thread
From: Jan Kara @ 2018-04-21 17:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, linux-fsdevel, lsf-pc, Andres Freund, Matthew Wilcox

On Fri 20-04-18 09:47:45, Dave Chinner wrote:
> On Thu, Apr 19, 2018 at 01:14:04PM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote:
> > > On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> > > > > > 4) syncfs doesn't currently report an error when a single inode fails
> > > > > > writeback, only when syncing out the block device. Should it report
> > > > > > errors in that case as well?
> > > > > 
> > > > > Yes.
> > > > 
> > > > I have a small patch that implements this that I posted a few days ago.
> > > > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
> > > > it's the right approach.
> > > > 
> > > > Instead of growing struct file to accommodate a second errseq cursor,
> > > > this only implements that behavior when the file is opened with O_PATH
> > > > (as we know that that will have the fsync op set to NULL). Maybe we can
> > > > do this better somehow though.
> > > 
> > > No idea whether this is possible, or even a good idea, but could we
> > > just have syncfs() create a temporary struct file for the duration
> > > of the syncfs call, so it works on any fd passed in? (sorta like an
> > > internal dupfd() call?)
> > > 
> > 
> > No, we need something that will persist the errseq_t cursor between
> > syncfs calls.
> > 
> > If we did what you're suggesting, once your temporary file goes away,
> > you'd lose your place in the error stream and you'd end up reporting
> > the same errors more than once on subsequent calls to syncfs.
> 
> Which, IMO, is correct behaviour. If there is a persistent data
> writeback errors, then syncfs() should report it *every time it is
> called* because that syncfs() call did not result in the filesystem
> being fully committed to stable storage.
> 
> I don't care whether the error has been reported before - the
> context of syncfs() is "commit the entire filesystem to stable
> storage". If any IO failed then we have not acheived what the
> application asked us to do and so we should be returning an error on
> every call that sees a data writeback error.

I believe syncfs() behavior should be consistent with an fsync() behavior.
Otherwise it would be rather confusing for userspace. So what you suggest
would mean that fsync() would have to keep reporting error once it reported
an error since we've dropped the dirty bits from pages and thus cannot ever
make user data durable anymore.

Or we could do what Jeff proposes - i.e., make syncfs() behave like fsync()
currently by reporting each error for each fd only once.

And I find both behaviors justifiable. 

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-04-21 23:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-17 11:08 [LSF/MM TOPIC] improving writeback error handling Jeff Layton
2018-04-17 22:53 ` Dave Chinner
2018-04-18 16:00   ` [Lsf-pc] " Jeff Layton
2018-04-19  0:44     ` Dave Chinner
2018-04-19  1:47       ` Trond Myklebust
2018-04-19  1:57         ` Matthew Wilcox
2018-04-19  2:12           ` Trond Myklebust
2018-04-19 18:57             ` andres
2018-04-19  2:15           ` andres
2018-04-19  2:19             ` Trond Myklebust
2018-04-19 17:14       ` Jeff Layton
2018-04-19 23:47         ` Dave Chinner
2018-04-20 11:24           ` Jeff Layton
2018-04-21 17:21           ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.