Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling

From: Jeff Layton <jlayton@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	lsf-pc@lists.linux-foundation.org,
	Andres Freund <andres@anarazel.de>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
Date: Thu, 19 Apr 2018 13:14:04 -0400	[thread overview]
Message-ID: <1524158044.2943.17.camel@kernel.org> (raw)
In-Reply-To: <20180419004411.GG27893@dastard>

On Thu, 2018-04-19 at 10:44 +1000, Dave Chinner wrote:
> On Wed, Apr 18, 2018 at 12:00:10PM -0400, Jeff Layton wrote:
> > On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote:
> > > On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote:
> > > > I'd like to have a discussion on how to improve the handling of errors
> > > > that occur during writeback. I think there are 4 main issues that would
> > > > be helpful to cover:
> > > > 
> > > > 1) In v4.16, we added the errseq mechanism to the kernel to make
> > > > writeback error reporting more reliable. That has helped some use cases,
> > > > but there are some applications (e.g. PostgreSQL) that were relying on
> > > > the ability to see writeback errors that occurred before the file
> > > > description existed. Do we need to tweak this mechanism to help those
> > > > use cases, or would that do more harm than good?
> > > 
> > > More harm than good, IMO. Too many wacky corner cases...
> > > 
> > 
> > One idea that willy had was to set f_wb_err in the file to 0 if the
> > inode's wb_err has the SEEN bit set. That would allow the Pg use case to

Sorry...that should have been "wb_err has the SEEN bit clear".

> > continue working, as long as they don't wait so long to open the file
> > that the inode gets evicted from the cache.
> > 
> > That latter bit is what worries me. Such behavior is non-deterministic.
> 
> Yup. IMO, unreliable error reporting is a far worse situation than
> no error reporting at all. If we can't guarantee delivery of the
> error, then we should not be trying to expose errors through that
> mechanism.
> 
> We could prevent eviction of inodes with pending errors, but we've
> already rejected that idea because it's a known OOM vector.
> 
> .....
> 
> > > > 3) The question of what to do with pages in the pagecache that fail
> > > > writeback is still unresolved. Most filesystems just clear the dirty bit
> > > > and and carry on, but some invalidate them or just clear the uptodate
> > > > bit. This sort of inconsistency (and lack of documentation) is
> > > > problematic and has led to applications assuming behavior that doesn't
> > > > exist. I believe we need to declare an "ideal behavior" for Linux
> > > > filesystems in this regard, add VM/FS helpers to make it easier for
> > > > filesystems to conform to that behavior, and document it well. The big
> > > > question is : what sort of behavior makes the most sense here?
> > > 
> > > User configurable on a per-filesystem basis, just like metadata
> > > error handling. The default should be what is best for system
> > > stability when users do things like pull USB sticks with GB of dirty
> > > data still in memory.
> > > 
> > 
> > Maybe. I think though we need to consider offering a bit more of a
> > guarantee here.
> > 
> > As you pointed out recently, xfs will clear the uptodate bit on a
> > writeback failure, so you'll end up having to re-read the page from disk
> > later if someone issues a read against it.
> > 
> > But...that means that we offer no guarantee of the posix requirement to
> > see the result of a write in a subsequent read. If writeback occurs
> > between the two, that write could vanish.
> > 
> > So, if you're relying on being able to see the result of a write in your
> > read, then you really _must_ issue fsync prior to any read and check the
> > error code. That sort of sucks, performance-wise.
> > 
> > It might be nice if we could ensure that the data sticks around for a
> > short period of time (a few seconds at least) so that you wouldn't have
> > to issue fsync so frequently to get such a guarantee.
> 
> Well, that's the point of the configurable error behaviour. Look at
> the XFS metadata config dor a moment:
> 
> $ ls /sys/fs/xfs/sdc/error/metadata/    
> EIO  ENODEV  ENOSPC  default
> $ ls /sys/fs/xfs/sdc/error/metadata/EIO/
> max_retries  retry_timeout_seconds
> $ cat /sys/fs/xfs/sdc/error/metadata/EIO/max_retries 
> -1
> $ cat /sys/fs/xfs/sdc/error/metadata/EIO/retry_timeout_seconds 
> -1
> $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/max_retries 
> 0
> $ cat /sys/fs/xfs/sdc/error/metadata/ENODEV/retry_timeout_seconds 
> 0
> 
> We have different config capability for EIO, ENOSPC and ENODEV and
> default behaviour for all other metadata writeback errors.
> 
> The defaults are "retry forever" for EIO (-1 means keep retrying
> forever and retry whenever the log pushes on the journal to free up
> space) but for ENODEV (i.e. the "USB stick pull") we default to
> fail-immediately semantics.
> 
> i.e. by default we treat EIO as transient and ENODEV as permanent.
> 
> If we want to change EIO or ENOSPC to be "try a few times, then give
> up", we set max_retries = 3 and retry_timeout_seconds = 300 to retry
> writeback 3 times five minutes apart before considering the error as
> permanent.  Once a permanent error is triggered, we shut down the
> filesystem, as permanent errors in metadata writeback result in
> filesystem corruption.
> 
> This is trickier to track for per-page data writeback errors because
> we don't really have object-based IO control context. However, I
> really didn't see these error configs to apply per-page errors but
> to the inode itself. That part is still to be worked out...
> 

That's an interesting approach. Maybe we should expand that sort of
interface across filesystems? I worry a bit about knob proliferation
though -- when things are that configurable it's easy for users to hang
themselves, end up with nonsensical configuration, etc.

Still, we may have no choice given the fact that there doesn't seem to
be any one-size-fits-all way to do this.

> > > > 4) syncfs doesn't currently report an error when a single inode fails
> > > > writeback, only when syncing out the block device. Should it report
> > > > errors in that case as well?
> > > 
> > > Yes.
> > 
> > I have a small patch that implements this that I posted a few days ago.
> > I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
> > it's the right approach.
> > 
> > Instead of growing struct file to accommodate a second errseq cursor,
> > this only implements that behavior when the file is opened with O_PATH
> > (as we know that that will have the fsync op set to NULL). Maybe we can
> > do this better somehow though.
> 
> No idea whether this is possible, or even a good idea, but could we
> just have syncfs() create a temporary struct file for the duration
> of the syncfs call, so it works on any fd passed in? (sorta like an
> internal dupfd() call?)
> 

No, we need something that will persist the errseq_t cursor between
syncfs calls.

If we did what you're suggesting, once your temporary file goes away,
you'd lose your place in the error stream and you'd end up reporting
the same errors more than once on subsequent calls to syncfs.

We'll need some way to store that in the struct file, the question is
whether we are willing to grow the struct to accomodate it.

-- 
Jeff Layton <jlayton@kernel.org>