Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling

From: Jeff Layton <jlayton@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	lsf-pc@lists.linux-foundation.org,
	Andres Freund <andres@anarazel.de>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
Date: Wed, 18 Apr 2018 12:00:10 -0400	[thread overview]
Message-ID: <1524067210.27056.28.camel@kernel.org> (raw)
In-Reply-To: <20180417225309.GA27893@dastard>

On Wed, 2018-04-18 at 08:53 +1000, Dave Chinner wrote:
> On Tue, Apr 17, 2018 at 07:08:01AM -0400, Jeff Layton wrote:
> > I'd like to have a discussion on how to improve the handling of errors
> > that occur during writeback. I think there are 4 main issues that would
> > be helpful to cover:
> > 
> > 1) In v4.16, we added the errseq mechanism to the kernel to make
> > writeback error reporting more reliable. That has helped some use cases,
> > but there are some applications (e.g. PostgreSQL) that were relying on
> > the ability to see writeback errors that occurred before the file
> > description existed. Do we need to tweak this mechanism to help those
> > use cases, or would that do more harm than good?
> 
> More harm than good, IMO. Too many wacky corner cases...
> 

One idea that willy had was to set f_wb_err in the file to 0 if the
inode's wb_err has the SEEN bit set. That would allow the Pg use case to
continue working, as long as they don't wait so long to open the file
that the inode gets evicted from the cache.

That latter bit is what worries me. Such behavior is non-deterministic.

> > 2) Most filesystems report data wb errors on all fds that were open at
> > the time of the error now, but metadata writeback can also fail, and
> > those don't get reported in the same way so far. Should we extend those
> > semantics to metadata writeback? How do we get there if so?
> 
> No. Metadata writeback errors are intimately related to internal
> filesystem consistency - if there's a problem with metadata
> writeback, it's up to the filesystem to determine if fsync needs to
> report it or not.
> 
> And, FWIW, journalling filesystems will follow their global "fatal
> error" shutdown configuration if there's  fatal metadata writeback
> error. Such an error is essentially filesystem corruption, so once
> this error condition is set, any subsequent attempt to fsync or
> modify the filesystem should return errors (i.e. EIO or
> EFSCORRUPTED).
> 

Ok, fair enough.

A few months ago, I played around with some patches to do this for ext4.
It meant growing struct file though, which was sort of nasty. I also had
no idea how to test it properly, so I punted on the idea.

We could revisit it, but most folks are using errors=remount-ro on ext4,
and that makes it sort of a moot point.

> > 3) The question of what to do with pages in the pagecache that fail
> > writeback is still unresolved. Most filesystems just clear the dirty bit
> > and and carry on, but some invalidate them or just clear the uptodate
> > bit. This sort of inconsistency (and lack of documentation) is
> > problematic and has led to applications assuming behavior that doesn't
> > exist. I believe we need to declare an "ideal behavior" for Linux
> > filesystems in this regard, add VM/FS helpers to make it easier for
> > filesystems to conform to that behavior, and document it well. The big
> > question is : what sort of behavior makes the most sense here?
> 
> User configurable on a per-filesystem basis, just like metadata
> error handling. The default should be what is best for system
> stability when users do things like pull USB sticks with GB of dirty
> data still in memory.
> 

Maybe. I think though we need to consider offering a bit more of a
guarantee here.

As you pointed out recently, xfs will clear the uptodate bit on a
writeback failure, so you'll end up having to re-read the page from disk
later if someone issues a read against it.

But...that means that we offer no guarantee of the posix requirement to
see the result of a write in a subsequent read. If writeback occurs
between the two, that write could vanish.

So, if you're relying on being able to see the result of a write in your
read, then you really _must_ issue fsync prior to any read and check the
error code. That sort of sucks, performance-wise.

It might be nice if we could ensure that the data sticks around for a
short period of time (a few seconds at least) so that you wouldn't have
to issue fsync so frequently to get such a guarantee.

That may be too difficult to do though. idk.

In any case, we should document this much better than we do today, so
that userland devs have proper expectations of the behavior here.

> > 4) syncfs doesn't currently report an error when a single inode fails
> > writeback, only when syncing out the block device. Should it report
> > errors in that case as well?
> 
> Yes.
> 

I have a small patch that implements this that I posted a few days ago.
I meant to mark it as an RFC, but didn't, fwiw. I'm not convinced that
it's the right approach.

Instead of growing struct file to accommodate a second errseq cursor,
this only implements that behavior when the file is opened with O_PATH
(as we know that that will have the fsync op set to NULL). Maybe we can
do this better somehow though.

> And there's more than that. the filesystem ->sync_fs() method also
> needs to return an error because that's where most fileystems
> persist their metadata. And because sync_fs has historically thrown
> the error returned away, most filesystems don't bother checking or
> returning errors.
> 

That sounds like a good goal too. sync_fs prototype already returns an
error, but it looks like all of the callers ignore it. Making it pass
that up would be a good initial goal, and then we could fix up the
sync_fs operations one-by-one.

-- 
Jeff Layton <jlayton@kernel.org>