From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:14374 "EHLO
        ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1750904AbeDNBrz (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 13 Apr 2018 21:47:55 -0400
Date: Sat, 14 Apr 2018 11:47:52 +1000
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@redhat.com>,
        lsf-pc <lsf-pc@lists.linuxfoundation.org>,
        Andres Freund <andres@anarazel.de>,
        Andreas Dilger <adilger@dilger.ca>,
        "Theodore Y. Ts'o" <tytso@mit.edu>,
        Ext4 Developers List <linux-ext4@vger.kernel.org>,
        Linux FS Devel <linux-fsdevel@vger.kernel.org>,
        "Joshua D. Drake" <jd@commandprompt.com>
Subject: Re: fsync() errors is unsafe and risks data loss
Message-ID: <20180414014752.GG23861@dastard>
References: <20180410220726.vunhvwuzxi5bm6e5@alap3.anarazel.de>
 <190CF56C-C03D-4504-8B35-5DB479801513@dilger.ca>
 <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de>
 <20180412030248.GA8509@bombadil.infradead.org>
 <1523531354.4532.21.camel@redhat.com>
 <20180412120122.GE23861@dastard>
 <1523545730.4532.82.camel@redhat.com>
 <20180412224404.GA5572@dastard>
 <1523625536.4847.21.camel@redhat.com>
 <20180413140232.GA24379@bombadil.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180413140232.GA24379@bombadil.infradead.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:
> > On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:
> > > To save you looking, XFS will trash the page contents completely on
> > > a filesystem level ->writepage error. It doesn't mark them "clean",
> > > doesn't attempt to redirty and rewrite them - it clears the uptodate
> > > state and may invalidate it completely. IOWs, the data written
> > > "sucessfully" to the cached page is now gone. It will be re-read
> > > from disk on the next read() call, in direct violation of the above
> > > POSIX requirements.
> > > 
> > > This is my point: we've done that in XFS knowing that we violate
> > > POSIX specifications in this specific corner case - it's the lesser
> > > of many evils we have to chose between. Hence if we chose to encode
> > > that behaviour as the general writeback IO error handling algorithm,
> > > then it needs to done with the knowledge it is a specification
> > > violation. Not to mention be documented as a POSIX violation in the
> > > various relevant man pages and that this is how all filesystems will
> > > behave on async writeback error.....
> > > 
> > 
> > Got it, thanks.
> > 
> > Yes, I think we ought to probably do the same thing globally. It's nice
> > to know that xfs has already been doing this. That makes me feel better
> > about making this behavior the gold standard for Linux filesystems.
> > 
> > So to summarize, at this point in the discussion, I think we want to
> > consider doing the following:
> > 
> > * better reporting from syncfs (report an error when even one inode
> > failed to be written back since last syncfs call). We'll probably
> > implement this via a per-sb errseq_t in some fashion, though there are
> > some implementation issues to work out.
> > 
> > * invalidate or clear uptodate flag on pages that experience writeback
> > errors, across filesystems. Encourage this as standard behavior for
> > filesystems and maybe add helpers to make it easier to do this.
> > 
> > Did I miss anything? Would that be enough to help the Pg usecase?
> > 
> > I don't see us ever being able to reasonably support its current
> > expectation that writeback errors will be seen on fd's that were opened
> > after the error occurred. That's a really thorny problem from an object
> > lifetime perspective.
> 
> I think we can do better than XFS is currently doing (but I agree that
> we should have the same behaviour across all Linux filesystems!)
> 
> 1. If we get an error while wbc->for_background is true, we should not clear
>    uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than
a permanent error.

> 2. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient
error - it requires a user to specifically run sync to get the data
to disk after the transient error has occurred. Say they don't
notice the problem because it's fleeting and doesn't cause any
obvious problems?

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't
allocate space to write back the page, then space is freed up a few
seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the
system crashed" bug reports.

> 3. for_sync writebacks should attempt one last write.  Maybe it'll
>    succeed this time.  If it does, just ClearPageError.  If not, we have
>    somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount
to report fatal errors?

We used to do this with XFS metadata. We'd just keep trying to write
metadata and keep the filesystem running (because it's consistent in
memory and it might be a transient error) rather than shutting down
the filesystem after a couple of retries. the result was that users
wouldn't notice there were problems until unmount, and the most
common sympton of that was "why is system shutdown hanging?".

We now don't hang at unmount by default:

$ cat /sys/fs/xfs/dm-0/error/fail_at_unmount 
1
$

And we treat different errors according to their seriousness.  EIO
and device ENOSPC we default to retry forever because they are often
transient, but for ENODEV we fail and shutdown immediately (someone
pulled the USB stick out). metadata failure behaviour is configured
via changing fields in /sys/fs/xfs/<dev>/error/metadata/<errno>/...

We've planned to extend this failure configuration to data IO, too,
but never quite got around to it yet. this is a clear example of
"one size doesn't fit all" and I think we'll end up doing the same
sort of error behaviour configuration in XFS for these cases.
(i.e. /sys/fs/xfs/<dev>/error/writeback/<errno>/....)

> And this logic all needs to be on one place, although invoked from
> each filesystem.

Perhaps so, but as there's no "one-size-fits-all" behaviour, I
really want to extend the XFS error config infrastructure to control
what the filesystem does on error here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com