Re: [LSF/MM TOPIC] I/O error handling and fsync()

From: Kevin Wolf <kwolf@redhat.com>
To: NeilBrown <neilb@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Christoph Hellwig <hch@infradead.org>,
	Ric Wheeler <rwheeler@redhat.com>, Rik van Riel <riel@redhat.com>
Subject: Re: [LSF/MM TOPIC] I/O error handling and fsync()
Date: Fri, 13 Jan 2017 12:51:28 +0100	[thread overview]
Message-ID: <20170113115128.GB4981@noname.redhat.com> (raw)
In-Reply-To: <87y3yfftqa.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 3554 bytes --]

Am 13.01.2017 um 05:51 hat NeilBrown geschrieben:
> On Wed, Jan 11 2017, Kevin Wolf wrote:
> 
> > Am 11.01.2017 um 06:03 hat Theodore Ts'o geschrieben:
> >> A couple of thoughts.
> >> 
> >> First of all, one of the reasons why this probably hasn't been
> >> addressed for so long is because programs who really care about issues
> >> like this tend to use Direct I/O, and don't use the page cache at all.
> >> And perhaps this is an option open to qemu as well?
> >
> > For our immediate case, yes, O_DIRECT can be enabled as an option in
> > qemu, and it is generally recommended to do that at least for long-lived
> > VMs. For other cases it might be nice to use the cache e.g. for quicker
> > startup, but those might be cases where error recovery isn't as
> > important.
> >
> > I just see a much broader problem here than just for qemu. Essentially
> > this approach would mean that every program that cares about the state
> > it sees being safe on disk after a successful fsync() would have to use
> > O_DIRECT. I'm not sure if that's what we want.
> 
> This is not correct.  If an application has exclusive write access to a
> file (which is common, even if only enforced by convention) and if that
> program checks the return of every write() and every fsync() (which, for
> example, stdio does, allowing ferror() to report if there have ever been
> errors), then it will know if its data if safe.
> 
> If any of these writes returned an error, then there is NOTHING IT CAN
> DO about that file.  It should be considered to be toast.
> If there is a separate filesystem it can use, then maybe there is a way
> forward, but normally it would just report an error in whatever way is
> appropriate.
> 
> My position on this is primarily that if you get a single write error,
> then you cannot trust anything any more.

But why? Do you think this is inevitable and therefore it is the most
useful approach to handle errors, or is it just more convenient because
then you don't have to think as much about error cases?

The semantics I know is that a failed write means that the contents of
the blocks touched by a failed write request is undefined now, but why
can't I trust anything else in the same file (we're talking about what
is often a whole block device in the case of qemu) any more?

> You suggested before that NFS problems can cause errors which can be
> fixed by the sysadmin so subsequent writes succeed.  I disagreed - NFS
> will block, not return an error.  Your last paragraph below indicates
> that you agree.  So I ask again: can you provide a genuine example of a
> case where a write might result in an error, but that sysadmin
> involvement can allow a subsequent attempt to write to succeed.   I
> don't think you can, but I'm open...

I think I replied to that in the other email now, so in order to keep it
in one place I don't repeat my answer here

> I note that ext4 has an option "errors=remount-ro".  I think that
> actually makes a lot of sense.  I could easily see an argument for
> supporting this at the file level, when it isn't enabled at the
> filesystem level. If there is any write error, then all subsequent
> writes should cause an error, only reads should be allowed.

Obviously, that doesn't solve the problems we have to recover, but makes
them only worse. However, I admit it would be the only reasonable choice
if "after a single write error, you can't trust the whole file" is the
official semantics. (Which I hope it isn't.)

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]