Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

From: Ted Ts'o <tytso@mit.edu>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@infradead.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	LKML <linux-kernel@vger.kernel.org>,
	Edward Shishkin <edward@redhat.com>
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error
Date: Mon, 23 Jan 2012 16:47:09 -0500	[thread overview]
Message-ID: <20120123214709.GB17974@thunk.org> (raw)
In-Reply-To: <20120123030422.GE15102@dastard>

On Mon, Jan 23, 2012 at 02:04:22PM +1100, Dave Chinner wrote:
> 
> Sure, but the buffer contents are dirty until the IO completes
> successfully and what is on disk matches the contents of the buffer
> in memory. It doesn't magically become clean when we clear the dirty
> bit. We only clear the dirty bit before submitting the IO to stop
> multiple callers from trying to submit it for write at the same
> time. IOWs, the buffer dirty bit doesn't really track the dirty
> state of the buffer correctly.

Doesn't BH_Lock prevent multple callers from submitting it for write
at the same time?  If memory serves, one of the reasons why we cleared
the dirty bit before submitting the write was because we allowed
writers to dirty the buffer_head while the write was "in flight".  Of
course, this is becomes problematic if we're trying to support DIF/DIX.

What if we simply disallow BH_Dirty from being set (and disallow the
modification of the buffer) while the buffer is locked?  Then the
dirty bit would indeed correctly track the state of the buffer
correctly.

> I can only assume that you didn't read what I said about how
> different filesystems can (and do) handle write errors differently.
> Indeed, even within a filesystem there can be different error
> handling methods for different types of write IO errors (e.g.
> transient vs unrecoverable).  Hence there are any number of valid
> error handling strategies that can be added to the above list. One
> size does not fit all...

That's another problem, which is that we need more context than just
!uptodate.  We need to know what sort of write I/O errors occurred, so
we can determine whether it's likely to be transient or permanent.

> The thing is, transient write errors tend to be isolated and go away
> when a retry occurs (think of IO timeouts when multipath failover
> occurs). When non-isolated IO or unrecoverable problems occur (e.g.
> no paths left to fail over onto), critical other metadata reads and
> writes will fail and shut down the filesystem, thereby terminating
> the "try forever" background writeback loop those delayed write
> buffers may be in. So the truth is that "trying forever" on write
> errors can handle a whole class of write IO errors very
> effectively....

So how does XFS decide whether a write should fail and shutdown the
file system, or just "try forever"?

						- Ted